mirror of
https://github.com/beshu-tech/deltaglider.git
synced 2026-05-19 13:26:54 +02:00
d81240be80
* fix(metadata): align direct-upload keys to canonical dg-* namespace
`_upload_direct` (the path taken by non-delta-eligible files like
.sha1 / .sha512) wrote user-metadata with bare underscored keys
(`original_name`, `file_sha256`, `compression`) while delta and
reference uploads correctly used the canonical dashed namespace
(`dg-original-name`, `dg-file-sha256`, `dg-compression`).
Downstream consumers — most visibly the DeltaGlider Proxy — only
recognised the dashed form, so every .sha1 / .sha512 listing on
a bucket holding deltaglider-uploaded files produced:
WARN PATHOLOGICAL | Missing/corrupt DG metadata for
bucket/key.sha1 -- falling back to passthrough.
Error: Storage error: Missing dg-original-name
This patch aligns the writer to the canonical scheme and keeps the
read path backward-compatible with already-stored bare-keyed objects
via `resolve_metadata`. No re-upload required.
Changes
-------
* `_upload_direct` emits metadata using `f"{METADATA_PREFIX}{key}"`
(the same pattern delta/reference uploads already use).
* `METADATA_KEY_ALIASES` now lists `compression` and `source_name`
so `resolve_metadata` works for both fields uniformly.
* Replaced bare `metadata.get("compression")` /
`metadata.get("original_name")` / `metadata.get("file_size")` /
`metadata.get("ref_key")` lookups in `DeltaService.get`,
`DeltaService.delete`, `_delete_delta`, the recursive-delete
listing path, `client.list_objects_v2`, and
`client_operations.stats.get_object_info` with `resolve_metadata`
calls so legacy bare-keyed objects keep working forever.
Tests
-----
* `tests/unit/test_metadata_aliases.py` (new, 11 tests) — pins the
alias table contract: new dashed keys, legacy bare underscored
keys, legacy hyphenated keys, priority rule, empty-string
handling.
* `test_direct_upload_emits_dashed_namespace` in
`tests/unit/test_core_service.py` — pins the writer to emit only
dg-* keys.
* Existing tests using the legacy bare `compression: "none"` form
in `test_s3_compat.py` and `test_recursive_delete_reference_*.py`
still pass — proving the dual-scheme read contract holds.
Full unit suite: 87/87 pass, mypy clean, ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(metadata): also resolve legacy file_sha256 in get() dispatch
Adversarial review of the original patch caught a second
asymmetry: DeltaService.get's "is this a regular S3 object or
DeltaGlider-managed?" dispatch was a literal-string check
`"dg-file-sha256" not in obj_head.metadata`. After the writer
fix, NEW direct uploads have `dg-file-sha256` so they route
correctly. But ~4400 pre-fix `.sha1` / `.sha512` files in
production have the bare `file_sha256` key, and they were
silently being routed through the "regular S3 object" branch
instead of the "direct upload" branch.
Both branches call `_get_direct` so file content was still
served correctly — but the wrong log message fired
("Downloading regular S3 object (no DeltaGlider metadata)") and
the recorded file-size for telemetry came from obj_head.size
instead of the metadata's `file_size` (same value for direct
uploads, but still semantically wrong).
Swap the literal-string check for `resolve_metadata(meta,
"file_sha256") is None` so both schemes route to the
DeltaGlider-managed branch.
Added regression test `test_get_legacy_direct_upload_not_
misclassified_as_regular_s3` that builds a HEAD response with
the legacy bare-keyed metadata shape (exactly what's stored on
Hetzner today for the .sha files), captures the log messages,
and fails if the "regular S3 object" canary fires.
Demonstrated locally: revert the dispatch back to literal-string
check → new test fails with the canary log line. Restore →
88/88 pass.
CHANGELOG updated to document both fixes (writer + dispatch).
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
17 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
Fixed
- Direct-upload metadata now uses the canonical
dg-*dashed namespace. Pre-fix, files routed through_upload_direct(non-delta-eligible extensions:.sha1,.sha512, etc.) wrote metadata with bare underscored keys (original_name,file_sha256,compression) while delta and reference uploads correctly used the namespaced form (dg-original-name,dg-file-sha256,dg-compression). Downstream consumers — most visibly the DeltaGlider Proxy — only recognised the dashed form, so every.sha1/.sha512listing triggered aPATHOLOGICAL | Missing/corrupt DG metadatawarning. Aligned the writer to the canonical scheme so new uploads stop producing log spam.
Changed
- Read path now resolves both schemes uniformly. The historical bare keys (
original_name,compression, etc.) stay inMETADATA_KEY_ALIASESso already-stored objects keep being recognised on read — no migration required. Replaced ad-hocmetadata.get("compression")/metadata.get("original_name")/metadata.get("file_size")/metadata.get("ref_key")lookups inDeltaService.get,DeltaService.delete,_delete_delta, the recursive-delete listing path,client.list_objects_v2, andclient_operations.stats.get_object_infowithresolve_metadata(meta, field)calls so both schemes work transparently for the lifetime of the bucket. Newcompressionandsource_nameentries added to the alias table. DeltaService.get"regular S3 vs DeltaGlider-managed" dispatch now usesresolve_metadatafor thefile_sha256presence check. Pre-fix, this check looked for the literal string"dg-file-sha256"inobj_head.metadata, which silently misclassified legacy bare-keyed direct uploads (file_sha256without thedg-prefix) as "regular S3 objects" — they still served correctly because both branches call_get_direct, but the wrong log line fired and the wrongfile_sizevalue was recorded for telemetry. Caught during adversarial PR review.
Added
- Regression tests for the dual-scheme contract (
tests/unit/test_metadata_aliases.py, 11 tests): every alias resolves, new dashed keys win when both are present, empty strings count as missing, the alias-table shape is pinned (first alias dashed, bare underscored alias always present,compression+source_namepresent). test_direct_upload_emits_dashed_namespaceintest_core_service.pypins the writer to emitdg-*-only metadata so the original underscored regression cannot return.test_get_legacy_direct_upload_not_misclassified_as_regular_s3intest_core_service.pypins theget()dispatch to route bare-keyed legacy direct uploads through the DeltaGlider-managed branch (not the "regular S3 object" passthrough). Demonstrated to fail without the correspondingresolve_metadataswap, pass with it.
[6.1.1] - 2026-03-23
Fixed
- S3-Compatible Endpoint Support: Disabled boto3 automatic request checksums (CRC32/CRC64) that were added in boto3 1.36+. S3-compatible stores like Hetzner Object Storage reject these headers with
BadRequest, breaking direct (non-delta) file uploads. Setsrequest_checksum_calculation="when_required"to restore compatibility while still working with AWS S3. - CI: LocalStack pinned to 4.4 —
localstack/localstack:latestnow requires a paid license; pinned to last free version across all workflows and docker-compose files.
Changed
- Dependency Pinning: All runtime dependencies now use major-version upper bounds (
boto3>=1.35.0,<2.0.0, etc.) to prevent surprise breaking changes in Docker builds.
Added
- S3 Compatibility Tests: New
test_s3_compat.pyunit tests verifying the boto3 client disables automatic checksums andput_objectdoesn't pass checksum kwargs — regression protection for non-AWS S3 endpoints. - Dependency Management Guide: Added quarterly dependency refresh checklist and known compatibility constraints to CLAUDE.md.
6.1.0 - 2025-02-07
Added
- Bucket ACL Management: New
put_bucket_acl()andget_bucket_acl()methods- boto3-compatible passthrough to native S3 ACL operations
- Supports canned ACLs (
private,public-read,public-read-write,authenticated-read) - Supports grant-based ACLs (
GrantRead,GrantWrite,GrantFullControl, etc.) - Supports full
AccessControlPolicydict for fine-grained control - SDK method count increased from 21 to 23
- New CLI Commands:
deltaglider put-bucket-aclanddeltaglider get-bucket-acl- Mirrors
aws s3api put-bucket-acl/get-bucket-aclsyntax - Accepts bucket name or
s3://bucketURL format - JSON output for
get-bucket-acl(compatible with AWS CLI) - Supports
--endpoint-url,--region,--profileflags
- Mirrors
- Docker Publishing: Added GitHub Actions workflow for multi-arch Docker image builds (amd64/arm64)
Changed
- Refactor: Extracted
DeltaGliderConfigdataclass for centralized configuration management - Refactor: Introduced typed
DeleteResultandRecursiveDeleteResultdataclasses replacing raw dicts - Refactor: Centralized S3 metadata key aliases into
core/models.pyconstants - Refactor: Extracted helper methods in
DeltaServicefor improved readability
Fixed
- Removed unused imports flagged by ruff in test files
Documentation
- Updated BOTO3_COMPATIBILITY.md (coverage 20% → 23%)
- Updated AWS S3 CLI compatibility docs with ACL command examples
- Refreshed README with dark mode logo and streamlined content
- Cleaned up SDK documentation and examples
6.0.0 - 2025-10-17
Added
- EC2 Region Detection & Cost Optimization
- Automatic detection of EC2 instance region using IMDSv2
- Warns when EC2 region ≠ S3 client region (potential cross-region charges)
- Different warnings for auto-detected vs. explicit
--regionflag mismatches - Green checkmark when regions are aligned (optimal configuration)
- Can be disabled with
DG_DISABLE_EC2_DETECTION=trueenvironment variable - Helps users optimize for cost and performance before migration starts
- New CLI Command:
deltaglider migratefor S3-to-S3 bucket migration with compression- Supports resume capability (skips already migrated files)
- Real-time progress tracking with file count and statistics
- Interactive confirmation prompt (use
--yesto skip) - Prefix preservation by default (use
--no-preserve-prefixto disable) - Dry run mode with
--dry-runflag - Include/exclude pattern filtering
- Shows compression statistics after migration
- EC2-aware region logging: Detects EC2 instance and warns about cross-region charges
- FIXED: Now correctly preserves original filenames during migration
- S3-to-S3 Recursive Copy:
deltaglider cp -r s3://source/ s3://dest/now supported- Automatically uses migration functionality with prefix preservation
- Applies delta compression during transfer
- Preserves original filenames correctly
- Version Command: Added
--versionflag to show deltaglider version- Usage:
deltaglider --version
- Usage:
- DeltaService API Enhancement: Added
override_nameparameter toput()method- Allows specifying destination filename independently of source filesystem path
- Enables proper S3-to-S3 transfers without filesystem renaming tricks
- Rehydration & Purge: Automatic rehydration of delta-compressed files for presigned URL access
- New
deltaglider purgeCLI command to clean expired temporary files
- New
- Metadata Namespace: Centralized
dg-prefixed metadata keys for all DeltaGlider metadata - S3-Based Stats Caching: Bucket statistics cached in S3 with automatic invalidation
Fixed
- Critical: S3-to-S3 migration now preserves original filenames
- Previously created files with temp names like
tmp1b9cpdsn.zip - Now correctly uses original filenames from source S3 keys
- Fixed by adding
override_nameparameter toDeltaService.put()
- Previously created files with temp names like
- CLI Region Support:
--regionflag now properly passes region to boto3 client- Previously only set environment variable, relied on boto3 auto-detection
- Now explicitly passes
region_nametoboto3.client()viaboto3_kwargs - Ensures consistent behavior with
DeltaGliderClientSDK
Changed
- Recursive S3-to-S3 copy operations now preserve source prefix structure by default
- Migration operations show formatted output with source and destination paths
Documentation
- Added comprehensive migration guide in README.md
- Updated CLI reference with migrate command examples
- Added prefix preservation behavior documentation
[5.1.1] - 2025-01-10
Fixed
- Stats Command: Fixed incorrect compression ratio calculations
- Now correctly counts ALL files including reference.bin in compressed size
- Fixed handling of orphaned reference.bin files (reference files with no delta files)
- Added prominent warnings for orphaned reference files with cleanup commands
- Fixed stats for buckets with no compression (now shows 0% instead of negative)
- SHA1 checksum files are now properly included in calculations
Improved
- Stats Performance: Optimized metadata fetching with parallel requests
- 5-10x faster for buckets with many delta files
- Uses ThreadPoolExecutor for concurrent HEAD requests
- Single-pass calculation algorithm for better efficiency
5.1.0 - 2025-10-10
Added
- New CLI Command:
deltaglider stats <bucket>for bucket statistics and compression metrics- Supports
--detailedflag for comprehensive analysis - Supports
--jsonflag for machine-readable output - Accepts multiple formats:
s3://bucket/,s3://bucket,bucket
- Supports
- Session-Level Statistics Caching: Bucket stats now cached per client instance
- Automatic cache invalidation on mutations (put, delete, bucket operations)
- Intelligent cache reuse (detailed stats serve quick stat requests)
- Enhanced
list_buckets()includes cached stats when available
- Programmatic Cache Management: Added cache management APIs for long-running applications
clear_cache(): Clear all cached referencesevict_cache(): Remove specific cached reference- Session-scoped cache lifecycle management
Changed
- Bucket statistics are now cached within client session for performance
list_buckets()response includesDeltaGliderStatsmetadata when cached
Documentation
- Added comprehensive DG_MAX_RATIO tuning guide in docs/
- Updated CLI command reference in CLAUDE.md and README.md
- Added detailed cache management documentation
5.0.3 - 2025-10-10
Security
- BREAKING: Removed all legacy shared cache code for security
- BREAKING: Encryption is now ALWAYS ON (cannot be disabled)
- Ephemeral process-isolated cache is now the ONLY mode (no opt-out)
- Content-Addressed Storage (CAS): Implemented SHA256-based cache storage
- Zero collision risk (SHA256 namespace guarantees uniqueness)
- Automatic deduplication (same content = same filename)
- Tampering protection (changing content changes SHA, breaks lookup)
- Two-level directory structure for filesystem optimization
- Encrypted Cache: All cache data encrypted at rest using Fernet (AES-128-CBC + HMAC)
- Ephemeral encryption keys per process (forward secrecy)
- Optional persistent keys via
DG_CACHE_ENCRYPTION_KEYfor shared filesystems - Automatic cleanup of corrupted cache files on decryption failures
- Fixed TOCTOU vulnerabilities with atomic SHA validation at use-time
- Added
get_validated_ref()method to prevent cache poisoning - Eliminated multi-user data exposure through mandatory cache isolation
Removed
- BREAKING: Removed
DG_UNSAFE_SHARED_CACHEenvironment variable - BREAKING: Removed
DG_CACHE_DIRenvironment variable - BREAKING: Removed
DG_CACHE_ENCRYPTIONenvironment variable (encryption always on) - BREAKING: Removed
cache_dirparameter fromcreate_client()
Changed
- Cache is now auto-created in
/tmp/deltaglider-*and cleaned on exit - All cache operations use file locking (Unix) and SHA validation
- Added
CacheMissErrorandCacheCorruptionErrorexceptions
Added
- New
ContentAddressedCacheadapter inadapters/cache_cas.py - New
EncryptedCachewrapper inadapters/cache_encrypted.py - New
MemoryCacheadapter inadapters/cache_memory.pywith LRU eviction - Self-describing cache structure with SHA256-based filenames
- Configurable cache backends via
DG_CACHE_BACKEND(filesystem or memory) - Memory cache size limit via
DG_CACHE_MEMORY_SIZE_MB(default: 100MB)
Internal
- Updated all tests to use Content-Addressed Storage and encryption
- All 119 tests passing with zero errors (99 original + 20 new cache tests)
- Type checking: 0 errors (mypy)
- Linting: All checks passed (ruff)
- Completed Phase 1, 2, and 7 of SECURITY_FIX_ROADMAP.md
- Added comprehensive test suites for encryption (13 tests) and memory cache (10 tests)
5.0.1 - 2025-01-10
Changed
- Code Organization: Refactored client.py from 1560 to 1154 lines (26% reduction)
- Extracted client operations into modular
client_operations/package:bucket.py- S3 bucket management operationspresigned.py- Presigned URL generationbatch.py- Batch upload/download operationsstats.py- Analytics and statistics operations
- Improved code maintainability with logical separation of concerns
- Better developer experience with cleaner module structure
Internal
- Full type safety maintained with mypy (0 errors)
- All 99 tests passing
- Code quality checks passing (ruff)
- No breaking changes - all public APIs remain unchanged
5.0.0 - 2025-01-10
Added
- boto3-compatible TypedDict types for S3 responses (no boto3 import needed)
- Complete boto3 compatibility vision document
- Type-safe response builders using TypedDict patterns
Changed
- BREAKING:
list_objects()now returns boto3-compatible dict instead of custom dataclass- Use
response['Contents']instead ofresponse.contents - Use
response.get('IsTruncated')instead ofresponse.is_truncated - Use
response.get('NextContinuationToken')instead ofresponse.next_continuation_token - DeltaGlider metadata now in
Metadatafield of each object
- Use
- Internal response building now uses TypedDict for compile-time type safety
- All S3 responses are dicts at runtime (TypedDict is a dict!)
Fixed
- Updated all documentation examples to use dict-based responses
- Fixed pagination examples in README and API docs
- Corrected SDK documentation with accurate method signatures
4.2.4 - 2025-01-10
Fixed
- Show only filename in
lsoutput instead of full path for cleaner display - Correct
lscommand path handling and prefix display logic
4.2.3 - 2025-01-07
Added
- Comprehensive test coverage for
delete_objects_recursive()method with 19 thorough tests - Tests cover delta suffix handling, error/warning aggregation, statistics tracking, and edge cases
- Better code organization with separate
client_models.pyandclient_delete_helpers.pymodules
Fixed
- Fixed all mypy type errors using proper
cast()for type safety - Improved type hints for dictionary operations in client code
Changed
- Refactored client code into logical modules for better maintainability
- Enhanced code quality with comprehensive linting and type checking
- All 99 integration/unit tests passing with zero type errors
Internal
- Better separation of concerns in client module
- Improved developer experience with clearer code structure
4.2.2 - 2024-10-06
Fixed
- Add .delta suffix fallback for
delete_object()method - Handle regular S3 objects without DeltaGlider metadata
- Update mypy type ignore comment for compatibility
4.2.1 - 2024-10-06
Fixed
- Make GitHub release creation non-blocking in workflows
4.2.0 - 2024-10-03
Added
- AWS credential parameters to
create_client()function - Support for custom endpoint URLs
- Enhanced boto3 compatibility
4.1.0 - 2024-09-29
Added
- boto3-compatible client API
- Bucket management methods
- Comprehensive SDK documentation
4.0.0 - 2024-09-21
Added
- Initial public release
- CLI with AWS S3 compatibility
- Delta compression for versioned artifacts
- 99%+ compression for similar files