Files
deltaglider/CHANGELOG.md
T
Simone Scarduzio d81240be80 fix(metadata): align direct-upload keys to canonical dg-* namespace (#8)
* fix(metadata): align direct-upload keys to canonical dg-* namespace

`_upload_direct` (the path taken by non-delta-eligible files like
.sha1 / .sha512) wrote user-metadata with bare underscored keys
(`original_name`, `file_sha256`, `compression`) while delta and
reference uploads correctly used the canonical dashed namespace
(`dg-original-name`, `dg-file-sha256`, `dg-compression`).

Downstream consumers — most visibly the DeltaGlider Proxy — only
recognised the dashed form, so every .sha1 / .sha512 listing on
a bucket holding deltaglider-uploaded files produced:

    WARN PATHOLOGICAL | Missing/corrupt DG metadata for
    bucket/key.sha1 -- falling back to passthrough.
    Error: Storage error: Missing dg-original-name

This patch aligns the writer to the canonical scheme and keeps the
read path backward-compatible with already-stored bare-keyed objects
via `resolve_metadata`. No re-upload required.

Changes
-------
* `_upload_direct` emits metadata using `f"{METADATA_PREFIX}{key}"`
  (the same pattern delta/reference uploads already use).
* `METADATA_KEY_ALIASES` now lists `compression` and `source_name`
  so `resolve_metadata` works for both fields uniformly.
* Replaced bare `metadata.get("compression")` /
  `metadata.get("original_name")` / `metadata.get("file_size")` /
  `metadata.get("ref_key")` lookups in `DeltaService.get`,
  `DeltaService.delete`, `_delete_delta`, the recursive-delete
  listing path, `client.list_objects_v2`, and
  `client_operations.stats.get_object_info` with `resolve_metadata`
  calls so legacy bare-keyed objects keep working forever.

Tests
-----
* `tests/unit/test_metadata_aliases.py` (new, 11 tests) — pins the
  alias table contract: new dashed keys, legacy bare underscored
  keys, legacy hyphenated keys, priority rule, empty-string
  handling.
* `test_direct_upload_emits_dashed_namespace` in
  `tests/unit/test_core_service.py` — pins the writer to emit only
  dg-* keys.
* Existing tests using the legacy bare `compression: "none"` form
  in `test_s3_compat.py` and `test_recursive_delete_reference_*.py`
  still pass — proving the dual-scheme read contract holds.

Full unit suite: 87/87 pass, mypy clean, ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metadata): also resolve legacy file_sha256 in get() dispatch

Adversarial review of the original patch caught a second
asymmetry: DeltaService.get's "is this a regular S3 object or
DeltaGlider-managed?" dispatch was a literal-string check
`"dg-file-sha256" not in obj_head.metadata`. After the writer
fix, NEW direct uploads have `dg-file-sha256` so they route
correctly. But ~4400 pre-fix `.sha1` / `.sha512` files in
production have the bare `file_sha256` key, and they were
silently being routed through the "regular S3 object" branch
instead of the "direct upload" branch.

Both branches call `_get_direct` so file content was still
served correctly — but the wrong log message fired
("Downloading regular S3 object (no DeltaGlider metadata)") and
the recorded file-size for telemetry came from obj_head.size
instead of the metadata's `file_size` (same value for direct
uploads, but still semantically wrong).

Swap the literal-string check for `resolve_metadata(meta,
"file_sha256") is None` so both schemes route to the
DeltaGlider-managed branch.

Added regression test `test_get_legacy_direct_upload_not_
misclassified_as_regular_s3` that builds a HEAD response with
the legacy bare-keyed metadata shape (exactly what's stored on
Hetzner today for the .sha files), captures the log messages,
and fails if the "regular S3 object" canary fires.

Demonstrated locally: revert the dispatch back to literal-string
check → new test fails with the canary log line. Restore →
88/88 pass.

CHANGELOG updated to document both fixes (writer + dispatch).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 10:28:25 +02:00

17 KiB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Fixed

  • Direct-upload metadata now uses the canonical dg-* dashed namespace. Pre-fix, files routed through _upload_direct (non-delta-eligible extensions: .sha1, .sha512, etc.) wrote metadata with bare underscored keys (original_name, file_sha256, compression) while delta and reference uploads correctly used the namespaced form (dg-original-name, dg-file-sha256, dg-compression). Downstream consumers — most visibly the DeltaGlider Proxy — only recognised the dashed form, so every .sha1/.sha512 listing triggered a PATHOLOGICAL | Missing/corrupt DG metadata warning. Aligned the writer to the canonical scheme so new uploads stop producing log spam.

Changed

  • Read path now resolves both schemes uniformly. The historical bare keys (original_name, compression, etc.) stay in METADATA_KEY_ALIASES so already-stored objects keep being recognised on read — no migration required. Replaced ad-hoc metadata.get("compression") / metadata.get("original_name") / metadata.get("file_size") / metadata.get("ref_key") lookups in DeltaService.get, DeltaService.delete, _delete_delta, the recursive-delete listing path, client.list_objects_v2, and client_operations.stats.get_object_info with resolve_metadata(meta, field) calls so both schemes work transparently for the lifetime of the bucket. New compression and source_name entries added to the alias table.
  • DeltaService.get "regular S3 vs DeltaGlider-managed" dispatch now uses resolve_metadata for the file_sha256 presence check. Pre-fix, this check looked for the literal string "dg-file-sha256" in obj_head.metadata, which silently misclassified legacy bare-keyed direct uploads (file_sha256 without the dg- prefix) as "regular S3 objects" — they still served correctly because both branches call _get_direct, but the wrong log line fired and the wrong file_size value was recorded for telemetry. Caught during adversarial PR review.

Added

  • Regression tests for the dual-scheme contract (tests/unit/test_metadata_aliases.py, 11 tests): every alias resolves, new dashed keys win when both are present, empty strings count as missing, the alias-table shape is pinned (first alias dashed, bare underscored alias always present, compression + source_name present).
  • test_direct_upload_emits_dashed_namespace in test_core_service.py pins the writer to emit dg-*-only metadata so the original underscored regression cannot return.
  • test_get_legacy_direct_upload_not_misclassified_as_regular_s3 in test_core_service.py pins the get() dispatch to route bare-keyed legacy direct uploads through the DeltaGlider-managed branch (not the "regular S3 object" passthrough). Demonstrated to fail without the corresponding resolve_metadata swap, pass with it.

[6.1.1] - 2026-03-23

Fixed

  • S3-Compatible Endpoint Support: Disabled boto3 automatic request checksums (CRC32/CRC64) that were added in boto3 1.36+. S3-compatible stores like Hetzner Object Storage reject these headers with BadRequest, breaking direct (non-delta) file uploads. Sets request_checksum_calculation="when_required" to restore compatibility while still working with AWS S3.
  • CI: LocalStack pinned to 4.4localstack/localstack:latest now requires a paid license; pinned to last free version across all workflows and docker-compose files.

Changed

  • Dependency Pinning: All runtime dependencies now use major-version upper bounds (boto3>=1.35.0,<2.0.0, etc.) to prevent surprise breaking changes in Docker builds.

Added

  • S3 Compatibility Tests: New test_s3_compat.py unit tests verifying the boto3 client disables automatic checksums and put_object doesn't pass checksum kwargs — regression protection for non-AWS S3 endpoints.
  • Dependency Management Guide: Added quarterly dependency refresh checklist and known compatibility constraints to CLAUDE.md.

6.1.0 - 2025-02-07

Added

  • Bucket ACL Management: New put_bucket_acl() and get_bucket_acl() methods
    • boto3-compatible passthrough to native S3 ACL operations
    • Supports canned ACLs (private, public-read, public-read-write, authenticated-read)
    • Supports grant-based ACLs (GrantRead, GrantWrite, GrantFullControl, etc.)
    • Supports full AccessControlPolicy dict for fine-grained control
    • SDK method count increased from 21 to 23
  • New CLI Commands: deltaglider put-bucket-acl and deltaglider get-bucket-acl
    • Mirrors aws s3api put-bucket-acl / get-bucket-acl syntax
    • Accepts bucket name or s3://bucket URL format
    • JSON output for get-bucket-acl (compatible with AWS CLI)
    • Supports --endpoint-url, --region, --profile flags
  • Docker Publishing: Added GitHub Actions workflow for multi-arch Docker image builds (amd64/arm64)

Changed

  • Refactor: Extracted DeltaGliderConfig dataclass for centralized configuration management
  • Refactor: Introduced typed DeleteResult and RecursiveDeleteResult dataclasses replacing raw dicts
  • Refactor: Centralized S3 metadata key aliases into core/models.py constants
  • Refactor: Extracted helper methods in DeltaService for improved readability

Fixed

  • Removed unused imports flagged by ruff in test files

Documentation

  • Updated BOTO3_COMPATIBILITY.md (coverage 20% → 23%)
  • Updated AWS S3 CLI compatibility docs with ACL command examples
  • Refreshed README with dark mode logo and streamlined content
  • Cleaned up SDK documentation and examples

6.0.0 - 2025-10-17

Added

  • EC2 Region Detection & Cost Optimization
    • Automatic detection of EC2 instance region using IMDSv2
    • Warns when EC2 region ≠ S3 client region (potential cross-region charges)
    • Different warnings for auto-detected vs. explicit --region flag mismatches
    • Green checkmark when regions are aligned (optimal configuration)
    • Can be disabled with DG_DISABLE_EC2_DETECTION=true environment variable
    • Helps users optimize for cost and performance before migration starts
  • New CLI Command: deltaglider migrate for S3-to-S3 bucket migration with compression
    • Supports resume capability (skips already migrated files)
    • Real-time progress tracking with file count and statistics
    • Interactive confirmation prompt (use --yes to skip)
    • Prefix preservation by default (use --no-preserve-prefix to disable)
    • Dry run mode with --dry-run flag
    • Include/exclude pattern filtering
    • Shows compression statistics after migration
    • EC2-aware region logging: Detects EC2 instance and warns about cross-region charges
    • FIXED: Now correctly preserves original filenames during migration
  • S3-to-S3 Recursive Copy: deltaglider cp -r s3://source/ s3://dest/ now supported
    • Automatically uses migration functionality with prefix preservation
    • Applies delta compression during transfer
    • Preserves original filenames correctly
  • Version Command: Added --version flag to show deltaglider version
    • Usage: deltaglider --version
  • DeltaService API Enhancement: Added override_name parameter to put() method
    • Allows specifying destination filename independently of source filesystem path
    • Enables proper S3-to-S3 transfers without filesystem renaming tricks
  • Rehydration & Purge: Automatic rehydration of delta-compressed files for presigned URL access
    • New deltaglider purge CLI command to clean expired temporary files
  • Metadata Namespace: Centralized dg- prefixed metadata keys for all DeltaGlider metadata
  • S3-Based Stats Caching: Bucket statistics cached in S3 with automatic invalidation

Fixed

  • Critical: S3-to-S3 migration now preserves original filenames
    • Previously created files with temp names like tmp1b9cpdsn.zip
    • Now correctly uses original filenames from source S3 keys
    • Fixed by adding override_name parameter to DeltaService.put()
  • CLI Region Support: --region flag now properly passes region to boto3 client
    • Previously only set environment variable, relied on boto3 auto-detection
    • Now explicitly passes region_name to boto3.client() via boto3_kwargs
    • Ensures consistent behavior with DeltaGliderClient SDK

Changed

  • Recursive S3-to-S3 copy operations now preserve source prefix structure by default
  • Migration operations show formatted output with source and destination paths

Documentation

  • Added comprehensive migration guide in README.md
  • Updated CLI reference with migrate command examples
  • Added prefix preservation behavior documentation

[5.1.1] - 2025-01-10

Fixed

  • Stats Command: Fixed incorrect compression ratio calculations
    • Now correctly counts ALL files including reference.bin in compressed size
    • Fixed handling of orphaned reference.bin files (reference files with no delta files)
    • Added prominent warnings for orphaned reference files with cleanup commands
    • Fixed stats for buckets with no compression (now shows 0% instead of negative)
    • SHA1 checksum files are now properly included in calculations

Improved

  • Stats Performance: Optimized metadata fetching with parallel requests
    • 5-10x faster for buckets with many delta files
    • Uses ThreadPoolExecutor for concurrent HEAD requests
    • Single-pass calculation algorithm for better efficiency

5.1.0 - 2025-10-10

Added

  • New CLI Command: deltaglider stats <bucket> for bucket statistics and compression metrics
    • Supports --detailed flag for comprehensive analysis
    • Supports --json flag for machine-readable output
    • Accepts multiple formats: s3://bucket/, s3://bucket, bucket
  • Session-Level Statistics Caching: Bucket stats now cached per client instance
    • Automatic cache invalidation on mutations (put, delete, bucket operations)
    • Intelligent cache reuse (detailed stats serve quick stat requests)
    • Enhanced list_buckets() includes cached stats when available
  • Programmatic Cache Management: Added cache management APIs for long-running applications
    • clear_cache(): Clear all cached references
    • evict_cache(): Remove specific cached reference
    • Session-scoped cache lifecycle management

Changed

  • Bucket statistics are now cached within client session for performance
  • list_buckets() response includes DeltaGliderStats metadata when cached

Documentation

  • Added comprehensive DG_MAX_RATIO tuning guide in docs/
  • Updated CLI command reference in CLAUDE.md and README.md
  • Added detailed cache management documentation

5.0.3 - 2025-10-10

Security

  • BREAKING: Removed all legacy shared cache code for security
  • BREAKING: Encryption is now ALWAYS ON (cannot be disabled)
  • Ephemeral process-isolated cache is now the ONLY mode (no opt-out)
  • Content-Addressed Storage (CAS): Implemented SHA256-based cache storage
    • Zero collision risk (SHA256 namespace guarantees uniqueness)
    • Automatic deduplication (same content = same filename)
    • Tampering protection (changing content changes SHA, breaks lookup)
    • Two-level directory structure for filesystem optimization
  • Encrypted Cache: All cache data encrypted at rest using Fernet (AES-128-CBC + HMAC)
    • Ephemeral encryption keys per process (forward secrecy)
    • Optional persistent keys via DG_CACHE_ENCRYPTION_KEY for shared filesystems
    • Automatic cleanup of corrupted cache files on decryption failures
  • Fixed TOCTOU vulnerabilities with atomic SHA validation at use-time
  • Added get_validated_ref() method to prevent cache poisoning
  • Eliminated multi-user data exposure through mandatory cache isolation

Removed

  • BREAKING: Removed DG_UNSAFE_SHARED_CACHE environment variable
  • BREAKING: Removed DG_CACHE_DIR environment variable
  • BREAKING: Removed DG_CACHE_ENCRYPTION environment variable (encryption always on)
  • BREAKING: Removed cache_dir parameter from create_client()

Changed

  • Cache is now auto-created in /tmp/deltaglider-* and cleaned on exit
  • All cache operations use file locking (Unix) and SHA validation
  • Added CacheMissError and CacheCorruptionError exceptions

Added

  • New ContentAddressedCache adapter in adapters/cache_cas.py
  • New EncryptedCache wrapper in adapters/cache_encrypted.py
  • New MemoryCache adapter in adapters/cache_memory.py with LRU eviction
  • Self-describing cache structure with SHA256-based filenames
  • Configurable cache backends via DG_CACHE_BACKEND (filesystem or memory)
  • Memory cache size limit via DG_CACHE_MEMORY_SIZE_MB (default: 100MB)

Internal

  • Updated all tests to use Content-Addressed Storage and encryption
  • All 119 tests passing with zero errors (99 original + 20 new cache tests)
  • Type checking: 0 errors (mypy)
  • Linting: All checks passed (ruff)
  • Completed Phase 1, 2, and 7 of SECURITY_FIX_ROADMAP.md
  • Added comprehensive test suites for encryption (13 tests) and memory cache (10 tests)

5.0.1 - 2025-01-10

Changed

  • Code Organization: Refactored client.py from 1560 to 1154 lines (26% reduction)
  • Extracted client operations into modular client_operations/ package:
    • bucket.py - S3 bucket management operations
    • presigned.py - Presigned URL generation
    • batch.py - Batch upload/download operations
    • stats.py - Analytics and statistics operations
  • Improved code maintainability with logical separation of concerns
  • Better developer experience with cleaner module structure

Internal

  • Full type safety maintained with mypy (0 errors)
  • All 99 tests passing
  • Code quality checks passing (ruff)
  • No breaking changes - all public APIs remain unchanged

5.0.0 - 2025-01-10

Added

  • boto3-compatible TypedDict types for S3 responses (no boto3 import needed)
  • Complete boto3 compatibility vision document
  • Type-safe response builders using TypedDict patterns

Changed

  • BREAKING: list_objects() now returns boto3-compatible dict instead of custom dataclass
    • Use response['Contents'] instead of response.contents
    • Use response.get('IsTruncated') instead of response.is_truncated
    • Use response.get('NextContinuationToken') instead of response.next_continuation_token
    • DeltaGlider metadata now in Metadata field of each object
  • Internal response building now uses TypedDict for compile-time type safety
  • All S3 responses are dicts at runtime (TypedDict is a dict!)

Fixed

  • Updated all documentation examples to use dict-based responses
  • Fixed pagination examples in README and API docs
  • Corrected SDK documentation with accurate method signatures

4.2.4 - 2025-01-10

Fixed

  • Show only filename in ls output instead of full path for cleaner display
  • Correct ls command path handling and prefix display logic

4.2.3 - 2025-01-07

Added

  • Comprehensive test coverage for delete_objects_recursive() method with 19 thorough tests
  • Tests cover delta suffix handling, error/warning aggregation, statistics tracking, and edge cases
  • Better code organization with separate client_models.py and client_delete_helpers.py modules

Fixed

  • Fixed all mypy type errors using proper cast() for type safety
  • Improved type hints for dictionary operations in client code

Changed

  • Refactored client code into logical modules for better maintainability
  • Enhanced code quality with comprehensive linting and type checking
  • All 99 integration/unit tests passing with zero type errors

Internal

  • Better separation of concerns in client module
  • Improved developer experience with clearer code structure

4.2.2 - 2024-10-06

Fixed

  • Add .delta suffix fallback for delete_object() method
  • Handle regular S3 objects without DeltaGlider metadata
  • Update mypy type ignore comment for compatibility

4.2.1 - 2024-10-06

Fixed

  • Make GitHub release creation non-blocking in workflows

4.2.0 - 2024-10-03

Added

  • AWS credential parameters to create_client() function
  • Support for custom endpoint URLs
  • Enhanced boto3 compatibility

4.1.0 - 2024-09-29

Added

  • boto3-compatible client API
  • Bucket management methods
  • Comprehensive SDK documentation

4.0.0 - 2024-09-21

Added

  • Initial public release
  • CLI with AWS S3 compatibility
  • Delta compression for versioned artifacts
  • 99%+ compression for similar files