Commit Graph

139 Commits

Author SHA1 Message Date
Simone Scarduzio d81240be80 fix(metadata): align direct-upload keys to canonical dg-* namespace (#8)
* fix(metadata): align direct-upload keys to canonical dg-* namespace

`_upload_direct` (the path taken by non-delta-eligible files like
.sha1 / .sha512) wrote user-metadata with bare underscored keys
(`original_name`, `file_sha256`, `compression`) while delta and
reference uploads correctly used the canonical dashed namespace
(`dg-original-name`, `dg-file-sha256`, `dg-compression`).

Downstream consumers — most visibly the DeltaGlider Proxy — only
recognised the dashed form, so every .sha1 / .sha512 listing on
a bucket holding deltaglider-uploaded files produced:

    WARN PATHOLOGICAL | Missing/corrupt DG metadata for
    bucket/key.sha1 -- falling back to passthrough.
    Error: Storage error: Missing dg-original-name

This patch aligns the writer to the canonical scheme and keeps the
read path backward-compatible with already-stored bare-keyed objects
via `resolve_metadata`. No re-upload required.

Changes
-------
* `_upload_direct` emits metadata using `f"{METADATA_PREFIX}{key}"`
  (the same pattern delta/reference uploads already use).
* `METADATA_KEY_ALIASES` now lists `compression` and `source_name`
  so `resolve_metadata` works for both fields uniformly.
* Replaced bare `metadata.get("compression")` /
  `metadata.get("original_name")` / `metadata.get("file_size")` /
  `metadata.get("ref_key")` lookups in `DeltaService.get`,
  `DeltaService.delete`, `_delete_delta`, the recursive-delete
  listing path, `client.list_objects_v2`, and
  `client_operations.stats.get_object_info` with `resolve_metadata`
  calls so legacy bare-keyed objects keep working forever.

Tests
-----
* `tests/unit/test_metadata_aliases.py` (new, 11 tests) — pins the
  alias table contract: new dashed keys, legacy bare underscored
  keys, legacy hyphenated keys, priority rule, empty-string
  handling.
* `test_direct_upload_emits_dashed_namespace` in
  `tests/unit/test_core_service.py` — pins the writer to emit only
  dg-* keys.
* Existing tests using the legacy bare `compression: "none"` form
  in `test_s3_compat.py` and `test_recursive_delete_reference_*.py`
  still pass — proving the dual-scheme read contract holds.

Full unit suite: 87/87 pass, mypy clean, ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metadata): also resolve legacy file_sha256 in get() dispatch

Adversarial review of the original patch caught a second
asymmetry: DeltaService.get's "is this a regular S3 object or
DeltaGlider-managed?" dispatch was a literal-string check
`"dg-file-sha256" not in obj_head.metadata`. After the writer
fix, NEW direct uploads have `dg-file-sha256` so they route
correctly. But ~4400 pre-fix `.sha1` / `.sha512` files in
production have the bare `file_sha256` key, and they were
silently being routed through the "regular S3 object" branch
instead of the "direct upload" branch.

Both branches call `_get_direct` so file content was still
served correctly — but the wrong log message fired
("Downloading regular S3 object (no DeltaGlider metadata)") and
the recorded file-size for telemetry came from obj_head.size
instead of the metadata's `file_size` (same value for direct
uploads, but still semantically wrong).

Swap the literal-string check for `resolve_metadata(meta,
"file_sha256") is None` so both schemes route to the
DeltaGlider-managed branch.

Added regression test `test_get_legacy_direct_upload_not_
misclassified_as_regular_s3` that builds a HEAD response with
the legacy bare-keyed metadata shape (exactly what's stored on
Hetzner today for the .sha files), captures the log messages,
and fails if the "regular S3 object" canary fires.

Demonstrated locally: revert the dispatch back to literal-string
check → new test fails with the canary log line. Restore →
88/88 pass.

CHANGELOG updated to document both fixes (writer + dispatch).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v6.1.2
2026-05-17 10:28:25 +02:00
Simone Scarduzio a98fc7c178 style: format storage_s3.py for ruff format compliance
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v6.1.1
2026-03-23 19:15:09 +01:00
Simone Scarduzio 82e00623de fix: verbose diagnostic logging on put_object retries
On retry: logs bucket, key, body size, content type, metadata keys,
endpoint URL, HTTP status, error code/message, request ID, and full
HTTP response headers. Enables botocore DEBUG logging for wire-level
HTTP traces on subsequent retry attempts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 19:08:35 +01:00
Simone Scarduzio e8c76f1dc7 fix: add retry with backoff for put_object on transient S3 failures
S3-compatible endpoints (Hetzner) occasionally return transient
BadRequest errors. Retries up to 3 times with exponential backoff
(1s, 2s) before giving up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 19:02:51 +01:00
Simone Scarduzio c492a5087b feat: log deltaglider version on every CLI invocation
Helps verify which version is running in Docker containers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 13:58:15 +01:00
Simone Scarduzio 85af5a95c8 docs: update CHANGELOG for v6.1.1 release
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 11:52:49 +01:00
Simone Scarduzio 60b70309fa fix: pin LocalStack to 4.4 (latest now requires paid license)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 11:50:05 +01:00
Simone Scarduzio b0699f952a fix: disable boto3 auto-checksums for S3-compatible endpoint support
boto3 1.36+ sends CRC32/CRC64 checksums by default on PUT requests.
S3-compatible stores like Hetzner Object Storage reject these with
BadRequest, breaking direct (non-delta) file uploads. This sets
request_checksum_calculation="when_required" to restore compatibility
while still working with AWS S3.

Also pins runtime deps to major version ranges and adds S3 compat tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 11:45:05 +01:00
Simone Scarduzio 9bfe121f44 style: format files for ruff format --check compliance
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v6.1.0
2026-02-07 16:02:40 +01:00
Simone Scarduzio 6cab3de9a0 fix: disable sha tag on tag pushes to avoid invalid Docker tag
The sha tag template `prefix={{branch}}-` produces `:-hash` on tag
pushes because {{branch}} is empty, resulting in an invalid Docker
tag like `beshultd/deltaglider:-482f45f`. Only emit sha tags on
branch pushes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 15:57:37 +01:00
Simone Scarduzio 482f45fc02 docs: update CHANGELOG for v6.1.0 release
Add v6.1.0 section with bucket ACL support, Docker publishing,
config/model refactoring. Backfill v6.0.0 section from previously
unreleased entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 15:55:50 +01:00
Simone Scarduzio 6b3245266e feat: add put_bucket_acl and get_bucket_acl support
Add boto3-compatible bucket ACL operations as pure S3 passthroughs,
following the existing create_bucket/delete_bucket pattern. Includes
CLI commands (put-bucket-acl, get-bucket-acl), 7 integration tests,
and documentation updates (method count 21→23).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 15:53:33 +01:00
Simone Scarduzio 20053acb5f fix: remove unused imports flagged by ruff
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 08:48:22 +01:00
Simone Scarduzio 87f425734f refactor: typed result dataclasses, centralized metadata aliases, config extraction
- Replace dict[str,Any] returns in delete/delete_recursive with DeleteResult
  and RecursiveDeleteResult dataclasses for type safety
- Extract _delete_reference/_delete_delta/_classify_objects_for_deletion
  helper methods from oversized delete methods in service.py
- Centralize metadata key aliases in METADATA_KEY_ALIASES dict with
  resolve_metadata() replacing duplicated _meta_value() lookups
- Add DeltaGliderConfig dataclass with from_env() for centralized config
- Add ObjectKey.full_key property, remove dead _multipart_uploads dict
- Update all consumers (client, CLI, tests) for dataclass access patterns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 23:16:57 +01:00
Simone Scarduzio 012662c377 updates 2025-11-11 17:20:43 +01:00
Simone Scarduzio 284f030fae updates to docs 2025-11-11 17:05:50 +01:00
Simone Scarduzio 7a4d30a007 freshen up 2025-11-11 11:18:06 +01:00
Simone Scarduzio 0d46283ff0 width 2025-11-11 09:55:52 +01:00
Simone Scarduzio 805e2967bc dark mode 2025-11-11 09:53:54 +01:00
Simone Scarduzio 2ef1741d51 freshen up readme 2025-11-11 09:48:34 +01:00
Simone Scarduzio 2c1d756e7b tweak readme 2025-11-06 16:14:29 +01:00
Simone Scarduzio c6cee7ae26 docker 2025-11-06 15:56:15 +01:00
Simone Scarduzio cee9a9fd2d higher limits why not v6.0.2 2025-10-17 18:43:46 +02:00
Simone Scarduzio 0507e6ebcd format 2025-10-16 17:14:37 +02:00
Simone Scarduzio fa9c4fa42d feat: Implement rehydration and purge functionality for deltaglider files
- Added `rehydrate_for_download` method to download and decompress deltaglider-compressed files, re-uploading them with expiration metadata.
- Introduced `generate_presigned_url_with_rehydration` method to generate presigned URLs that automatically handle rehydration for both regular and deltaglider files.
- Implemented `purge_temp_files` command in CLI to delete expired temporary files from the .deltaglider/tmp/ directory, with options for dry run and JSON output.
- Enhanced service methods to support the new rehydration and purging features, including detailed logging and metrics tracking.
2025-10-16 17:02:00 +02:00
Simone Scarduzio 934d83975c fix: format models.py v6.0.1 2025-10-16 11:21:33 +02:00
Simone Scarduzio c32d5265d9 feat: Enhance metadata handling and bucket statistics
- Added object_limit_reached attribute to BucketStats for tracking limits.
- Introduced QUICK_LIST_LIMIT and SAMPLED_LIST_LIMIT constants to manage listing limits.
- Implemented _first_metadata_value helper function for improved metadata retrieval.
- Updated get_bucket_stats to log when listing is capped due to limits.
- Refactored DeltaMeta to streamline metadata extraction with error handling.
- Enhanced object listing to support max_objects parameter and limit tracking.
2025-10-16 11:17:13 +02:00
Simone Scarduzio 1cf7e3ad21 import 2025-10-15 18:52:56 +02:00
Simone Scarduzio 9b36087438 not mandatory to have the command metadata field set 2025-10-15 18:16:43 +02:00
Simone Scarduzio 60877966f2 docs: Remove outdated METADATA_ISSUE_DIAGNOSIS.md
This document describes the old metadata format without dg- prefix.
Since v6.0.0 uses the new dg- prefixed format and requires all files
to be re-uploaded (greenfield approach), this diagnosis doc is no longer
relevant.
2025-10-15 11:45:52 +02:00
Simone Scarduzio fbd44ea3c3 style: Format integration test files with ruff v6.0.0 2025-10-15 11:38:17 +02:00
Simone Scarduzio 3f689fc601 fix: Update integration tests for new metadata format and caching behavior
- Fix sync tests: Add list_objects.side_effect = NotImplementedError() to mock
- Fix sync tests: Add side_effect for put() to avoid hanging
- Fix MockStorage: Add continuation_token parameter to list_objects()
- Fix stats tests: Update assertions to include use_cache and refresh_cache params
- Fix bucket management test: Update caching expectations for S3-based cache

All 97 integration tests now pass.
2025-10-15 11:34:43 +02:00
Simone Scarduzio 3753212f96 style: Format test file with ruff
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-15 11:22:00 +02:00
Simone Scarduzio db7d14f8a8 feat: Add metadata namespace and fix stats calculation
This is a major release with breaking changes to metadata format.

BREAKING CHANGES:
- All metadata keys now use 'dg-' namespace prefix (becomes 'x-amz-meta-dg-*' in S3)
- Old metadata format is not supported - all files must be re-uploaded
- Stats behavior changed: quick mode no longer shows misleading warnings

Features:
- Metadata now uses real package version (dg-tool: deltaglider/VERSION)
- All metadata keys properly namespaced with 'dg-' prefix
- Clean stats output in quick mode (no per-file warning spam)
- Fixed nonsensical negative compression ratios in quick mode

Fixes:
- Stats now correctly handles delta files without metadata
- Space saved shows 0 instead of negative numbers when metadata unavailable
- Removed misleading warnings in quick mode (metadata not fetched is expected)
- Fixed metadata keys to use hyphens instead of underscores

Documentation:
- Added comprehensive metadata documentation
- Added stats calculation behavior guide
- Added real version tracking documentation

Tests:
- Updated all tests to use new dg- prefixed metadata keys
- All 73 unit tests passing
- All quality checks passing (ruff, mypy)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-15 11:19:10 +02:00
Simone Scarduzio e1259b7ea8 fix: Code quality improvements for v5.2.2 release
- Fix pagination bug using continuation_token instead of start_after
- Add stats caching to prevent blocking web apps
- Improve code formatting and type checking
- Add comprehensive unit tests for new features
- Fix test mock usage in object_listing tests
v5.2.2
2025-10-14 23:54:49 +02:00
Simone Scarduzio ff05e77c24 fix: Prevent get_bucket_stats from blocking web apps indefinitely
**Performance Issues Fixed:**
1. aws_compat.py: Changed to use cached stats only (no bucket scans after uploads)
2. stats.py: Added safety mechanisms to prevent infinite hangs
   - Max 10k iterations (10M object limit)
   - 10 min timeout on metadata fetching
   - Missing pagination token detection
   - Graceful error recovery with partial stats

**Refactoring:**
- Reduced nesting in get_bucket_stats from 5 levels to 2 levels
- Extracted 5 helper functions for better maintainability
- Main function reduced from 300+ lines to 33 lines
- 100% backward compatible - no API changes

**Benefits:**
- Web apps no longer hang on upload/delete operations
- Explicit get_bucket_stats() calls complete within bounded time
- Better error handling and logging
- Easier to test and maintain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
v5.2.1
2025-10-14 14:47:39 +02:00
Simone Scarduzio c3d385bf18 fix tests v5.2.0 2025-10-13 17:26:35 +02:00
Simone Scarduzio aea5cb5d9a feat: Enhance S3 migration CLI with new commands and EC2 detection option 2025-10-12 23:12:32 +02:00
Simone Scarduzio b2ca59490b feat: Add EC2 region detection and cost optimization features 2025-10-12 22:41:48 +02:00
Simone Scarduzio 4f56c4b600 fix: Preserve original filenames during S3-to-S3 migration 2025-10-12 18:10:04 +02:00
Simone Scarduzio 14c6af0f35 handle version in cli 2025-10-12 17:47:05 +02:00
Simone Scarduzio 67792b2031 migrate CLI support 2025-10-12 17:37:44 +02:00
Simone Scarduzio a9a1396e6e style: Format test_stats_algorithm.py with ruff v5.1.1 2025-10-11 14:17:49 +02:00
Simone Scarduzio 52eb5bba21 fix: Fix unit test import issues for concurrent.futures
- Remove unnecessary concurrent.futures patches in tests
- Update test_detailed_stats_flag to match current implementation behavior
- Tests now properly handle parallel metadata fetching without mocking
2025-10-11 14:13:40 +02:00
Simone Scarduzio f75db142e8 fix: Correct logging message formatting in get_bucket_stats and update test assertionsalls for clarity. 2025-10-11 14:05:54 +02:00
Simone Scarduzio 35d34d4862 chore: Update CHANGELOG for v5.1.1 release
- Document stats command fixes
- Document performance improvements
2025-10-10 19:57:11 +02:00
Simone Scarduzio 9230cbd762 test 2025-10-10 19:52:15 +02:00
Simone Scarduzio 2eba6e8d38 optimisation 2025-10-10 19:50:33 +02:00
Simone Scarduzio 656726b57b algorithm correctness 2025-10-10 19:46:39 +02:00
Simone Scarduzio 85dd315424 ruff v5.1.0 v5.0.4 2025-10-10 18:44:46 +02:00