125 Commits

Author SHA1 Message Date
Simone Scarduzio
012662c377 updates 2025-11-11 17:20:43 +01:00
Simone Scarduzio
284f030fae updates to docs 2025-11-11 17:05:50 +01:00
Simone Scarduzio
7a4d30a007 freshen up 2025-11-11 11:18:06 +01:00
Simone Scarduzio
0d46283ff0 width 2025-11-11 09:55:52 +01:00
Simone Scarduzio
805e2967bc dark mode 2025-11-11 09:53:54 +01:00
Simone Scarduzio
2ef1741d51 freshen up readme 2025-11-11 09:48:34 +01:00
Simone Scarduzio
2c1d756e7b tweak readme 2025-11-06 16:14:29 +01:00
Simone Scarduzio
c6cee7ae26 docker 2025-11-06 15:56:15 +01:00
Simone Scarduzio
cee9a9fd2d higher limits why not v6.0.2 2025-10-17 18:43:46 +02:00
Simone Scarduzio
0507e6ebcd format 2025-10-16 17:14:37 +02:00
Simone Scarduzio
fa9c4fa42d feat: Implement rehydration and purge functionality for deltaglider files
- Added `rehydrate_for_download` method to download and decompress deltaglider-compressed files, re-uploading them with expiration metadata.
- Introduced `generate_presigned_url_with_rehydration` method to generate presigned URLs that automatically handle rehydration for both regular and deltaglider files.
- Implemented `purge_temp_files` command in CLI to delete expired temporary files from the .deltaglider/tmp/ directory, with options for dry run and JSON output.
- Enhanced service methods to support the new rehydration and purging features, including detailed logging and metrics tracking.
2025-10-16 17:02:00 +02:00
Simone Scarduzio
934d83975c fix: format models.py v6.0.1 2025-10-16 11:21:33 +02:00
Simone Scarduzio
c32d5265d9 feat: Enhance metadata handling and bucket statistics
- Added object_limit_reached attribute to BucketStats for tracking limits.
- Introduced QUICK_LIST_LIMIT and SAMPLED_LIST_LIMIT constants to manage listing limits.
- Implemented _first_metadata_value helper function for improved metadata retrieval.
- Updated get_bucket_stats to log when listing is capped due to limits.
- Refactored DeltaMeta to streamline metadata extraction with error handling.
- Enhanced object listing to support max_objects parameter and limit tracking.
2025-10-16 11:17:13 +02:00
Simone Scarduzio
1cf7e3ad21 import 2025-10-15 18:52:56 +02:00
Simone Scarduzio
9b36087438 not mandatory to have the command metadata field set 2025-10-15 18:16:43 +02:00
Simone Scarduzio
60877966f2 docs: Remove outdated METADATA_ISSUE_DIAGNOSIS.md
This document describes the old metadata format without dg- prefix.
Since v6.0.0 uses the new dg- prefixed format and requires all files
to be re-uploaded (greenfield approach), this diagnosis doc is no longer
relevant.
2025-10-15 11:45:52 +02:00
Simone Scarduzio
fbd44ea3c3 style: Format integration test files with ruff v6.0.0 2025-10-15 11:38:17 +02:00
Simone Scarduzio
3f689fc601 fix: Update integration tests for new metadata format and caching behavior
- Fix sync tests: Add list_objects.side_effect = NotImplementedError() to mock
- Fix sync tests: Add side_effect for put() to avoid hanging
- Fix MockStorage: Add continuation_token parameter to list_objects()
- Fix stats tests: Update assertions to include use_cache and refresh_cache params
- Fix bucket management test: Update caching expectations for S3-based cache

All 97 integration tests now pass.
2025-10-15 11:34:43 +02:00
Simone Scarduzio
3753212f96 style: Format test file with ruff
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-15 11:22:00 +02:00
Simone Scarduzio
db7d14f8a8 feat: Add metadata namespace and fix stats calculation
This is a major release with breaking changes to metadata format.

BREAKING CHANGES:
- All metadata keys now use 'dg-' namespace prefix (becomes 'x-amz-meta-dg-*' in S3)
- Old metadata format is not supported - all files must be re-uploaded
- Stats behavior changed: quick mode no longer shows misleading warnings

Features:
- Metadata now uses real package version (dg-tool: deltaglider/VERSION)
- All metadata keys properly namespaced with 'dg-' prefix
- Clean stats output in quick mode (no per-file warning spam)
- Fixed nonsensical negative compression ratios in quick mode

Fixes:
- Stats now correctly handles delta files without metadata
- Space saved shows 0 instead of negative numbers when metadata unavailable
- Removed misleading warnings in quick mode (metadata not fetched is expected)
- Fixed metadata keys to use hyphens instead of underscores

Documentation:
- Added comprehensive metadata documentation
- Added stats calculation behavior guide
- Added real version tracking documentation

Tests:
- Updated all tests to use new dg- prefixed metadata keys
- All 73 unit tests passing
- All quality checks passing (ruff, mypy)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-15 11:19:10 +02:00
Simone Scarduzio
e1259b7ea8 fix: Code quality improvements for v5.2.2 release
- Fix pagination bug using continuation_token instead of start_after
- Add stats caching to prevent blocking web apps
- Improve code formatting and type checking
- Add comprehensive unit tests for new features
- Fix test mock usage in object_listing tests
v5.2.2
2025-10-14 23:54:49 +02:00
Simone Scarduzio
ff05e77c24 fix: Prevent get_bucket_stats from blocking web apps indefinitely
**Performance Issues Fixed:**
1. aws_compat.py: Changed to use cached stats only (no bucket scans after uploads)
2. stats.py: Added safety mechanisms to prevent infinite hangs
   - Max 10k iterations (10M object limit)
   - 10 min timeout on metadata fetching
   - Missing pagination token detection
   - Graceful error recovery with partial stats

**Refactoring:**
- Reduced nesting in get_bucket_stats from 5 levels to 2 levels
- Extracted 5 helper functions for better maintainability
- Main function reduced from 300+ lines to 33 lines
- 100% backward compatible - no API changes

**Benefits:**
- Web apps no longer hang on upload/delete operations
- Explicit get_bucket_stats() calls complete within bounded time
- Better error handling and logging
- Easier to test and maintain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
v5.2.1
2025-10-14 14:47:39 +02:00
Simone Scarduzio
c3d385bf18 fix tests v5.2.0 2025-10-13 17:26:35 +02:00
Simone Scarduzio
aea5cb5d9a feat: Enhance S3 migration CLI with new commands and EC2 detection option 2025-10-12 23:12:32 +02:00
Simone Scarduzio
b2ca59490b feat: Add EC2 region detection and cost optimization features 2025-10-12 22:41:48 +02:00
Simone Scarduzio
4f56c4b600 fix: Preserve original filenames during S3-to-S3 migration 2025-10-12 18:10:04 +02:00
Simone Scarduzio
14c6af0f35 handle version in cli 2025-10-12 17:47:05 +02:00
Simone Scarduzio
67792b2031 migrate CLI support 2025-10-12 17:37:44 +02:00
Simone Scarduzio
a9a1396e6e style: Format test_stats_algorithm.py with ruff v5.1.1 2025-10-11 14:17:49 +02:00
Simone Scarduzio
52eb5bba21 fix: Fix unit test import issues for concurrent.futures
- Remove unnecessary concurrent.futures patches in tests
- Update test_detailed_stats_flag to match current implementation behavior
- Tests now properly handle parallel metadata fetching without mocking
2025-10-11 14:13:40 +02:00
Simone Scarduzio
f75db142e8 fix: Correct logging message formatting in get_bucket_stats and update test assertionsalls for clarity. 2025-10-11 14:05:54 +02:00
Simone Scarduzio
35d34d4862 chore: Update CHANGELOG for v5.1.1 release
- Document stats command fixes
- Document performance improvements
2025-10-10 19:57:11 +02:00
Simone Scarduzio
9230cbd762 test 2025-10-10 19:52:15 +02:00
Simone Scarduzio
2eba6e8d38 optimisation 2025-10-10 19:50:33 +02:00
Simone Scarduzio
656726b57b algorithm correctness 2025-10-10 19:46:39 +02:00
Simone Scarduzio
85dd315424 ruff v5.1.0 v5.0.4 2025-10-10 18:44:46 +02:00
Simone Scarduzio
dbd2632cae docs: Update SDK documentation for v5.1.0 features
- Add session-level caching documentation to API reference
- Document clear_cache() and evict_cache() methods
- Add comprehensive bucket statistics examples
- Update list_buckets() with DeltaGliderStats metadata
- Add cache management patterns and best practices
- Update CHANGELOG comparison links

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 18:34:44 +02:00
Simone Scarduzio
3d04a407c0 feat: Add stats command with session-level caching (v5.1.0)
New Features:
- Add 'deltaglider stats' CLI command for bucket compression metrics
- Session-level bucket statistics caching for performance
- Enhanced list_buckets() with cached stats metadata

Technical Changes:
- Automatic cache invalidation on bucket mutations
- Intelligent cache reuse (detailed → quick fallback)
- Comprehensive test coverage (106+ new test lines)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 18:30:05 +02:00
Simone Scarduzio
47f022fffe feat: Add programmatic cache management for long-running applications
Implements cache clearing functionality for SDK users who need manual
cache management in long-running applications where automatic cleanup
on process exit is not sufficient.

New Features:
- Added `clear()` method to CachePort protocol
- Implemented `clear()` in all cache adapters:
  * ContentAddressedCache: Clears files and SHA mappings
  * EncryptedCache: Clears encryption mappings and delegates to backend
  * MemoryCache: Already had clear() method
- Added `clear_cache()` method to DeltaGliderClient for public API

Cache Management API:
```python
from deltaglider import create_client

client = create_client()

# Upload files
client.put_object(Bucket='bucket', Key='file.zip', Body=data)

# Clear cache manually (important for long-running apps!)
client.clear_cache()
```

New Documentation:
- docs/CACHE_MANAGEMENT.md (684 lines)
  * Comprehensive guide for programmatic cache management
  * Long-running application strategies (web apps, services, batch jobs)
  * Encryption key management (ephemeral vs. persistent)
  * Key rotation procedures
  * Memory vs. filesystem cache trade-offs
  * Best practices by application type
  * Monitoring and troubleshooting

Key Topics Covered:
- Why SDK requires manual cache management (vs. CLI auto-cleanup)
- When to clear cache (periodic, config changes, tests, etc.)
- Cache strategies for 5 application types:
  * Long-running background services
  * Periodic batch jobs
  * Web applications / API servers
  * Testing / CI/CD
  * AWS Lambda / Serverless
- Encryption key management:
  * Ephemeral keys (default, maximum security)
  * Persistent keys (shared cache scenarios)
  * Key rotation procedures
  * Secure key storage (Secrets Manager)
- Memory vs. filesystem cache selection
- Monitoring cache health
- Troubleshooting common issues

Use Cases:
- Long-running services: Periodic cache clearing to prevent growth
- Batch jobs: Clear cache in finally block
- Tests: Clear cache after each test for clean state
- Multi-process: Shared cache with persistent encryption keys
- High performance: Memory cache with automatic LRU eviction

Security Enhancements:
- Documented encryption key lifecycle management
- Key rotation procedures
- Secure key storage best practices
- Ephemeral vs. persistent key trade-offs

Testing:
- All 119 tests passing 
- Type checking: 0 errors (mypy) 
- Linting: All checks passed (ruff) 

Breaking Changes: None (new API only)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 10:34:02 +02:00
Simone Scarduzio
7a2ed16ee7 docs: Add comprehensive DG_MAX_RATIO tuning guide
Created extensive documentation for the DG_MAX_RATIO parameter, which
controls delta compression efficiency thresholds.

New Documentation:
- docs/DG_MAX_RATIO.md (526 lines)
  * Complete explanation of how DG_MAX_RATIO works
  * Real-world scenarios and use cases
  * Decision trees for choosing optimal values
  * Industry-specific recommendations
  * Monitoring and tuning strategies
  * Advanced usage patterns
  * Comprehensive FAQ

Updates to Existing Documentation:
- README.md: Added link to DG_MAX_RATIO guide with tip callout
- CLAUDE.md: Added detailed DG_MAX_RATIO explanation and guide link
- Dockerfile: Added inline comments explaining DG_MAX_RATIO tuning
- docs/sdk/getting-started.md: Added DG_MAX_RATIO guide reference

Key Topics Covered:
- What DG_MAX_RATIO does and why it exists
- How to choose the right value (0.2-0.7 range)
- Real-world scenarios (nightly builds, major versions, etc.)
- Industry-specific use cases (SaaS, mobile apps, backups, etc.)
- Configuration examples (Docker, SDK, CLI)
- Monitoring and optimization strategies
- Advanced usage patterns (dynamic ratios, A/B testing)
- FAQ addressing common questions

Examples Included:
- Conservative (0.2-0.3): For dissimilar files or expensive storage
- Default (0.5): Balanced approach for most use cases
- Permissive (0.6-0.7): For very similar files or cheap storage

Value Proposition:
- Helps users optimize compression for their specific use case
- Prevents inefficient delta compression
- Provides data-driven tuning methodology
- Reduces support questions about compression behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 10:19:59 +02:00
Simone Scarduzio
5e333254ba docs: Comprehensive environment variable documentation
Added complete documentation for all environment variables across
Dockerfile, README.md, and SDK documentation.

Dockerfile Changes:
- Documented all DeltaGlider environment variables with defaults
- Added AWS configuration variables (commented for runtime override)
- Updated version label to 5.0.3
- Updated description to mention encryption

README.md Changes:
- Added comprehensive Docker Usage section
- Documented all environment variables with examples
- Added Docker examples for:
  * Basic usage with AWS credentials
  * Memory cache configuration for CI/CD
  * MinIO/custom endpoint usage
  * Persistent encryption key setup
- Security notes for encryption and cache behavior

SDK Documentation Changes:
- Added DeltaGlider Configuration section
- Documented all environment variables
- Added configuration examples
- Security notes for encryption behavior

Environment Variables Documented:
- DG_LOG_LEVEL (logging configuration)
- DG_MAX_RATIO (compression threshold)
- DG_CACHE_BACKEND (filesystem or memory)
- DG_CACHE_MEMORY_SIZE_MB (memory cache size)
- DG_CACHE_ENCRYPTION_KEY (optional persistent key)
- AWS_ENDPOINT_URL (custom S3 endpoints)
- AWS_ACCESS_KEY_ID (AWS credentials)
- AWS_SECRET_ACCESS_KEY (AWS credentials)
- AWS_DEFAULT_REGION (AWS region)

Quality Checks:
- All 119 tests passing 
- Type checking: 0 errors (mypy) 
- Linting: All checks passed (ruff) 
- Dockerfile syntax validated 

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
v5.0.3
2025-10-10 10:12:25 +02:00
Simone Scarduzio
04cc984d4a ruff 2025-10-10 10:09:11 +02:00
Simone Scarduzio
ac7d4e067f security: Make encryption always-on with auto-cleanup
BREAKING CHANGES:
- Encryption is now ALWAYS enabled (cannot be disabled)
- Removed DG_CACHE_ENCRYPTION environment variable

Security Enhancements:
- Encryption is mandatory for all cache operations
- Ephemeral encryption keys per process (forward secrecy)
- Automatic deletion of corrupted cache files on decryption failures
- Auto-cleanup on both decryption failures and SHA mismatches

Changes:
- Removed DG_CACHE_ENCRYPTION toggle from CLI and SDK
- Updated EncryptedCache to auto-delete corrupted files
- Simplified cache initialization (always wrapped with encryption)
- DG_CACHE_ENCRYPTION_KEY remains optional for persistent keys

Documentation:
- Updated CLAUDE.md with encryption always-on behavior
- Updated CHANGELOG.md with breaking changes
- Clarified security model and auto-cleanup behavior

Testing:
- All 119 tests passing with encryption always-on
- Type checking: 0 errors (mypy)
- Linting: All checks passed (ruff)

Rationale:
- Zero-trust cache architecture requires encryption
- Corrupted cache is security risk - auto-deletion prevents exploitation
- Ephemeral keys provide maximum security by default
- Users who need cross-process sharing can opt-in with persistent keys

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 09:51:29 +02:00
Simone Scarduzio
e8fb926fd6 docs: Update SECURITY_FIX_ROADMAP.md - mark encryption complete 2025-10-10 09:40:02 +02:00
Simone Scarduzio
626e28eaf6 feat: Add cache encryption and memory backend support
Implements cache encryption and configurable memory backend as part of
DeltaGlider v5.0.3 security enhancements.

Features:
- EncryptedCache wrapper using Fernet (AES-128-CBC + HMAC)
- Ephemeral encryption keys per process for forward secrecy
- Optional persistent keys via DG_CACHE_ENCRYPTION_KEY env var
- MemoryCache adapter with LRU eviction and configurable size limits
- Configurable cache backend via DG_CACHE_BACKEND (filesystem/memory)
- Encryption enabled by default with opt-out via DG_CACHE_ENCRYPTION=false

Security:
- Data encrypted at rest with authenticated encryption (HMAC)
- Ephemeral keys provide forward secrecy and process isolation
- SHA256 plaintext mapping maintains CAS compatibility
- Zero-knowledge architecture: encryption keys never leave process

Performance:
- Memory cache: zero I/O, perfect for CI/CD pipelines
- LRU eviction prevents memory exhaustion
- ~10-15% encryption overhead, configurable via env vars

Testing:
- Comprehensive encryption test suite (13 tests)
- Memory cache test suite (10 tests)
- All 119 tests passing with encryption enabled

Documentation:
- Updated CLAUDE.md with encryption and cache backend details
- Environment variables documented
- Security notes and performance considerations

Dependencies:
- Added cryptography>=42.0.0 for Fernet encryption

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 09:38:48 +02:00
Simone Scarduzio
90a342dc33 feat: Implement Content-Addressed Storage (CAS) cache
Implemented SHA256-based Content-Addressed Storage to eliminate
cache collisions and enable automatic deduplication.

Key Features:
- Zero collision risk: SHA256 namespace guarantees uniqueness
- Automatic deduplication: same content = same filename
- Tampering protection: changing content changes SHA, breaks lookup
- Two-level directory structure (ab/cd/abcdef...) for filesystem optimization

Changes:
- Added ContentAddressedCache adapter in adapters/cache_cas.py
- Updated CLI and SDK to use CAS instead of FsCacheAdapter
- Updated all tests to use ContentAddressedCache
- Documented CAS architecture in CLAUDE.md and SECURITY_FIX_ROADMAP.md

Security Benefits:
- Eliminates cross-endpoint collision vulnerabilities
- Self-describing cache (filename IS the checksum)
- Natural cache validation without external metadata

All quality checks passing:
- 99 tests passing (0 failures)
- Type checking: 0 errors (mypy)
- Linting: All checks passed (ruff)

Completed Phase 2 of SECURITY_FIX_ROADMAP.md
2025-10-10 09:06:29 +02:00
Simone Scarduzio
f9f2b036e3 docs: Update CHANGELOG.md for v5.0.3 release 2025-10-10 08:57:52 +02:00
Simone Scarduzio
778d7f0148 security: Remove all legacy shared cache code and env vars
BREAKING CHANGE: Removed DG_UNSAFE_SHARED_CACHE and DG_CACHE_DIR
environment variables. DeltaGlider now ONLY uses ephemeral
process-isolated cache for security.

Changes:
- Removed cache_dir parameter from create_client()
- Removed all conditional legacy cache mode logic
- Updated documentation (CLAUDE.md, docs/sdk/api.md)
- Updated tests to not pass removed cache_dir parameter
- Marked Phase 1 of SECURITY_FIX_ROADMAP.md as completed

All 99 tests passing. Ephemeral cache is now the only mode.
2025-10-10 08:56:49 +02:00
Simone Scarduzio
37ea2f138c security: Implement Phase 1 emergency hotfix (v5.0.3)
CRITICAL SECURITY FIXES:

1. Ephemeral Cache Mode (Default)
   - Process-isolated temporary cache directories
   - Automatic cleanup on exit via atexit
   - Prevents multi-user interference and cache poisoning
   - Legacy shared cache requires explicit DG_UNSAFE_SHARED_CACHE=true

2. TOCTOU Vulnerability Fix
   - New get_validated_ref() method with atomic SHA validation
   - File locking on Unix platforms (fcntl)
   - Validates SHA256 at use-time, not just check-time
   - Removes corrupted cache entries automatically
   - Prevents cache poisoning attacks

3. New Cache Error Classes
   - CacheMissError: Cache not found
   - CacheCorruptionError: SHA mismatch or tampering detected

SECURITY IMPACT:
- Eliminates multi-user cache attacks
- Closes TOCTOU attack window
- Prevents cache poisoning
- Automatic tamper detection

Files Modified:
- src/deltaglider/app/cli/main.py: Ephemeral cache for CLI
- src/deltaglider/client.py: Ephemeral cache for SDK
- src/deltaglider/ports/cache.py: get_validated_ref protocol
- src/deltaglider/adapters/cache_fs.py: TOCTOU-safe implementation
- src/deltaglider/core/service.py: Use validated refs
- src/deltaglider/core/errors.py: Cache error classes

Tests: 99/99 passing (18 unit + 81 integration)

This is the first phase of the security roadmap outlined in
SECURITY_FIX_ROADMAP.md. Addresses CVE-CRITICAL vulnerabilities
in cache system.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 08:44:41 +02:00
Simone Scarduzio
5e3b76791e fix: Exclude reference.bin from bucket stats calculations
reference.bin files are internal implementation details used for delta
compression. Their size was being incorrectly counted in both total_size
and compressed_size, resulting in 0% savings contribution.

Since delta file metadata already contains the original file_size that
the delta represents, including reference.bin would double-count storage.

This fix skips reference.bin files during stats calculation, consistent
with how they're filtered in other parts of the codebase (aws_compat.py,
sync.py, client.py).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
v5.0.2
2025-10-09 22:20:32 +02:00