New Features: - Add 'deltaglider stats' CLI command for bucket compression metrics - Session-level bucket statistics caching for performance - Enhanced list_buckets() with cached stats metadata Technical Changes: - Automatic cache invalidation on bucket mutations - Intelligent cache reuse (detailed → quick fallback) - Comprehensive test coverage (106+ new test lines) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
DeltaGlider is a drop-in S3 replacement that achieves 99.9% compression for versioned artifacts through intelligent binary delta compression using xdelta3. It's designed to store 4TB of similar files in 5GB by storing only the differences between versions.
Essential Commands
Development Setup
# Install with development dependencies using uv (preferred)
uv pip install -e ".[dev]"
# Or using pip
pip install -e ".[dev]"
Testing
# Run all tests
uv run pytest
# Run unit tests only
uv run pytest tests/unit
# Run integration tests only
uv run pytest tests/integration
# Run a specific test file
uv run pytest tests/integration/test_full_workflow.py
# Run a specific test
uv run pytest tests/integration/test_full_workflow.py::test_full_put_get_workflow
# Run with verbose output
uv run pytest -v
# Run with coverage
uv run pytest --cov=deltaglider
Code Quality
# Run linter (ruff)
uv run ruff check src/
# Fix linting issues automatically
uv run ruff check --fix src/
# Format code
uv run ruff format src/
# Type checking with mypy
uv run mypy src/
# Run all checks (linting + type checking)
uv run ruff check src/ && uv run mypy src/
Local Testing with MinIO
# Start MinIO for local S3 testing
docker run -p 9000:9000 -p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
minio/minio server /data --console-address ":9001"
# Test with local MinIO
export AWS_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
# Now you can use deltaglider commands
deltaglider cp test.zip s3://test-bucket/
deltaglider stats test-bucket # Get bucket statistics
Available CLI Commands
cp # Copy files to/from S3 (AWS S3 compatible)
ls # List S3 buckets or objects (AWS S3 compatible)
rm # Remove S3 objects (AWS S3 compatible)
sync # Synchronize directories with S3 (AWS S3 compatible)
stats # Get bucket statistics and compression metrics
verify # Verify integrity of delta file
Architecture
Hexagonal Architecture Pattern
The codebase follows a clean hexagonal (ports and adapters) architecture:
src/deltaglider/
├── core/ # Domain logic (pure Python, no external dependencies)
│ ├── service.py # Main DeltaService orchestration
│ ├── models.py # Data models (DeltaSpace, ObjectKey, PutSummary, etc.)
│ └── errors.py # Domain-specific exceptions
├── ports/ # Abstract interfaces (protocols)
│ ├── storage.py # StoragePort protocol for S3-like operations
│ ├── diff.py # DiffPort protocol for delta operations
│ ├── hash.py # HashPort protocol for integrity checks
│ ├── cache.py # CachePort protocol for local references
│ ├── clock.py # ClockPort protocol for time operations
│ ├── logger.py # LoggerPort protocol for logging
│ └── metrics.py # MetricsPort protocol for observability
├── adapters/ # Concrete implementations
│ ├── storage_s3.py # S3StorageAdapter using boto3
│ ├── diff_xdelta.py # XdeltaAdapter using xdelta3 binary
│ ├── hash_sha256.py # Sha256Adapter for checksums
│ ├── cache_cas.py # ContentAddressedCache (SHA256-based storage)
│ ├── cache_encrypted.py # EncryptedCache (Fernet encryption wrapper)
│ ├── cache_memory.py # MemoryCache (LRU in-memory cache)
│ ├── clock_utc.py # UtcClockAdapter for UTC timestamps
│ ├── logger_std.py # StdLoggerAdapter for console output
│ └── metrics_noop.py # NoopMetricsAdapter (placeholder)
└── app/
└── cli/ # Click-based CLI application
├── main.py # Main CLI entry point with AWS S3 commands
├── aws_compat.py # AWS S3 compatibility helpers
└── sync.py # Sync command implementation
Core Concepts
-
DeltaSpace: A prefix in S3 where related files are stored for delta compression. Contains a
reference.binfile that serves as the base for delta compression. -
Delta Compression Flow:
- First file uploaded to a DeltaSpace becomes the reference (stored as
reference.bin) - Subsequent files are compared against the reference using xdelta3
- Only the differences (delta) are stored with
.deltasuffix - Metadata in S3 tags preserves original file info and delta relationships
- First file uploaded to a DeltaSpace becomes the reference (stored as
-
File Type Intelligence:
- Archive files (
.zip,.tar,.gz,.jar, etc.) use delta compression - Text files, small files, and already-compressed unique files bypass delta
- Decision made by
should_use_delta()incore/service.py
- Archive files (
-
AWS S3 CLI Compatibility:
- Commands (
cp,ls,rm,sync) mirror AWS CLI syntax exactly - Located in
app/cli/main.pywith helpers inaws_compat.py
- Commands (
Key Algorithms
-
Delta Ratio Check (
core/service.py):- After creating a delta, checks if
delta_size / file_size > max_ratio(default 0.5) - If delta is too large (>50% of original), stores file directly instead
- Prevents inefficient compression for dissimilar files
- After creating a delta, checks if
-
Reference Management (
core/service.py):- Reference stored at
{deltaspace.prefix}/reference.bin - SHA256 verification on every read/write
- Content-Addressed Storage (CAS) cache in
/tmp/deltaglider-*(ephemeral) - Cache uses SHA256 as filename with two-level directory structure (ab/cd/abcdef...)
- Automatic deduplication: same content = same SHA = same cache file
- Zero collision risk: SHA256 namespace guarantees uniqueness
- Encryption: Optional Fernet (AES-128-CBC + HMAC) encryption at rest (enabled by default)
- Ephemeral encryption keys per process for forward secrecy
- Cache Backends: Configurable filesystem or in-memory cache with LRU eviction
- Reference stored at
-
Sync Algorithm (
app/cli/sync.py):- Compares local vs S3 using size and modification time
- For delta files, uses timestamp comparison with 1-second tolerance
- Supports
--deleteflag for true mirroring
Testing Strategy
- Unit Tests (
tests/unit/): Test individual adapters and core logic with mocks - Integration Tests (
tests/integration/): Test CLI commands and workflows - E2E Tests (
tests/e2e/): Require LocalStack for full S3 simulation
Key test files:
test_full_workflow.py: Complete put/get cycle testingtest_aws_cli_commands_v2.py: AWS S3 CLI compatibility teststest_xdelta.py: Binary diff engine integration tests
Common Development Tasks
Adding a New CLI Command
- Add command function to
src/deltaglider/app/cli/main.py - Use
@cli.command()decorator and@click.pass_objfor service access - Follow AWS S3 CLI conventions for flags and arguments
- Add tests to
tests/integration/test_aws_cli_commands_v2.py
Adding a New Port/Adapter Pair
- Define protocol in
src/deltaglider/ports/ - Implement adapter in
src/deltaglider/adapters/ - Wire adapter in
create_service()inapp/cli/main.py - Add unit tests in
tests/unit/test_adapters.py
Modifying Delta Logic
Core delta logic is in src/deltaglider/core/service.py:
put(): Handles upload with delta compressionget(): Handles download with delta reconstructionshould_use_delta(): File type discrimination logic
Environment Variables
DG_LOG_LEVEL: Logging level (default: "INFO")DG_MAX_RATIO: Maximum acceptable delta/file ratio (default: "0.5", range: "0.0-1.0")- See docs/DG_MAX_RATIO.md for complete tuning guide
- Controls when to use delta vs. direct storage
- Lower (0.2-0.3) = conservative, only high-quality compression
- Higher (0.6-0.7) = permissive, accept modest savings
DG_CACHE_BACKEND: Cache backend type - "filesystem" (default) or "memory"DG_CACHE_MEMORY_SIZE_MB: Memory cache size limit in MB (default: "100")DG_CACHE_ENCRYPTION_KEY: Optional base64-encoded Fernet key for persistent encryption (ephemeral by default)AWS_ENDPOINT_URL: Override S3 endpoint for MinIO/LocalStackAWS_ACCESS_KEY_ID: AWS credentialsAWS_SECRET_ACCESS_KEY: AWS credentialsAWS_DEFAULT_REGION: AWS region
Security Notes:
- Encryption Always On: Cache data is ALWAYS encrypted (cannot be disabled)
- Ephemeral Keys: Encryption keys auto-generated per process for maximum security
- Auto-Cleanup: Corrupted cache files automatically deleted on decryption failures
- Process Isolation: Each process gets isolated cache in
/tmp/deltaglider-*, cleaned up on exit - Persistent Keys: Set
DG_CACHE_ENCRYPTION_KEYonly if you need cross-process cache sharing (e.g., shared filesystems)
Important Implementation Details
-
xdelta3 Binary Dependency: The system requires xdelta3 binary installed on the system. The
XdeltaAdapteruses subprocess to call it. -
Metadata Storage: File metadata is stored in S3 object metadata/tags, not in a separate database. This keeps the system simple and stateless.
-
SHA256 Verification: Every read and write operation includes SHA256 verification for data integrity.
-
Atomic Operations: All S3 operations are atomic - no partial states are left if operations fail.
-
Reference File Updates: Currently, the first file uploaded to a DeltaSpace becomes the permanent reference. Future versions may implement reference rotation.
Performance Considerations
- Content-Addressed Storage: SHA256-based deduplication eliminates redundant storage
- Cache Backends:
- Filesystem cache (default): persistent across processes, good for shared workflows
- Memory cache: faster, zero I/O, perfect for ephemeral CI/CD pipelines
- Encryption Overhead: ~10-15% performance impact, provides security at rest
- Delta compression is CPU-intensive; consider parallelization for bulk uploads
- The default max_ratio of 0.5 prevents storing inefficient deltas
- For files <1MB, delta overhead may exceed benefits
Security Notes
- Never store AWS credentials in code
- Use IAM roles when possible
- All S3 operations respect bucket policies and encryption settings
- SHA256 checksums prevent tampering and corruption
- Encryption Always On: Cache data is ALWAYS encrypted using Fernet (AES-128-CBC + HMAC) - cannot be disabled
- Ephemeral Keys: Encryption keys auto-generated per process for forward secrecy and process isolation
- Auto-Cleanup: Corrupted or tampered cache files automatically deleted on decryption failures
- Persistent Keys: Set
DG_CACHE_ENCRYPTION_KEYonly for cross-process cache sharing (use secrets management) - Content-Addressed Storage: SHA256-based filenames prevent collision attacks
- Zero-Trust Cache: All cache operations include cryptographic validation