Files
deltaglider-beshu-tech/CLAUDE.md
Simone Scarduzio 3d04a407c0 feat: Add stats command with session-level caching (v5.1.0)
New Features:
- Add 'deltaglider stats' CLI command for bucket compression metrics
- Session-level bucket statistics caching for performance
- Enhanced list_buckets() with cached stats metadata

Technical Changes:
- Automatic cache invalidation on bucket mutations
- Intelligent cache reuse (detailed → quick fallback)
- Comprehensive test coverage (106+ new test lines)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 18:30:05 +02:00

11 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

DeltaGlider is a drop-in S3 replacement that achieves 99.9% compression for versioned artifacts through intelligent binary delta compression using xdelta3. It's designed to store 4TB of similar files in 5GB by storing only the differences between versions.

Essential Commands

Development Setup

# Install with development dependencies using uv (preferred)
uv pip install -e ".[dev]"

# Or using pip
pip install -e ".[dev]"

Testing

# Run all tests
uv run pytest

# Run unit tests only
uv run pytest tests/unit

# Run integration tests only
uv run pytest tests/integration

# Run a specific test file
uv run pytest tests/integration/test_full_workflow.py

# Run a specific test
uv run pytest tests/integration/test_full_workflow.py::test_full_put_get_workflow

# Run with verbose output
uv run pytest -v

# Run with coverage
uv run pytest --cov=deltaglider

Code Quality

# Run linter (ruff)
uv run ruff check src/

# Fix linting issues automatically
uv run ruff check --fix src/

# Format code
uv run ruff format src/

# Type checking with mypy
uv run mypy src/

# Run all checks (linting + type checking)
uv run ruff check src/ && uv run mypy src/

Local Testing with MinIO

# Start MinIO for local S3 testing
docker run -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=minioadmin \
  minio/minio server /data --console-address ":9001"

# Test with local MinIO
export AWS_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin

# Now you can use deltaglider commands
deltaglider cp test.zip s3://test-bucket/
deltaglider stats test-bucket                  # Get bucket statistics

Available CLI Commands

cp      # Copy files to/from S3 (AWS S3 compatible)
ls      # List S3 buckets or objects (AWS S3 compatible)
rm      # Remove S3 objects (AWS S3 compatible)
sync    # Synchronize directories with S3 (AWS S3 compatible)
stats   # Get bucket statistics and compression metrics
verify  # Verify integrity of delta file

Architecture

Hexagonal Architecture Pattern

The codebase follows a clean hexagonal (ports and adapters) architecture:

src/deltaglider/
├── core/           # Domain logic (pure Python, no external dependencies)
│   ├── service.py  # Main DeltaService orchestration
│   ├── models.py   # Data models (DeltaSpace, ObjectKey, PutSummary, etc.)
│   └── errors.py   # Domain-specific exceptions
├── ports/          # Abstract interfaces (protocols)
│   ├── storage.py  # StoragePort protocol for S3-like operations
│   ├── diff.py     # DiffPort protocol for delta operations
│   ├── hash.py     # HashPort protocol for integrity checks
│   ├── cache.py    # CachePort protocol for local references
│   ├── clock.py    # ClockPort protocol for time operations
│   ├── logger.py   # LoggerPort protocol for logging
│   └── metrics.py  # MetricsPort protocol for observability
├── adapters/       # Concrete implementations
│   ├── storage_s3.py      # S3StorageAdapter using boto3
│   ├── diff_xdelta.py     # XdeltaAdapter using xdelta3 binary
│   ├── hash_sha256.py     # Sha256Adapter for checksums
│   ├── cache_cas.py       # ContentAddressedCache (SHA256-based storage)
│   ├── cache_encrypted.py # EncryptedCache (Fernet encryption wrapper)
│   ├── cache_memory.py    # MemoryCache (LRU in-memory cache)
│   ├── clock_utc.py       # UtcClockAdapter for UTC timestamps
│   ├── logger_std.py      # StdLoggerAdapter for console output
│   └── metrics_noop.py    # NoopMetricsAdapter (placeholder)
└── app/
    └── cli/        # Click-based CLI application
        ├── main.py          # Main CLI entry point with AWS S3 commands
        ├── aws_compat.py    # AWS S3 compatibility helpers
        └── sync.py          # Sync command implementation

Core Concepts

  1. DeltaSpace: A prefix in S3 where related files are stored for delta compression. Contains a reference.bin file that serves as the base for delta compression.

  2. Delta Compression Flow:

    • First file uploaded to a DeltaSpace becomes the reference (stored as reference.bin)
    • Subsequent files are compared against the reference using xdelta3
    • Only the differences (delta) are stored with .delta suffix
    • Metadata in S3 tags preserves original file info and delta relationships
  3. File Type Intelligence:

    • Archive files (.zip, .tar, .gz, .jar, etc.) use delta compression
    • Text files, small files, and already-compressed unique files bypass delta
    • Decision made by should_use_delta() in core/service.py
  4. AWS S3 CLI Compatibility:

    • Commands (cp, ls, rm, sync) mirror AWS CLI syntax exactly
    • Located in app/cli/main.py with helpers in aws_compat.py

Key Algorithms

  1. Delta Ratio Check (core/service.py):

    • After creating a delta, checks if delta_size / file_size > max_ratio (default 0.5)
    • If delta is too large (>50% of original), stores file directly instead
    • Prevents inefficient compression for dissimilar files
  2. Reference Management (core/service.py):

    • Reference stored at {deltaspace.prefix}/reference.bin
    • SHA256 verification on every read/write
    • Content-Addressed Storage (CAS) cache in /tmp/deltaglider-* (ephemeral)
    • Cache uses SHA256 as filename with two-level directory structure (ab/cd/abcdef...)
    • Automatic deduplication: same content = same SHA = same cache file
    • Zero collision risk: SHA256 namespace guarantees uniqueness
    • Encryption: Optional Fernet (AES-128-CBC + HMAC) encryption at rest (enabled by default)
    • Ephemeral encryption keys per process for forward secrecy
    • Cache Backends: Configurable filesystem or in-memory cache with LRU eviction
  3. Sync Algorithm (app/cli/sync.py):

    • Compares local vs S3 using size and modification time
    • For delta files, uses timestamp comparison with 1-second tolerance
    • Supports --delete flag for true mirroring

Testing Strategy

  • Unit Tests (tests/unit/): Test individual adapters and core logic with mocks
  • Integration Tests (tests/integration/): Test CLI commands and workflows
  • E2E Tests (tests/e2e/): Require LocalStack for full S3 simulation

Key test files:

  • test_full_workflow.py: Complete put/get cycle testing
  • test_aws_cli_commands_v2.py: AWS S3 CLI compatibility tests
  • test_xdelta.py: Binary diff engine integration tests

Common Development Tasks

Adding a New CLI Command

  1. Add command function to src/deltaglider/app/cli/main.py
  2. Use @cli.command() decorator and @click.pass_obj for service access
  3. Follow AWS S3 CLI conventions for flags and arguments
  4. Add tests to tests/integration/test_aws_cli_commands_v2.py

Adding a New Port/Adapter Pair

  1. Define protocol in src/deltaglider/ports/
  2. Implement adapter in src/deltaglider/adapters/
  3. Wire adapter in create_service() in app/cli/main.py
  4. Add unit tests in tests/unit/test_adapters.py

Modifying Delta Logic

Core delta logic is in src/deltaglider/core/service.py:

  • put(): Handles upload with delta compression
  • get(): Handles download with delta reconstruction
  • should_use_delta(): File type discrimination logic

Environment Variables

  • DG_LOG_LEVEL: Logging level (default: "INFO")
  • DG_MAX_RATIO: Maximum acceptable delta/file ratio (default: "0.5", range: "0.0-1.0")
    • See docs/DG_MAX_RATIO.md for complete tuning guide
    • Controls when to use delta vs. direct storage
    • Lower (0.2-0.3) = conservative, only high-quality compression
    • Higher (0.6-0.7) = permissive, accept modest savings
  • DG_CACHE_BACKEND: Cache backend type - "filesystem" (default) or "memory"
  • DG_CACHE_MEMORY_SIZE_MB: Memory cache size limit in MB (default: "100")
  • DG_CACHE_ENCRYPTION_KEY: Optional base64-encoded Fernet key for persistent encryption (ephemeral by default)
  • AWS_ENDPOINT_URL: Override S3 endpoint for MinIO/LocalStack
  • AWS_ACCESS_KEY_ID: AWS credentials
  • AWS_SECRET_ACCESS_KEY: AWS credentials
  • AWS_DEFAULT_REGION: AWS region

Security Notes:

  • Encryption Always On: Cache data is ALWAYS encrypted (cannot be disabled)
  • Ephemeral Keys: Encryption keys auto-generated per process for maximum security
  • Auto-Cleanup: Corrupted cache files automatically deleted on decryption failures
  • Process Isolation: Each process gets isolated cache in /tmp/deltaglider-*, cleaned up on exit
  • Persistent Keys: Set DG_CACHE_ENCRYPTION_KEY only if you need cross-process cache sharing (e.g., shared filesystems)

Important Implementation Details

  1. xdelta3 Binary Dependency: The system requires xdelta3 binary installed on the system. The XdeltaAdapter uses subprocess to call it.

  2. Metadata Storage: File metadata is stored in S3 object metadata/tags, not in a separate database. This keeps the system simple and stateless.

  3. SHA256 Verification: Every read and write operation includes SHA256 verification for data integrity.

  4. Atomic Operations: All S3 operations are atomic - no partial states are left if operations fail.

  5. Reference File Updates: Currently, the first file uploaded to a DeltaSpace becomes the permanent reference. Future versions may implement reference rotation.

Performance Considerations

  • Content-Addressed Storage: SHA256-based deduplication eliminates redundant storage
  • Cache Backends:
    • Filesystem cache (default): persistent across processes, good for shared workflows
    • Memory cache: faster, zero I/O, perfect for ephemeral CI/CD pipelines
  • Encryption Overhead: ~10-15% performance impact, provides security at rest
  • Delta compression is CPU-intensive; consider parallelization for bulk uploads
  • The default max_ratio of 0.5 prevents storing inefficient deltas
  • For files <1MB, delta overhead may exceed benefits

Security Notes

  • Never store AWS credentials in code
  • Use IAM roles when possible
  • All S3 operations respect bucket policies and encryption settings
  • SHA256 checksums prevent tampering and corruption
  • Encryption Always On: Cache data is ALWAYS encrypted using Fernet (AES-128-CBC + HMAC) - cannot be disabled
  • Ephemeral Keys: Encryption keys auto-generated per process for forward secrecy and process isolation
  • Auto-Cleanup: Corrupted or tampered cache files automatically deleted on decryption failures
  • Persistent Keys: Set DG_CACHE_ENCRYPTION_KEY only for cross-process cache sharing (use secrets management)
  • Content-Addressed Storage: SHA256-based filenames prevent collision attacks
  • Zero-Trust Cache: All cache operations include cryptographic validation