Files
deltaglider-beshu-tech/CLAUDE.md
Simone Scarduzio 59b15b6384 no more leaves
2025-09-23 14:14:54 +02:00

8.1 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

DeltaGlider is a drop-in S3 replacement that achieves 99.9% compression for versioned artifacts through intelligent binary delta compression using xdelta3. It's designed to store 4TB of similar files in 5GB by storing only the differences between versions.

Essential Commands

Development Setup

# Install with development dependencies using uv (preferred)
uv pip install -e ".[dev]"

# Or using pip
pip install -e ".[dev]"

Testing

# Run all tests
uv run pytest

# Run unit tests only
uv run pytest tests/unit

# Run integration tests only
uv run pytest tests/integration

# Run a specific test file
uv run pytest tests/integration/test_full_workflow.py

# Run a specific test
uv run pytest tests/integration/test_full_workflow.py::test_full_put_get_workflow

# Run with verbose output
uv run pytest -v

# Run with coverage
uv run pytest --cov=deltaglider

Code Quality

# Run linter (ruff)
uv run ruff check src/

# Fix linting issues automatically
uv run ruff check --fix src/

# Format code
uv run ruff format src/

# Type checking with mypy
uv run mypy src/

# Run all checks (linting + type checking)
uv run ruff check src/ && uv run mypy src/

Local Testing with MinIO

# Start MinIO for local S3 testing
docker run -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=minioadmin \
  minio/minio server /data --console-address ":9001"

# Test with local MinIO
export AWS_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin

# Now you can use deltaglider commands
deltaglider cp test.zip s3://test-bucket/

Architecture

Hexagonal Architecture Pattern

The codebase follows a clean hexagonal (ports and adapters) architecture:

src/deltaglider/
├── core/           # Domain logic (pure Python, no external dependencies)
│   ├── service.py  # Main DeltaService orchestration
│   ├── models.py   # Data models (DeltaSpace, ObjectKey, PutSummary, etc.)
│   └── errors.py   # Domain-specific exceptions
├── ports/          # Abstract interfaces (protocols)
│   ├── storage.py  # StoragePort protocol for S3-like operations
│   ├── diff.py     # DiffPort protocol for delta operations
│   ├── hash.py     # HashPort protocol for integrity checks
│   ├── cache.py    # CachePort protocol for local references
│   ├── clock.py    # ClockPort protocol for time operations
│   ├── logger.py   # LoggerPort protocol for logging
│   └── metrics.py  # MetricsPort protocol for observability
├── adapters/       # Concrete implementations
│   ├── storage_s3.py    # S3StorageAdapter using boto3
│   ├── diff_xdelta.py   # XdeltaAdapter using xdelta3 binary
│   ├── hash_sha256.py   # Sha256Adapter for checksums
│   ├── cache_fs.py      # FsCacheAdapter for file system cache
│   ├── clock_utc.py     # UtcClockAdapter for UTC timestamps
│   ├── logger_std.py    # StdLoggerAdapter for console output
│   └── metrics_noop.py  # NoopMetricsAdapter (placeholder)
└── app/
    └── cli/        # Click-based CLI application
        ├── main.py          # Main CLI entry point with AWS S3 commands
        ├── aws_compat.py    # AWS S3 compatibility helpers
        └── sync.py          # Sync command implementation

Core Concepts

  1. DeltaSpace: A prefix in S3 where related files are stored for delta compression. Contains a reference.bin file that serves as the base for delta compression.

  2. Delta Compression Flow:

    • First file uploaded to a DeltaSpace becomes the reference (stored as reference.bin)
    • Subsequent files are compared against the reference using xdelta3
    • Only the differences (delta) are stored with .delta suffix
    • Metadata in S3 tags preserves original file info and delta relationships
  3. File Type Intelligence:

    • Archive files (.zip, .tar, .gz, .jar, etc.) use delta compression
    • Text files, small files, and already-compressed unique files bypass delta
    • Decision made by should_use_delta() in core/service.py
  4. AWS S3 CLI Compatibility:

    • Commands (cp, ls, rm, sync) mirror AWS CLI syntax exactly
    • Located in app/cli/main.py with helpers in aws_compat.py
    • Maintains backward compatibility with original put/get commands

Key Algorithms

  1. Delta Ratio Check (core/service.py):

    • After creating a delta, checks if delta_size / file_size > max_ratio (default 0.5)
    • If delta is too large (>50% of original), stores file directly instead
    • Prevents inefficient compression for dissimilar files
  2. Reference Management (core/service.py):

    • Reference stored at {deltaspace.prefix}/reference.bin
    • SHA256 verification on every read/write
    • Local cache in /tmp/.deltaglider/reference_cache for performance
  3. Sync Algorithm (app/cli/sync.py):

    • Compares local vs S3 using size and modification time
    • For delta files, uses timestamp comparison with 1-second tolerance
    • Supports --delete flag for true mirroring

Testing Strategy

  • Unit Tests (tests/unit/): Test individual adapters and core logic with mocks
  • Integration Tests (tests/integration/): Test CLI commands and workflows
  • E2E Tests (tests/e2e/): Require LocalStack for full S3 simulation

Key test files:

  • test_full_workflow.py: Complete put/get cycle testing
  • test_aws_cli_commands_v2.py: AWS S3 CLI compatibility tests
  • test_xdelta.py: Binary diff engine integration tests

Common Development Tasks

Adding a New CLI Command

  1. Add command function to src/deltaglider/app/cli/main.py
  2. Use @cli.command() decorator and @click.pass_obj for service access
  3. Follow AWS S3 CLI conventions for flags and arguments
  4. Add tests to tests/integration/test_aws_cli_commands_v2.py

Adding a New Port/Adapter Pair

  1. Define protocol in src/deltaglider/ports/
  2. Implement adapter in src/deltaglider/adapters/
  3. Wire adapter in create_service() in app/cli/main.py
  4. Add unit tests in tests/unit/test_adapters.py

Modifying Delta Logic

Core delta logic is in src/deltaglider/core/service.py:

  • put(): Handles upload with delta compression
  • get(): Handles download with delta reconstruction
  • should_use_delta(): File type discrimination logic

Environment Variables

  • DG_LOG_LEVEL: Logging level (default: "INFO")
  • DG_CACHE_DIR: Local reference cache directory (default: "/tmp/.deltaglider/reference_cache")
  • DG_MAX_RATIO: Maximum acceptable delta/file ratio (default: "0.5")
  • AWS_ENDPOINT_URL: Override S3 endpoint for MinIO/LocalStack
  • AWS_ACCESS_KEY_ID: AWS credentials
  • AWS_SECRET_ACCESS_KEY: AWS credentials
  • AWS_DEFAULT_REGION: AWS region

Important Implementation Details

  1. xdelta3 Binary Dependency: The system requires xdelta3 binary installed on the system. The XdeltaAdapter uses subprocess to call it.

  2. Metadata Storage: File metadata is stored in S3 object metadata/tags, not in a separate database. This keeps the system simple and stateless.

  3. SHA256 Verification: Every read and write operation includes SHA256 verification for data integrity.

  4. Atomic Operations: All S3 operations are atomic - no partial states are left if operations fail.

  5. Reference File Updates: Currently, the first file uploaded to a DeltaSpace becomes the permanent reference. Future versions may implement reference rotation.

Performance Considerations

  • Local reference caching dramatically improves performance for repeated operations
  • Delta compression is CPU-intensive; consider parallelization for bulk uploads
  • The default max_ratio of 0.5 prevents storing inefficient deltas
  • For files <1MB, delta overhead may exceed benefits

Security Notes

  • Never store AWS credentials in code
  • Use IAM roles when possible
  • All S3 operations respect bucket policies and encryption settings
  • SHA256 checksums prevent tampering and corruption