mirror of
https://github.com/beshu-tech/deltaglider.git
synced 2026-03-26 11:01:09 +01:00
Created extensive documentation for the DG_MAX_RATIO parameter, which controls delta compression efficiency thresholds. New Documentation: - docs/DG_MAX_RATIO.md (526 lines) * Complete explanation of how DG_MAX_RATIO works * Real-world scenarios and use cases * Decision trees for choosing optimal values * Industry-specific recommendations * Monitoring and tuning strategies * Advanced usage patterns * Comprehensive FAQ Updates to Existing Documentation: - README.md: Added link to DG_MAX_RATIO guide with tip callout - CLAUDE.md: Added detailed DG_MAX_RATIO explanation and guide link - Dockerfile: Added inline comments explaining DG_MAX_RATIO tuning - docs/sdk/getting-started.md: Added DG_MAX_RATIO guide reference Key Topics Covered: - What DG_MAX_RATIO does and why it exists - How to choose the right value (0.2-0.7 range) - Real-world scenarios (nightly builds, major versions, etc.) - Industry-specific use cases (SaaS, mobile apps, backups, etc.) - Configuration examples (Docker, SDK, CLI) - Monitoring and optimization strategies - Advanced usage patterns (dynamic ratios, A/B testing) - FAQ addressing common questions Examples Included: - Conservative (0.2-0.3): For dissimilar files or expensive storage - Default (0.5): Balanced approach for most use cases - Permissive (0.6-0.7): For very similar files or cheap storage Value Proposition: - Helps users optimize compression for their specific use case - Prevents inefficient delta compression - Provides data-driven tuning methodology - Reduces support questions about compression behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
246 lines
10 KiB
Markdown
246 lines
10 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
DeltaGlider is a drop-in S3 replacement that achieves 99.9% compression for versioned artifacts through intelligent binary delta compression using xdelta3. It's designed to store 4TB of similar files in 5GB by storing only the differences between versions.
|
|
|
|
## Essential Commands
|
|
|
|
### Development Setup
|
|
```bash
|
|
# Install with development dependencies using uv (preferred)
|
|
uv pip install -e ".[dev]"
|
|
|
|
# Or using pip
|
|
pip install -e ".[dev]"
|
|
```
|
|
|
|
### Testing
|
|
```bash
|
|
# Run all tests
|
|
uv run pytest
|
|
|
|
# Run unit tests only
|
|
uv run pytest tests/unit
|
|
|
|
# Run integration tests only
|
|
uv run pytest tests/integration
|
|
|
|
# Run a specific test file
|
|
uv run pytest tests/integration/test_full_workflow.py
|
|
|
|
# Run a specific test
|
|
uv run pytest tests/integration/test_full_workflow.py::test_full_put_get_workflow
|
|
|
|
# Run with verbose output
|
|
uv run pytest -v
|
|
|
|
# Run with coverage
|
|
uv run pytest --cov=deltaglider
|
|
```
|
|
|
|
### Code Quality
|
|
```bash
|
|
# Run linter (ruff)
|
|
uv run ruff check src/
|
|
|
|
# Fix linting issues automatically
|
|
uv run ruff check --fix src/
|
|
|
|
# Format code
|
|
uv run ruff format src/
|
|
|
|
# Type checking with mypy
|
|
uv run mypy src/
|
|
|
|
# Run all checks (linting + type checking)
|
|
uv run ruff check src/ && uv run mypy src/
|
|
```
|
|
|
|
### Local Testing with MinIO
|
|
```bash
|
|
# Start MinIO for local S3 testing
|
|
docker run -p 9000:9000 -p 9001:9001 \
|
|
-e MINIO_ROOT_USER=minioadmin \
|
|
-e MINIO_ROOT_PASSWORD=minioadmin \
|
|
minio/minio server /data --console-address ":9001"
|
|
|
|
# Test with local MinIO
|
|
export AWS_ENDPOINT_URL=http://localhost:9000
|
|
export AWS_ACCESS_KEY_ID=minioadmin
|
|
export AWS_SECRET_ACCESS_KEY=minioadmin
|
|
|
|
# Now you can use deltaglider commands
|
|
deltaglider cp test.zip s3://test-bucket/
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Hexagonal Architecture Pattern
|
|
|
|
The codebase follows a clean hexagonal (ports and adapters) architecture:
|
|
|
|
```
|
|
src/deltaglider/
|
|
├── core/ # Domain logic (pure Python, no external dependencies)
|
|
│ ├── service.py # Main DeltaService orchestration
|
|
│ ├── models.py # Data models (DeltaSpace, ObjectKey, PutSummary, etc.)
|
|
│ └── errors.py # Domain-specific exceptions
|
|
├── ports/ # Abstract interfaces (protocols)
|
|
│ ├── storage.py # StoragePort protocol for S3-like operations
|
|
│ ├── diff.py # DiffPort protocol for delta operations
|
|
│ ├── hash.py # HashPort protocol for integrity checks
|
|
│ ├── cache.py # CachePort protocol for local references
|
|
│ ├── clock.py # ClockPort protocol for time operations
|
|
│ ├── logger.py # LoggerPort protocol for logging
|
|
│ └── metrics.py # MetricsPort protocol for observability
|
|
├── adapters/ # Concrete implementations
|
|
│ ├── storage_s3.py # S3StorageAdapter using boto3
|
|
│ ├── diff_xdelta.py # XdeltaAdapter using xdelta3 binary
|
|
│ ├── hash_sha256.py # Sha256Adapter for checksums
|
|
│ ├── cache_cas.py # ContentAddressedCache (SHA256-based storage)
|
|
│ ├── cache_encrypted.py # EncryptedCache (Fernet encryption wrapper)
|
|
│ ├── cache_memory.py # MemoryCache (LRU in-memory cache)
|
|
│ ├── clock_utc.py # UtcClockAdapter for UTC timestamps
|
|
│ ├── logger_std.py # StdLoggerAdapter for console output
|
|
│ └── metrics_noop.py # NoopMetricsAdapter (placeholder)
|
|
└── app/
|
|
└── cli/ # Click-based CLI application
|
|
├── main.py # Main CLI entry point with AWS S3 commands
|
|
├── aws_compat.py # AWS S3 compatibility helpers
|
|
└── sync.py # Sync command implementation
|
|
```
|
|
|
|
### Core Concepts
|
|
|
|
1. **DeltaSpace**: A prefix in S3 where related files are stored for delta compression. Contains a `reference.bin` file that serves as the base for delta compression.
|
|
|
|
2. **Delta Compression Flow**:
|
|
- First file uploaded to a DeltaSpace becomes the reference (stored as `reference.bin`)
|
|
- Subsequent files are compared against the reference using xdelta3
|
|
- Only the differences (delta) are stored with `.delta` suffix
|
|
- Metadata in S3 tags preserves original file info and delta relationships
|
|
|
|
3. **File Type Intelligence**:
|
|
- Archive files (`.zip`, `.tar`, `.gz`, `.jar`, etc.) use delta compression
|
|
- Text files, small files, and already-compressed unique files bypass delta
|
|
- Decision made by `should_use_delta()` in `core/service.py`
|
|
|
|
4. **AWS S3 CLI Compatibility**:
|
|
- Commands (`cp`, `ls`, `rm`, `sync`) mirror AWS CLI syntax exactly
|
|
- Located in `app/cli/main.py` with helpers in `aws_compat.py`
|
|
|
|
### Key Algorithms
|
|
|
|
1. **Delta Ratio Check** (`core/service.py`):
|
|
- After creating a delta, checks if `delta_size / file_size > max_ratio` (default 0.5)
|
|
- If delta is too large (>50% of original), stores file directly instead
|
|
- Prevents inefficient compression for dissimilar files
|
|
|
|
2. **Reference Management** (`core/service.py`):
|
|
- Reference stored at `{deltaspace.prefix}/reference.bin`
|
|
- SHA256 verification on every read/write
|
|
- **Content-Addressed Storage (CAS)** cache in `/tmp/deltaglider-*` (ephemeral)
|
|
- Cache uses SHA256 as filename with two-level directory structure (ab/cd/abcdef...)
|
|
- Automatic deduplication: same content = same SHA = same cache file
|
|
- Zero collision risk: SHA256 namespace guarantees uniqueness
|
|
- **Encryption**: Optional Fernet (AES-128-CBC + HMAC) encryption at rest (enabled by default)
|
|
- Ephemeral encryption keys per process for forward secrecy
|
|
- **Cache Backends**: Configurable filesystem or in-memory cache with LRU eviction
|
|
|
|
3. **Sync Algorithm** (`app/cli/sync.py`):
|
|
- Compares local vs S3 using size and modification time
|
|
- For delta files, uses timestamp comparison with 1-second tolerance
|
|
- Supports `--delete` flag for true mirroring
|
|
|
|
## Testing Strategy
|
|
|
|
- **Unit Tests** (`tests/unit/`): Test individual adapters and core logic with mocks
|
|
- **Integration Tests** (`tests/integration/`): Test CLI commands and workflows
|
|
- **E2E Tests** (`tests/e2e/`): Require LocalStack for full S3 simulation
|
|
|
|
Key test files:
|
|
- `test_full_workflow.py`: Complete put/get cycle testing
|
|
- `test_aws_cli_commands_v2.py`: AWS S3 CLI compatibility tests
|
|
- `test_xdelta.py`: Binary diff engine integration tests
|
|
|
|
## Common Development Tasks
|
|
|
|
### Adding a New CLI Command
|
|
1. Add command function to `src/deltaglider/app/cli/main.py`
|
|
2. Use `@cli.command()` decorator and `@click.pass_obj` for service access
|
|
3. Follow AWS S3 CLI conventions for flags and arguments
|
|
4. Add tests to `tests/integration/test_aws_cli_commands_v2.py`
|
|
|
|
### Adding a New Port/Adapter Pair
|
|
1. Define protocol in `src/deltaglider/ports/`
|
|
2. Implement adapter in `src/deltaglider/adapters/`
|
|
3. Wire adapter in `create_service()` in `app/cli/main.py`
|
|
4. Add unit tests in `tests/unit/test_adapters.py`
|
|
|
|
### Modifying Delta Logic
|
|
Core delta logic is in `src/deltaglider/core/service.py`:
|
|
- `put()`: Handles upload with delta compression
|
|
- `get()`: Handles download with delta reconstruction
|
|
- `should_use_delta()`: File type discrimination logic
|
|
|
|
## Environment Variables
|
|
|
|
- `DG_LOG_LEVEL`: Logging level (default: "INFO")
|
|
- `DG_MAX_RATIO`: Maximum acceptable delta/file ratio (default: "0.5", range: "0.0-1.0")
|
|
- **See [docs/DG_MAX_RATIO.md](docs/DG_MAX_RATIO.md) for complete tuning guide**
|
|
- Controls when to use delta vs. direct storage
|
|
- Lower (0.2-0.3) = conservative, only high-quality compression
|
|
- Higher (0.6-0.7) = permissive, accept modest savings
|
|
- `DG_CACHE_BACKEND`: Cache backend type - "filesystem" (default) or "memory"
|
|
- `DG_CACHE_MEMORY_SIZE_MB`: Memory cache size limit in MB (default: "100")
|
|
- `DG_CACHE_ENCRYPTION_KEY`: Optional base64-encoded Fernet key for persistent encryption (ephemeral by default)
|
|
- `AWS_ENDPOINT_URL`: Override S3 endpoint for MinIO/LocalStack
|
|
- `AWS_ACCESS_KEY_ID`: AWS credentials
|
|
- `AWS_SECRET_ACCESS_KEY`: AWS credentials
|
|
- `AWS_DEFAULT_REGION`: AWS region
|
|
|
|
**Security Notes**:
|
|
- **Encryption Always On**: Cache data is ALWAYS encrypted (cannot be disabled)
|
|
- **Ephemeral Keys**: Encryption keys auto-generated per process for maximum security
|
|
- **Auto-Cleanup**: Corrupted cache files automatically deleted on decryption failures
|
|
- **Process Isolation**: Each process gets isolated cache in `/tmp/deltaglider-*`, cleaned up on exit
|
|
- **Persistent Keys**: Set `DG_CACHE_ENCRYPTION_KEY` only if you need cross-process cache sharing (e.g., shared filesystems)
|
|
|
|
## Important Implementation Details
|
|
|
|
1. **xdelta3 Binary Dependency**: The system requires xdelta3 binary installed on the system. The `XdeltaAdapter` uses subprocess to call it.
|
|
|
|
2. **Metadata Storage**: File metadata is stored in S3 object metadata/tags, not in a separate database. This keeps the system simple and stateless.
|
|
|
|
3. **SHA256 Verification**: Every read and write operation includes SHA256 verification for data integrity.
|
|
|
|
4. **Atomic Operations**: All S3 operations are atomic - no partial states are left if operations fail.
|
|
|
|
5. **Reference File Updates**: Currently, the first file uploaded to a DeltaSpace becomes the permanent reference. Future versions may implement reference rotation.
|
|
|
|
## Performance Considerations
|
|
|
|
- **Content-Addressed Storage**: SHA256-based deduplication eliminates redundant storage
|
|
- **Cache Backends**:
|
|
- Filesystem cache (default): persistent across processes, good for shared workflows
|
|
- Memory cache: faster, zero I/O, perfect for ephemeral CI/CD pipelines
|
|
- **Encryption Overhead**: ~10-15% performance impact, provides security at rest
|
|
- Delta compression is CPU-intensive; consider parallelization for bulk uploads
|
|
- The default max_ratio of 0.5 prevents storing inefficient deltas
|
|
- For files <1MB, delta overhead may exceed benefits
|
|
|
|
## Security Notes
|
|
|
|
- Never store AWS credentials in code
|
|
- Use IAM roles when possible
|
|
- All S3 operations respect bucket policies and encryption settings
|
|
- SHA256 checksums prevent tampering and corruption
|
|
- **Encryption Always On**: Cache data is ALWAYS encrypted using Fernet (AES-128-CBC + HMAC) - cannot be disabled
|
|
- **Ephemeral Keys**: Encryption keys auto-generated per process for forward secrecy and process isolation
|
|
- **Auto-Cleanup**: Corrupted or tampered cache files automatically deleted on decryption failures
|
|
- **Persistent Keys**: Set `DG_CACHE_ENCRYPTION_KEY` only for cross-process cache sharing (use secrets management)
|
|
- **Content-Addressed Storage**: SHA256-based filenames prevent collision attacks
|
|
- **Zero-Trust Cache**: All cache operations include cryptographic validation |