Files
deltaglider-beshu-tech/CLAUDE.md
Simone Scarduzio 59b15b6384 no more leaves
2025-09-23 14:14:54 +02:00

216 lines
8.1 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
DeltaGlider is a drop-in S3 replacement that achieves 99.9% compression for versioned artifacts through intelligent binary delta compression using xdelta3. It's designed to store 4TB of similar files in 5GB by storing only the differences between versions.
## Essential Commands
### Development Setup
```bash
# Install with development dependencies using uv (preferred)
uv pip install -e ".[dev]"
# Or using pip
pip install -e ".[dev]"
```
### Testing
```bash
# Run all tests
uv run pytest
# Run unit tests only
uv run pytest tests/unit
# Run integration tests only
uv run pytest tests/integration
# Run a specific test file
uv run pytest tests/integration/test_full_workflow.py
# Run a specific test
uv run pytest tests/integration/test_full_workflow.py::test_full_put_get_workflow
# Run with verbose output
uv run pytest -v
# Run with coverage
uv run pytest --cov=deltaglider
```
### Code Quality
```bash
# Run linter (ruff)
uv run ruff check src/
# Fix linting issues automatically
uv run ruff check --fix src/
# Format code
uv run ruff format src/
# Type checking with mypy
uv run mypy src/
# Run all checks (linting + type checking)
uv run ruff check src/ && uv run mypy src/
```
### Local Testing with MinIO
```bash
# Start MinIO for local S3 testing
docker run -p 9000:9000 -p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
minio/minio server /data --console-address ":9001"
# Test with local MinIO
export AWS_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
# Now you can use deltaglider commands
deltaglider cp test.zip s3://test-bucket/
```
## Architecture
### Hexagonal Architecture Pattern
The codebase follows a clean hexagonal (ports and adapters) architecture:
```
src/deltaglider/
├── core/ # Domain logic (pure Python, no external dependencies)
│ ├── service.py # Main DeltaService orchestration
│ ├── models.py # Data models (DeltaSpace, ObjectKey, PutSummary, etc.)
│ └── errors.py # Domain-specific exceptions
├── ports/ # Abstract interfaces (protocols)
│ ├── storage.py # StoragePort protocol for S3-like operations
│ ├── diff.py # DiffPort protocol for delta operations
│ ├── hash.py # HashPort protocol for integrity checks
│ ├── cache.py # CachePort protocol for local references
│ ├── clock.py # ClockPort protocol for time operations
│ ├── logger.py # LoggerPort protocol for logging
│ └── metrics.py # MetricsPort protocol for observability
├── adapters/ # Concrete implementations
│ ├── storage_s3.py # S3StorageAdapter using boto3
│ ├── diff_xdelta.py # XdeltaAdapter using xdelta3 binary
│ ├── hash_sha256.py # Sha256Adapter for checksums
│ ├── cache_fs.py # FsCacheAdapter for file system cache
│ ├── clock_utc.py # UtcClockAdapter for UTC timestamps
│ ├── logger_std.py # StdLoggerAdapter for console output
│ └── metrics_noop.py # NoopMetricsAdapter (placeholder)
└── app/
└── cli/ # Click-based CLI application
├── main.py # Main CLI entry point with AWS S3 commands
├── aws_compat.py # AWS S3 compatibility helpers
└── sync.py # Sync command implementation
```
### Core Concepts
1. **DeltaSpace**: A prefix in S3 where related files are stored for delta compression. Contains a `reference.bin` file that serves as the base for delta compression.
2. **Delta Compression Flow**:
- First file uploaded to a DeltaSpace becomes the reference (stored as `reference.bin`)
- Subsequent files are compared against the reference using xdelta3
- Only the differences (delta) are stored with `.delta` suffix
- Metadata in S3 tags preserves original file info and delta relationships
3. **File Type Intelligence**:
- Archive files (`.zip`, `.tar`, `.gz`, `.jar`, etc.) use delta compression
- Text files, small files, and already-compressed unique files bypass delta
- Decision made by `should_use_delta()` in `core/service.py`
4. **AWS S3 CLI Compatibility**:
- Commands (`cp`, `ls`, `rm`, `sync`) mirror AWS CLI syntax exactly
- Located in `app/cli/main.py` with helpers in `aws_compat.py`
- Maintains backward compatibility with original `put`/`get` commands
### Key Algorithms
1. **Delta Ratio Check** (`core/service.py`):
- After creating a delta, checks if `delta_size / file_size > max_ratio` (default 0.5)
- If delta is too large (>50% of original), stores file directly instead
- Prevents inefficient compression for dissimilar files
2. **Reference Management** (`core/service.py`):
- Reference stored at `{deltaspace.prefix}/reference.bin`
- SHA256 verification on every read/write
- Local cache in `/tmp/.deltaglider/reference_cache` for performance
3. **Sync Algorithm** (`app/cli/sync.py`):
- Compares local vs S3 using size and modification time
- For delta files, uses timestamp comparison with 1-second tolerance
- Supports `--delete` flag for true mirroring
## Testing Strategy
- **Unit Tests** (`tests/unit/`): Test individual adapters and core logic with mocks
- **Integration Tests** (`tests/integration/`): Test CLI commands and workflows
- **E2E Tests** (`tests/e2e/`): Require LocalStack for full S3 simulation
Key test files:
- `test_full_workflow.py`: Complete put/get cycle testing
- `test_aws_cli_commands_v2.py`: AWS S3 CLI compatibility tests
- `test_xdelta.py`: Binary diff engine integration tests
## Common Development Tasks
### Adding a New CLI Command
1. Add command function to `src/deltaglider/app/cli/main.py`
2. Use `@cli.command()` decorator and `@click.pass_obj` for service access
3. Follow AWS S3 CLI conventions for flags and arguments
4. Add tests to `tests/integration/test_aws_cli_commands_v2.py`
### Adding a New Port/Adapter Pair
1. Define protocol in `src/deltaglider/ports/`
2. Implement adapter in `src/deltaglider/adapters/`
3. Wire adapter in `create_service()` in `app/cli/main.py`
4. Add unit tests in `tests/unit/test_adapters.py`
### Modifying Delta Logic
Core delta logic is in `src/deltaglider/core/service.py`:
- `put()`: Handles upload with delta compression
- `get()`: Handles download with delta reconstruction
- `should_use_delta()`: File type discrimination logic
## Environment Variables
- `DG_LOG_LEVEL`: Logging level (default: "INFO")
- `DG_CACHE_DIR`: Local reference cache directory (default: "/tmp/.deltaglider/reference_cache")
- `DG_MAX_RATIO`: Maximum acceptable delta/file ratio (default: "0.5")
- `AWS_ENDPOINT_URL`: Override S3 endpoint for MinIO/LocalStack
- `AWS_ACCESS_KEY_ID`: AWS credentials
- `AWS_SECRET_ACCESS_KEY`: AWS credentials
- `AWS_DEFAULT_REGION`: AWS region
## Important Implementation Details
1. **xdelta3 Binary Dependency**: The system requires xdelta3 binary installed on the system. The `XdeltaAdapter` uses subprocess to call it.
2. **Metadata Storage**: File metadata is stored in S3 object metadata/tags, not in a separate database. This keeps the system simple and stateless.
3. **SHA256 Verification**: Every read and write operation includes SHA256 verification for data integrity.
4. **Atomic Operations**: All S3 operations are atomic - no partial states are left if operations fail.
5. **Reference File Updates**: Currently, the first file uploaded to a DeltaSpace becomes the permanent reference. Future versions may implement reference rotation.
## Performance Considerations
- Local reference caching dramatically improves performance for repeated operations
- Delta compression is CPU-intensive; consider parallelization for bulk uploads
- The default max_ratio of 0.5 prevents storing inefficient deltas
- For files <1MB, delta overhead may exceed benefits
## Security Notes
- Never store AWS credentials in code
- Use IAM roles when possible
- All S3 operations respect bucket policies and encryption settings
- SHA256 checksums prevent tampering and corruption