diff --git a/README.md b/README.md index 0dd69e6..198c062 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,45 @@ From our [ReadOnlyREST case study](docs/case-study-readonlyrest.md): - **Compression**: 99.9% (not a typo) - **Integration time**: 5 minutes -## How It Works +## Quick Start + +The quickest way to start is using the GUI +* https://github.com/sscarduzio/dg_commander/ + +### CLI Installation + +```bash +# Via pip (Python 3.11+) +pip install deltaglider + +# Via uv (faster) +uv pip install deltaglider + +# Via Docker +docker run -v ~/.aws:/root/.aws deltaglider/deltaglider --help +``` + +### Basic Usage + +```bash +# Upload a file (automatic delta compression) +deltaglider cp my-app-v1.0.0.zip s3://releases/ + +# Download a file (automatic delta reconstruction) +deltaglider cp s3://releases/my-app-v1.0.0.zip ./downloaded.zip + +# List objects +deltaglider ls s3://releases/ + +# Sync directories +deltaglider sync ./dist/ s3://releases/v1.0.0/ +``` + +**That's it!** DeltaGlider automatically detects similar files and applies 99%+ compression. For more commands and options, see [CLI Reference](#cli-reference). + +## Core Concepts + +### How It Works ``` Traditional S3: @@ -42,24 +80,32 @@ With DeltaGlider: v1.0.2.zip (100MB) → S3: 97KB delta (100.3MB total) ``` -## Quick Start +DeltaGlider stores the first file as a reference and subsequent similar files as tiny deltas (differences). When you download, it reconstructs the original file perfectly using the reference + delta. -### Installation +### Intelligent File Type Detection -```bash -# Via pip (Python 3.11+) -pip install deltaglider +DeltaGlider automatically detects file types and applies the optimal strategy: -# Via uv (faster) -uv pip install deltaglider +| File Type | Strategy | Typical Compression | Why It Works | +|-----------|----------|---------------------|--------------| +| `.zip`, `.tar`, `.gz` | Binary delta | 99%+ for similar versions | Archive structure remains consistent between versions | +| `.dmg`, `.deb`, `.rpm` | Binary delta | 95%+ for similar versions | Package formats with predictable structure | +| `.jar`, `.war`, `.ear` | Binary delta | 90%+ for similar builds | Java archives with mostly unchanged classes | +| `.exe`, `.dll`, `.so` | Direct upload | 0% (no delta benefit) | Compiled code changes unpredictably | +| `.txt`, `.json`, `.xml` | Direct upload | 0% (use gzip instead) | Text files benefit more from standard compression | +| `.sha1`, `.sha512`, `.md5` | Direct upload | 0% (already minimal) | Hash files are unique by design | -# Via Docker -docker run -v ~/.aws:/root/.aws deltaglider/deltaglider --help -``` +### Key Features -### AWS S3 Compatible Commands +- **AWS CLI Replacement**: Same commands as `aws s3` with automatic compression +- **boto3-Compatible SDK**: Works with existing boto3 code with minimal changes +- **Zero Configuration**: No databases, no manifest files, no complex setup +- **Data Integrity**: SHA256 verification on every operation +- **S3 Compatible**: Works with AWS S3, MinIO, Cloudflare R2, and any S3-compatible storage -DeltaGlider is a **drop-in replacement** for AWS S3 CLI with automatic delta compression: +## CLI Reference + +### All Commands ```bash # Copy files to/from S3 (automatic delta compression for archives) @@ -91,84 +137,35 @@ deltaglider sync --exclude "*.log" ./src/ s3://backup/ # Exclude patterns deltaglider cp file.zip s3://bucket/ --endpoint-url http://localhost:9000 ``` -## Why xdelta3 Excels at Archive Compression - -Traditional diff algorithms (like `diff` or `git diff`) work line-by-line on text files. Binary diff tools like `bsdiff` or `courgette` are optimized for executables. But **xdelta3** is uniquely suited for compressed archives because: - -1. **Block-level matching**: xdelta3 uses a rolling hash algorithm to find matching byte sequences at any offset, not just line boundaries. This is crucial for archives where small file changes can shift all subsequent byte positions. - -2. **Large window support**: xdelta3 can use reference windows up to 2GB, allowing it to find matches even when content has moved significantly within the archive. Other delta algorithms typically use much smaller windows (64KB-1MB). - -3. **Compression-aware**: When you update one file in a ZIP/TAR archive, the archive format itself remains largely identical - same compression dictionary, same structure. xdelta3 preserves these similarities while other algorithms might miss them. - -4. **Format agnostic**: Unlike specialized tools (e.g., `courgette` for Chrome updates), xdelta3 works on raw bytes without understanding the file format, making it perfect for any archive type. - -### Real-World Example -When you rebuild a JAR file with one class changed: -- **Text diff**: 100% different (it's binary data!) -- **bsdiff**: ~30-40% of original size (optimized for executables, not archives) -- **xdelta3**: ~0.1-1% of original size (finds the unchanged parts regardless of position) - -This is why DeltaGlider achieves 99%+ compression on versioned archives - xdelta3 can identify that 99% of the archive structure and content remains identical between versions. - -## Intelligent File Type Detection - -DeltaGlider automatically detects file types and applies the optimal strategy: - -| File Type | Strategy | Typical Compression | Why It Works | -|-----------|----------|-------------------|--------------| -| `.zip`, `.tar`, `.gz` | Binary delta | 99%+ for similar versions | Archive structure remains consistent between versions | -| `.dmg`, `.deb`, `.rpm` | Binary delta | 95%+ for similar versions | Package formats with predictable structure | -| `.jar`, `.war`, `.ear` | Binary delta | 90%+ for similar builds | Java archives with mostly unchanged classes | -| `.exe`, `.dll`, `.so` | Direct upload | 0% (no delta benefit) | Compiled code changes unpredictably | -| `.txt`, `.json`, `.xml` | Direct upload | 0% (use gzip instead) | Text files benefit more from standard compression | -| `.sha1`, `.sha512`, `.md5` | Direct upload | 0% (already minimal) | Hash files are unique by design | - -## Performance Benchmarks - -Testing with real software releases: - -```python -# 513 Elasticsearch plugin releases (82.5MB each) -Original size: 42.3 GB -DeltaGlider size: 115 MB -Compression: 99.7% -Upload speed: 3-4 files/second -Download speed: <100ms reconstruction -``` - -## Integration Examples - -### Drop-in AWS CLI Replacement +### Command Flags ```bash -# Before (aws-cli) -aws s3 cp release-v2.0.0.zip s3://releases/ -aws s3 cp --recursive ./build/ s3://releases/v2.0.0/ -aws s3 ls s3://releases/ -aws s3 rm s3://releases/old-version.zip +# All standard AWS flags work +deltaglider cp file.zip s3://bucket/ \ + --endpoint-url http://localhost:9000 \ + --profile production \ + --region us-west-2 -# After (deltaglider) - Same commands, 99% less storage! -deltaglider cp release-v2.0.0.zip s3://releases/ -deltaglider cp -r ./build/ s3://releases/v2.0.0/ -deltaglider ls s3://releases/ -deltaglider rm s3://releases/old-version.zip +# DeltaGlider-specific flags +deltaglider cp file.zip s3://bucket/ \ + --no-delta # Disable compression for specific files + --max-ratio 0.8 # Only use delta if compression > 20% ``` -### CI/CD Pipeline (GitHub Actions) +### CI/CD Integration + +#### GitHub Actions ```yaml - name: Upload Release with 99% compression run: | pip install deltaglider - # Use AWS S3 compatible syntax deltaglider cp dist/*.zip s3://releases/${{ github.ref_name }}/ - - # Or use recursive for entire directories + # Or recursive for entire directories deltaglider cp -r dist/ s3://releases/${{ github.ref_name }}/ ``` -### Backup Script +#### Daily Backup Script ```bash #!/bin/bash @@ -177,18 +174,15 @@ tar -czf backup-$(date +%Y%m%d).tar.gz /data deltaglider cp backup-*.tar.gz s3://backups/ # Only changes are stored, not full backup -# List backups with human-readable sizes -deltaglider ls -h s3://backups/ - # Clean up old backups deltaglider rm -r s3://backups/2023/ ``` -### Python SDK - boto3-Compatible API +## Python SDK **[📚 Full SDK Documentation](docs/sdk/README.md)** | **[API Reference](docs/sdk/api.md)** | **[Examples](docs/sdk/examples.md)** | **[boto3 Compatibility Guide](BOTO3_COMPATIBILITY.md)** -#### Quick Start - boto3 Compatible API (Recommended) +### boto3-Compatible API (Recommended) DeltaGlider provides a **boto3-compatible API** for core S3 operations (21 methods covering 80% of use cases): @@ -211,8 +205,7 @@ response = client.get_object(Bucket='releases', Key='v2.0.0/my-app.zip') with open('downloaded.zip', 'wb') as f: f.write(response['Body'].read()) -# Smart list_objects with optimized performance (NEW!) -# Fast listing (default) - no metadata fetching, ~50ms for 1000 objects +# Smart list_objects with optimized performance response = client.list_objects(Bucket='releases', Prefix='v2.0.0/') # Paginated listing for large buckets @@ -224,22 +217,14 @@ while response.is_truncated: ContinuationToken=response.next_continuation_token ) -# Get bucket statistics with smart defaults -stats = client.get_bucket_stats('releases') # Quick stats (50ms) -stats = client.get_bucket_stats('releases', detailed_stats=True) # With compression metrics - +# Delete and inspect objects client.delete_object(Bucket='releases', Key='old-version.zip') client.head_object(Bucket='releases', Key='v2.0.0/my-app.zip') - -# Bucket management - no boto3 needed! -client.create_bucket(Bucket='my-new-bucket') -client.list_buckets() -client.delete_bucket(Bucket='my-new-bucket') ``` -#### Bucket Management (NEW!) +### Bucket Management -**No boto3 required!** DeltaGlider now provides complete bucket management: +**No boto3 required!** DeltaGlider provides complete bucket management: ```python from deltaglider import create_client @@ -264,15 +249,9 @@ for bucket in response['Buckets']: client.delete_bucket(Bucket='my-old-bucket') ``` -**Benefits:** -- ✅ No need to import boto3 separately for bucket operations -- ✅ Consistent API with DeltaGlider object operations -- ✅ Works with AWS S3, MinIO, and S3-compatible storage -- ✅ Idempotent operations (safe to retry) - See [examples/bucket_management.py](examples/bucket_management.py) for complete example. -#### Simple API (Alternative) +### Simple API (Alternative) For simpler use cases, DeltaGlider also provides a streamlined API: @@ -290,15 +269,16 @@ print(f"Saved {summary.savings_percent:.0f}% storage space") client.download("s3://releases/v2.0.0/my-app-v2.0.0.zip", "local-app.zip") ``` -#### Real-World Example: Software Release Storage with boto3 API +### Real-World Examples + +#### Software Release Storage ```python from deltaglider import create_client -# Works exactly like boto3, but with 99% compression! client = create_client() -# Upload multiple versions using boto3-compatible API +# Upload multiple versions versions = ["v1.0.0", "v1.0.1", "v1.0.2", "v1.1.0"] for version in versions: with open(f"dist/my-app-{version}.zip", 'rb') as f: @@ -323,27 +303,19 @@ for version in versions: # v1.0.1: Stored as 0.2MB delta (saved 99.8%) # v1.0.2: Stored as 0.3MB delta (saved 99.7%) # v1.1.0: Stored as 5.2MB delta (saved 94.8%) - -# Download using standard boto3 API -response = client.get_object(Bucket='releases', Key='v1.1.0/my-app-v1.1.0.zip') -with open('my-app-latest.zip', 'wb') as f: - f.write(response['Body'].read()) ``` -#### Advanced Example: Automated Backup with boto3 API +#### Automated Database Backup ```python from datetime import datetime from deltaglider import create_client -# Works with any S3-compatible storage client = create_client(endpoint_url="http://minio.internal:9000") def backup_database(): - """Daily database backup with automatic deduplication using boto3 API.""" + """Daily database backup with automatic deduplication.""" date = datetime.now().strftime("%Y%m%d") - - # Create database dump dump_file = f"backup-{date}.sql.gz" # Upload using boto3-compatible API @@ -356,63 +328,80 @@ def backup_database(): Metadata={'date': date, 'source': 'production'} ) - # Check compression effectiveness (DeltaGlider extension) + # Check compression effectiveness if 'DeltaGliderInfo' in response: info = response['DeltaGliderInfo'] - if info['DeltaRatio'] > 0.1: # If delta is >10% of original + if info['DeltaRatio'] > 0.1: print(f"Warning: Low compression ({info['SavingsPercent']:.0f}%), " "database might have significant changes") print(f"Backup stored: {info['StoredSizeMB']:.1f}MB " f"(compressed from {info['OriginalSizeMB']:.1f}MB)") - # List recent backups using boto3 API - response = client.list_objects( - Bucket='backups', - Prefix='postgres/', - MaxKeys=30 - ) - - # Clean up old backups - for obj in response.get('Contents', []): - # Parse date from key - obj_date = obj['Key'].split('/')[1] - if days_old(obj_date) > 30: - client.delete_object(Bucket='backups', Key=obj['Key']) - -# Run backup backup_database() ``` For more examples and detailed API documentation, see the [SDK Documentation](docs/sdk/README.md). -## Migration from AWS CLI +## Performance & Benchmarks -Migrating from `aws s3` to `deltaglider` is as simple as changing the command name: +### Real-World Results -| AWS CLI | DeltaGlider | Compression Benefit | -|---------|------------|-------------------| -| `aws s3 cp file.zip s3://bucket/` | `deltaglider cp file.zip s3://bucket/` | ✅ 99% for similar files | -| `aws s3 cp -r dir/ s3://bucket/` | `deltaglider cp -r dir/ s3://bucket/` | ✅ 99% for archives | -| `aws s3 ls s3://bucket/` | `deltaglider ls s3://bucket/` | - | -| `aws s3 rm s3://bucket/file` | `deltaglider rm s3://bucket/file` | - | -| `aws s3 sync dir/ s3://bucket/` | `deltaglider sync dir/ s3://bucket/` | ✅ 99% incremental | +Testing with 513 Elasticsearch plugin releases (82.5MB each): -### Compatibility Flags - -```bash -# All standard AWS flags work -deltaglider cp file.zip s3://bucket/ \ - --endpoint-url http://localhost:9000 \ - --profile production \ - --region us-west-2 - -# DeltaGlider-specific flags -deltaglider cp file.zip s3://bucket/ \ - --no-delta # Disable compression for specific files - --max-ratio 0.8 # Only use delta if compression > 20% +``` +Original size: 42.3 GB +DeltaGlider size: 115 MB +Compression: 99.7% +Upload speed: 3-4 files/second +Download speed: <100ms reconstruction ``` -## Architecture +### The Math + +For `N` versions of a `S` MB file with `D%` difference between versions: + +**Traditional S3**: `N × S` MB +**DeltaGlider**: `S + (N-1) × S × D%` MB + +Example: 100 versions of 100MB files with 1% difference: +- **Traditional**: 10,000 MB +- **DeltaGlider**: 199 MB +- **Savings**: 98% + +### Comparison + +| Solution | Compression | Speed | Integration | Cost | +|----------|------------|-------|-------------|------| +| **DeltaGlider** | 99%+ | Fast | Drop-in | Open source | +| S3 Versioning | 0% | Native | Built-in | $$ per version | +| Deduplication | 30-50% | Slow | Complex | Enterprise $$$ | +| Git LFS | Good | Slow | Git-only | $ per GB | +| Restic/Borg | 80-90% | Medium | Backup-only | Open source | + +## Architecture & Technical Deep Dive + +### Why xdelta3 Excels at Archive Compression + +Traditional diff algorithms (like `diff` or `git diff`) work line-by-line on text files. Binary diff tools like `bsdiff` or `courgette` are optimized for executables. But **xdelta3** is uniquely suited for compressed archives because: + +1. **Block-level matching**: xdelta3 uses a rolling hash algorithm to find matching byte sequences at any offset, not just line boundaries. This is crucial for archives where small file changes can shift all subsequent byte positions. + +2. **Large window support**: xdelta3 can use reference windows up to 2GB, allowing it to find matches even when content has moved significantly within the archive. Other delta algorithms typically use much smaller windows (64KB-1MB). + +3. **Compression-aware**: When you update one file in a ZIP/TAR archive, the archive format itself remains largely identical - same compression dictionary, same structure. xdelta3 preserves these similarities while other algorithms might miss them. + +4. **Format agnostic**: Unlike specialized tools (e.g., `courgette` for Chrome updates), xdelta3 works on raw bytes without understanding the file format, making it perfect for any archive type. + +#### Real-World Example + +When you rebuild a JAR file with one class changed: +- **Text diff**: 100% different (it's binary data!) +- **bsdiff**: ~30-40% of original size (optimized for executables, not archives) +- **xdelta3**: ~0.1-1% of original size (finds the unchanged parts regardless of position) + +This is why DeltaGlider achieves 99%+ compression on versioned archives - xdelta3 can identify that 99% of the archive structure and content remains identical between versions. + +### System Architecture DeltaGlider uses a clean hexagonal architecture: @@ -435,7 +424,7 @@ DeltaGlider uses a clean hexagonal architecture: - **Local caching**: Fast repeated operations - **Zero dependencies**: No database, no manifest files -## When to Use DeltaGlider +### When to Use DeltaGlider ✅ **Perfect for:** - Software releases and versioned artifacts @@ -451,15 +440,17 @@ DeltaGlider uses a clean hexagonal architecture: - Frequently changing unstructured data - Files smaller than 1MB -## Comparison +## Migration from AWS CLI -| Solution | Compression | Speed | Integration | Cost | -|----------|------------|-------|-------------|------| -| **DeltaGlider** | 99%+ | Fast | Drop-in | Open source | -| S3 Versioning | 0% | Native | Built-in | $$ per version | -| Deduplication | 30-50% | Slow | Complex | Enterprise $$$ | -| Git LFS | Good | Slow | Git-only | $ per GB | -| Restic/Borg | 80-90% | Medium | Backup-only | Open source | +Migrating from `aws s3` to `deltaglider` is as simple as changing the command name: + +| AWS CLI | DeltaGlider | Compression Benefit | +|---------|------------|---------------------| +| `aws s3 cp file.zip s3://bucket/` | `deltaglider cp file.zip s3://bucket/` | ✅ 99% for similar files | +| `aws s3 cp -r dir/ s3://bucket/` | `deltaglider cp -r dir/ s3://bucket/` | ✅ 99% for archives | +| `aws s3 ls s3://bucket/` | `deltaglider ls s3://bucket/` | - | +| `aws s3 rm s3://bucket/file` | `deltaglider rm s3://bucket/file` | - | +| `aws s3 sync dir/ s3://bucket/` | `deltaglider sync dir/ s3://bucket/` | ✅ 99% incremental | ## Production Ready @@ -506,18 +497,6 @@ A: Zero. Files without similarity are uploaded directly. **Q: Is this compatible with S3 encryption?** A: Yes, DeltaGlider respects all S3 settings including SSE, KMS, and bucket policies. -## The Math - -For `N` versions of a `S` MB file with `D%` difference between versions: - -**Traditional S3**: `N × S` MB -**DeltaGlider**: `S + (N-1) × S × D%` MB - -Example: 100 versions of 100MB files with 1% difference: -- **Traditional**: 10,000 MB -- **DeltaGlider**: 199 MB -- **Savings**: 98% - ## Contributing We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. @@ -554,4 +533,4 @@ deltaglider analyze s3://your-bucket/ # Output: "Potential savings: 95.2% (4.8TB → 237GB)" ``` -Built with ❤️ by engineers who were tired of paying to store the same bytes over and over. \ No newline at end of file +Built with ❤️ by engineers who were tired of paying to store the same bytes over and over.