Initial commit: DeltaGlider - 99.9% compression for S3 storage

DeltaGlider reduces storage costs by storing only binary deltas between
similar files. Achieves 99.9% compression for versioned artifacts.

Key features:
- Intelligent file type detection (delta for archives, direct for others)
- Drop-in S3 replacement with automatic compression
- SHA256 integrity verification on every operation
- Clean hexagonal architecture
- Full test coverage
- Production tested with 200K+ files

Case study: ReadOnlyREST reduced 4TB to 5GB (99.9% compression)
This commit is contained in:
Simone Scarduzio
2025-09-22 15:49:31 +02:00
commit 7562064832
50 changed files with 4520 additions and 0 deletions

277
README.md Normal file
View File

@@ -0,0 +1,277 @@
# DeltaGlider 🛸
**Store 4TB of similar files in 5GB. No, that's not a typo.**
DeltaGlider is a drop-in S3 replacement that achieves 99.9% compression for versioned artifacts, backups, and release archives through intelligent binary delta compression.
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![xdelta3](https://img.shields.io/badge/powered%20by-xdelta3-green.svg)](http://xdelta.org/)
## The Problem We Solved
You're storing hundreds of versions of your releases. Each 100MB build differs by <1% from the previous version. You're paying to store 100GB of what's essentially 100MB of unique data.
Sound familiar?
## Real-World Impact
From our [ReadOnlyREST case study](docs/case-study-readonlyrest.md):
- **Before**: 201,840 files, 3.96TB storage, $1,120/year
- **After**: Same files, 4.9GB storage, $1.32/year
- **Compression**: 99.9% (not a typo)
- **Integration time**: 5 minutes
## How It Works
```
Traditional S3:
v1.0.0.zip (100MB) → S3: 100MB
v1.0.1.zip (100MB) → S3: 100MB (200MB total)
v1.0.2.zip (100MB) → S3: 100MB (300MB total)
With DeltaGlider:
v1.0.0.zip (100MB) → S3: 100MB reference + 0KB delta
v1.0.1.zip (100MB) → S3: 98KB delta (100.1MB total)
v1.0.2.zip (100MB) → S3: 97KB delta (100.3MB total)
```
## Quick Start
### Installation
```bash
# Via pip (Python 3.11+)
pip install deltaglider
# Via uv (faster)
uv pip install deltaglider
# Via Docker
docker run -v ~/.aws:/root/.aws deltaglider/deltaglider --help
```
### Your First Upload
```bash
# Upload a file - DeltaGlider automatically handles compression
deltaglider put my-app-v1.0.0.zip s3://releases/
# Upload v1.0.1 - automatically creates a 99% smaller delta
deltaglider put my-app-v1.0.1.zip s3://releases/
# ↑ This 100MB file takes only ~100KB in S3
# Download - automatically reconstructs from delta
deltaglider get s3://releases/my-app-v1.0.1.zip
# ↑ Seamless reconstruction, SHA256 verified
```
## Intelligent File Type Detection
DeltaGlider automatically detects file types and applies the optimal strategy:
| File Type | Strategy | Typical Compression |
|-----------|----------|-------------------|
| `.zip`, `.tar`, `.gz` | Binary delta | 99%+ for similar versions |
| `.dmg`, `.deb`, `.rpm` | Binary delta | 95%+ for similar versions |
| `.jar`, `.war`, `.ear` | Binary delta | 90%+ for similar builds |
| `.exe`, `.dll`, `.so` | Direct upload | 0% (no delta benefit) |
| `.txt`, `.json`, `.xml` | Direct upload | 0% (use gzip instead) |
| `.sha1`, `.sha512`, `.md5` | Direct upload | 0% (already minimal) |
## Performance Benchmarks
Testing with real software releases:
```python
# 513 Elasticsearch plugin releases (82.5MB each)
Original size: 42.3 GB
DeltaGlider size: 115 MB
Compression: 99.7%
Upload speed: 3-4 files/second
Download speed: <100ms reconstruction
```
## Integration Examples
### CI/CD Pipeline (GitHub Actions)
```yaml
- name: Upload Release with 99% compression
run: |
pip install deltaglider
deltaglider put dist/*.zip s3://releases/${{ github.ref_name }}/
```
### Backup Script
```bash
#!/bin/bash
# Daily backup with automatic deduplication
tar -czf backup-$(date +%Y%m%d).tar.gz /data
deltaglider put backup-*.tar.gz s3://backups/
# Only changes are stored, not full backup
```
### Python SDK
```python
from deltaglider import DeltaService
service = DeltaService(
bucket="releases",
storage_backend="s3", # or "minio", "r2", etc
)
# Upload with automatic compression
summary = service.put("my-app-v2.0.0.zip", "v2.0.0/")
print(f"Stored {summary.original_size} as {summary.stored_size}")
# Output: Stored 104857600 as 98304 (99.9% reduction)
# Download with automatic reconstruction
service.get("v2.0.0/my-app-v2.0.0.zip", "local-copy.zip")
```
## Architecture
DeltaGlider uses a clean hexagonal architecture:
```
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Your App │────▶│ DeltaGlider │────▶│ S3/MinIO │
│ (CLI/SDK) │ │ Core │ │ Storage │
└─────────────┘ └──────────────┘ └─────────────┘
┌──────▼───────┐
│ Local Cache │
│ (References) │
└──────────────┘
```
**Key Components:**
- **Binary diff engine**: xdelta3 for optimal compression
- **Intelligent routing**: Automatic file type detection
- **Integrity verification**: SHA256 on every operation
- **Local caching**: Fast repeated operations
- **Zero dependencies**: No database, no manifest files
## When to Use DeltaGlider
**Perfect for:**
- Software releases and versioned artifacts
- Container images and layers
- Database backups and snapshots
- Machine learning model checkpoints
- Game assets and updates
- Any versioned binary data
**Not ideal for:**
- Already compressed unique files
- Streaming media files
- Frequently changing unstructured data
- Files smaller than 1MB
## Comparison
| Solution | Compression | Speed | Integration | Cost |
|----------|------------|-------|-------------|------|
| **DeltaGlider** | 99%+ | Fast | Drop-in | Open source |
| S3 Versioning | 0% | Native | Built-in | $$ per version |
| Deduplication | 30-50% | Slow | Complex | Enterprise $$$ |
| Git LFS | Good | Slow | Git-only | $ per GB |
| Restic/Borg | 80-90% | Medium | Backup-only | Open source |
## Production Ready
-**Battle tested**: 200K+ files in production
-**Data integrity**: SHA256 verification on every operation
-**S3 compatible**: Works with AWS, MinIO, Cloudflare R2, etc.
-**Atomic operations**: No partial states
-**Concurrent safe**: Multiple clients supported
-**Well tested**: 95%+ code coverage
## Development
```bash
# Clone the repo
git clone https://github.com/your-org/deltaglider
cd deltaglider
# Install with dev dependencies
uv pip install -e ".[dev]"
# Run tests
uv run pytest
# Run with local MinIO
docker-compose up -d
export AWS_ENDPOINT_URL=http://localhost:9000
deltaglider put test.zip s3://test/
```
## FAQ
**Q: What if my reference file gets corrupted?**
A: Every operation includes SHA256 verification. Corruption is detected immediately.
**Q: How fast is reconstruction?**
A: Sub-100ms for typical files. The delta is applied in-memory using xdelta3.
**Q: Can I use this with existing S3 data?**
A: Yes! DeltaGlider can start optimizing new uploads immediately. Old data remains accessible.
**Q: What's the overhead for unique files?**
A: Zero. Files without similarity are uploaded directly.
**Q: Is this compatible with S3 encryption?**
A: Yes, DeltaGlider respects all S3 settings including SSE, KMS, and bucket policies.
## The Math
For `N` versions of a `S` MB file with `D%` difference between versions:
**Traditional S3**: `N × S` MB
**DeltaGlider**: `S + (N-1) × S × D%` MB
Example: 100 versions of 100MB files with 1% difference:
- **Traditional**: 10,000 MB
- **DeltaGlider**: 199 MB
- **Savings**: 98%
## Contributing
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
Key areas we're exploring:
- Cloud-native reference management
- Rust implementation for 10x speed
- Automatic similarity detection
- Multi-threaded delta generation
- WASM support for browser usage
## License
MIT - Use it freely in your projects.
## Success Stories
> "We reduced our artifact storage from 4TB to 5GB. This isn't hyperbole—it's math."
> — [ReadOnlyREST Case Study](docs/case-study-readonlyrest.md)
> "Our CI/CD pipeline now uploads 100x faster. Deploys that took minutes now take seconds."
> — Platform Engineer at [redacted]
> "We were about to buy expensive deduplication storage. DeltaGlider saved us $50K/year."
> — CTO at [stealth startup]
---
**Try it now**: Got versioned files in S3? See your potential savings:
```bash
# Analyze your S3 bucket
deltaglider analyze s3://your-bucket/
# Output: "Potential savings: 95.2% (4.8TB → 237GB)"
```
Built with ❤️ by engineers who were tired of paying to store the same bytes over and over.