- Create DeltaGliderClient with user-friendly interface - Add create_client() factory function with sensible defaults - Implement UploadSummary dataclass with helpful properties - Expose simplified API through main package - Add comprehensive SDK documentation under docs/sdk/: - Getting started guide with installation and examples - Complete API reference documentation - Real-world usage examples for 8 common scenarios - Architecture deep dive explaining how DeltaGlider works - Automatic documentation generation scripts - Update CONTRIBUTING.md with SDK documentation guidelines - All tests pass and code quality checks succeed 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
20 KiB
DeltaGlider Architecture
Understanding how DeltaGlider achieves 99.9% compression through intelligent binary delta compression.
Table of Contents
- Overview
- Hexagonal Architecture
- Core Concepts
- Compression Algorithm
- Storage Strategy
- Performance Optimizations
- Security & Integrity
- Comparison with Alternatives
Overview
DeltaGlider is built on a simple yet powerful idea: most versioned files share 99% of their content. Instead of storing complete files repeatedly, we store one reference file and only the differences (deltas) for similar files.
High-Level Flow
First Upload (v1.0.0):
┌──────────┐ ┌─────────────┐ ┌──────┐
│ 100MB │───────▶│ DeltaGlider │──────▶│ S3 │
│ File │ │ │ │100MB │
└──────────┘ └─────────────┘ └──────┘
Second Upload (v1.0.1):
┌──────────┐ ┌─────────────┐ ┌──────┐
│ 100MB │───────▶│ DeltaGlider │──────▶│ S3 │
│ File │ │ (xdelta3) │ │ 98KB │
└──────────┘ └─────────────┘ └──────┘
│
Creates 98KB delta
by comparing with
v1.0.0 reference
Hexagonal Architecture
DeltaGlider follows the hexagonal (ports and adapters) architecture pattern for maximum flexibility and testability.
Architecture Diagram
┌─────────────────┐
│ Application │
│ (CLI / SDK) │
└────────┬────────┘
│
┌────────▼────────┐
│ │
│ DeltaService │
│ (Core Logic) │
│ │
└────┬─────┬──────┘
│ │
┌──────────▼─┬───▼──────────┐
│ │ │
│ Ports │ Ports │
│ (Interfaces)│ (Interfaces)│
│ │ │
└──────┬─────┴────┬─────────┘
│ │
┌───────────▼──┐ ┌───▼───────────┐
│ │ │ │
│ Adapters │ │ Adapters │
│ │ │ │
├──────────────┤ ├───────────────┤
│ S3Storage │ │ XdeltaDiff │
│ Sha256Hash │ │ FsCache │
│ UtcClock │ │ StdLogger │
│ NoopMetrics │ │ │
└──────────────┘ └───────────────┘
│ │
┌──────▼─────┐ ┌─────▼──────┐
│ AWS │ │ xdelta3 │
│ S3 │ │ binary │
└────────────┘ └────────────┘
Ports (Interfaces)
Ports define contracts that adapters must implement:
# StoragePort - Abstract S3 operations
class StoragePort(Protocol):
def put_object(self, bucket: str, key: str, data: bytes, metadata: Dict) -> None
def get_object(self, bucket: str, key: str) -> Tuple[bytes, Dict]
def object_exists(self, bucket: str, key: str) -> bool
def delete_object(self, bucket: str, key: str) -> None
# DiffPort - Abstract delta operations
class DiffPort(Protocol):
def create_delta(self, reference: bytes, target: bytes) -> bytes
def apply_delta(self, reference: bytes, delta: bytes) -> bytes
# HashPort - Abstract integrity checks
class HashPort(Protocol):
def hash(self, data: bytes) -> str
def hash_file(self, path: Path) -> str
# CachePort - Abstract local caching
class CachePort(Protocol):
def get(self, key: str) -> Optional[Path]
def put(self, key: str, path: Path) -> None
def exists(self, key: str) -> bool
Adapters (Implementations)
Adapters provide concrete implementations:
- S3StorageAdapter: Uses boto3 for S3 operations
- XdeltaAdapter: Wraps xdelta3 binary for delta compression
- Sha256Adapter: Provides SHA256 hashing
- FsCacheAdapter: File system based reference cache
- UtcClockAdapter: UTC timestamp provider
- StdLoggerAdapter: Console logging
Benefits
- Testability: Mock any adapter for unit testing
- Flexibility: Swap implementations (e.g., different storage backends)
- Separation: Business logic isolated from infrastructure
- Extensibility: Add new adapters without changing core
Core Concepts
DeltaSpace
A DeltaSpace is an S3 prefix containing related files that share a common reference:
@dataclass
class DeltaSpace:
bucket: str # S3 bucket
prefix: str # Prefix for related files
# Example:
# DeltaSpace(bucket="releases", prefix="myapp/v1/")
# Contains:
# - reference.bin (first uploaded file)
# - file1.zip.delta
# - file2.zip.delta
Reference File
The first file uploaded to a DeltaSpace becomes the reference:
s3://bucket/prefix/reference.bin # Full file (e.g., 100MB)
s3://bucket/prefix/reference.bin.sha256 # Integrity checksum
Delta Files
Subsequent files are stored as deltas:
s3://bucket/prefix/myfile.zip.delta # Delta file (e.g., 98KB)
Metadata (S3 tags):
- original_name: myfile.zip
- original_size: 104857600
- original_hash: abc123...
- reference_hash: def456...
- tool_version: deltaglider/0.1.0
Compression Algorithm
xdelta3: The Secret Sauce
DeltaGlider uses xdelta3, a binary diff algorithm optimized for large files:
How xdelta3 Works
- Rolling Hash: Scans reference file with a rolling hash window
- Block Matching: Finds matching byte sequences at any offset
- Instruction Stream: Generates copy/insert instructions
- Compression: Further compresses the instruction stream
Original: ABCDEFGHIJKLMNOP
Modified: ABCXYZGHIJKLMNOP
Delta instructions:
- COPY 0-2 (ABC) # Copy bytes 0-2 from reference
- INSERT XYZ # Insert new bytes
- COPY 6-15 (GHIJKLMNOP) # Copy bytes 6-15 from reference
Delta size: ~10 bytes instead of 16 bytes
Why xdelta3 Excels at Archives
Archive files (ZIP, TAR, JAR) have predictable structure:
ZIP Structure:
┌─────────────┐
│ Headers │ ← Usually identical between versions
├─────────────┤
│ File 1 │ ← May be unchanged
├─────────────┤
│ File 2 │ ← Small change
├─────────────┤
│ File 3 │ ← May be unchanged
├─────────────┤
│ Directory │ ← Structure mostly same
└─────────────┘
Even when one file changes inside the archive, xdelta3 can:
- Identify unchanged sections (even if byte positions shift)
- Compress repeated patterns efficiently
- Handle binary data optimally
Intelligent File Type Detection
def should_use_delta(file_path: Path) -> bool:
"""Determine if file should use delta compression."""
# File size check
if file_path.stat().st_size < 1_000_000: # < 1MB
return False # Overhead not worth it
# Extension-based detection
DELTA_EXTENSIONS = {
'.zip', '.tar', '.gz', '.tgz', '.bz2', # Archives
'.jar', '.war', '.ear', # Java
'.dmg', '.pkg', '.deb', '.rpm', # Packages
'.iso', '.img', '.vhd', # Disk images
}
DIRECT_EXTENSIONS = {
'.txt', '.md', '.json', '.xml', # Text (use gzip)
'.jpg', '.png', '.mp4', # Media (already compressed)
'.sha1', '.sha256', '.md5', # Checksums (unique)
}
ext = file_path.suffix.lower()
if ext in DELTA_EXTENSIONS:
return True
elif ext in DIRECT_EXTENSIONS:
return False
else:
# Unknown type - use heuristic
return is_likely_archive(file_path)
Storage Strategy
S3 Object Layout
bucket/
├── releases/
│ ├── v1.0.0/
│ │ ├── reference.bin # First uploaded file (full)
│ │ ├── reference.bin.sha256 # Checksum
│ │ ├── app-linux.tar.gz.delta # Delta from reference
│ │ ├── app-mac.dmg.delta # Delta from reference
│ │ └── app-win.zip.delta # Delta from reference
│ ├── v1.0.1/
│ │ ├── reference.bin # New reference for this version
│ │ └── ...
│ └── v1.1.0/
│ └── ...
└── backups/
└── ...
Metadata Strategy
DeltaGlider stores metadata in S3 object tags/metadata:
# For delta files
metadata = {
"x-amz-meta-original-name": "app.zip",
"x-amz-meta-original-size": "104857600",
"x-amz-meta-original-hash": "sha256:abc123...",
"x-amz-meta-reference-hash": "sha256:def456...",
"x-amz-meta-tool-version": "deltaglider/0.1.0",
"x-amz-meta-compression-ratio": "0.001", # 0.1% of original
}
Benefits:
- No separate metadata store needed
- Atomic operations (metadata stored with object)
- Works with S3 versioning and lifecycle policies
- Queryable via S3 API
Local Cache Strategy
/tmp/.deltaglider/cache/
├── references/
│ ├── sha256_abc123.bin # Cached reference files
│ ├── sha256_def456.bin
│ └── ...
└── metadata.json # Cache index
Cache benefits:
- Avoid repeated reference downloads
- Speed up delta creation for multiple files
- Reduce S3 API calls and bandwidth
Performance Optimizations
1. Reference Caching
class FsCacheAdapter:
def get_reference(self, hash: str) -> Optional[Path]:
cache_path = self.cache_dir / f"sha256_{hash}.bin"
if cache_path.exists():
# Verify integrity
if self.verify_hash(cache_path, hash):
return cache_path
return None
def put_reference(self, hash: str, path: Path) -> None:
cache_path = self.cache_dir / f"sha256_{hash}.bin"
shutil.copy2(path, cache_path)
# Update cache index
self.update_index(hash, cache_path)
2. Streaming Operations
For large files, DeltaGlider uses streaming:
def upload_large_file(file_path: Path, s3_url: str):
# Stream file to S3 using multipart upload
with open(file_path, 'rb') as f:
# boto3 automatically uses multipart for large files
s3.upload_fileobj(f, bucket, key,
Config=TransferConfig(
multipart_threshold=1024 * 25, # 25MB
max_concurrency=10,
use_threads=True))
3. Parallel Processing
def process_batch(files: List[Path]):
with ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for file in files:
future = executor.submit(process_file, file)
futures.append(future)
for future in as_completed(futures):
result = future.result()
print(f"Processed: {result}")
4. Delta Ratio Optimization
def optimize_compression(file: Path, reference: Path) -> bytes:
# Create delta
delta = create_delta(reference, file)
# Check compression effectiveness
ratio = len(delta) / file.stat().st_size
if ratio > MAX_RATIO: # Default: 0.5 (50%)
# Delta too large, store original
return None
else:
# Good compression, use delta
return delta
Security & Integrity
SHA256 Verification
Every operation includes checksum verification:
def verify_integrity(data: bytes, expected_hash: str) -> bool:
actual_hash = hashlib.sha256(data).hexdigest()
return actual_hash == expected_hash
# Upload flow
file_hash = calculate_hash(file)
upload_to_s3(file, metadata={"hash": file_hash})
# Download flow
data, metadata = download_from_s3(key)
if not verify_integrity(data, metadata["hash"]):
raise IntegrityError("File corrupted")
Atomic Operations
All S3 operations are atomic:
def atomic_upload(file: Path, bucket: str, key: str):
try:
# Upload to temporary key
temp_key = f"{key}.tmp.{uuid.uuid4()}"
s3.upload_file(file, bucket, temp_key)
# Atomic rename (S3 copy + delete)
s3.copy_object(
CopySource={'Bucket': bucket, 'Key': temp_key},
Bucket=bucket,
Key=key
)
s3.delete_object(Bucket=bucket, Key=temp_key)
except Exception:
# Cleanup on failure
try:
s3.delete_object(Bucket=bucket, Key=temp_key)
except:
pass
raise
Encryption Support
DeltaGlider respects S3 encryption settings:
# Server-side encryption with S3-managed keys
s3.put_object(
Bucket=bucket,
Key=key,
Body=data,
ServerSideEncryption='AES256'
)
# Server-side encryption with KMS
s3.put_object(
Bucket=bucket,
Key=key,
Body=data,
ServerSideEncryption='aws:kms',
SSEKMSKeyId='arn:aws:kms:...'
)
Comparison with Alternatives
vs. S3 Versioning
| Aspect | DeltaGlider | S3 Versioning |
|---|---|---|
| Storage | Only stores deltas | Stores full copies |
| Compression | 99%+ for similar files | 0% |
| Cost | Minimal | $$ per version |
| Complexity | Transparent | Built-in |
| Recovery | Download + reconstruct | Direct download |
vs. Git LFS
| Aspect | DeltaGlider | Git LFS |
|---|---|---|
| Use case | Any S3 storage | Git repositories |
| Compression | Binary delta | Deduplication |
| Integration | S3 API | Git workflow |
| Scalability | Unlimited | Repository-bound |
vs. Deduplication Systems
| Aspect | DeltaGlider | Dedup Systems |
|---|---|---|
| Approach | File-level delta | Block-level dedup |
| Compression | 99%+ for similar | 30-50% typical |
| Complexity | Simple | Complex |
| Cost | Open source | Enterprise $$$ |
vs. Backup Tools (Restic/Borg)
| Aspect | DeltaGlider | Restic/Borg |
|---|---|---|
| Purpose | S3 optimization | Full backup |
| Storage | S3-native | Custom format |
| Granularity | File-level | Repository |
| Use case | Artifacts/releases | System backups |
Advanced Topics
Reference Rotation Strategy
Currently, the first file becomes the permanent reference. Future versions may implement:
class ReferenceRotationStrategy:
def should_rotate(self, stats: ReferenceStats) -> bool:
# Rotate if average delta ratio is too high
if stats.avg_delta_ratio > 0.4:
return True
# Rotate if reference is too old
if stats.age_days > 90:
return True
# Rotate if better candidate exists
if stats.better_candidate_score > 0.8:
return True
return False
def select_new_reference(self, files: List[FileStats]) -> Path:
# Select file that minimizes total delta sizes
best_score = float('inf')
best_file = None
for candidate in files:
total_delta_size = sum(
compute_delta_size(candidate, other)
for other in files
if other != candidate
)
if total_delta_size < best_score:
best_score = total_delta_size
best_file = candidate
return best_file
Multi-Reference Support
For diverse file sets, multiple references could be used:
class MultiReferenceStrategy:
def assign_reference(self, file: Path, references: List[Reference]) -> Reference:
# Find best matching reference
best_reference = None
best_ratio = float('inf')
for ref in references:
delta = create_delta(ref.path, file)
ratio = len(delta) / file.stat().st_size
if ratio < best_ratio:
best_ratio = ratio
best_reference = ref
# Create new reference if no good match
if best_ratio > 0.5:
return self.create_new_reference(file)
return best_reference
Incremental Delta Chains
For frequently updated files:
class DeltaChain:
"""
v1.0.0 (reference) <- v1.0.1 (delta) <- v1.0.2 (delta) <- v1.0.3 (delta)
"""
def reconstruct(self, version: str) -> bytes:
# Start with reference
data = self.load_reference()
# Apply deltas in sequence
for delta in self.get_delta_chain(version):
data = apply_delta(data, delta)
return data
Monitoring & Observability
Metrics to Track
@dataclass
class CompressionMetrics:
total_uploads: int
total_original_size: int
total_stored_size: int
average_compression_ratio: float
delta_files_count: int
reference_files_count: int
cache_hit_rate: float
average_upload_time: float
average_download_time: float
failed_compressions: int
Health Checks
class HealthCheck:
def check_xdelta3(self) -> bool:
"""Verify xdelta3 binary is available."""
return shutil.which('xdelta3') is not None
def check_s3_access(self) -> bool:
"""Verify S3 credentials and permissions."""
try:
s3.list_buckets()
return True
except:
return False
def check_cache_space(self) -> bool:
"""Verify adequate cache space."""
cache_dir = Path('/tmp/.deltaglider/cache')
free_space = shutil.disk_usage(cache_dir).free
return free_space > 1_000_000_000 # 1GB minimum
Future Enhancements
- Cloud-Native Reference Management: Store references in distributed cache
- Rust Implementation: 10x performance improvement
- Automatic Similarity Detection: ML-based reference selection
- Multi-Threaded Compression: Parallel delta generation
- WASM Support: Browser-based delta compression
- S3 Batch Operations: Bulk compression of existing data
- Compression Prediction: Estimate compression before upload
- Adaptive Strategies: Auto-tune based on workload patterns
Contributing
See CONTRIBUTING.md for development setup and guidelines.