mirror of https://github.com/beshu-tech/deltaglider.git synced 2026-01-11 22:30:48 +01:00

Files

Simone Scarduzio edcbd2c7d0 Add simplified SDK client API and comprehensive documentation

- Create DeltaGliderClient with user-friendly interface
- Add create_client() factory function with sensible defaults
- Implement UploadSummary dataclass with helpful properties
- Expose simplified API through main package
- Add comprehensive SDK documentation under docs/sdk/:
  - Getting started guide with installation and examples
  - Complete API reference documentation
  - Real-world usage examples for 8 common scenarios
  - Architecture deep dive explaining how DeltaGlider works
  - Automatic documentation generation scripts
- Update CONTRIBUTING.md with SDK documentation guidelines
- All tests pass and code quality checks succeed

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-23 13:44:38 +02:00

20 KiB

Raw Blame History

DeltaGlider Architecture

Understanding how DeltaGlider achieves 99.9% compression through intelligent binary delta compression.

Overview
Hexagonal Architecture
Core Concepts
Compression Algorithm
Storage Strategy
Performance Optimizations
Security & Integrity
Comparison with Alternatives

Overview

DeltaGlider is built on a simple yet powerful idea: most versioned files share 99% of their content. Instead of storing complete files repeatedly, we store one reference file and only the differences (deltas) for similar files.

High-Level Flow

First Upload (v1.0.0):
┌──────────┐        ┌─────────────┐       ┌──────┐
│  100MB   │───────▶│ DeltaGlider │──────▶│  S3  │
│   File   │        │             │       │100MB │
└──────────┘        └─────────────┘       └──────┘

Second Upload (v1.0.1):
┌──────────┐        ┌─────────────┐       ┌──────┐
│  100MB   │───────▶│ DeltaGlider │──────▶│  S3  │
│   File   │        │   (xdelta3) │       │ 98KB │
└──────────┘        └─────────────┘       └──────┘
                           │
                    Creates 98KB delta
                    by comparing with
                    v1.0.0 reference

Hexagonal Architecture

DeltaGlider follows the hexagonal (ports and adapters) architecture pattern for maximum flexibility and testability.

Architecture Diagram

                    ┌─────────────────┐
                    │   Application   │
                    │   (CLI / SDK)   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │                 │
                    │   DeltaService  │
                    │   (Core Logic)  │
                    │                 │
                    └────┬─────┬──────┘
                         │     │
              ┌──────────▼─┬───▼──────────┐
              │            │              │
              │   Ports    │    Ports    │
              │ (Interfaces)│ (Interfaces)│
              │            │              │
              └──────┬─────┴────┬─────────┘
                     │          │
         ┌───────────▼──┐  ┌───▼───────────┐
         │              │  │               │
         │   Adapters   │  │   Adapters   │
         │              │  │               │
         ├──────────────┤  ├───────────────┤
         │ S3Storage    │  │ XdeltaDiff   │
         │ Sha256Hash   │  │ FsCache      │
         │ UtcClock     │  │ StdLogger    │
         │ NoopMetrics  │  │              │
         └──────────────┘  └───────────────┘
                │                 │
         ┌──────▼─────┐    ┌─────▼──────┐
         │    AWS     │    │   xdelta3  │
         │     S3     │    │   binary   │
         └────────────┘    └────────────┘

Ports (Interfaces)

Ports define contracts that adapters must implement:

# StoragePort - Abstract S3 operations
class StoragePort(Protocol):
    def put_object(self, bucket: str, key: str, data: bytes, metadata: Dict) -> None
    def get_object(self, bucket: str, key: str) -> Tuple[bytes, Dict]
    def object_exists(self, bucket: str, key: str) -> bool
    def delete_object(self, bucket: str, key: str) -> None

# DiffPort - Abstract delta operations
class DiffPort(Protocol):
    def create_delta(self, reference: bytes, target: bytes) -> bytes
    def apply_delta(self, reference: bytes, delta: bytes) -> bytes

# HashPort - Abstract integrity checks
class HashPort(Protocol):
    def hash(self, data: bytes) -> str
    def hash_file(self, path: Path) -> str

# CachePort - Abstract local caching
class CachePort(Protocol):
    def get(self, key: str) -> Optional[Path]
    def put(self, key: str, path: Path) -> None
    def exists(self, key: str) -> bool

Adapters (Implementations)

Adapters provide concrete implementations:

S3StorageAdapter: Uses boto3 for S3 operations
XdeltaAdapter: Wraps xdelta3 binary for delta compression
Sha256Adapter: Provides SHA256 hashing
FsCacheAdapter: File system based reference cache
UtcClockAdapter: UTC timestamp provider
StdLoggerAdapter: Console logging

Benefits

Testability: Mock any adapter for unit testing
Flexibility: Swap implementations (e.g., different storage backends)
Separation: Business logic isolated from infrastructure
Extensibility: Add new adapters without changing core

Core Concepts

DeltaSpace

A DeltaSpace is an S3 prefix containing related files that share a common reference:

@dataclass
class DeltaSpace:
    bucket: str  # S3 bucket
    prefix: str  # Prefix for related files

# Example:
# DeltaSpace(bucket="releases", prefix="myapp/v1/")
# Contains:
#   - reference.bin (first uploaded file)
#   - file1.zip.delta
#   - file2.zip.delta

Reference File

The first file uploaded to a DeltaSpace becomes the reference:

s3://bucket/prefix/reference.bin       # Full file (e.g., 100MB)
s3://bucket/prefix/reference.bin.sha256 # Integrity checksum

Delta Files

Subsequent files are stored as deltas:

s3://bucket/prefix/myfile.zip.delta    # Delta file (e.g., 98KB)

Metadata (S3 tags):
  - original_name: myfile.zip
  - original_size: 104857600
  - original_hash: abc123...
  - reference_hash: def456...
  - tool_version: deltaglider/0.1.0

Compression Algorithm

xdelta3: The Secret Sauce

DeltaGlider uses xdelta3, a binary diff algorithm optimized for large files:

How xdelta3 Works

Rolling Hash: Scans reference file with a rolling hash window
Block Matching: Finds matching byte sequences at any offset
Instruction Stream: Generates copy/insert instructions
Compression: Further compresses the instruction stream

Original: ABCDEFGHIJKLMNOP
Modified: ABCXYZGHIJKLMNOP

Delta instructions:
- COPY 0-2 (ABC)       # Copy bytes 0-2 from reference
- INSERT XYZ           # Insert new bytes
- COPY 6-15 (GHIJKLMNOP) # Copy bytes 6-15 from reference

Delta size: ~10 bytes instead of 16 bytes

Why xdelta3 Excels at Archives

Archive files (ZIP, TAR, JAR) have predictable structure:

ZIP Structure:
┌─────────────┐
│  Headers    │ ← Usually identical between versions
├─────────────┤
│  File 1     │ ← May be unchanged
├─────────────┤
│  File 2     │ ← Small change
├─────────────┤
│  File 3     │ ← May be unchanged
├─────────────┤
│  Directory  │ ← Structure mostly same
└─────────────┘

Even when one file changes inside the archive, xdelta3 can:

Identify unchanged sections (even if byte positions shift)
Compress repeated patterns efficiently
Handle binary data optimally

Intelligent File Type Detection

def should_use_delta(file_path: Path) -> bool:
    """Determine if file should use delta compression."""

    # File size check
    if file_path.stat().st_size < 1_000_000:  # < 1MB
        return False  # Overhead not worth it

    # Extension-based detection
    DELTA_EXTENSIONS = {
        '.zip', '.tar', '.gz', '.tgz', '.bz2',  # Archives
        '.jar', '.war', '.ear',                  # Java
        '.dmg', '.pkg', '.deb', '.rpm',         # Packages
        '.iso', '.img', '.vhd',                 # Disk images
    }

    DIRECT_EXTENSIONS = {
        '.txt', '.md', '.json', '.xml',         # Text (use gzip)
        '.jpg', '.png', '.mp4',                 # Media (already compressed)
        '.sha1', '.sha256', '.md5',             # Checksums (unique)
    }

    ext = file_path.suffix.lower()

    if ext in DELTA_EXTENSIONS:
        return True
    elif ext in DIRECT_EXTENSIONS:
        return False
    else:
        # Unknown type - use heuristic
        return is_likely_archive(file_path)

Storage Strategy

S3 Object Layout

bucket/
├── releases/
│   ├── v1.0.0/
│   │   ├── reference.bin          # First uploaded file (full)
│   │   ├── reference.bin.sha256   # Checksum
│   │   ├── app-linux.tar.gz.delta # Delta from reference
│   │   ├── app-mac.dmg.delta      # Delta from reference
│   │   └── app-win.zip.delta      # Delta from reference
│   ├── v1.0.1/
│   │   ├── reference.bin          # New reference for this version
│   │   └── ...
│   └── v1.1.0/
│       └── ...
└── backups/
    └── ...

Metadata Strategy

DeltaGlider stores metadata in S3 object tags/metadata:

# For delta files
metadata = {
    "x-amz-meta-original-name": "app.zip",
    "x-amz-meta-original-size": "104857600",
    "x-amz-meta-original-hash": "sha256:abc123...",
    "x-amz-meta-reference-hash": "sha256:def456...",
    "x-amz-meta-tool-version": "deltaglider/0.1.0",
    "x-amz-meta-compression-ratio": "0.001",  # 0.1% of original
}

Benefits:

No separate metadata store needed
Atomic operations (metadata stored with object)
Works with S3 versioning and lifecycle policies
Queryable via S3 API

Local Cache Strategy

/tmp/.deltaglider/cache/
├── references/
│   ├── sha256_abc123.bin    # Cached reference files
│   ├── sha256_def456.bin
│   └── ...
└── metadata.json             # Cache index

Cache benefits:

Avoid repeated reference downloads
Speed up delta creation for multiple files
Reduce S3 API calls and bandwidth

Performance Optimizations

1. Reference Caching

class FsCacheAdapter:
    def get_reference(self, hash: str) -> Optional[Path]:
        cache_path = self.cache_dir / f"sha256_{hash}.bin"
        if cache_path.exists():
            # Verify integrity
            if self.verify_hash(cache_path, hash):
                return cache_path
        return None

    def put_reference(self, hash: str, path: Path) -> None:
        cache_path = self.cache_dir / f"sha256_{hash}.bin"
        shutil.copy2(path, cache_path)
        # Update cache index
        self.update_index(hash, cache_path)

2. Streaming Operations

For large files, DeltaGlider uses streaming:

def upload_large_file(file_path: Path, s3_url: str):
    # Stream file to S3 using multipart upload
    with open(file_path, 'rb') as f:
        # boto3 automatically uses multipart for large files
        s3.upload_fileobj(f, bucket, key,
                         Config=TransferConfig(
                             multipart_threshold=1024 * 25,  # 25MB
                             max_concurrency=10,
                             use_threads=True))

3. Parallel Processing

def process_batch(files: List[Path]):
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = []
        for file in files:
            future = executor.submit(process_file, file)
            futures.append(future)

        for future in as_completed(futures):
            result = future.result()
            print(f"Processed: {result}")

4. Delta Ratio Optimization

def optimize_compression(file: Path, reference: Path) -> bytes:
    # Create delta
    delta = create_delta(reference, file)

    # Check compression effectiveness
    ratio = len(delta) / file.stat().st_size

    if ratio > MAX_RATIO:  # Default: 0.5 (50%)
        # Delta too large, store original
        return None
    else:
        # Good compression, use delta
        return delta

Security & Integrity

SHA256 Verification

Every operation includes checksum verification:

def verify_integrity(data: bytes, expected_hash: str) -> bool:
    actual_hash = hashlib.sha256(data).hexdigest()
    return actual_hash == expected_hash

# Upload flow
file_hash = calculate_hash(file)
upload_to_s3(file, metadata={"hash": file_hash})

# Download flow
data, metadata = download_from_s3(key)
if not verify_integrity(data, metadata["hash"]):
    raise IntegrityError("File corrupted")

Atomic Operations

All S3 operations are atomic:

def atomic_upload(file: Path, bucket: str, key: str):
    try:
        # Upload to temporary key
        temp_key = f"{key}.tmp.{uuid.uuid4()}"
        s3.upload_file(file, bucket, temp_key)

        # Atomic rename (S3 copy + delete)
        s3.copy_object(
            CopySource={'Bucket': bucket, 'Key': temp_key},
            Bucket=bucket,
            Key=key
        )
        s3.delete_object(Bucket=bucket, Key=temp_key)

    except Exception:
        # Cleanup on failure
        try:
            s3.delete_object(Bucket=bucket, Key=temp_key)
        except:
            pass
        raise

Encryption Support

DeltaGlider respects S3 encryption settings:

# Server-side encryption with S3-managed keys
s3.put_object(
    Bucket=bucket,
    Key=key,
    Body=data,
    ServerSideEncryption='AES256'
)

# Server-side encryption with KMS
s3.put_object(
    Bucket=bucket,
    Key=key,
    Body=data,
    ServerSideEncryption='aws:kms',
    SSEKMSKeyId='arn:aws:kms:...'
)

Comparison with Alternatives

vs. S3 Versioning

Aspect	DeltaGlider	S3 Versioning
Storage	Only stores deltas	Stores full copies
Compression	99%+ for similar files	0%
Cost	Minimal	$$ per version
Complexity	Transparent	Built-in
Recovery	Download + reconstruct	Direct download

vs. Git LFS

Aspect	DeltaGlider	Git LFS
Use case	Any S3 storage	Git repositories
Compression	Binary delta	Deduplication
Integration	S3 API	Git workflow
Scalability	Unlimited	Repository-bound

vs. Deduplication Systems

Aspect	DeltaGlider	Dedup Systems
Approach	File-level delta	Block-level dedup
Compression	99%+ for similar	30-50% typical
Complexity	Simple	Complex
Cost	Open source	Enterprise $$$

vs. Backup Tools (Restic/Borg)

Aspect	DeltaGlider	Restic/Borg
Purpose	S3 optimization	Full backup
Storage	S3-native	Custom format
Granularity	File-level	Repository
Use case	Artifacts/releases	System backups

Advanced Topics

Reference Rotation Strategy

Currently, the first file becomes the permanent reference. Future versions may implement:

class ReferenceRotationStrategy:
    def should_rotate(self, stats: ReferenceStats) -> bool:
        # Rotate if average delta ratio is too high
        if stats.avg_delta_ratio > 0.4:
            return True

        # Rotate if reference is too old
        if stats.age_days > 90:
            return True

        # Rotate if better candidate exists
        if stats.better_candidate_score > 0.8:
            return True

        return False

    def select_new_reference(self, files: List[FileStats]) -> Path:
        # Select file that minimizes total delta sizes
        best_score = float('inf')
        best_file = None

        for candidate in files:
            total_delta_size = sum(
                compute_delta_size(candidate, other)
                for other in files
                if other != candidate
            )
            if total_delta_size < best_score:
                best_score = total_delta_size
                best_file = candidate

        return best_file

Multi-Reference Support

For diverse file sets, multiple references could be used:

class MultiReferenceStrategy:
    def assign_reference(self, file: Path, references: List[Reference]) -> Reference:
        # Find best matching reference
        best_reference = None
        best_ratio = float('inf')

        for ref in references:
            delta = create_delta(ref.path, file)
            ratio = len(delta) / file.stat().st_size

            if ratio < best_ratio:
                best_ratio = ratio
                best_reference = ref

        # Create new reference if no good match
        if best_ratio > 0.5:
            return self.create_new_reference(file)

        return best_reference

Incremental Delta Chains

For frequently updated files:

class DeltaChain:
    """
    v1.0.0 (reference) <- v1.0.1 (delta) <- v1.0.2 (delta) <- v1.0.3 (delta)
    """
    def reconstruct(self, version: str) -> bytes:
        # Start with reference
        data = self.load_reference()

        # Apply deltas in sequence
        for delta in self.get_delta_chain(version):
            data = apply_delta(data, delta)

        return data

Monitoring & Observability

Metrics to Track

@dataclass
class CompressionMetrics:
    total_uploads: int
    total_original_size: int
    total_stored_size: int
    average_compression_ratio: float
    delta_files_count: int
    reference_files_count: int
    cache_hit_rate: float
    average_upload_time: float
    average_download_time: float
    failed_compressions: int

Health Checks

class HealthCheck:
    def check_xdelta3(self) -> bool:
        """Verify xdelta3 binary is available."""
        return shutil.which('xdelta3') is not None

    def check_s3_access(self) -> bool:
        """Verify S3 credentials and permissions."""
        try:
            s3.list_buckets()
            return True
        except:
            return False

    def check_cache_space(self) -> bool:
        """Verify adequate cache space."""
        cache_dir = Path('/tmp/.deltaglider/cache')
        free_space = shutil.disk_usage(cache_dir).free
        return free_space > 1_000_000_000  # 1GB minimum

Future Enhancements

Cloud-Native Reference Management: Store references in distributed cache
Rust Implementation: 10x performance improvement
Automatic Similarity Detection: ML-based reference selection
Multi-Threaded Compression: Parallel delta generation
WASM Support: Browser-based delta compression
S3 Batch Operations: Bulk compression of existing data
Compression Prediction: Estimate compression before upload
Adaptive Strategies: Auto-tune based on workload patterns

Contributing

See CONTRIBUTING.md for development setup and guidelines.

20 KiB Raw Blame History