Add simplified SDK client API and comprehensive documentation

- Create DeltaGliderClient with user-friendly interface
- Add create_client() factory function with sensible defaults
- Implement UploadSummary dataclass with helpful properties
- Expose simplified API through main package
- Add comprehensive SDK documentation under docs/sdk/:
  - Getting started guide with installation and examples
  - Complete API reference documentation
  - Real-world usage examples for 8 common scenarios
  - Architecture deep dive explaining how DeltaGlider works
  - Automatic documentation generation scripts
- Update CONTRIBUTING.md with SDK documentation guidelines
- All tests pass and code quality checks succeed

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Simone Scarduzio
2025-09-23 13:44:38 +02:00
parent f08960b6c5
commit edcbd2c7d0
10 changed files with 3136 additions and 1 deletions

View File

@@ -102,11 +102,27 @@ uv run pytest -m e2e
- Use type hints for all function signatures
- Write docstrings for all public functions and classes
## Documentation
### SDK Documentation
The SDK documentation is located in `docs/sdk/` and includes:
- Getting Started guide
- API Reference
- Examples and use cases
- Architecture overview
When making changes to the Python SDK:
1. Update relevant documentation in `docs/sdk/`
2. Update docstrings in the code
3. Run `make generate` in `docs/sdk/` to update auto-generated docs
## Pull Request Process
1. Update the README.md with details of changes to the interface, if applicable
2. Update the docs/ with any new functionality
3. The PR will be merged once you have the sign-off of at least one maintainer
3. Update SDK documentation if you've modified the client API
4. The PR will be merged once you have the sign-off of at least one maintainer
## Performance Considerations

36
docs/sdk/Makefile Normal file
View File

@@ -0,0 +1,36 @@
# Makefile for DeltaGlider SDK Documentation
.PHONY: all clean generate serve
# Default target
all: generate
# Generate documentation
generate:
@echo "Generating SDK documentation..."
python generate_docs.py
@echo "Documentation generated successfully!"
# Clean generated files
clean:
@echo "Cleaning generated documentation..."
rm -f generated_api.md module_index.json
@echo "Clean complete!"
# Serve documentation locally (requires Python http.server)
serve:
@echo "Starting documentation server at http://localhost:8000/docs/sdk/"
cd ../.. && python -m http.server 8000
# Install documentation dependencies
install-deps:
pip install pdoc3 sphinx sphinx-rtd-theme
# Generate full HTML documentation with pdoc
html:
pdoc3 --html --output-dir html ../../src/deltaglider
# Generate with Sphinx (future enhancement)
sphinx:
@echo "Sphinx documentation generation not yet configured"
@echo "Run 'make install-deps' then 'sphinx-quickstart' to set up"

122
docs/sdk/README.md Normal file
View File

@@ -0,0 +1,122 @@
# DeltaGlider Python SDK Documentation
The DeltaGlider Python SDK provides a simple, intuitive interface for integrating delta compression into your Python applications. Whether you're managing software releases, database backups, or any versioned binary data, DeltaGlider can reduce your storage costs by up to 99%.
## Quick Links
- [Getting Started](getting-started.md) - Installation and first steps
- [Examples](examples.md) - Real-world usage patterns
- [API Reference](api.md) - Complete API documentation
- [Architecture](architecture.md) - How it works under the hood
## Overview
DeltaGlider provides two ways to interact with your S3 storage:
### 1. CLI (Command Line Interface)
Drop-in replacement for AWS S3 CLI with automatic delta compression:
```bash
deltaglider cp my-app-v1.0.0.zip s3://releases/
deltaglider ls s3://releases/
deltaglider sync ./builds/ s3://releases/
```
### 2. Python SDK
Programmatic interface for Python applications:
```python
from deltaglider import create_client
client = create_client()
summary = client.upload("my-app-v1.0.0.zip", "s3://releases/v1.0.0/")
print(f"Compressed from {summary.original_size_mb:.1f}MB to {summary.stored_size_mb:.1f}MB")
```
## Key Features
- **99%+ Compression**: For versioned artifacts and similar files
- **Drop-in Replacement**: Works with existing AWS S3 workflows
- **Intelligent Detection**: Automatically determines when to use delta compression
- **Data Integrity**: SHA256 verification on every operation
- **S3 Compatible**: Works with AWS, MinIO, Cloudflare R2, and other S3-compatible storage
## When to Use DeltaGlider
### Perfect For
- Software releases and versioned artifacts
- Container images and layers
- Database backups and snapshots
- Machine learning model checkpoints
- Game assets and updates
- Any versioned binary data
### Not Ideal For
- Already compressed unique files
- Streaming media files
- Frequently changing unstructured data
- Files smaller than 1MB
## Installation
```bash
pip install deltaglider
```
For development or testing with MinIO:
```bash
docker run -p 9000:9000 minio/minio server /data
export AWS_ENDPOINT_URL=http://localhost:9000
```
## Basic Usage
### Simple Upload/Download
```python
from deltaglider import create_client
# Create client (uses AWS credentials from environment)
client = create_client()
# Upload a file
summary = client.upload("release-v2.0.0.zip", "s3://releases/v2.0.0/")
print(f"Saved {summary.savings_percent:.0f}% storage space")
# Download a file
client.download("s3://releases/v2.0.0/release-v2.0.0.zip", "local-copy.zip")
```
### With Custom Configuration
```python
from deltaglider import create_client
client = create_client(
endpoint_url="http://minio.internal:9000", # Custom S3 endpoint
log_level="DEBUG", # Detailed logging
cache_dir="/var/cache/deltaglider", # Custom cache location
)
```
## How It Works
1. **First Upload**: The first file uploaded to a prefix becomes the reference
2. **Delta Compression**: Subsequent similar files are compared using xdelta3
3. **Smart Storage**: Only the differences (deltas) are stored
4. **Transparent Reconstruction**: Files are automatically reconstructed on download
## Performance
Based on real-world usage:
- **Compression**: 99%+ for similar versions
- **Upload Speed**: 3-4 files/second
- **Download Speed**: <100ms reconstruction
- **Storage Savings**: 4TB → 5GB (ReadOnlyREST case study)
## Support
- GitHub Issues: [github.com/beshu-tech/deltaglider/issues](https://github.com/beshu-tech/deltaglider/issues)
- Documentation: [github.com/beshu-tech/deltaglider#readme](https://github.com/beshu-tech/deltaglider#readme)
## License
MIT License - See [LICENSE](https://github.com/beshu-tech/deltaglider/blob/main/LICENSE) for details.

583
docs/sdk/api.md Normal file
View File

@@ -0,0 +1,583 @@
# DeltaGlider API Reference
Complete API documentation for the DeltaGlider Python SDK.
## Table of Contents
- [Client Creation](#client-creation)
- [DeltaGliderClient](#deltaglidererclient)
- [UploadSummary](#uploadsummary)
- [DeltaService](#deltaservice)
- [Models](#models)
- [Exceptions](#exceptions)
## Client Creation
### `create_client`
Factory function to create a configured DeltaGlider client with sensible defaults.
```python
def create_client(
endpoint_url: Optional[str] = None,
log_level: str = "INFO",
cache_dir: str = "/tmp/.deltaglider/cache",
**kwargs
) -> DeltaGliderClient
```
#### Parameters
- **endpoint_url** (`Optional[str]`): S3 endpoint URL for MinIO, R2, or other S3-compatible storage. If None, uses AWS S3.
- **log_level** (`str`): Logging verbosity level. Options: "DEBUG", "INFO", "WARNING", "ERROR". Default: "INFO".
- **cache_dir** (`str`): Directory for local reference cache. Default: "/tmp/.deltaglider/cache".
- **kwargs**: Additional arguments passed to `DeltaService`:
- **tool_version** (`str`): Version string for metadata. Default: "deltaglider/0.1.0"
- **max_ratio** (`float`): Maximum acceptable delta/file ratio. Default: 0.5
#### Returns
`DeltaGliderClient`: Configured client instance ready for use.
#### Examples
```python
# Default AWS S3 configuration
client = create_client()
# Custom endpoint for MinIO
client = create_client(endpoint_url="http://localhost:9000")
# Debug mode with custom cache
client = create_client(
log_level="DEBUG",
cache_dir="/var/cache/deltaglider"
)
# Custom delta ratio threshold
client = create_client(max_ratio=0.3) # Only use delta if <30% of original
```
## DeltaGliderClient
Main client class for interacting with DeltaGlider.
### Constructor
```python
class DeltaGliderClient:
def __init__(
self,
service: DeltaService,
endpoint_url: Optional[str] = None
)
```
**Note**: Use `create_client()` instead of instantiating directly.
### Methods
#### `upload`
Upload a file to S3 with automatic delta compression.
```python
def upload(
self,
file_path: str | Path,
s3_url: str,
tags: Optional[Dict[str, str]] = None,
max_ratio: float = 0.5
) -> UploadSummary
```
##### Parameters
- **file_path** (`str | Path`): Local file path to upload.
- **s3_url** (`str`): S3 destination URL in format `s3://bucket/prefix/`.
- **tags** (`Optional[Dict[str, str]]`): S3 object tags to attach. (Future feature)
- **max_ratio** (`float`): Maximum acceptable delta/file size ratio. Default: 0.5.
##### Returns
`UploadSummary`: Object containing upload statistics and compression details.
##### Raises
- `FileNotFoundError`: If local file doesn't exist.
- `ValueError`: If S3 URL is invalid.
- `PermissionError`: If S3 access is denied.
##### Examples
```python
# Simple upload
summary = client.upload("app.zip", "s3://releases/v1.0.0/")
# With custom compression threshold
summary = client.upload(
"large-file.tar.gz",
"s3://backups/",
max_ratio=0.3 # Only use delta if compression > 70%
)
# Check results
if summary.is_delta:
print(f"Stored as delta: {summary.stored_size_mb:.1f} MB")
else:
print(f"Stored as full file: {summary.original_size_mb:.1f} MB")
```
#### `download`
Download and reconstruct a file from S3.
```python
def download(
self,
s3_url: str,
output_path: str | Path
) -> None
```
##### Parameters
- **s3_url** (`str`): S3 source URL in format `s3://bucket/key`.
- **output_path** (`str | Path`): Local destination path.
##### Returns
None. File is written to `output_path`.
##### Raises
- `ValueError`: If S3 URL is invalid or missing key.
- `FileNotFoundError`: If S3 object doesn't exist.
- `PermissionError`: If local path is not writable or S3 access denied.
##### Examples
```python
# Download a file
client.download("s3://releases/v1.0.0/app.zip", "downloaded.zip")
# Auto-detects .delta suffix if needed
client.download("s3://releases/v1.0.0/app.zip", "app.zip")
# Will try app.zip first, then app.zip.delta if not found
# Download to specific directory
from pathlib import Path
output = Path("/tmp/downloads/app.zip")
output.parent.mkdir(parents=True, exist_ok=True)
client.download("s3://releases/v1.0.0/app.zip", output)
```
#### `verify`
Verify the integrity of a stored file using SHA256 checksums.
```python
def verify(
self,
s3_url: str
) -> bool
```
##### Parameters
- **s3_url** (`str`): S3 URL of the file to verify.
##### Returns
`bool`: True if verification passed, False if corrupted.
##### Raises
- `ValueError`: If S3 URL is invalid.
- `FileNotFoundError`: If S3 object doesn't exist.
##### Examples
```python
# Verify file integrity
is_valid = client.verify("s3://releases/v1.0.0/app.zip")
if is_valid:
print("✓ File integrity verified")
else:
print("✗ File is corrupted!")
# Re-upload or investigate
```
#### `lifecycle_policy`
Set lifecycle policy for S3 prefix (placeholder for future implementation).
```python
def lifecycle_policy(
self,
s3_prefix: str,
days_before_archive: int = 30,
days_before_delete: int = 90
) -> None
```
**Note**: This method is a placeholder for future S3 lifecycle policy management.
## UploadSummary
Data class containing upload operation results.
```python
@dataclass
class UploadSummary:
operation: str # Operation type: "PUT" or "PUT_DELTA"
bucket: str # S3 bucket name
key: str # S3 object key
original_size: int # Original file size in bytes
stored_size: int # Actual stored size in bytes
is_delta: bool # Whether delta compression was used
delta_ratio: float = 0.0 # Ratio of delta size to original
```
### Properties
#### `original_size_mb`
Original file size in megabytes.
```python
@property
def original_size_mb(self) -> float
```
#### `stored_size_mb`
Stored size in megabytes (after compression if applicable).
```python
@property
def stored_size_mb(self) -> float
```
#### `savings_percent`
Percentage saved through compression.
```python
@property
def savings_percent(self) -> float
```
### Example Usage
```python
summary = client.upload("app.zip", "s3://releases/")
print(f"Operation: {summary.operation}")
print(f"Location: s3://{summary.bucket}/{summary.key}")
print(f"Original: {summary.original_size_mb:.1f} MB")
print(f"Stored: {summary.stored_size_mb:.1f} MB")
print(f"Saved: {summary.savings_percent:.0f}%")
print(f"Delta used: {summary.is_delta}")
if summary.is_delta:
print(f"Delta ratio: {summary.delta_ratio:.2%}")
```
## DeltaService
Core service class handling delta compression logic.
```python
class DeltaService:
def __init__(
self,
storage: StoragePort,
diff: DiffPort,
hasher: HashPort,
cache: CachePort,
clock: ClockPort,
logger: LoggerPort,
metrics: MetricsPort,
tool_version: str = "deltaglider/0.1.0",
max_ratio: float = 0.5
)
```
### Methods
#### `put`
Upload a file with automatic delta compression.
```python
def put(
self,
file: Path,
delta_space: DeltaSpace,
max_ratio: Optional[float] = None
) -> PutSummary
```
#### `get`
Download and reconstruct a file.
```python
def get(
self,
object_key: ObjectKey,
output_path: Path
) -> GetSummary
```
#### `verify`
Verify file integrity.
```python
def verify(
self,
object_key: ObjectKey
) -> VerifyResult
```
## Models
### DeltaSpace
Represents a compression space in S3.
```python
@dataclass(frozen=True)
class DeltaSpace:
bucket: str # S3 bucket name
prefix: str # S3 prefix for related files
```
### ObjectKey
Represents an S3 object location.
```python
@dataclass(frozen=True)
class ObjectKey:
bucket: str # S3 bucket name
key: str # S3 object key
```
### PutSummary
Detailed upload operation results.
```python
@dataclass
class PutSummary:
operation: str # "PUT" or "PUT_DELTA"
bucket: str # S3 bucket
key: str # S3 key
file_size: int # Original file size
file_hash: str # SHA256 of original file
delta_size: Optional[int] # Size of delta (if used)
delta_hash: Optional[str] # SHA256 of delta
delta_ratio: Optional[float] # Delta/original ratio
reference_hash: Optional[str] # Reference file hash
```
### GetSummary
Download operation results.
```python
@dataclass
class GetSummary:
operation: str # "GET" or "GET_DELTA"
bucket: str # S3 bucket
key: str # S3 key
size: int # Downloaded size
hash: str # SHA256 hash
reconstructed: bool # Whether reconstruction was needed
```
### VerifyResult
Verification operation results.
```python
@dataclass
class VerifyResult:
valid: bool # Verification result
operation: str # "VERIFY" or "VERIFY_DELTA"
expected_hash: str # Expected SHA256
actual_hash: Optional[str] # Actual SHA256 (if computed)
details: Optional[str] # Error details if invalid
```
## Exceptions
DeltaGlider uses standard Python exceptions with descriptive messages:
### Common Exceptions
- **FileNotFoundError**: Local file or S3 object not found
- **PermissionError**: Access denied (S3 or local filesystem)
- **ValueError**: Invalid parameters (malformed URLs, invalid ratios)
- **IOError**: I/O operations failed
- **RuntimeError**: xdelta3 binary not found or failed
### Exception Handling Example
```python
from deltaglider import create_client
client = create_client()
try:
summary = client.upload("app.zip", "s3://bucket/path/")
except FileNotFoundError as e:
print(f"File not found: {e}")
except PermissionError as e:
print(f"Permission denied: {e}")
print("Check AWS credentials and S3 bucket permissions")
except ValueError as e:
print(f"Invalid parameters: {e}")
except RuntimeError as e:
print(f"System error: {e}")
print("Ensure xdelta3 is installed: apt-get install xdelta3")
except Exception as e:
print(f"Unexpected error: {e}")
# Log for investigation
import traceback
traceback.print_exc()
```
## Environment Variables
DeltaGlider respects these environment variables:
### AWS Configuration
- **AWS_ACCESS_KEY_ID**: AWS access key
- **AWS_SECRET_ACCESS_KEY**: AWS secret key
- **AWS_DEFAULT_REGION**: AWS region (default: us-east-1)
- **AWS_ENDPOINT_URL**: Custom S3 endpoint (for MinIO/R2)
- **AWS_PROFILE**: AWS profile to use
### DeltaGlider Configuration
- **DG_LOG_LEVEL**: Logging level (DEBUG, INFO, WARNING, ERROR)
- **DG_CACHE_DIR**: Local cache directory
- **DG_MAX_RATIO**: Default maximum delta ratio
### Example
```bash
# Configure for MinIO
export AWS_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
# Configure DeltaGlider
export DG_LOG_LEVEL=DEBUG
export DG_CACHE_DIR=/var/cache/deltaglider
export DG_MAX_RATIO=0.3
# Now use normally
python my_script.py
```
## Thread Safety
DeltaGlider clients are thread-safe for read operations but should not be shared across threads for write operations. For multi-threaded applications:
```python
import threading
from deltaglider import create_client
# Create separate client per thread
def worker(file_path, s3_url):
client = create_client() # Each thread gets its own client
summary = client.upload(file_path, s3_url)
print(f"Thread {threading.current_thread().name}: {summary.savings_percent:.0f}%")
# Create threads
threads = []
for i, (file, url) in enumerate(files_to_upload):
t = threading.Thread(target=worker, args=(file, url), name=f"Worker-{i}")
threads.append(t)
t.start()
# Wait for completion
for t in threads:
t.join()
```
## Performance Considerations
### Upload Performance
- **First file**: No compression overhead (becomes reference)
- **Similar files**: 3-4 files/second with compression
- **Network bound**: Limited by S3 upload speed
- **CPU bound**: xdelta3 compression for large files
### Download Performance
- **Direct files**: Limited by S3 download speed
- **Delta files**: <100ms reconstruction overhead
- **Cache hits**: Near-instant for cached references
### Optimization Tips
1. **Group related files**: Upload similar files to same prefix
2. **Batch operations**: Use concurrent uploads for independent files
3. **Cache management**: Don't clear cache during operations
4. **Compression threshold**: Tune `max_ratio` for your use case
5. **Network optimization**: Use S3 Transfer Acceleration if available
## Logging
DeltaGlider uses Python's standard logging framework:
```python
import logging
# Configure logging before creating client
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('deltaglider.log'),
logging.StreamHandler()
]
)
# Create client (will use configured logging)
client = create_client(log_level="DEBUG")
```
### Log Levels
- **DEBUG**: Detailed operations, xdelta3 commands
- **INFO**: Normal operations, compression statistics
- **WARNING**: Non-critical issues, fallbacks
- **ERROR**: Operation failures, exceptions
## Version Compatibility
- **Python**: 3.11 or higher required
- **boto3**: 1.35.0 or higher
- **xdelta3**: System binary required
- **S3 API**: Compatible with S3 API v4
## Support
- **GitHub Issues**: [github.com/beshu-tech/deltaglider/issues](https://github.com/beshu-tech/deltaglider/issues)
- **Documentation**: [github.com/beshu-tech/deltaglider](https://github.com/beshu-tech/deltaglider)
- **PyPI Package**: [pypi.org/project/deltaglider](https://pypi.org/project/deltaglider)

648
docs/sdk/architecture.md Normal file
View File

@@ -0,0 +1,648 @@
# DeltaGlider Architecture
Understanding how DeltaGlider achieves 99.9% compression through intelligent binary delta compression.
## Table of Contents
1. [Overview](#overview)
2. [Hexagonal Architecture](#hexagonal-architecture)
3. [Core Concepts](#core-concepts)
4. [Compression Algorithm](#compression-algorithm)
5. [Storage Strategy](#storage-strategy)
6. [Performance Optimizations](#performance-optimizations)
7. [Security & Integrity](#security--integrity)
8. [Comparison with Alternatives](#comparison-with-alternatives)
## Overview
DeltaGlider is built on a simple yet powerful idea: **most versioned files share 99% of their content**. Instead of storing complete files repeatedly, we store one reference file and only the differences (deltas) for similar files.
### High-Level Flow
```
First Upload (v1.0.0):
┌──────────┐ ┌─────────────┐ ┌──────┐
│ 100MB │───────▶│ DeltaGlider │──────▶│ S3 │
│ File │ │ │ │100MB │
└──────────┘ └─────────────┘ └──────┘
Second Upload (v1.0.1):
┌──────────┐ ┌─────────────┐ ┌──────┐
│ 100MB │───────▶│ DeltaGlider │──────▶│ S3 │
│ File │ │ (xdelta3) │ │ 98KB │
└──────────┘ └─────────────┘ └──────┘
Creates 98KB delta
by comparing with
v1.0.0 reference
```
## Hexagonal Architecture
DeltaGlider follows the hexagonal (ports and adapters) architecture pattern for maximum flexibility and testability.
### Architecture Diagram
```
┌─────────────────┐
│ Application │
│ (CLI / SDK) │
└────────┬────────┘
┌────────▼────────┐
│ │
│ DeltaService │
│ (Core Logic) │
│ │
└────┬─────┬──────┘
│ │
┌──────────▼─┬───▼──────────┐
│ │ │
│ Ports │ Ports │
│ (Interfaces)│ (Interfaces)│
│ │ │
└──────┬─────┴────┬─────────┘
│ │
┌───────────▼──┐ ┌───▼───────────┐
│ │ │ │
│ Adapters │ │ Adapters │
│ │ │ │
├──────────────┤ ├───────────────┤
│ S3Storage │ │ XdeltaDiff │
│ Sha256Hash │ │ FsCache │
│ UtcClock │ │ StdLogger │
│ NoopMetrics │ │ │
└──────────────┘ └───────────────┘
│ │
┌──────▼─────┐ ┌─────▼──────┐
│ AWS │ │ xdelta3 │
│ S3 │ │ binary │
└────────────┘ └────────────┘
```
### Ports (Interfaces)
Ports define contracts that adapters must implement:
```python
# StoragePort - Abstract S3 operations
class StoragePort(Protocol):
def put_object(self, bucket: str, key: str, data: bytes, metadata: Dict) -> None
def get_object(self, bucket: str, key: str) -> Tuple[bytes, Dict]
def object_exists(self, bucket: str, key: str) -> bool
def delete_object(self, bucket: str, key: str) -> None
# DiffPort - Abstract delta operations
class DiffPort(Protocol):
def create_delta(self, reference: bytes, target: bytes) -> bytes
def apply_delta(self, reference: bytes, delta: bytes) -> bytes
# HashPort - Abstract integrity checks
class HashPort(Protocol):
def hash(self, data: bytes) -> str
def hash_file(self, path: Path) -> str
# CachePort - Abstract local caching
class CachePort(Protocol):
def get(self, key: str) -> Optional[Path]
def put(self, key: str, path: Path) -> None
def exists(self, key: str) -> bool
```
### Adapters (Implementations)
Adapters provide concrete implementations:
- **S3StorageAdapter**: Uses boto3 for S3 operations
- **XdeltaAdapter**: Wraps xdelta3 binary for delta compression
- **Sha256Adapter**: Provides SHA256 hashing
- **FsCacheAdapter**: File system based reference cache
- **UtcClockAdapter**: UTC timestamp provider
- **StdLoggerAdapter**: Console logging
### Benefits
1. **Testability**: Mock any adapter for unit testing
2. **Flexibility**: Swap implementations (e.g., different storage backends)
3. **Separation**: Business logic isolated from infrastructure
4. **Extensibility**: Add new adapters without changing core
## Core Concepts
### DeltaSpace
A DeltaSpace is an S3 prefix containing related files that share a common reference:
```python
@dataclass
class DeltaSpace:
bucket: str # S3 bucket
prefix: str # Prefix for related files
# Example:
# DeltaSpace(bucket="releases", prefix="myapp/v1/")
# Contains:
# - reference.bin (first uploaded file)
# - file1.zip.delta
# - file2.zip.delta
```
### Reference File
The first file uploaded to a DeltaSpace becomes the reference:
```
s3://bucket/prefix/reference.bin # Full file (e.g., 100MB)
s3://bucket/prefix/reference.bin.sha256 # Integrity checksum
```
### Delta Files
Subsequent files are stored as deltas:
```
s3://bucket/prefix/myfile.zip.delta # Delta file (e.g., 98KB)
Metadata (S3 tags):
- original_name: myfile.zip
- original_size: 104857600
- original_hash: abc123...
- reference_hash: def456...
- tool_version: deltaglider/0.1.0
```
## Compression Algorithm
### xdelta3: The Secret Sauce
DeltaGlider uses [xdelta3](http://xdelta.org/), a binary diff algorithm optimized for large files:
#### How xdelta3 Works
1. **Rolling Hash**: Scans reference file with a rolling hash window
2. **Block Matching**: Finds matching byte sequences at any offset
3. **Instruction Stream**: Generates copy/insert instructions
4. **Compression**: Further compresses the instruction stream
```
Original: ABCDEFGHIJKLMNOP
Modified: ABCXYZGHIJKLMNOP
Delta instructions:
- COPY 0-2 (ABC) # Copy bytes 0-2 from reference
- INSERT XYZ # Insert new bytes
- COPY 6-15 (GHIJKLMNOP) # Copy bytes 6-15 from reference
Delta size: ~10 bytes instead of 16 bytes
```
#### Why xdelta3 Excels at Archives
Archive files (ZIP, TAR, JAR) have predictable structure:
```
ZIP Structure:
┌─────────────┐
│ Headers │ ← Usually identical between versions
├─────────────┤
│ File 1 │ ← May be unchanged
├─────────────┤
│ File 2 │ ← Small change
├─────────────┤
│ File 3 │ ← May be unchanged
├─────────────┤
│ Directory │ ← Structure mostly same
└─────────────┘
```
Even when one file changes inside the archive, xdelta3 can:
- Identify unchanged sections (even if byte positions shift)
- Compress repeated patterns efficiently
- Handle binary data optimally
### Intelligent File Type Detection
```python
def should_use_delta(file_path: Path) -> bool:
"""Determine if file should use delta compression."""
# File size check
if file_path.stat().st_size < 1_000_000: # < 1MB
return False # Overhead not worth it
# Extension-based detection
DELTA_EXTENSIONS = {
'.zip', '.tar', '.gz', '.tgz', '.bz2', # Archives
'.jar', '.war', '.ear', # Java
'.dmg', '.pkg', '.deb', '.rpm', # Packages
'.iso', '.img', '.vhd', # Disk images
}
DIRECT_EXTENSIONS = {
'.txt', '.md', '.json', '.xml', # Text (use gzip)
'.jpg', '.png', '.mp4', # Media (already compressed)
'.sha1', '.sha256', '.md5', # Checksums (unique)
}
ext = file_path.suffix.lower()
if ext in DELTA_EXTENSIONS:
return True
elif ext in DIRECT_EXTENSIONS:
return False
else:
# Unknown type - use heuristic
return is_likely_archive(file_path)
```
## Storage Strategy
### S3 Object Layout
```
bucket/
├── releases/
│ ├── v1.0.0/
│ │ ├── reference.bin # First uploaded file (full)
│ │ ├── reference.bin.sha256 # Checksum
│ │ ├── app-linux.tar.gz.delta # Delta from reference
│ │ ├── app-mac.dmg.delta # Delta from reference
│ │ └── app-win.zip.delta # Delta from reference
│ ├── v1.0.1/
│ │ ├── reference.bin # New reference for this version
│ │ └── ...
│ └── v1.1.0/
│ └── ...
└── backups/
└── ...
```
### Metadata Strategy
DeltaGlider stores metadata in S3 object tags/metadata:
```python
# For delta files
metadata = {
"x-amz-meta-original-name": "app.zip",
"x-amz-meta-original-size": "104857600",
"x-amz-meta-original-hash": "sha256:abc123...",
"x-amz-meta-reference-hash": "sha256:def456...",
"x-amz-meta-tool-version": "deltaglider/0.1.0",
"x-amz-meta-compression-ratio": "0.001", # 0.1% of original
}
```
Benefits:
- No separate metadata store needed
- Atomic operations (metadata stored with object)
- Works with S3 versioning and lifecycle policies
- Queryable via S3 API
### Local Cache Strategy
```
/tmp/.deltaglider/cache/
├── references/
│ ├── sha256_abc123.bin # Cached reference files
│ ├── sha256_def456.bin
│ └── ...
└── metadata.json # Cache index
```
Cache benefits:
- Avoid repeated reference downloads
- Speed up delta creation for multiple files
- Reduce S3 API calls and bandwidth
## Performance Optimizations
### 1. Reference Caching
```python
class FsCacheAdapter:
def get_reference(self, hash: str) -> Optional[Path]:
cache_path = self.cache_dir / f"sha256_{hash}.bin"
if cache_path.exists():
# Verify integrity
if self.verify_hash(cache_path, hash):
return cache_path
return None
def put_reference(self, hash: str, path: Path) -> None:
cache_path = self.cache_dir / f"sha256_{hash}.bin"
shutil.copy2(path, cache_path)
# Update cache index
self.update_index(hash, cache_path)
```
### 2. Streaming Operations
For large files, DeltaGlider uses streaming:
```python
def upload_large_file(file_path: Path, s3_url: str):
# Stream file to S3 using multipart upload
with open(file_path, 'rb') as f:
# boto3 automatically uses multipart for large files
s3.upload_fileobj(f, bucket, key,
Config=TransferConfig(
multipart_threshold=1024 * 25, # 25MB
max_concurrency=10,
use_threads=True))
```
### 3. Parallel Processing
```python
def process_batch(files: List[Path]):
with ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for file in files:
future = executor.submit(process_file, file)
futures.append(future)
for future in as_completed(futures):
result = future.result()
print(f"Processed: {result}")
```
### 4. Delta Ratio Optimization
```python
def optimize_compression(file: Path, reference: Path) -> bytes:
# Create delta
delta = create_delta(reference, file)
# Check compression effectiveness
ratio = len(delta) / file.stat().st_size
if ratio > MAX_RATIO: # Default: 0.5 (50%)
# Delta too large, store original
return None
else:
# Good compression, use delta
return delta
```
## Security & Integrity
### SHA256 Verification
Every operation includes checksum verification:
```python
def verify_integrity(data: bytes, expected_hash: str) -> bool:
actual_hash = hashlib.sha256(data).hexdigest()
return actual_hash == expected_hash
# Upload flow
file_hash = calculate_hash(file)
upload_to_s3(file, metadata={"hash": file_hash})
# Download flow
data, metadata = download_from_s3(key)
if not verify_integrity(data, metadata["hash"]):
raise IntegrityError("File corrupted")
```
### Atomic Operations
All S3 operations are atomic:
```python
def atomic_upload(file: Path, bucket: str, key: str):
try:
# Upload to temporary key
temp_key = f"{key}.tmp.{uuid.uuid4()}"
s3.upload_file(file, bucket, temp_key)
# Atomic rename (S3 copy + delete)
s3.copy_object(
CopySource={'Bucket': bucket, 'Key': temp_key},
Bucket=bucket,
Key=key
)
s3.delete_object(Bucket=bucket, Key=temp_key)
except Exception:
# Cleanup on failure
try:
s3.delete_object(Bucket=bucket, Key=temp_key)
except:
pass
raise
```
### Encryption Support
DeltaGlider respects S3 encryption settings:
```python
# Server-side encryption with S3-managed keys
s3.put_object(
Bucket=bucket,
Key=key,
Body=data,
ServerSideEncryption='AES256'
)
# Server-side encryption with KMS
s3.put_object(
Bucket=bucket,
Key=key,
Body=data,
ServerSideEncryption='aws:kms',
SSEKMSKeyId='arn:aws:kms:...'
)
```
## Comparison with Alternatives
### vs. S3 Versioning
| Aspect | DeltaGlider | S3 Versioning |
|--------|-------------|---------------|
| Storage | Only stores deltas | Stores full copies |
| Compression | 99%+ for similar files | 0% |
| Cost | Minimal | $$ per version |
| Complexity | Transparent | Built-in |
| Recovery | Download + reconstruct | Direct download |
### vs. Git LFS
| Aspect | DeltaGlider | Git LFS |
|--------|-------------|---------|
| Use case | Any S3 storage | Git repositories |
| Compression | Binary delta | Deduplication |
| Integration | S3 API | Git workflow |
| Scalability | Unlimited | Repository-bound |
### vs. Deduplication Systems
| Aspect | DeltaGlider | Dedup Systems |
|--------|-------------|---------------|
| Approach | File-level delta | Block-level dedup |
| Compression | 99%+ for similar | 30-50% typical |
| Complexity | Simple | Complex |
| Cost | Open source | Enterprise $$$ |
### vs. Backup Tools (Restic/Borg)
| Aspect | DeltaGlider | Restic/Borg |
|--------|-------------|-------------|
| Purpose | S3 optimization | Full backup |
| Storage | S3-native | Custom format |
| Granularity | File-level | Repository |
| Use case | Artifacts/releases | System backups |
## Advanced Topics
### Reference Rotation Strategy
Currently, the first file becomes the permanent reference. Future versions may implement:
```python
class ReferenceRotationStrategy:
def should_rotate(self, stats: ReferenceStats) -> bool:
# Rotate if average delta ratio is too high
if stats.avg_delta_ratio > 0.4:
return True
# Rotate if reference is too old
if stats.age_days > 90:
return True
# Rotate if better candidate exists
if stats.better_candidate_score > 0.8:
return True
return False
def select_new_reference(self, files: List[FileStats]) -> Path:
# Select file that minimizes total delta sizes
best_score = float('inf')
best_file = None
for candidate in files:
total_delta_size = sum(
compute_delta_size(candidate, other)
for other in files
if other != candidate
)
if total_delta_size < best_score:
best_score = total_delta_size
best_file = candidate
return best_file
```
### Multi-Reference Support
For diverse file sets, multiple references could be used:
```python
class MultiReferenceStrategy:
def assign_reference(self, file: Path, references: List[Reference]) -> Reference:
# Find best matching reference
best_reference = None
best_ratio = float('inf')
for ref in references:
delta = create_delta(ref.path, file)
ratio = len(delta) / file.stat().st_size
if ratio < best_ratio:
best_ratio = ratio
best_reference = ref
# Create new reference if no good match
if best_ratio > 0.5:
return self.create_new_reference(file)
return best_reference
```
### Incremental Delta Chains
For frequently updated files:
```python
class DeltaChain:
"""
v1.0.0 (reference) <- v1.0.1 (delta) <- v1.0.2 (delta) <- v1.0.3 (delta)
"""
def reconstruct(self, version: str) -> bytes:
# Start with reference
data = self.load_reference()
# Apply deltas in sequence
for delta in self.get_delta_chain(version):
data = apply_delta(data, delta)
return data
```
## Monitoring & Observability
### Metrics to Track
```python
@dataclass
class CompressionMetrics:
total_uploads: int
total_original_size: int
total_stored_size: int
average_compression_ratio: float
delta_files_count: int
reference_files_count: int
cache_hit_rate: float
average_upload_time: float
average_download_time: float
failed_compressions: int
```
### Health Checks
```python
class HealthCheck:
def check_xdelta3(self) -> bool:
"""Verify xdelta3 binary is available."""
return shutil.which('xdelta3') is not None
def check_s3_access(self) -> bool:
"""Verify S3 credentials and permissions."""
try:
s3.list_buckets()
return True
except:
return False
def check_cache_space(self) -> bool:
"""Verify adequate cache space."""
cache_dir = Path('/tmp/.deltaglider/cache')
free_space = shutil.disk_usage(cache_dir).free
return free_space > 1_000_000_000 # 1GB minimum
```
## Future Enhancements
1. **Cloud-Native Reference Management**: Store references in distributed cache
2. **Rust Implementation**: 10x performance improvement
3. **Automatic Similarity Detection**: ML-based reference selection
4. **Multi-Threaded Compression**: Parallel delta generation
5. **WASM Support**: Browser-based delta compression
6. **S3 Batch Operations**: Bulk compression of existing data
7. **Compression Prediction**: Estimate compression before upload
8. **Adaptive Strategies**: Auto-tune based on workload patterns
## Contributing
See [CONTRIBUTING.md](https://github.com/beshu-tech/deltaglider/blob/main/CONTRIBUTING.md) for development setup and guidelines.
## Additional Resources
- [xdelta3 Documentation](http://xdelta.org/)
- [S3 API Reference](https://docs.aws.amazon.com/s3/index.html)
- [Hexagonal Architecture](https://alistair.cockburn.us/hexagonal-architecture/)
- [Binary Diff Algorithms](https://en.wikipedia.org/wiki/Delta_encoding)

1112
docs/sdk/examples.md Normal file

File diff suppressed because it is too large Load Diff

130
docs/sdk/generate_docs.py Normal file
View File

@@ -0,0 +1,130 @@
#!/usr/bin/env python3
"""
Generate API documentation for DeltaGlider SDK.
This script generates documentation from Python source code using introspection.
Can be extended to use tools like Sphinx, pdoc, or mkdocs.
"""
import ast
import json
import sys
from pathlib import Path
from typing import Dict, List, Any
def extract_docstrings(file_path: Path) -> Dict[str, Any]:
"""Extract docstrings and signatures from Python file."""
with open(file_path, 'r') as f:
tree = ast.parse(f.read(), filename=str(file_path))
docs = {
"module": ast.get_docstring(tree),
"classes": {},
"functions": {}
}
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
class_docs = {
"docstring": ast.get_docstring(node),
"methods": {}
}
for item in node.body:
if isinstance(item, ast.FunctionDef):
method_doc = {
"docstring": ast.get_docstring(item),
"signature": get_function_signature(item)
}
class_docs["methods"][item.name] = method_doc
docs["classes"][node.name] = class_docs
elif isinstance(node, ast.FunctionDef) and node.col_offset == 0:
docs["functions"][node.name] = {
"docstring": ast.get_docstring(node),
"signature": get_function_signature(node)
}
return docs
def get_function_signature(node: ast.FunctionDef) -> str:
"""Extract function signature."""
args = []
for arg in node.args.args:
arg_str = arg.arg
if arg.annotation:
arg_str += f": {ast.unparse(arg.annotation)}"
args.append(arg_str)
defaults = node.args.defaults
if defaults:
for i, default in enumerate(defaults, start=len(args) - len(defaults)):
args[i] += f" = {ast.unparse(default)}"
return f"({', '.join(args)})"
def generate_markdown_docs(docs: Dict[str, Any], module_name: str) -> str:
"""Generate Markdown documentation from extracted docs."""
lines = [f"# {module_name} API Documentation\n"]
if docs["module"]:
lines.append(f"{docs['module']}\n")
if docs["functions"]:
lines.append("## Functions\n")
for name, func in docs["functions"].items():
lines.append(f"### `{name}{func['signature']}`\n")
if func["docstring"]:
lines.append(f"{func['docstring']}\n")
if docs["classes"]:
lines.append("## Classes\n")
for class_name, class_info in docs["classes"].items():
lines.append(f"### {class_name}\n")
if class_info["docstring"]:
lines.append(f"{class_info['docstring']}\n")
if class_info["methods"]:
lines.append("#### Methods\n")
for method_name, method_info in class_info["methods"].items():
lines.append(f"##### `{method_name}{method_info['signature']}`\n")
if method_info["docstring"]:
lines.append(f"{method_info['docstring']}\n")
return "\n".join(lines)
def main():
"""Generate documentation for DeltaGlider SDK."""
src_dir = Path(__file__).parent.parent.parent / "src" / "deltaglider"
# Extract documentation from client.py
client_docs = extract_docstrings(src_dir / "client.py")
# Generate API documentation
api_content = generate_markdown_docs(client_docs, "deltaglider.client")
# Save generated documentation
output_file = Path(__file__).parent / "generated_api.md"
with open(output_file, 'w') as f:
f.write(api_content)
print(f"Documentation generated: {output_file}")
# Generate index of all modules
modules = []
for py_file in src_dir.rglob("*.py"):
if not py_file.name.startswith("_"):
rel_path = py_file.relative_to(src_dir)
module_name = str(rel_path).replace("/", ".").replace(".py", "")
modules.append(module_name)
index_file = Path(__file__).parent / "module_index.json"
with open(index_file, 'w') as f:
json.dump({"modules": sorted(modules)}, f, indent=2)
print(f"Module index generated: {index_file}")
if __name__ == "__main__":
main()

238
docs/sdk/getting-started.md Normal file
View File

@@ -0,0 +1,238 @@
# Getting Started with DeltaGlider SDK
This guide will help you get up and running with the DeltaGlider Python SDK in minutes.
## Prerequisites
- Python 3.11 or higher
- AWS credentials configured (or access to MinIO/S3-compatible storage)
- xdelta3 installed on your system (installed automatically with the package)
## Installation
### Using pip
```bash
pip install deltaglider
```
### Using uv (faster)
```bash
uv pip install deltaglider
```
### Development Installation
```bash
git clone https://github.com/beshu-tech/deltaglider
cd deltaglider
pip install -e ".[dev]"
```
## Configuration
### AWS Credentials
DeltaGlider uses standard AWS credential discovery:
1. **Environment Variables**
```bash
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2
```
2. **AWS Credentials File** (`~/.aws/credentials`)
```ini
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key
region = us-west-2
```
3. **IAM Role** (when running on EC2/ECS/Lambda)
Automatically uses instance/task role credentials
### Custom S3 Endpoints
For MinIO, Cloudflare R2, or other S3-compatible storage:
```python
from deltaglider import create_client
client = create_client(endpoint_url="http://minio.local:9000")
```
Or via environment variable:
```bash
export AWS_ENDPOINT_URL=http://minio.local:9000
```
## Your First Upload
### Basic Example
```python
from deltaglider import create_client
# Create a client
client = create_client()
# Upload a file
summary = client.upload(
file_path="my-app-v1.0.0.zip",
s3_url="s3://my-bucket/releases/v1.0.0/"
)
# Check the results
print(f"Upload completed!")
print(f"Original size: {summary.original_size_mb:.1f} MB")
print(f"Stored size: {summary.stored_size_mb:.1f} MB")
print(f"Compression: {summary.savings_percent:.0f}%")
print(f"Is delta: {summary.is_delta}")
```
### Understanding the Results
When you upload a file, DeltaGlider returns an `UploadSummary` with:
- `operation`: What was done (`PUT` for new reference, `PUT_DELTA` for delta)
- `original_size_mb`: Original file size in MB
- `stored_size_mb`: Actual size stored in S3
- `savings_percent`: Percentage of storage saved
- `is_delta`: Whether delta compression was used
- `delta_ratio`: Ratio of delta size to original (smaller is better)
## Downloading Files
```python
# Download a file
client.download(
s3_url="s3://my-bucket/releases/v1.0.0/my-app-v1.0.0.zip",
output_path="downloaded-app.zip"
)
# The file is automatically reconstructed if it was stored as a delta
```
## Working with Multiple Versions
Here's where DeltaGlider shines - uploading multiple versions:
```python
from deltaglider import create_client
from pathlib import Path
client = create_client()
# Upload multiple versions
versions = ["v1.0.0", "v1.0.1", "v1.0.2", "v1.1.0"]
for version in versions:
file = f"builds/my-app-{version}.zip"
summary = client.upload(
file_path=file,
s3_url=f"s3://releases/{version}/"
)
if summary.is_delta:
print(f"{version}: Compressed to {summary.stored_size_mb:.1f}MB "
f"(saved {summary.savings_percent:.0f}%)")
else:
print(f"{version}: Stored as reference ({summary.original_size_mb:.1f}MB)")
# Typical output:
# v1.0.0: Stored as reference (100.0MB)
# v1.0.1: Compressed to 0.2MB (saved 99.8%)
# v1.0.2: Compressed to 0.3MB (saved 99.7%)
# v1.1.0: Compressed to 5.2MB (saved 94.8%)
```
## Verification
Verify the integrity of stored files:
```python
# Verify a stored file
is_valid = client.verify("s3://releases/v1.0.0/my-app-v1.0.0.zip")
print(f"File integrity: {'✓ Valid' if is_valid else '✗ Corrupted'}")
```
## Error Handling
```python
from deltaglider import create_client
client = create_client()
try:
summary = client.upload("app.zip", "s3://bucket/path/")
except FileNotFoundError:
print("Local file not found")
except PermissionError:
print("S3 access denied - check credentials")
except Exception as e:
print(f"Upload failed: {e}")
```
## Logging
Control logging verbosity:
```python
# Debug logging for troubleshooting
client = create_client(log_level="DEBUG")
# Quiet mode
client = create_client(log_level="WARNING")
# Default is INFO
client = create_client() # INFO level
```
## Local Testing with MinIO
For development and testing without AWS:
1. **Start MinIO**
```bash
docker run -p 9000:9000 -p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
minio/minio server /data --console-address ":9001"
```
2. **Create a bucket** (via MinIO Console at http://localhost:9001)
3. **Use DeltaGlider**
```python
from deltaglider import create_client
client = create_client(
endpoint_url="http://localhost:9000"
)
# Set credentials via environment
import os
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
# Now use normally
summary = client.upload("test.zip", "s3://test-bucket/")
```
## Best Practices
1. **Group Similar Files**: Upload related files to the same S3 prefix for optimal compression
2. **Version Naming**: Use consistent naming for versions (e.g., `app-v1.0.0.zip`, `app-v1.0.1.zip`)
3. **Cache Management**: The local reference cache improves performance - don't clear it unnecessarily
4. **Error Recovery**: Always handle exceptions for production code
5. **Monitoring**: Log compression ratios to track effectiveness
## Next Steps
- [Examples](examples.md) - See real-world usage patterns
- [API Reference](api.md) - Complete API documentation
- [Architecture](architecture.md) - Understand how it works

View File

@@ -5,3 +5,16 @@ try:
except ImportError:
# Package is not installed, so version is not available
__version__ = "0.0.0+unknown"
# Import simplified client API
from .client import DeltaGliderClient, create_client
from .core import DeltaService, DeltaSpace, ObjectKey
__all__ = [
"__version__",
"DeltaGliderClient",
"create_client",
"DeltaService",
"DeltaSpace",
"ObjectKey",
]

237
src/deltaglider/client.py Normal file
View File

@@ -0,0 +1,237 @@
"""Simplified client API for DeltaGlider."""
from dataclasses import dataclass
from pathlib import Path
from typing import Any
from .adapters import (
FsCacheAdapter,
NoopMetricsAdapter,
S3StorageAdapter,
Sha256Adapter,
StdLoggerAdapter,
UtcClockAdapter,
XdeltaAdapter,
)
from .core import DeltaService, DeltaSpace, ObjectKey
@dataclass
class UploadSummary:
"""User-friendly upload summary."""
operation: str
bucket: str
key: str
original_size: int
stored_size: int
is_delta: bool
delta_ratio: float = 0.0
@property
def original_size_mb(self) -> float:
"""Original size in MB."""
return self.original_size / (1024 * 1024)
@property
def stored_size_mb(self) -> float:
"""Stored size in MB."""
return self.stored_size / (1024 * 1024)
@property
def savings_percent(self) -> float:
"""Percentage saved through compression."""
if self.original_size == 0:
return 0.0
return ((self.original_size - self.stored_size) / self.original_size) * 100
class DeltaGliderClient:
"""Simplified client for DeltaGlider operations."""
def __init__(self, service: DeltaService, endpoint_url: str | None = None):
"""Initialize client with service."""
self.service = service
self.endpoint_url = endpoint_url
def upload(
self,
file_path: str | Path,
s3_url: str,
tags: dict[str, str] | None = None,
max_ratio: float = 0.5,
) -> UploadSummary:
"""Upload a file to S3 with automatic delta compression.
Args:
file_path: Local file to upload
s3_url: S3 destination URL (s3://bucket/prefix/)
tags: Optional tags to add to the object
max_ratio: Maximum acceptable delta/file ratio (default 0.5)
Returns:
UploadSummary with compression statistics
"""
file_path = Path(file_path)
# Parse S3 URL
if not s3_url.startswith("s3://"):
raise ValueError(f"Invalid S3 URL: {s3_url}")
s3_path = s3_url[5:].rstrip("/")
parts = s3_path.split("/", 1)
bucket = parts[0]
prefix = parts[1] if len(parts) > 1 else ""
# Create delta space and upload
delta_space = DeltaSpace(bucket=bucket, prefix=prefix)
summary = self.service.put(file_path, delta_space, max_ratio)
# TODO: Add tags support when implemented
# Convert to user-friendly summary
is_delta = summary.delta_size is not None
stored_size = summary.delta_size if is_delta else summary.file_size
return UploadSummary(
operation=summary.operation,
bucket=summary.bucket,
key=summary.key,
original_size=summary.file_size,
stored_size=stored_size or summary.file_size, # Ensure stored_size is never None
is_delta=is_delta,
delta_ratio=summary.delta_ratio or 0.0,
)
def download(self, s3_url: str, output_path: str | Path) -> None:
"""Download and reconstruct a file from S3.
Args:
s3_url: S3 source URL (s3://bucket/key)
output_path: Local destination path
"""
output_path = Path(output_path)
# Parse S3 URL
if not s3_url.startswith("s3://"):
raise ValueError(f"Invalid S3 URL: {s3_url}")
s3_path = s3_url[5:]
parts = s3_path.split("/", 1)
if len(parts) < 2:
raise ValueError(f"S3 URL must include key: {s3_url}")
bucket = parts[0]
key = parts[1]
# Auto-append .delta if the file doesn't exist without it
# This allows users to specify the original name and we'll find the delta
obj_key = ObjectKey(bucket=bucket, key=key)
# Try to get metadata first to see if it exists
try:
self.service.get(obj_key, output_path)
except Exception:
# Try with .delta suffix
if not key.endswith(".delta"):
obj_key = ObjectKey(bucket=bucket, key=key + ".delta")
self.service.get(obj_key, output_path)
else:
raise
def verify(self, s3_url: str) -> bool:
"""Verify integrity of a stored file.
Args:
s3_url: S3 URL of the file to verify
Returns:
True if verification passed, False otherwise
"""
# Parse S3 URL
if not s3_url.startswith("s3://"):
raise ValueError(f"Invalid S3 URL: {s3_url}")
s3_path = s3_url[5:]
parts = s3_path.split("/", 1)
if len(parts) < 2:
raise ValueError(f"S3 URL must include key: {s3_url}")
bucket = parts[0]
key = parts[1]
obj_key = ObjectKey(bucket=bucket, key=key)
result = self.service.verify(obj_key)
return result.valid
def lifecycle_policy(
self, s3_prefix: str, days_before_archive: int = 30, days_before_delete: int = 90
) -> None:
"""Set lifecycle policy for a prefix (placeholder for future implementation).
Args:
s3_prefix: S3 prefix to apply policy to
days_before_archive: Days before transitioning to archive storage
days_before_delete: Days before deletion
"""
# TODO: Implement lifecycle policy management
# This would integrate with S3 lifecycle policies
# For now, this is a placeholder for the API
pass
def create_client(
endpoint_url: str | None = None,
log_level: str = "INFO",
cache_dir: str = "/tmp/.deltaglider/cache",
**kwargs: Any,
) -> DeltaGliderClient:
"""Create a DeltaGlider client with sensible defaults.
Args:
endpoint_url: Optional S3 endpoint URL (for MinIO, R2, etc.)
log_level: Logging level (DEBUG, INFO, WARNING, ERROR)
cache_dir: Directory for reference cache
**kwargs: Additional arguments passed to DeltaService
Returns:
Configured DeltaGliderClient instance
Examples:
>>> # Use with AWS S3 (credentials from environment)
>>> client = create_client()
>>> # Use with MinIO
>>> client = create_client(endpoint_url="http://localhost:9000")
>>> # Use with debug logging
>>> client = create_client(log_level="DEBUG")
"""
# Create adapters
hasher = Sha256Adapter()
storage = S3StorageAdapter(endpoint_url=endpoint_url)
diff = XdeltaAdapter()
cache = FsCacheAdapter(Path(cache_dir), hasher)
clock = UtcClockAdapter()
logger = StdLoggerAdapter(level=log_level)
metrics = NoopMetricsAdapter()
# Get default values
tool_version = kwargs.pop("tool_version", "deltaglider/0.1.0")
max_ratio = kwargs.pop("max_ratio", 0.5)
# Create service
service = DeltaService(
storage=storage,
diff=diff,
hasher=hasher,
cache=cache,
clock=clock,
logger=logger,
metrics=metrics,
tool_version=tool_version,
max_ratio=max_ratio,
**kwargs,
)
return DeltaGliderClient(service, endpoint_url)