mirror of
https://github.com/beshu-tech/deltaglider.git
synced 2026-01-11 14:40:26 +01:00
Add simplified SDK client API and comprehensive documentation
- Create DeltaGliderClient with user-friendly interface - Add create_client() factory function with sensible defaults - Implement UploadSummary dataclass with helpful properties - Expose simplified API through main package - Add comprehensive SDK documentation under docs/sdk/: - Getting started guide with installation and examples - Complete API reference documentation - Real-world usage examples for 8 common scenarios - Architecture deep dive explaining how DeltaGlider works - Automatic documentation generation scripts - Update CONTRIBUTING.md with SDK documentation guidelines - All tests pass and code quality checks succeed 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -102,11 +102,27 @@ uv run pytest -m e2e
|
||||
- Use type hints for all function signatures
|
||||
- Write docstrings for all public functions and classes
|
||||
|
||||
## Documentation
|
||||
|
||||
### SDK Documentation
|
||||
|
||||
The SDK documentation is located in `docs/sdk/` and includes:
|
||||
- Getting Started guide
|
||||
- API Reference
|
||||
- Examples and use cases
|
||||
- Architecture overview
|
||||
|
||||
When making changes to the Python SDK:
|
||||
1. Update relevant documentation in `docs/sdk/`
|
||||
2. Update docstrings in the code
|
||||
3. Run `make generate` in `docs/sdk/` to update auto-generated docs
|
||||
|
||||
## Pull Request Process
|
||||
|
||||
1. Update the README.md with details of changes to the interface, if applicable
|
||||
2. Update the docs/ with any new functionality
|
||||
3. The PR will be merged once you have the sign-off of at least one maintainer
|
||||
3. Update SDK documentation if you've modified the client API
|
||||
4. The PR will be merged once you have the sign-off of at least one maintainer
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
|
||||
36
docs/sdk/Makefile
Normal file
36
docs/sdk/Makefile
Normal file
@@ -0,0 +1,36 @@
|
||||
# Makefile for DeltaGlider SDK Documentation
|
||||
|
||||
.PHONY: all clean generate serve
|
||||
|
||||
# Default target
|
||||
all: generate
|
||||
|
||||
# Generate documentation
|
||||
generate:
|
||||
@echo "Generating SDK documentation..."
|
||||
python generate_docs.py
|
||||
@echo "Documentation generated successfully!"
|
||||
|
||||
# Clean generated files
|
||||
clean:
|
||||
@echo "Cleaning generated documentation..."
|
||||
rm -f generated_api.md module_index.json
|
||||
@echo "Clean complete!"
|
||||
|
||||
# Serve documentation locally (requires Python http.server)
|
||||
serve:
|
||||
@echo "Starting documentation server at http://localhost:8000/docs/sdk/"
|
||||
cd ../.. && python -m http.server 8000
|
||||
|
||||
# Install documentation dependencies
|
||||
install-deps:
|
||||
pip install pdoc3 sphinx sphinx-rtd-theme
|
||||
|
||||
# Generate full HTML documentation with pdoc
|
||||
html:
|
||||
pdoc3 --html --output-dir html ../../src/deltaglider
|
||||
|
||||
# Generate with Sphinx (future enhancement)
|
||||
sphinx:
|
||||
@echo "Sphinx documentation generation not yet configured"
|
||||
@echo "Run 'make install-deps' then 'sphinx-quickstart' to set up"
|
||||
122
docs/sdk/README.md
Normal file
122
docs/sdk/README.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# DeltaGlider Python SDK Documentation
|
||||
|
||||
The DeltaGlider Python SDK provides a simple, intuitive interface for integrating delta compression into your Python applications. Whether you're managing software releases, database backups, or any versioned binary data, DeltaGlider can reduce your storage costs by up to 99%.
|
||||
|
||||
## Quick Links
|
||||
|
||||
- [Getting Started](getting-started.md) - Installation and first steps
|
||||
- [Examples](examples.md) - Real-world usage patterns
|
||||
- [API Reference](api.md) - Complete API documentation
|
||||
- [Architecture](architecture.md) - How it works under the hood
|
||||
|
||||
## Overview
|
||||
|
||||
DeltaGlider provides two ways to interact with your S3 storage:
|
||||
|
||||
### 1. CLI (Command Line Interface)
|
||||
Drop-in replacement for AWS S3 CLI with automatic delta compression:
|
||||
```bash
|
||||
deltaglider cp my-app-v1.0.0.zip s3://releases/
|
||||
deltaglider ls s3://releases/
|
||||
deltaglider sync ./builds/ s3://releases/
|
||||
```
|
||||
|
||||
### 2. Python SDK
|
||||
Programmatic interface for Python applications:
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
summary = client.upload("my-app-v1.0.0.zip", "s3://releases/v1.0.0/")
|
||||
print(f"Compressed from {summary.original_size_mb:.1f}MB to {summary.stored_size_mb:.1f}MB")
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
- **99%+ Compression**: For versioned artifacts and similar files
|
||||
- **Drop-in Replacement**: Works with existing AWS S3 workflows
|
||||
- **Intelligent Detection**: Automatically determines when to use delta compression
|
||||
- **Data Integrity**: SHA256 verification on every operation
|
||||
- **S3 Compatible**: Works with AWS, MinIO, Cloudflare R2, and other S3-compatible storage
|
||||
|
||||
## When to Use DeltaGlider
|
||||
|
||||
### Perfect For
|
||||
- Software releases and versioned artifacts
|
||||
- Container images and layers
|
||||
- Database backups and snapshots
|
||||
- Machine learning model checkpoints
|
||||
- Game assets and updates
|
||||
- Any versioned binary data
|
||||
|
||||
### Not Ideal For
|
||||
- Already compressed unique files
|
||||
- Streaming media files
|
||||
- Frequently changing unstructured data
|
||||
- Files smaller than 1MB
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install deltaglider
|
||||
```
|
||||
|
||||
For development or testing with MinIO:
|
||||
```bash
|
||||
docker run -p 9000:9000 minio/minio server /data
|
||||
export AWS_ENDPOINT_URL=http://localhost:9000
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Simple Upload/Download
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
# Create client (uses AWS credentials from environment)
|
||||
client = create_client()
|
||||
|
||||
# Upload a file
|
||||
summary = client.upload("release-v2.0.0.zip", "s3://releases/v2.0.0/")
|
||||
print(f"Saved {summary.savings_percent:.0f}% storage space")
|
||||
|
||||
# Download a file
|
||||
client.download("s3://releases/v2.0.0/release-v2.0.0.zip", "local-copy.zip")
|
||||
```
|
||||
|
||||
### With Custom Configuration
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client(
|
||||
endpoint_url="http://minio.internal:9000", # Custom S3 endpoint
|
||||
log_level="DEBUG", # Detailed logging
|
||||
cache_dir="/var/cache/deltaglider", # Custom cache location
|
||||
)
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **First Upload**: The first file uploaded to a prefix becomes the reference
|
||||
2. **Delta Compression**: Subsequent similar files are compared using xdelta3
|
||||
3. **Smart Storage**: Only the differences (deltas) are stored
|
||||
4. **Transparent Reconstruction**: Files are automatically reconstructed on download
|
||||
|
||||
## Performance
|
||||
|
||||
Based on real-world usage:
|
||||
- **Compression**: 99%+ for similar versions
|
||||
- **Upload Speed**: 3-4 files/second
|
||||
- **Download Speed**: <100ms reconstruction
|
||||
- **Storage Savings**: 4TB → 5GB (ReadOnlyREST case study)
|
||||
|
||||
## Support
|
||||
|
||||
- GitHub Issues: [github.com/beshu-tech/deltaglider/issues](https://github.com/beshu-tech/deltaglider/issues)
|
||||
- Documentation: [github.com/beshu-tech/deltaglider#readme](https://github.com/beshu-tech/deltaglider#readme)
|
||||
|
||||
## License
|
||||
|
||||
MIT License - See [LICENSE](https://github.com/beshu-tech/deltaglider/blob/main/LICENSE) for details.
|
||||
583
docs/sdk/api.md
Normal file
583
docs/sdk/api.md
Normal file
@@ -0,0 +1,583 @@
|
||||
# DeltaGlider API Reference
|
||||
|
||||
Complete API documentation for the DeltaGlider Python SDK.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Client Creation](#client-creation)
|
||||
- [DeltaGliderClient](#deltaglidererclient)
|
||||
- [UploadSummary](#uploadsummary)
|
||||
- [DeltaService](#deltaservice)
|
||||
- [Models](#models)
|
||||
- [Exceptions](#exceptions)
|
||||
|
||||
## Client Creation
|
||||
|
||||
### `create_client`
|
||||
|
||||
Factory function to create a configured DeltaGlider client with sensible defaults.
|
||||
|
||||
```python
|
||||
def create_client(
|
||||
endpoint_url: Optional[str] = None,
|
||||
log_level: str = "INFO",
|
||||
cache_dir: str = "/tmp/.deltaglider/cache",
|
||||
**kwargs
|
||||
) -> DeltaGliderClient
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **endpoint_url** (`Optional[str]`): S3 endpoint URL for MinIO, R2, or other S3-compatible storage. If None, uses AWS S3.
|
||||
- **log_level** (`str`): Logging verbosity level. Options: "DEBUG", "INFO", "WARNING", "ERROR". Default: "INFO".
|
||||
- **cache_dir** (`str`): Directory for local reference cache. Default: "/tmp/.deltaglider/cache".
|
||||
- **kwargs**: Additional arguments passed to `DeltaService`:
|
||||
- **tool_version** (`str`): Version string for metadata. Default: "deltaglider/0.1.0"
|
||||
- **max_ratio** (`float`): Maximum acceptable delta/file ratio. Default: 0.5
|
||||
|
||||
#### Returns
|
||||
|
||||
`DeltaGliderClient`: Configured client instance ready for use.
|
||||
|
||||
#### Examples
|
||||
|
||||
```python
|
||||
# Default AWS S3 configuration
|
||||
client = create_client()
|
||||
|
||||
# Custom endpoint for MinIO
|
||||
client = create_client(endpoint_url="http://localhost:9000")
|
||||
|
||||
# Debug mode with custom cache
|
||||
client = create_client(
|
||||
log_level="DEBUG",
|
||||
cache_dir="/var/cache/deltaglider"
|
||||
)
|
||||
|
||||
# Custom delta ratio threshold
|
||||
client = create_client(max_ratio=0.3) # Only use delta if <30% of original
|
||||
```
|
||||
|
||||
## DeltaGliderClient
|
||||
|
||||
Main client class for interacting with DeltaGlider.
|
||||
|
||||
### Constructor
|
||||
|
||||
```python
|
||||
class DeltaGliderClient:
|
||||
def __init__(
|
||||
self,
|
||||
service: DeltaService,
|
||||
endpoint_url: Optional[str] = None
|
||||
)
|
||||
```
|
||||
|
||||
**Note**: Use `create_client()` instead of instantiating directly.
|
||||
|
||||
### Methods
|
||||
|
||||
#### `upload`
|
||||
|
||||
Upload a file to S3 with automatic delta compression.
|
||||
|
||||
```python
|
||||
def upload(
|
||||
self,
|
||||
file_path: str | Path,
|
||||
s3_url: str,
|
||||
tags: Optional[Dict[str, str]] = None,
|
||||
max_ratio: float = 0.5
|
||||
) -> UploadSummary
|
||||
```
|
||||
|
||||
##### Parameters
|
||||
|
||||
- **file_path** (`str | Path`): Local file path to upload.
|
||||
- **s3_url** (`str`): S3 destination URL in format `s3://bucket/prefix/`.
|
||||
- **tags** (`Optional[Dict[str, str]]`): S3 object tags to attach. (Future feature)
|
||||
- **max_ratio** (`float`): Maximum acceptable delta/file size ratio. Default: 0.5.
|
||||
|
||||
##### Returns
|
||||
|
||||
`UploadSummary`: Object containing upload statistics and compression details.
|
||||
|
||||
##### Raises
|
||||
|
||||
- `FileNotFoundError`: If local file doesn't exist.
|
||||
- `ValueError`: If S3 URL is invalid.
|
||||
- `PermissionError`: If S3 access is denied.
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Simple upload
|
||||
summary = client.upload("app.zip", "s3://releases/v1.0.0/")
|
||||
|
||||
# With custom compression threshold
|
||||
summary = client.upload(
|
||||
"large-file.tar.gz",
|
||||
"s3://backups/",
|
||||
max_ratio=0.3 # Only use delta if compression > 70%
|
||||
)
|
||||
|
||||
# Check results
|
||||
if summary.is_delta:
|
||||
print(f"Stored as delta: {summary.stored_size_mb:.1f} MB")
|
||||
else:
|
||||
print(f"Stored as full file: {summary.original_size_mb:.1f} MB")
|
||||
```
|
||||
|
||||
#### `download`
|
||||
|
||||
Download and reconstruct a file from S3.
|
||||
|
||||
```python
|
||||
def download(
|
||||
self,
|
||||
s3_url: str,
|
||||
output_path: str | Path
|
||||
) -> None
|
||||
```
|
||||
|
||||
##### Parameters
|
||||
|
||||
- **s3_url** (`str`): S3 source URL in format `s3://bucket/key`.
|
||||
- **output_path** (`str | Path`): Local destination path.
|
||||
|
||||
##### Returns
|
||||
|
||||
None. File is written to `output_path`.
|
||||
|
||||
##### Raises
|
||||
|
||||
- `ValueError`: If S3 URL is invalid or missing key.
|
||||
- `FileNotFoundError`: If S3 object doesn't exist.
|
||||
- `PermissionError`: If local path is not writable or S3 access denied.
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Download a file
|
||||
client.download("s3://releases/v1.0.0/app.zip", "downloaded.zip")
|
||||
|
||||
# Auto-detects .delta suffix if needed
|
||||
client.download("s3://releases/v1.0.0/app.zip", "app.zip")
|
||||
# Will try app.zip first, then app.zip.delta if not found
|
||||
|
||||
# Download to specific directory
|
||||
from pathlib import Path
|
||||
output = Path("/tmp/downloads/app.zip")
|
||||
output.parent.mkdir(parents=True, exist_ok=True)
|
||||
client.download("s3://releases/v1.0.0/app.zip", output)
|
||||
```
|
||||
|
||||
#### `verify`
|
||||
|
||||
Verify the integrity of a stored file using SHA256 checksums.
|
||||
|
||||
```python
|
||||
def verify(
|
||||
self,
|
||||
s3_url: str
|
||||
) -> bool
|
||||
```
|
||||
|
||||
##### Parameters
|
||||
|
||||
- **s3_url** (`str`): S3 URL of the file to verify.
|
||||
|
||||
##### Returns
|
||||
|
||||
`bool`: True if verification passed, False if corrupted.
|
||||
|
||||
##### Raises
|
||||
|
||||
- `ValueError`: If S3 URL is invalid.
|
||||
- `FileNotFoundError`: If S3 object doesn't exist.
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Verify file integrity
|
||||
is_valid = client.verify("s3://releases/v1.0.0/app.zip")
|
||||
|
||||
if is_valid:
|
||||
print("✓ File integrity verified")
|
||||
else:
|
||||
print("✗ File is corrupted!")
|
||||
# Re-upload or investigate
|
||||
```
|
||||
|
||||
#### `lifecycle_policy`
|
||||
|
||||
Set lifecycle policy for S3 prefix (placeholder for future implementation).
|
||||
|
||||
```python
|
||||
def lifecycle_policy(
|
||||
self,
|
||||
s3_prefix: str,
|
||||
days_before_archive: int = 30,
|
||||
days_before_delete: int = 90
|
||||
) -> None
|
||||
```
|
||||
|
||||
**Note**: This method is a placeholder for future S3 lifecycle policy management.
|
||||
|
||||
## UploadSummary
|
||||
|
||||
Data class containing upload operation results.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class UploadSummary:
|
||||
operation: str # Operation type: "PUT" or "PUT_DELTA"
|
||||
bucket: str # S3 bucket name
|
||||
key: str # S3 object key
|
||||
original_size: int # Original file size in bytes
|
||||
stored_size: int # Actual stored size in bytes
|
||||
is_delta: bool # Whether delta compression was used
|
||||
delta_ratio: float = 0.0 # Ratio of delta size to original
|
||||
```
|
||||
|
||||
### Properties
|
||||
|
||||
#### `original_size_mb`
|
||||
|
||||
Original file size in megabytes.
|
||||
|
||||
```python
|
||||
@property
|
||||
def original_size_mb(self) -> float
|
||||
```
|
||||
|
||||
#### `stored_size_mb`
|
||||
|
||||
Stored size in megabytes (after compression if applicable).
|
||||
|
||||
```python
|
||||
@property
|
||||
def stored_size_mb(self) -> float
|
||||
```
|
||||
|
||||
#### `savings_percent`
|
||||
|
||||
Percentage saved through compression.
|
||||
|
||||
```python
|
||||
@property
|
||||
def savings_percent(self) -> float
|
||||
```
|
||||
|
||||
### Example Usage
|
||||
|
||||
```python
|
||||
summary = client.upload("app.zip", "s3://releases/")
|
||||
|
||||
print(f"Operation: {summary.operation}")
|
||||
print(f"Location: s3://{summary.bucket}/{summary.key}")
|
||||
print(f"Original: {summary.original_size_mb:.1f} MB")
|
||||
print(f"Stored: {summary.stored_size_mb:.1f} MB")
|
||||
print(f"Saved: {summary.savings_percent:.0f}%")
|
||||
print(f"Delta used: {summary.is_delta}")
|
||||
|
||||
if summary.is_delta:
|
||||
print(f"Delta ratio: {summary.delta_ratio:.2%}")
|
||||
```
|
||||
|
||||
## DeltaService
|
||||
|
||||
Core service class handling delta compression logic.
|
||||
|
||||
```python
|
||||
class DeltaService:
|
||||
def __init__(
|
||||
self,
|
||||
storage: StoragePort,
|
||||
diff: DiffPort,
|
||||
hasher: HashPort,
|
||||
cache: CachePort,
|
||||
clock: ClockPort,
|
||||
logger: LoggerPort,
|
||||
metrics: MetricsPort,
|
||||
tool_version: str = "deltaglider/0.1.0",
|
||||
max_ratio: float = 0.5
|
||||
)
|
||||
```
|
||||
|
||||
### Methods
|
||||
|
||||
#### `put`
|
||||
|
||||
Upload a file with automatic delta compression.
|
||||
|
||||
```python
|
||||
def put(
|
||||
self,
|
||||
file: Path,
|
||||
delta_space: DeltaSpace,
|
||||
max_ratio: Optional[float] = None
|
||||
) -> PutSummary
|
||||
```
|
||||
|
||||
#### `get`
|
||||
|
||||
Download and reconstruct a file.
|
||||
|
||||
```python
|
||||
def get(
|
||||
self,
|
||||
object_key: ObjectKey,
|
||||
output_path: Path
|
||||
) -> GetSummary
|
||||
```
|
||||
|
||||
#### `verify`
|
||||
|
||||
Verify file integrity.
|
||||
|
||||
```python
|
||||
def verify(
|
||||
self,
|
||||
object_key: ObjectKey
|
||||
) -> VerifyResult
|
||||
```
|
||||
|
||||
## Models
|
||||
|
||||
### DeltaSpace
|
||||
|
||||
Represents a compression space in S3.
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class DeltaSpace:
|
||||
bucket: str # S3 bucket name
|
||||
prefix: str # S3 prefix for related files
|
||||
```
|
||||
|
||||
### ObjectKey
|
||||
|
||||
Represents an S3 object location.
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class ObjectKey:
|
||||
bucket: str # S3 bucket name
|
||||
key: str # S3 object key
|
||||
```
|
||||
|
||||
### PutSummary
|
||||
|
||||
Detailed upload operation results.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PutSummary:
|
||||
operation: str # "PUT" or "PUT_DELTA"
|
||||
bucket: str # S3 bucket
|
||||
key: str # S3 key
|
||||
file_size: int # Original file size
|
||||
file_hash: str # SHA256 of original file
|
||||
delta_size: Optional[int] # Size of delta (if used)
|
||||
delta_hash: Optional[str] # SHA256 of delta
|
||||
delta_ratio: Optional[float] # Delta/original ratio
|
||||
reference_hash: Optional[str] # Reference file hash
|
||||
```
|
||||
|
||||
### GetSummary
|
||||
|
||||
Download operation results.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class GetSummary:
|
||||
operation: str # "GET" or "GET_DELTA"
|
||||
bucket: str # S3 bucket
|
||||
key: str # S3 key
|
||||
size: int # Downloaded size
|
||||
hash: str # SHA256 hash
|
||||
reconstructed: bool # Whether reconstruction was needed
|
||||
```
|
||||
|
||||
### VerifyResult
|
||||
|
||||
Verification operation results.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class VerifyResult:
|
||||
valid: bool # Verification result
|
||||
operation: str # "VERIFY" or "VERIFY_DELTA"
|
||||
expected_hash: str # Expected SHA256
|
||||
actual_hash: Optional[str] # Actual SHA256 (if computed)
|
||||
details: Optional[str] # Error details if invalid
|
||||
```
|
||||
|
||||
## Exceptions
|
||||
|
||||
DeltaGlider uses standard Python exceptions with descriptive messages:
|
||||
|
||||
### Common Exceptions
|
||||
|
||||
- **FileNotFoundError**: Local file or S3 object not found
|
||||
- **PermissionError**: Access denied (S3 or local filesystem)
|
||||
- **ValueError**: Invalid parameters (malformed URLs, invalid ratios)
|
||||
- **IOError**: I/O operations failed
|
||||
- **RuntimeError**: xdelta3 binary not found or failed
|
||||
|
||||
### Exception Handling Example
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
|
||||
try:
|
||||
summary = client.upload("app.zip", "s3://bucket/path/")
|
||||
|
||||
except FileNotFoundError as e:
|
||||
print(f"File not found: {e}")
|
||||
|
||||
except PermissionError as e:
|
||||
print(f"Permission denied: {e}")
|
||||
print("Check AWS credentials and S3 bucket permissions")
|
||||
|
||||
except ValueError as e:
|
||||
print(f"Invalid parameters: {e}")
|
||||
|
||||
except RuntimeError as e:
|
||||
print(f"System error: {e}")
|
||||
print("Ensure xdelta3 is installed: apt-get install xdelta3")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Unexpected error: {e}")
|
||||
# Log for investigation
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
DeltaGlider respects these environment variables:
|
||||
|
||||
### AWS Configuration
|
||||
|
||||
- **AWS_ACCESS_KEY_ID**: AWS access key
|
||||
- **AWS_SECRET_ACCESS_KEY**: AWS secret key
|
||||
- **AWS_DEFAULT_REGION**: AWS region (default: us-east-1)
|
||||
- **AWS_ENDPOINT_URL**: Custom S3 endpoint (for MinIO/R2)
|
||||
- **AWS_PROFILE**: AWS profile to use
|
||||
|
||||
### DeltaGlider Configuration
|
||||
|
||||
- **DG_LOG_LEVEL**: Logging level (DEBUG, INFO, WARNING, ERROR)
|
||||
- **DG_CACHE_DIR**: Local cache directory
|
||||
- **DG_MAX_RATIO**: Default maximum delta ratio
|
||||
|
||||
### Example
|
||||
|
||||
```bash
|
||||
# Configure for MinIO
|
||||
export AWS_ENDPOINT_URL=http://localhost:9000
|
||||
export AWS_ACCESS_KEY_ID=minioadmin
|
||||
export AWS_SECRET_ACCESS_KEY=minioadmin
|
||||
|
||||
# Configure DeltaGlider
|
||||
export DG_LOG_LEVEL=DEBUG
|
||||
export DG_CACHE_DIR=/var/cache/deltaglider
|
||||
export DG_MAX_RATIO=0.3
|
||||
|
||||
# Now use normally
|
||||
python my_script.py
|
||||
```
|
||||
|
||||
## Thread Safety
|
||||
|
||||
DeltaGlider clients are thread-safe for read operations but should not be shared across threads for write operations. For multi-threaded applications:
|
||||
|
||||
```python
|
||||
import threading
|
||||
from deltaglider import create_client
|
||||
|
||||
# Create separate client per thread
|
||||
def worker(file_path, s3_url):
|
||||
client = create_client() # Each thread gets its own client
|
||||
summary = client.upload(file_path, s3_url)
|
||||
print(f"Thread {threading.current_thread().name}: {summary.savings_percent:.0f}%")
|
||||
|
||||
# Create threads
|
||||
threads = []
|
||||
for i, (file, url) in enumerate(files_to_upload):
|
||||
t = threading.Thread(target=worker, args=(file, url), name=f"Worker-{i}")
|
||||
threads.append(t)
|
||||
t.start()
|
||||
|
||||
# Wait for completion
|
||||
for t in threads:
|
||||
t.join()
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Upload Performance
|
||||
|
||||
- **First file**: No compression overhead (becomes reference)
|
||||
- **Similar files**: 3-4 files/second with compression
|
||||
- **Network bound**: Limited by S3 upload speed
|
||||
- **CPU bound**: xdelta3 compression for large files
|
||||
|
||||
### Download Performance
|
||||
|
||||
- **Direct files**: Limited by S3 download speed
|
||||
- **Delta files**: <100ms reconstruction overhead
|
||||
- **Cache hits**: Near-instant for cached references
|
||||
|
||||
### Optimization Tips
|
||||
|
||||
1. **Group related files**: Upload similar files to same prefix
|
||||
2. **Batch operations**: Use concurrent uploads for independent files
|
||||
3. **Cache management**: Don't clear cache during operations
|
||||
4. **Compression threshold**: Tune `max_ratio` for your use case
|
||||
5. **Network optimization**: Use S3 Transfer Acceleration if available
|
||||
|
||||
## Logging
|
||||
|
||||
DeltaGlider uses Python's standard logging framework:
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
# Configure logging before creating client
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('deltaglider.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
|
||||
# Create client (will use configured logging)
|
||||
client = create_client(log_level="DEBUG")
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
- **DEBUG**: Detailed operations, xdelta3 commands
|
||||
- **INFO**: Normal operations, compression statistics
|
||||
- **WARNING**: Non-critical issues, fallbacks
|
||||
- **ERROR**: Operation failures, exceptions
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
- **Python**: 3.11 or higher required
|
||||
- **boto3**: 1.35.0 or higher
|
||||
- **xdelta3**: System binary required
|
||||
- **S3 API**: Compatible with S3 API v4
|
||||
|
||||
## Support
|
||||
|
||||
- **GitHub Issues**: [github.com/beshu-tech/deltaglider/issues](https://github.com/beshu-tech/deltaglider/issues)
|
||||
- **Documentation**: [github.com/beshu-tech/deltaglider](https://github.com/beshu-tech/deltaglider)
|
||||
- **PyPI Package**: [pypi.org/project/deltaglider](https://pypi.org/project/deltaglider)
|
||||
648
docs/sdk/architecture.md
Normal file
648
docs/sdk/architecture.md
Normal file
@@ -0,0 +1,648 @@
|
||||
# DeltaGlider Architecture
|
||||
|
||||
Understanding how DeltaGlider achieves 99.9% compression through intelligent binary delta compression.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#overview)
|
||||
2. [Hexagonal Architecture](#hexagonal-architecture)
|
||||
3. [Core Concepts](#core-concepts)
|
||||
4. [Compression Algorithm](#compression-algorithm)
|
||||
5. [Storage Strategy](#storage-strategy)
|
||||
6. [Performance Optimizations](#performance-optimizations)
|
||||
7. [Security & Integrity](#security--integrity)
|
||||
8. [Comparison with Alternatives](#comparison-with-alternatives)
|
||||
|
||||
## Overview
|
||||
|
||||
DeltaGlider is built on a simple yet powerful idea: **most versioned files share 99% of their content**. Instead of storing complete files repeatedly, we store one reference file and only the differences (deltas) for similar files.
|
||||
|
||||
### High-Level Flow
|
||||
|
||||
```
|
||||
First Upload (v1.0.0):
|
||||
┌──────────┐ ┌─────────────┐ ┌──────┐
|
||||
│ 100MB │───────▶│ DeltaGlider │──────▶│ S3 │
|
||||
│ File │ │ │ │100MB │
|
||||
└──────────┘ └─────────────┘ └──────┘
|
||||
|
||||
Second Upload (v1.0.1):
|
||||
┌──────────┐ ┌─────────────┐ ┌──────┐
|
||||
│ 100MB │───────▶│ DeltaGlider │──────▶│ S3 │
|
||||
│ File │ │ (xdelta3) │ │ 98KB │
|
||||
└──────────┘ └─────────────┘ └──────┘
|
||||
│
|
||||
Creates 98KB delta
|
||||
by comparing with
|
||||
v1.0.0 reference
|
||||
```
|
||||
|
||||
## Hexagonal Architecture
|
||||
|
||||
DeltaGlider follows the hexagonal (ports and adapters) architecture pattern for maximum flexibility and testability.
|
||||
|
||||
### Architecture Diagram
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Application │
|
||||
│ (CLI / SDK) │
|
||||
└────────┬────────┘
|
||||
│
|
||||
┌────────▼────────┐
|
||||
│ │
|
||||
│ DeltaService │
|
||||
│ (Core Logic) │
|
||||
│ │
|
||||
└────┬─────┬──────┘
|
||||
│ │
|
||||
┌──────────▼─┬───▼──────────┐
|
||||
│ │ │
|
||||
│ Ports │ Ports │
|
||||
│ (Interfaces)│ (Interfaces)│
|
||||
│ │ │
|
||||
└──────┬─────┴────┬─────────┘
|
||||
│ │
|
||||
┌───────────▼──┐ ┌───▼───────────┐
|
||||
│ │ │ │
|
||||
│ Adapters │ │ Adapters │
|
||||
│ │ │ │
|
||||
├──────────────┤ ├───────────────┤
|
||||
│ S3Storage │ │ XdeltaDiff │
|
||||
│ Sha256Hash │ │ FsCache │
|
||||
│ UtcClock │ │ StdLogger │
|
||||
│ NoopMetrics │ │ │
|
||||
└──────────────┘ └───────────────┘
|
||||
│ │
|
||||
┌──────▼─────┐ ┌─────▼──────┐
|
||||
│ AWS │ │ xdelta3 │
|
||||
│ S3 │ │ binary │
|
||||
└────────────┘ └────────────┘
|
||||
```
|
||||
|
||||
### Ports (Interfaces)
|
||||
|
||||
Ports define contracts that adapters must implement:
|
||||
|
||||
```python
|
||||
# StoragePort - Abstract S3 operations
|
||||
class StoragePort(Protocol):
|
||||
def put_object(self, bucket: str, key: str, data: bytes, metadata: Dict) -> None
|
||||
def get_object(self, bucket: str, key: str) -> Tuple[bytes, Dict]
|
||||
def object_exists(self, bucket: str, key: str) -> bool
|
||||
def delete_object(self, bucket: str, key: str) -> None
|
||||
|
||||
# DiffPort - Abstract delta operations
|
||||
class DiffPort(Protocol):
|
||||
def create_delta(self, reference: bytes, target: bytes) -> bytes
|
||||
def apply_delta(self, reference: bytes, delta: bytes) -> bytes
|
||||
|
||||
# HashPort - Abstract integrity checks
|
||||
class HashPort(Protocol):
|
||||
def hash(self, data: bytes) -> str
|
||||
def hash_file(self, path: Path) -> str
|
||||
|
||||
# CachePort - Abstract local caching
|
||||
class CachePort(Protocol):
|
||||
def get(self, key: str) -> Optional[Path]
|
||||
def put(self, key: str, path: Path) -> None
|
||||
def exists(self, key: str) -> bool
|
||||
```
|
||||
|
||||
### Adapters (Implementations)
|
||||
|
||||
Adapters provide concrete implementations:
|
||||
|
||||
- **S3StorageAdapter**: Uses boto3 for S3 operations
|
||||
- **XdeltaAdapter**: Wraps xdelta3 binary for delta compression
|
||||
- **Sha256Adapter**: Provides SHA256 hashing
|
||||
- **FsCacheAdapter**: File system based reference cache
|
||||
- **UtcClockAdapter**: UTC timestamp provider
|
||||
- **StdLoggerAdapter**: Console logging
|
||||
|
||||
### Benefits
|
||||
|
||||
1. **Testability**: Mock any adapter for unit testing
|
||||
2. **Flexibility**: Swap implementations (e.g., different storage backends)
|
||||
3. **Separation**: Business logic isolated from infrastructure
|
||||
4. **Extensibility**: Add new adapters without changing core
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### DeltaSpace
|
||||
|
||||
A DeltaSpace is an S3 prefix containing related files that share a common reference:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DeltaSpace:
|
||||
bucket: str # S3 bucket
|
||||
prefix: str # Prefix for related files
|
||||
|
||||
# Example:
|
||||
# DeltaSpace(bucket="releases", prefix="myapp/v1/")
|
||||
# Contains:
|
||||
# - reference.bin (first uploaded file)
|
||||
# - file1.zip.delta
|
||||
# - file2.zip.delta
|
||||
```
|
||||
|
||||
### Reference File
|
||||
|
||||
The first file uploaded to a DeltaSpace becomes the reference:
|
||||
|
||||
```
|
||||
s3://bucket/prefix/reference.bin # Full file (e.g., 100MB)
|
||||
s3://bucket/prefix/reference.bin.sha256 # Integrity checksum
|
||||
```
|
||||
|
||||
### Delta Files
|
||||
|
||||
Subsequent files are stored as deltas:
|
||||
|
||||
```
|
||||
s3://bucket/prefix/myfile.zip.delta # Delta file (e.g., 98KB)
|
||||
|
||||
Metadata (S3 tags):
|
||||
- original_name: myfile.zip
|
||||
- original_size: 104857600
|
||||
- original_hash: abc123...
|
||||
- reference_hash: def456...
|
||||
- tool_version: deltaglider/0.1.0
|
||||
```
|
||||
|
||||
## Compression Algorithm
|
||||
|
||||
### xdelta3: The Secret Sauce
|
||||
|
||||
DeltaGlider uses [xdelta3](http://xdelta.org/), a binary diff algorithm optimized for large files:
|
||||
|
||||
#### How xdelta3 Works
|
||||
|
||||
1. **Rolling Hash**: Scans reference file with a rolling hash window
|
||||
2. **Block Matching**: Finds matching byte sequences at any offset
|
||||
3. **Instruction Stream**: Generates copy/insert instructions
|
||||
4. **Compression**: Further compresses the instruction stream
|
||||
|
||||
```
|
||||
Original: ABCDEFGHIJKLMNOP
|
||||
Modified: ABCXYZGHIJKLMNOP
|
||||
|
||||
Delta instructions:
|
||||
- COPY 0-2 (ABC) # Copy bytes 0-2 from reference
|
||||
- INSERT XYZ # Insert new bytes
|
||||
- COPY 6-15 (GHIJKLMNOP) # Copy bytes 6-15 from reference
|
||||
|
||||
Delta size: ~10 bytes instead of 16 bytes
|
||||
```
|
||||
|
||||
#### Why xdelta3 Excels at Archives
|
||||
|
||||
Archive files (ZIP, TAR, JAR) have predictable structure:
|
||||
|
||||
```
|
||||
ZIP Structure:
|
||||
┌─────────────┐
|
||||
│ Headers │ ← Usually identical between versions
|
||||
├─────────────┤
|
||||
│ File 1 │ ← May be unchanged
|
||||
├─────────────┤
|
||||
│ File 2 │ ← Small change
|
||||
├─────────────┤
|
||||
│ File 3 │ ← May be unchanged
|
||||
├─────────────┤
|
||||
│ Directory │ ← Structure mostly same
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
Even when one file changes inside the archive, xdelta3 can:
|
||||
- Identify unchanged sections (even if byte positions shift)
|
||||
- Compress repeated patterns efficiently
|
||||
- Handle binary data optimally
|
||||
|
||||
### Intelligent File Type Detection
|
||||
|
||||
```python
|
||||
def should_use_delta(file_path: Path) -> bool:
|
||||
"""Determine if file should use delta compression."""
|
||||
|
||||
# File size check
|
||||
if file_path.stat().st_size < 1_000_000: # < 1MB
|
||||
return False # Overhead not worth it
|
||||
|
||||
# Extension-based detection
|
||||
DELTA_EXTENSIONS = {
|
||||
'.zip', '.tar', '.gz', '.tgz', '.bz2', # Archives
|
||||
'.jar', '.war', '.ear', # Java
|
||||
'.dmg', '.pkg', '.deb', '.rpm', # Packages
|
||||
'.iso', '.img', '.vhd', # Disk images
|
||||
}
|
||||
|
||||
DIRECT_EXTENSIONS = {
|
||||
'.txt', '.md', '.json', '.xml', # Text (use gzip)
|
||||
'.jpg', '.png', '.mp4', # Media (already compressed)
|
||||
'.sha1', '.sha256', '.md5', # Checksums (unique)
|
||||
}
|
||||
|
||||
ext = file_path.suffix.lower()
|
||||
|
||||
if ext in DELTA_EXTENSIONS:
|
||||
return True
|
||||
elif ext in DIRECT_EXTENSIONS:
|
||||
return False
|
||||
else:
|
||||
# Unknown type - use heuristic
|
||||
return is_likely_archive(file_path)
|
||||
```
|
||||
|
||||
## Storage Strategy
|
||||
|
||||
### S3 Object Layout
|
||||
|
||||
```
|
||||
bucket/
|
||||
├── releases/
|
||||
│ ├── v1.0.0/
|
||||
│ │ ├── reference.bin # First uploaded file (full)
|
||||
│ │ ├── reference.bin.sha256 # Checksum
|
||||
│ │ ├── app-linux.tar.gz.delta # Delta from reference
|
||||
│ │ ├── app-mac.dmg.delta # Delta from reference
|
||||
│ │ └── app-win.zip.delta # Delta from reference
|
||||
│ ├── v1.0.1/
|
||||
│ │ ├── reference.bin # New reference for this version
|
||||
│ │ └── ...
|
||||
│ └── v1.1.0/
|
||||
│ └── ...
|
||||
└── backups/
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Metadata Strategy
|
||||
|
||||
DeltaGlider stores metadata in S3 object tags/metadata:
|
||||
|
||||
```python
|
||||
# For delta files
|
||||
metadata = {
|
||||
"x-amz-meta-original-name": "app.zip",
|
||||
"x-amz-meta-original-size": "104857600",
|
||||
"x-amz-meta-original-hash": "sha256:abc123...",
|
||||
"x-amz-meta-reference-hash": "sha256:def456...",
|
||||
"x-amz-meta-tool-version": "deltaglider/0.1.0",
|
||||
"x-amz-meta-compression-ratio": "0.001", # 0.1% of original
|
||||
}
|
||||
```
|
||||
|
||||
Benefits:
|
||||
- No separate metadata store needed
|
||||
- Atomic operations (metadata stored with object)
|
||||
- Works with S3 versioning and lifecycle policies
|
||||
- Queryable via S3 API
|
||||
|
||||
### Local Cache Strategy
|
||||
|
||||
```
|
||||
/tmp/.deltaglider/cache/
|
||||
├── references/
|
||||
│ ├── sha256_abc123.bin # Cached reference files
|
||||
│ ├── sha256_def456.bin
|
||||
│ └── ...
|
||||
└── metadata.json # Cache index
|
||||
```
|
||||
|
||||
Cache benefits:
|
||||
- Avoid repeated reference downloads
|
||||
- Speed up delta creation for multiple files
|
||||
- Reduce S3 API calls and bandwidth
|
||||
|
||||
## Performance Optimizations
|
||||
|
||||
### 1. Reference Caching
|
||||
|
||||
```python
|
||||
class FsCacheAdapter:
|
||||
def get_reference(self, hash: str) -> Optional[Path]:
|
||||
cache_path = self.cache_dir / f"sha256_{hash}.bin"
|
||||
if cache_path.exists():
|
||||
# Verify integrity
|
||||
if self.verify_hash(cache_path, hash):
|
||||
return cache_path
|
||||
return None
|
||||
|
||||
def put_reference(self, hash: str, path: Path) -> None:
|
||||
cache_path = self.cache_dir / f"sha256_{hash}.bin"
|
||||
shutil.copy2(path, cache_path)
|
||||
# Update cache index
|
||||
self.update_index(hash, cache_path)
|
||||
```
|
||||
|
||||
### 2. Streaming Operations
|
||||
|
||||
For large files, DeltaGlider uses streaming:
|
||||
|
||||
```python
|
||||
def upload_large_file(file_path: Path, s3_url: str):
|
||||
# Stream file to S3 using multipart upload
|
||||
with open(file_path, 'rb') as f:
|
||||
# boto3 automatically uses multipart for large files
|
||||
s3.upload_fileobj(f, bucket, key,
|
||||
Config=TransferConfig(
|
||||
multipart_threshold=1024 * 25, # 25MB
|
||||
max_concurrency=10,
|
||||
use_threads=True))
|
||||
```
|
||||
|
||||
### 3. Parallel Processing
|
||||
|
||||
```python
|
||||
def process_batch(files: List[Path]):
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
futures = []
|
||||
for file in files:
|
||||
future = executor.submit(process_file, file)
|
||||
futures.append(future)
|
||||
|
||||
for future in as_completed(futures):
|
||||
result = future.result()
|
||||
print(f"Processed: {result}")
|
||||
```
|
||||
|
||||
### 4. Delta Ratio Optimization
|
||||
|
||||
```python
|
||||
def optimize_compression(file: Path, reference: Path) -> bytes:
|
||||
# Create delta
|
||||
delta = create_delta(reference, file)
|
||||
|
||||
# Check compression effectiveness
|
||||
ratio = len(delta) / file.stat().st_size
|
||||
|
||||
if ratio > MAX_RATIO: # Default: 0.5 (50%)
|
||||
# Delta too large, store original
|
||||
return None
|
||||
else:
|
||||
# Good compression, use delta
|
||||
return delta
|
||||
```
|
||||
|
||||
## Security & Integrity
|
||||
|
||||
### SHA256 Verification
|
||||
|
||||
Every operation includes checksum verification:
|
||||
|
||||
```python
|
||||
def verify_integrity(data: bytes, expected_hash: str) -> bool:
|
||||
actual_hash = hashlib.sha256(data).hexdigest()
|
||||
return actual_hash == expected_hash
|
||||
|
||||
# Upload flow
|
||||
file_hash = calculate_hash(file)
|
||||
upload_to_s3(file, metadata={"hash": file_hash})
|
||||
|
||||
# Download flow
|
||||
data, metadata = download_from_s3(key)
|
||||
if not verify_integrity(data, metadata["hash"]):
|
||||
raise IntegrityError("File corrupted")
|
||||
```
|
||||
|
||||
### Atomic Operations
|
||||
|
||||
All S3 operations are atomic:
|
||||
|
||||
```python
|
||||
def atomic_upload(file: Path, bucket: str, key: str):
|
||||
try:
|
||||
# Upload to temporary key
|
||||
temp_key = f"{key}.tmp.{uuid.uuid4()}"
|
||||
s3.upload_file(file, bucket, temp_key)
|
||||
|
||||
# Atomic rename (S3 copy + delete)
|
||||
s3.copy_object(
|
||||
CopySource={'Bucket': bucket, 'Key': temp_key},
|
||||
Bucket=bucket,
|
||||
Key=key
|
||||
)
|
||||
s3.delete_object(Bucket=bucket, Key=temp_key)
|
||||
|
||||
except Exception:
|
||||
# Cleanup on failure
|
||||
try:
|
||||
s3.delete_object(Bucket=bucket, Key=temp_key)
|
||||
except:
|
||||
pass
|
||||
raise
|
||||
```
|
||||
|
||||
### Encryption Support
|
||||
|
||||
DeltaGlider respects S3 encryption settings:
|
||||
|
||||
```python
|
||||
# Server-side encryption with S3-managed keys
|
||||
s3.put_object(
|
||||
Bucket=bucket,
|
||||
Key=key,
|
||||
Body=data,
|
||||
ServerSideEncryption='AES256'
|
||||
)
|
||||
|
||||
# Server-side encryption with KMS
|
||||
s3.put_object(
|
||||
Bucket=bucket,
|
||||
Key=key,
|
||||
Body=data,
|
||||
ServerSideEncryption='aws:kms',
|
||||
SSEKMSKeyId='arn:aws:kms:...'
|
||||
)
|
||||
```
|
||||
|
||||
## Comparison with Alternatives
|
||||
|
||||
### vs. S3 Versioning
|
||||
|
||||
| Aspect | DeltaGlider | S3 Versioning |
|
||||
|--------|-------------|---------------|
|
||||
| Storage | Only stores deltas | Stores full copies |
|
||||
| Compression | 99%+ for similar files | 0% |
|
||||
| Cost | Minimal | $$ per version |
|
||||
| Complexity | Transparent | Built-in |
|
||||
| Recovery | Download + reconstruct | Direct download |
|
||||
|
||||
### vs. Git LFS
|
||||
|
||||
| Aspect | DeltaGlider | Git LFS |
|
||||
|--------|-------------|---------|
|
||||
| Use case | Any S3 storage | Git repositories |
|
||||
| Compression | Binary delta | Deduplication |
|
||||
| Integration | S3 API | Git workflow |
|
||||
| Scalability | Unlimited | Repository-bound |
|
||||
|
||||
### vs. Deduplication Systems
|
||||
|
||||
| Aspect | DeltaGlider | Dedup Systems |
|
||||
|--------|-------------|---------------|
|
||||
| Approach | File-level delta | Block-level dedup |
|
||||
| Compression | 99%+ for similar | 30-50% typical |
|
||||
| Complexity | Simple | Complex |
|
||||
| Cost | Open source | Enterprise $$$ |
|
||||
|
||||
### vs. Backup Tools (Restic/Borg)
|
||||
|
||||
| Aspect | DeltaGlider | Restic/Borg |
|
||||
|--------|-------------|-------------|
|
||||
| Purpose | S3 optimization | Full backup |
|
||||
| Storage | S3-native | Custom format |
|
||||
| Granularity | File-level | Repository |
|
||||
| Use case | Artifacts/releases | System backups |
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Reference Rotation Strategy
|
||||
|
||||
Currently, the first file becomes the permanent reference. Future versions may implement:
|
||||
|
||||
```python
|
||||
class ReferenceRotationStrategy:
|
||||
def should_rotate(self, stats: ReferenceStats) -> bool:
|
||||
# Rotate if average delta ratio is too high
|
||||
if stats.avg_delta_ratio > 0.4:
|
||||
return True
|
||||
|
||||
# Rotate if reference is too old
|
||||
if stats.age_days > 90:
|
||||
return True
|
||||
|
||||
# Rotate if better candidate exists
|
||||
if stats.better_candidate_score > 0.8:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def select_new_reference(self, files: List[FileStats]) -> Path:
|
||||
# Select file that minimizes total delta sizes
|
||||
best_score = float('inf')
|
||||
best_file = None
|
||||
|
||||
for candidate in files:
|
||||
total_delta_size = sum(
|
||||
compute_delta_size(candidate, other)
|
||||
for other in files
|
||||
if other != candidate
|
||||
)
|
||||
if total_delta_size < best_score:
|
||||
best_score = total_delta_size
|
||||
best_file = candidate
|
||||
|
||||
return best_file
|
||||
```
|
||||
|
||||
### Multi-Reference Support
|
||||
|
||||
For diverse file sets, multiple references could be used:
|
||||
|
||||
```python
|
||||
class MultiReferenceStrategy:
|
||||
def assign_reference(self, file: Path, references: List[Reference]) -> Reference:
|
||||
# Find best matching reference
|
||||
best_reference = None
|
||||
best_ratio = float('inf')
|
||||
|
||||
for ref in references:
|
||||
delta = create_delta(ref.path, file)
|
||||
ratio = len(delta) / file.stat().st_size
|
||||
|
||||
if ratio < best_ratio:
|
||||
best_ratio = ratio
|
||||
best_reference = ref
|
||||
|
||||
# Create new reference if no good match
|
||||
if best_ratio > 0.5:
|
||||
return self.create_new_reference(file)
|
||||
|
||||
return best_reference
|
||||
```
|
||||
|
||||
### Incremental Delta Chains
|
||||
|
||||
For frequently updated files:
|
||||
|
||||
```python
|
||||
class DeltaChain:
|
||||
"""
|
||||
v1.0.0 (reference) <- v1.0.1 (delta) <- v1.0.2 (delta) <- v1.0.3 (delta)
|
||||
"""
|
||||
def reconstruct(self, version: str) -> bytes:
|
||||
# Start with reference
|
||||
data = self.load_reference()
|
||||
|
||||
# Apply deltas in sequence
|
||||
for delta in self.get_delta_chain(version):
|
||||
data = apply_delta(data, delta)
|
||||
|
||||
return data
|
||||
```
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Metrics to Track
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CompressionMetrics:
|
||||
total_uploads: int
|
||||
total_original_size: int
|
||||
total_stored_size: int
|
||||
average_compression_ratio: float
|
||||
delta_files_count: int
|
||||
reference_files_count: int
|
||||
cache_hit_rate: float
|
||||
average_upload_time: float
|
||||
average_download_time: float
|
||||
failed_compressions: int
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```python
|
||||
class HealthCheck:
|
||||
def check_xdelta3(self) -> bool:
|
||||
"""Verify xdelta3 binary is available."""
|
||||
return shutil.which('xdelta3') is not None
|
||||
|
||||
def check_s3_access(self) -> bool:
|
||||
"""Verify S3 credentials and permissions."""
|
||||
try:
|
||||
s3.list_buckets()
|
||||
return True
|
||||
except:
|
||||
return False
|
||||
|
||||
def check_cache_space(self) -> bool:
|
||||
"""Verify adequate cache space."""
|
||||
cache_dir = Path('/tmp/.deltaglider/cache')
|
||||
free_space = shutil.disk_usage(cache_dir).free
|
||||
return free_space > 1_000_000_000 # 1GB minimum
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Cloud-Native Reference Management**: Store references in distributed cache
|
||||
2. **Rust Implementation**: 10x performance improvement
|
||||
3. **Automatic Similarity Detection**: ML-based reference selection
|
||||
4. **Multi-Threaded Compression**: Parallel delta generation
|
||||
5. **WASM Support**: Browser-based delta compression
|
||||
6. **S3 Batch Operations**: Bulk compression of existing data
|
||||
7. **Compression Prediction**: Estimate compression before upload
|
||||
8. **Adaptive Strategies**: Auto-tune based on workload patterns
|
||||
|
||||
## Contributing
|
||||
|
||||
See [CONTRIBUTING.md](https://github.com/beshu-tech/deltaglider/blob/main/CONTRIBUTING.md) for development setup and guidelines.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [xdelta3 Documentation](http://xdelta.org/)
|
||||
- [S3 API Reference](https://docs.aws.amazon.com/s3/index.html)
|
||||
- [Hexagonal Architecture](https://alistair.cockburn.us/hexagonal-architecture/)
|
||||
- [Binary Diff Algorithms](https://en.wikipedia.org/wiki/Delta_encoding)
|
||||
1112
docs/sdk/examples.md
Normal file
1112
docs/sdk/examples.md
Normal file
File diff suppressed because it is too large
Load Diff
130
docs/sdk/generate_docs.py
Normal file
130
docs/sdk/generate_docs.py
Normal file
@@ -0,0 +1,130 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Generate API documentation for DeltaGlider SDK.
|
||||
|
||||
This script generates documentation from Python source code using introspection.
|
||||
Can be extended to use tools like Sphinx, pdoc, or mkdocs.
|
||||
"""
|
||||
|
||||
import ast
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any
|
||||
|
||||
def extract_docstrings(file_path: Path) -> Dict[str, Any]:
|
||||
"""Extract docstrings and signatures from Python file."""
|
||||
with open(file_path, 'r') as f:
|
||||
tree = ast.parse(f.read(), filename=str(file_path))
|
||||
|
||||
docs = {
|
||||
"module": ast.get_docstring(tree),
|
||||
"classes": {},
|
||||
"functions": {}
|
||||
}
|
||||
|
||||
for node in ast.walk(tree):
|
||||
if isinstance(node, ast.ClassDef):
|
||||
class_docs = {
|
||||
"docstring": ast.get_docstring(node),
|
||||
"methods": {}
|
||||
}
|
||||
|
||||
for item in node.body:
|
||||
if isinstance(item, ast.FunctionDef):
|
||||
method_doc = {
|
||||
"docstring": ast.get_docstring(item),
|
||||
"signature": get_function_signature(item)
|
||||
}
|
||||
class_docs["methods"][item.name] = method_doc
|
||||
|
||||
docs["classes"][node.name] = class_docs
|
||||
|
||||
elif isinstance(node, ast.FunctionDef) and node.col_offset == 0:
|
||||
docs["functions"][node.name] = {
|
||||
"docstring": ast.get_docstring(node),
|
||||
"signature": get_function_signature(node)
|
||||
}
|
||||
|
||||
return docs
|
||||
|
||||
def get_function_signature(node: ast.FunctionDef) -> str:
|
||||
"""Extract function signature."""
|
||||
args = []
|
||||
|
||||
for arg in node.args.args:
|
||||
arg_str = arg.arg
|
||||
if arg.annotation:
|
||||
arg_str += f": {ast.unparse(arg.annotation)}"
|
||||
args.append(arg_str)
|
||||
|
||||
defaults = node.args.defaults
|
||||
if defaults:
|
||||
for i, default in enumerate(defaults, start=len(args) - len(defaults)):
|
||||
args[i] += f" = {ast.unparse(default)}"
|
||||
|
||||
return f"({', '.join(args)})"
|
||||
|
||||
def generate_markdown_docs(docs: Dict[str, Any], module_name: str) -> str:
|
||||
"""Generate Markdown documentation from extracted docs."""
|
||||
lines = [f"# {module_name} API Documentation\n"]
|
||||
|
||||
if docs["module"]:
|
||||
lines.append(f"{docs['module']}\n")
|
||||
|
||||
if docs["functions"]:
|
||||
lines.append("## Functions\n")
|
||||
for name, func in docs["functions"].items():
|
||||
lines.append(f"### `{name}{func['signature']}`\n")
|
||||
if func["docstring"]:
|
||||
lines.append(f"{func['docstring']}\n")
|
||||
|
||||
if docs["classes"]:
|
||||
lines.append("## Classes\n")
|
||||
for class_name, class_info in docs["classes"].items():
|
||||
lines.append(f"### {class_name}\n")
|
||||
if class_info["docstring"]:
|
||||
lines.append(f"{class_info['docstring']}\n")
|
||||
|
||||
if class_info["methods"]:
|
||||
lines.append("#### Methods\n")
|
||||
for method_name, method_info in class_info["methods"].items():
|
||||
lines.append(f"##### `{method_name}{method_info['signature']}`\n")
|
||||
if method_info["docstring"]:
|
||||
lines.append(f"{method_info['docstring']}\n")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def main():
|
||||
"""Generate documentation for DeltaGlider SDK."""
|
||||
src_dir = Path(__file__).parent.parent.parent / "src" / "deltaglider"
|
||||
|
||||
# Extract documentation from client.py
|
||||
client_docs = extract_docstrings(src_dir / "client.py")
|
||||
|
||||
# Generate API documentation
|
||||
api_content = generate_markdown_docs(client_docs, "deltaglider.client")
|
||||
|
||||
# Save generated documentation
|
||||
output_file = Path(__file__).parent / "generated_api.md"
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(api_content)
|
||||
|
||||
print(f"Documentation generated: {output_file}")
|
||||
|
||||
# Generate index of all modules
|
||||
modules = []
|
||||
for py_file in src_dir.rglob("*.py"):
|
||||
if not py_file.name.startswith("_"):
|
||||
rel_path = py_file.relative_to(src_dir)
|
||||
module_name = str(rel_path).replace("/", ".").replace(".py", "")
|
||||
modules.append(module_name)
|
||||
|
||||
index_file = Path(__file__).parent / "module_index.json"
|
||||
with open(index_file, 'w') as f:
|
||||
json.dump({"modules": sorted(modules)}, f, indent=2)
|
||||
|
||||
print(f"Module index generated: {index_file}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
238
docs/sdk/getting-started.md
Normal file
238
docs/sdk/getting-started.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# Getting Started with DeltaGlider SDK
|
||||
|
||||
This guide will help you get up and running with the DeltaGlider Python SDK in minutes.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.11 or higher
|
||||
- AWS credentials configured (or access to MinIO/S3-compatible storage)
|
||||
- xdelta3 installed on your system (installed automatically with the package)
|
||||
|
||||
## Installation
|
||||
|
||||
### Using pip
|
||||
|
||||
```bash
|
||||
pip install deltaglider
|
||||
```
|
||||
|
||||
### Using uv (faster)
|
||||
|
||||
```bash
|
||||
uv pip install deltaglider
|
||||
```
|
||||
|
||||
### Development Installation
|
||||
|
||||
```bash
|
||||
git clone https://github.com/beshu-tech/deltaglider
|
||||
cd deltaglider
|
||||
pip install -e ".[dev]"
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### AWS Credentials
|
||||
|
||||
DeltaGlider uses standard AWS credential discovery:
|
||||
|
||||
1. **Environment Variables**
|
||||
```bash
|
||||
export AWS_ACCESS_KEY_ID=your_access_key
|
||||
export AWS_SECRET_ACCESS_KEY=your_secret_key
|
||||
export AWS_DEFAULT_REGION=us-west-2
|
||||
```
|
||||
|
||||
2. **AWS Credentials File** (`~/.aws/credentials`)
|
||||
```ini
|
||||
[default]
|
||||
aws_access_key_id = your_access_key
|
||||
aws_secret_access_key = your_secret_key
|
||||
region = us-west-2
|
||||
```
|
||||
|
||||
3. **IAM Role** (when running on EC2/ECS/Lambda)
|
||||
Automatically uses instance/task role credentials
|
||||
|
||||
### Custom S3 Endpoints
|
||||
|
||||
For MinIO, Cloudflare R2, or other S3-compatible storage:
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client(endpoint_url="http://minio.local:9000")
|
||||
```
|
||||
|
||||
Or via environment variable:
|
||||
```bash
|
||||
export AWS_ENDPOINT_URL=http://minio.local:9000
|
||||
```
|
||||
|
||||
## Your First Upload
|
||||
|
||||
### Basic Example
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
# Create a client
|
||||
client = create_client()
|
||||
|
||||
# Upload a file
|
||||
summary = client.upload(
|
||||
file_path="my-app-v1.0.0.zip",
|
||||
s3_url="s3://my-bucket/releases/v1.0.0/"
|
||||
)
|
||||
|
||||
# Check the results
|
||||
print(f"Upload completed!")
|
||||
print(f"Original size: {summary.original_size_mb:.1f} MB")
|
||||
print(f"Stored size: {summary.stored_size_mb:.1f} MB")
|
||||
print(f"Compression: {summary.savings_percent:.0f}%")
|
||||
print(f"Is delta: {summary.is_delta}")
|
||||
```
|
||||
|
||||
### Understanding the Results
|
||||
|
||||
When you upload a file, DeltaGlider returns an `UploadSummary` with:
|
||||
|
||||
- `operation`: What was done (`PUT` for new reference, `PUT_DELTA` for delta)
|
||||
- `original_size_mb`: Original file size in MB
|
||||
- `stored_size_mb`: Actual size stored in S3
|
||||
- `savings_percent`: Percentage of storage saved
|
||||
- `is_delta`: Whether delta compression was used
|
||||
- `delta_ratio`: Ratio of delta size to original (smaller is better)
|
||||
|
||||
## Downloading Files
|
||||
|
||||
```python
|
||||
# Download a file
|
||||
client.download(
|
||||
s3_url="s3://my-bucket/releases/v1.0.0/my-app-v1.0.0.zip",
|
||||
output_path="downloaded-app.zip"
|
||||
)
|
||||
|
||||
# The file is automatically reconstructed if it was stored as a delta
|
||||
```
|
||||
|
||||
## Working with Multiple Versions
|
||||
|
||||
Here's where DeltaGlider shines - uploading multiple versions:
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
from pathlib import Path
|
||||
|
||||
client = create_client()
|
||||
|
||||
# Upload multiple versions
|
||||
versions = ["v1.0.0", "v1.0.1", "v1.0.2", "v1.1.0"]
|
||||
|
||||
for version in versions:
|
||||
file = f"builds/my-app-{version}.zip"
|
||||
|
||||
summary = client.upload(
|
||||
file_path=file,
|
||||
s3_url=f"s3://releases/{version}/"
|
||||
)
|
||||
|
||||
if summary.is_delta:
|
||||
print(f"{version}: Compressed to {summary.stored_size_mb:.1f}MB "
|
||||
f"(saved {summary.savings_percent:.0f}%)")
|
||||
else:
|
||||
print(f"{version}: Stored as reference ({summary.original_size_mb:.1f}MB)")
|
||||
|
||||
# Typical output:
|
||||
# v1.0.0: Stored as reference (100.0MB)
|
||||
# v1.0.1: Compressed to 0.2MB (saved 99.8%)
|
||||
# v1.0.2: Compressed to 0.3MB (saved 99.7%)
|
||||
# v1.1.0: Compressed to 5.2MB (saved 94.8%)
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
Verify the integrity of stored files:
|
||||
|
||||
```python
|
||||
# Verify a stored file
|
||||
is_valid = client.verify("s3://releases/v1.0.0/my-app-v1.0.0.zip")
|
||||
print(f"File integrity: {'✓ Valid' if is_valid else '✗ Corrupted'}")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
|
||||
try:
|
||||
summary = client.upload("app.zip", "s3://bucket/path/")
|
||||
except FileNotFoundError:
|
||||
print("Local file not found")
|
||||
except PermissionError:
|
||||
print("S3 access denied - check credentials")
|
||||
except Exception as e:
|
||||
print(f"Upload failed: {e}")
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
Control logging verbosity:
|
||||
|
||||
```python
|
||||
# Debug logging for troubleshooting
|
||||
client = create_client(log_level="DEBUG")
|
||||
|
||||
# Quiet mode
|
||||
client = create_client(log_level="WARNING")
|
||||
|
||||
# Default is INFO
|
||||
client = create_client() # INFO level
|
||||
```
|
||||
|
||||
## Local Testing with MinIO
|
||||
|
||||
For development and testing without AWS:
|
||||
|
||||
1. **Start MinIO**
|
||||
```bash
|
||||
docker run -p 9000:9000 -p 9001:9001 \
|
||||
-e MINIO_ROOT_USER=minioadmin \
|
||||
-e MINIO_ROOT_PASSWORD=minioadmin \
|
||||
minio/minio server /data --console-address ":9001"
|
||||
```
|
||||
|
||||
2. **Create a bucket** (via MinIO Console at http://localhost:9001)
|
||||
|
||||
3. **Use DeltaGlider**
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client(
|
||||
endpoint_url="http://localhost:9000"
|
||||
)
|
||||
|
||||
# Set credentials via environment
|
||||
import os
|
||||
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
|
||||
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
|
||||
|
||||
# Now use normally
|
||||
summary = client.upload("test.zip", "s3://test-bucket/")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Group Similar Files**: Upload related files to the same S3 prefix for optimal compression
|
||||
2. **Version Naming**: Use consistent naming for versions (e.g., `app-v1.0.0.zip`, `app-v1.0.1.zip`)
|
||||
3. **Cache Management**: The local reference cache improves performance - don't clear it unnecessarily
|
||||
4. **Error Recovery**: Always handle exceptions for production code
|
||||
5. **Monitoring**: Log compression ratios to track effectiveness
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [Examples](examples.md) - See real-world usage patterns
|
||||
- [API Reference](api.md) - Complete API documentation
|
||||
- [Architecture](architecture.md) - Understand how it works
|
||||
@@ -5,3 +5,16 @@ try:
|
||||
except ImportError:
|
||||
# Package is not installed, so version is not available
|
||||
__version__ = "0.0.0+unknown"
|
||||
|
||||
# Import simplified client API
|
||||
from .client import DeltaGliderClient, create_client
|
||||
from .core import DeltaService, DeltaSpace, ObjectKey
|
||||
|
||||
__all__ = [
|
||||
"__version__",
|
||||
"DeltaGliderClient",
|
||||
"create_client",
|
||||
"DeltaService",
|
||||
"DeltaSpace",
|
||||
"ObjectKey",
|
||||
]
|
||||
|
||||
237
src/deltaglider/client.py
Normal file
237
src/deltaglider/client.py
Normal file
@@ -0,0 +1,237 @@
|
||||
"""Simplified client API for DeltaGlider."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .adapters import (
|
||||
FsCacheAdapter,
|
||||
NoopMetricsAdapter,
|
||||
S3StorageAdapter,
|
||||
Sha256Adapter,
|
||||
StdLoggerAdapter,
|
||||
UtcClockAdapter,
|
||||
XdeltaAdapter,
|
||||
)
|
||||
from .core import DeltaService, DeltaSpace, ObjectKey
|
||||
|
||||
|
||||
@dataclass
|
||||
class UploadSummary:
|
||||
"""User-friendly upload summary."""
|
||||
|
||||
operation: str
|
||||
bucket: str
|
||||
key: str
|
||||
original_size: int
|
||||
stored_size: int
|
||||
is_delta: bool
|
||||
delta_ratio: float = 0.0
|
||||
|
||||
@property
|
||||
def original_size_mb(self) -> float:
|
||||
"""Original size in MB."""
|
||||
return self.original_size / (1024 * 1024)
|
||||
|
||||
@property
|
||||
def stored_size_mb(self) -> float:
|
||||
"""Stored size in MB."""
|
||||
return self.stored_size / (1024 * 1024)
|
||||
|
||||
@property
|
||||
def savings_percent(self) -> float:
|
||||
"""Percentage saved through compression."""
|
||||
if self.original_size == 0:
|
||||
return 0.0
|
||||
return ((self.original_size - self.stored_size) / self.original_size) * 100
|
||||
|
||||
|
||||
class DeltaGliderClient:
|
||||
"""Simplified client for DeltaGlider operations."""
|
||||
|
||||
def __init__(self, service: DeltaService, endpoint_url: str | None = None):
|
||||
"""Initialize client with service."""
|
||||
self.service = service
|
||||
self.endpoint_url = endpoint_url
|
||||
|
||||
def upload(
|
||||
self,
|
||||
file_path: str | Path,
|
||||
s3_url: str,
|
||||
tags: dict[str, str] | None = None,
|
||||
max_ratio: float = 0.5,
|
||||
) -> UploadSummary:
|
||||
"""Upload a file to S3 with automatic delta compression.
|
||||
|
||||
Args:
|
||||
file_path: Local file to upload
|
||||
s3_url: S3 destination URL (s3://bucket/prefix/)
|
||||
tags: Optional tags to add to the object
|
||||
max_ratio: Maximum acceptable delta/file ratio (default 0.5)
|
||||
|
||||
Returns:
|
||||
UploadSummary with compression statistics
|
||||
"""
|
||||
file_path = Path(file_path)
|
||||
|
||||
# Parse S3 URL
|
||||
if not s3_url.startswith("s3://"):
|
||||
raise ValueError(f"Invalid S3 URL: {s3_url}")
|
||||
|
||||
s3_path = s3_url[5:].rstrip("/")
|
||||
parts = s3_path.split("/", 1)
|
||||
bucket = parts[0]
|
||||
prefix = parts[1] if len(parts) > 1 else ""
|
||||
|
||||
# Create delta space and upload
|
||||
delta_space = DeltaSpace(bucket=bucket, prefix=prefix)
|
||||
summary = self.service.put(file_path, delta_space, max_ratio)
|
||||
|
||||
# TODO: Add tags support when implemented
|
||||
|
||||
# Convert to user-friendly summary
|
||||
is_delta = summary.delta_size is not None
|
||||
stored_size = summary.delta_size if is_delta else summary.file_size
|
||||
|
||||
return UploadSummary(
|
||||
operation=summary.operation,
|
||||
bucket=summary.bucket,
|
||||
key=summary.key,
|
||||
original_size=summary.file_size,
|
||||
stored_size=stored_size or summary.file_size, # Ensure stored_size is never None
|
||||
is_delta=is_delta,
|
||||
delta_ratio=summary.delta_ratio or 0.0,
|
||||
)
|
||||
|
||||
def download(self, s3_url: str, output_path: str | Path) -> None:
|
||||
"""Download and reconstruct a file from S3.
|
||||
|
||||
Args:
|
||||
s3_url: S3 source URL (s3://bucket/key)
|
||||
output_path: Local destination path
|
||||
"""
|
||||
output_path = Path(output_path)
|
||||
|
||||
# Parse S3 URL
|
||||
if not s3_url.startswith("s3://"):
|
||||
raise ValueError(f"Invalid S3 URL: {s3_url}")
|
||||
|
||||
s3_path = s3_url[5:]
|
||||
parts = s3_path.split("/", 1)
|
||||
if len(parts) < 2:
|
||||
raise ValueError(f"S3 URL must include key: {s3_url}")
|
||||
|
||||
bucket = parts[0]
|
||||
key = parts[1]
|
||||
|
||||
# Auto-append .delta if the file doesn't exist without it
|
||||
# This allows users to specify the original name and we'll find the delta
|
||||
obj_key = ObjectKey(bucket=bucket, key=key)
|
||||
|
||||
# Try to get metadata first to see if it exists
|
||||
try:
|
||||
self.service.get(obj_key, output_path)
|
||||
except Exception:
|
||||
# Try with .delta suffix
|
||||
if not key.endswith(".delta"):
|
||||
obj_key = ObjectKey(bucket=bucket, key=key + ".delta")
|
||||
self.service.get(obj_key, output_path)
|
||||
else:
|
||||
raise
|
||||
|
||||
def verify(self, s3_url: str) -> bool:
|
||||
"""Verify integrity of a stored file.
|
||||
|
||||
Args:
|
||||
s3_url: S3 URL of the file to verify
|
||||
|
||||
Returns:
|
||||
True if verification passed, False otherwise
|
||||
"""
|
||||
# Parse S3 URL
|
||||
if not s3_url.startswith("s3://"):
|
||||
raise ValueError(f"Invalid S3 URL: {s3_url}")
|
||||
|
||||
s3_path = s3_url[5:]
|
||||
parts = s3_path.split("/", 1)
|
||||
if len(parts) < 2:
|
||||
raise ValueError(f"S3 URL must include key: {s3_url}")
|
||||
|
||||
bucket = parts[0]
|
||||
key = parts[1]
|
||||
|
||||
obj_key = ObjectKey(bucket=bucket, key=key)
|
||||
result = self.service.verify(obj_key)
|
||||
return result.valid
|
||||
|
||||
def lifecycle_policy(
|
||||
self, s3_prefix: str, days_before_archive: int = 30, days_before_delete: int = 90
|
||||
) -> None:
|
||||
"""Set lifecycle policy for a prefix (placeholder for future implementation).
|
||||
|
||||
Args:
|
||||
s3_prefix: S3 prefix to apply policy to
|
||||
days_before_archive: Days before transitioning to archive storage
|
||||
days_before_delete: Days before deletion
|
||||
"""
|
||||
# TODO: Implement lifecycle policy management
|
||||
# This would integrate with S3 lifecycle policies
|
||||
# For now, this is a placeholder for the API
|
||||
pass
|
||||
|
||||
|
||||
def create_client(
|
||||
endpoint_url: str | None = None,
|
||||
log_level: str = "INFO",
|
||||
cache_dir: str = "/tmp/.deltaglider/cache",
|
||||
**kwargs: Any,
|
||||
) -> DeltaGliderClient:
|
||||
"""Create a DeltaGlider client with sensible defaults.
|
||||
|
||||
Args:
|
||||
endpoint_url: Optional S3 endpoint URL (for MinIO, R2, etc.)
|
||||
log_level: Logging level (DEBUG, INFO, WARNING, ERROR)
|
||||
cache_dir: Directory for reference cache
|
||||
**kwargs: Additional arguments passed to DeltaService
|
||||
|
||||
Returns:
|
||||
Configured DeltaGliderClient instance
|
||||
|
||||
Examples:
|
||||
>>> # Use with AWS S3 (credentials from environment)
|
||||
>>> client = create_client()
|
||||
|
||||
>>> # Use with MinIO
|
||||
>>> client = create_client(endpoint_url="http://localhost:9000")
|
||||
|
||||
>>> # Use with debug logging
|
||||
>>> client = create_client(log_level="DEBUG")
|
||||
"""
|
||||
# Create adapters
|
||||
hasher = Sha256Adapter()
|
||||
storage = S3StorageAdapter(endpoint_url=endpoint_url)
|
||||
diff = XdeltaAdapter()
|
||||
cache = FsCacheAdapter(Path(cache_dir), hasher)
|
||||
clock = UtcClockAdapter()
|
||||
logger = StdLoggerAdapter(level=log_level)
|
||||
metrics = NoopMetricsAdapter()
|
||||
|
||||
# Get default values
|
||||
tool_version = kwargs.pop("tool_version", "deltaglider/0.1.0")
|
||||
max_ratio = kwargs.pop("max_ratio", 0.5)
|
||||
|
||||
# Create service
|
||||
service = DeltaService(
|
||||
storage=storage,
|
||||
diff=diff,
|
||||
hasher=hasher,
|
||||
cache=cache,
|
||||
clock=clock,
|
||||
logger=logger,
|
||||
metrics=metrics,
|
||||
tool_version=tool_version,
|
||||
max_ratio=max_ratio,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return DeltaGliderClient(service, endpoint_url)
|
||||
Reference in New Issue
Block a user