diff --git a/README.md b/README.md index b05e10c..fa53258 100644 --- a/README.md +++ b/README.md @@ -6,13 +6,13 @@ [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![xdelta3](https://img.shields.io/badge/powered%20by-xdelta3-green.svg)](https://github.com/jmacd/xdelta) -> 🌟 Star if you like this! 🙏 -> Leave a message in [Issues](https://github.com/beshu-tech/deltaglider/issues) - we are listening! **Store 4TB of similar files in 5GB. No, that's not a typo.** DeltaGlider is a drop-in S3 replacement that may achieve 99.9% size reduction for versioned compressed artifacts, backups, and release archives through intelligent binary delta compression (via xdelta3). +> 🌟 Star if you like this! Or Leave a message in [Issues](https://github.com/beshu-tech/deltaglider/issues) - we are listening! + ## The Problem We Solved You're storing hundreds of versions of your software releases. Each 100MB build differs by <1% from the previous version. You're paying to store 100GB of what's essentially 100MB of unique data. @@ -33,7 +33,7 @@ We don't expect significant benefit for multimedia content like videos, but we n ## Quick Start -The quickest way to start is using the GUI +Deltaglider comes as SDK, CLI, but we also have a GUI: * https://github.com/beshu-tech/deltaglider_commander/
@@ -210,6 +210,12 @@ deltaglider stats my-bucket --refresh # Force cache refresh deltaglider stats my-bucket --no-cache # Skip caching entirely deltaglider stats my-bucket --json # JSON output for automation +# Integrity verification & maintenance +deltaglider verify s3://releases/file.zip # Validate stored SHA256 +deltaglider purge my-bucket # Clean expired .deltaglider/tmp files +deltaglider purge my-bucket --dry-run # Preview purge results +deltaglider purge my-bucket --json # Machine-readable purge stats + # Migrate existing S3 buckets to DeltaGlider compression deltaglider migrate s3://old-bucket/ s3://new-bucket/ # Interactive migration deltaglider migrate s3://old-bucket/ s3://new-bucket/ --yes # Skip confirmation diff --git a/docs/sdk/README.md b/docs/sdk/README.md index 965b37a..0c14e30 100644 --- a/docs/sdk/README.md +++ b/docs/sdk/README.md @@ -206,10 +206,17 @@ from deltaglider import create_client client = create_client( endpoint_url="http://minio.internal:9000", # Custom S3 endpoint log_level="DEBUG", # Detailed logging - cache_dir="/var/cache/deltaglider", # Custom cache location + aws_access_key_id="minio", + aws_secret_access_key="minio", + region_name="eu-west-1", + max_ratio=0.3, # Stricter delta acceptance ) ``` +> â„šī¸ The SDK now manages an encrypted, process-isolated cache automatically in `/tmp/deltaglider-*`. +> Tune cache behavior via environment variables such as `DG_CACHE_BACKEND`, +> `DG_CACHE_MEMORY_SIZE_MB`, and `DG_CACHE_ENCRYPTION_KEY` instead of passing a `cache_dir` argument. + ## Real-World Example ```python @@ -299,4 +306,4 @@ url = client.generate_presigned_url( ## License -MIT License - See [LICENSE](https://github.com/beshu-tech/deltaglider/blob/main/LICENSE) for details. \ No newline at end of file +MIT License - See [LICENSE](https://github.com/beshu-tech/deltaglider/blob/main/LICENSE) for details. diff --git a/docs/sdk/api.md b/docs/sdk/api.md index 538d3c7..22b953b 100644 --- a/docs/sdk/api.md +++ b/docs/sdk/api.md @@ -156,29 +156,34 @@ for obj in response['Contents']: #### `get_bucket_stats` -Get statistics for a bucket with optional detailed compression metrics. Results are cached per client session for performance. +Get statistics for a bucket with optional detailed compression metrics. Results are cached inside the bucket for performance. ```python def get_bucket_stats( self, bucket: str, - detailed_stats: bool = False + mode: Literal["quick", "sampled", "detailed"] = "quick", + use_cache: bool = True, + refresh_cache: bool = False, ) -> BucketStats ``` ##### Parameters - **bucket** (`str`): S3 bucket name. -- **detailed_stats** (`bool`): If True, fetch accurate compression ratios for delta files. Default: False. - - With `detailed_stats=False`: ~50ms for any bucket size (LIST calls only) - - With `detailed_stats=True`: ~2-3s per 1000 objects (adds HEAD calls for delta files) +- **mode** (`Literal[...]`): Accuracy/cost trade-off: + - `"quick"` (default): LIST-only scan; compression ratios for deltas are estimated. + - `"sampled"`: HEAD one delta per deltaspace and reuse the ratio. + - `"detailed"`: HEAD every delta object; slowest but exact. +- **use_cache** (`bool`): If True, read/write `.deltaglider/stats_{mode}.json` in the bucket for reuse. +- **refresh_cache** (`bool`): Force recomputation even if a cache file is valid. ##### Caching Behavior -- **Session-scoped cache**: Results cached within client instance lifetime -- **Automatic invalidation**: Cache cleared on bucket mutations (put, delete, bucket operations) -- **Intelligent reuse**: Detailed stats can serve quick stat requests -- **Manual cache control**: Use `clear_cache()` to invalidate all cached stats +- Stats are cached per mode directly inside the bucket at `.deltaglider/stats_{mode}.json`. +- Every call validates cache freshness via a quick LIST (object count + compressed size). +- `refresh_cache=True` skips cache validation and recomputes immediately. +- `use_cache=False` bypasses both reading and writing cache artifacts. ##### Returns @@ -195,24 +200,20 @@ def get_bucket_stats( ##### Examples ```python -# Quick stats for dashboard display (cached after first call) +# Quick stats (fast LIST-only) stats = client.get_bucket_stats('releases') print(f"Objects: {stats.object_count}, Size: {stats.total_size}") -# Second call hits cache (instant response) -stats = client.get_bucket_stats('releases') -print(f"Space saved: {stats.space_saved} bytes") +# Sampled/detailed modes for analytics +sampled = client.get_bucket_stats('releases', mode='sampled') +detailed = client.get_bucket_stats('releases', mode='detailed') +print(f"Compression ratio: {detailed.average_compression_ratio:.1%}") -# Detailed stats for analytics (slower but accurate, also cached) -stats = client.get_bucket_stats('releases', detailed_stats=True) -print(f"Compression ratio: {stats.average_compression_ratio:.1%}") +# Force refresh if an external tool modified the bucket +fresh = client.get_bucket_stats('releases', mode='quick', refresh_cache=True) -# Quick call after detailed call reuses detailed cache (more accurate) -quick_stats = client.get_bucket_stats('releases') # Uses detailed cache - -# Clear cache to force refresh -client.clear_cache() -stats = client.get_bucket_stats('releases') # Fresh computation +# Skip cache entirely when running ad-hoc diagnostics +uncached = client.get_bucket_stats('releases', use_cache=False) ``` #### `put_object` @@ -334,7 +335,7 @@ client.delete_bucket(Bucket='old-releases') #### `list_buckets` -List all S3 buckets (boto3-compatible). Includes cached statistics when available. +List all S3 buckets (boto3-compatible). ```python def list_buckets( @@ -345,51 +346,18 @@ def list_buckets( ##### Returns -Dict with list of buckets and owner information (identical to boto3). Each bucket may include optional `DeltaGliderStats` metadata if statistics have been previously cached. - -##### Response Structure - -```python -{ - 'Buckets': [ - { - 'Name': 'bucket-name', - 'CreationDate': datetime(2025, 1, 1), - 'DeltaGliderStats': { # Optional, only if cached - 'Cached': True, - 'Detailed': bool, # Whether detailed stats were fetched - 'ObjectCount': int, - 'TotalSize': int, - 'CompressedSize': int, - 'SpaceSaved': int, - 'AverageCompressionRatio': float, - 'DeltaObjects': int, - 'DirectObjects': int - } - } - ], - 'Owner': {...} -} -``` +Dict with the same structure boto3 returns (`Buckets`, `Owner`, `ResponseMetadata`). DeltaGlider does not inject additional metadata; use `get_bucket_stats()` for compression data. ##### Examples ```python -# List all buckets response = client.list_buckets() for bucket in response['Buckets']: print(f"{bucket['Name']} - Created: {bucket['CreationDate']}") - # Check if stats are cached - if 'DeltaGliderStats' in bucket: - stats = bucket['DeltaGliderStats'] - print(f" Cached stats: {stats['ObjectCount']} objects, " - f"{stats['AverageCompressionRatio']:.1%} compression") - -# Fetch stats first, then list buckets to see cached data -client.get_bucket_stats('my-bucket', detailed_stats=True) -response = client.list_buckets() -# Now 'my-bucket' will include DeltaGliderStats in response +# Combine with get_bucket_stats for deeper insights +stats = client.get_bucket_stats('releases', mode='detailed') +print(f"releases -> {stats.object_count} objects, {stats.space_saved/(1024**3):.2f} GB saved") ``` ### Simple API Methods @@ -528,13 +496,9 @@ else: ### Cache Management Methods -DeltaGlider maintains two types of caches for performance optimization: -1. **Reference cache**: Binary reference files used for delta reconstruction -2. **Statistics cache**: Bucket statistics (session-scoped) - #### `clear_cache` -Clear all cached data including reference files and bucket statistics. +Clear all locally cached reference files. ```python def clear_cache(self) -> None @@ -542,23 +506,20 @@ def clear_cache(self) -> None ##### Description -Removes all cached reference files from the local filesystem and invalidates all bucket statistics. Useful for: -- Forcing fresh statistics computation +Removes all cached reference files from the local filesystem. Useful for: - Freeing disk space in long-running applications -- Ensuring latest data after external bucket modifications +- Ensuring the next upload/download fetches fresh references from S3 +- Resetting cache after configuration or credential changes - Testing and development workflows -##### Cache Types Cleared +##### Cache Scope -1. **Reference Cache**: Binary reference files stored in `/tmp/deltaglider-*/` - - Encrypted at rest with ephemeral keys - - Content-addressed storage (SHA256-based filenames) - - Automatically cleaned up on process exit - -2. **Statistics Cache**: Bucket statistics cached per client session - - Metadata about compression ratios and object counts - - Session-scoped (not persisted to disk) - - Automatically invalidated on bucket mutations +- **Reference Cache**: Binary reference files stored in `/tmp/deltaglider-*/` + - Encrypted at rest with ephemeral keys + - Content-addressed storage (SHA256-based filenames) + - Automatically cleaned up on process exit +- **Statistics Cache**: Stored inside the bucket as `.deltaglider/stats_{mode}.json`. + - `clear_cache()` does *not* remove these S3 objects; use `refresh_cache=True` or delete the objects manually if needed. ##### Examples @@ -574,71 +535,14 @@ for i in range(1000): if i % 100 == 0: client.clear_cache() -# Force fresh statistics after external changes -stats_before = client.get_bucket_stats('releases') # Cached -# ... external tool modifies bucket ... -client.clear_cache() -stats_after = client.get_bucket_stats('releases') # Fresh data +# Force fresh statistics after external changes (skip cache instead of clearing) +stats_before = client.get_bucket_stats('releases') +stats_after = client.get_bucket_stats('releases', refresh_cache=True) # Development workflow client.clear_cache() # Start with clean state ``` -#### `evict_cache` - -Remove a specific cached reference file from the local cache. - -```python -def evict_cache(self, s3_url: str) -> None -``` - -##### Parameters - -- **s3_url** (`str`): S3 URL of the reference file to evict (e.g., `s3://bucket/prefix/reference.bin`) - -##### Description - -Removes a specific reference file from the cache without affecting other cached files or statistics. Useful for: -- Selective cache invalidation when specific references are updated -- Memory management in applications with many delta spaces -- Testing specific delta compression scenarios - -##### Examples - -```python -# Evict specific reference after update -client.upload("new-reference.zip", "s3://releases/v2.0.0/") -client.evict_cache("s3://releases/v2.0.0/reference.bin") - -# Next upload will fetch fresh reference -client.upload("similar-file.zip", "s3://releases/v2.0.0/") - -# Selective eviction for specific delta spaces -delta_spaces = ["v1.0.0", "v1.1.0", "v1.2.0"] -for space in delta_spaces: - client.evict_cache(f"s3://releases/{space}/reference.bin") -``` - -##### See Also - -- [docs/CACHE_MANAGEMENT.md](../../CACHE_MANAGEMENT.md): Complete cache management guide -- `clear_cache()`: Clear all caches - -#### `lifecycle_policy` - -Set lifecycle policy for S3 prefix (placeholder for future implementation). - -```python -def lifecycle_policy( - self, - s3_prefix: str, - days_before_archive: int = 30, - days_before_delete: int = 90 -) -> None -``` - -**Note**: This method is a placeholder for future S3 lifecycle policy management. - ## UploadSummary Data class containing upload operation results. @@ -995,4 +899,4 @@ client = create_client(log_level="DEBUG") - **GitHub Issues**: [github.com/beshu-tech/deltaglider/issues](https://github.com/beshu-tech/deltaglider/issues) - **Documentation**: [github.com/beshu-tech/deltaglider](https://github.com/beshu-tech/deltaglider) -- **PyPI Package**: [pypi.org/project/deltaglider](https://pypi.org/project/deltaglider) \ No newline at end of file +- **PyPI Package**: [pypi.org/project/deltaglider](https://pypi.org/project/deltaglider) diff --git a/docs/sdk/examples.md b/docs/sdk/examples.md index 2224f56..9240e10 100644 --- a/docs/sdk/examples.md +++ b/docs/sdk/examples.md @@ -41,19 +41,19 @@ def fast_bucket_listing(bucket: str): # Process objects for display items = [] - for obj in response.contents: + for obj in response['Contents']: + metadata = obj.get("Metadata", {}) items.append({ - "key": obj.key, - "size": obj.size, - "last_modified": obj.last_modified, - "is_delta": obj.is_delta, # Determined from filename - # No compression_ratio - would require HEAD request + "key": obj["Key"], + "size": obj["Size"], + "last_modified": obj["LastModified"], + "is_delta": metadata.get("deltaglider-is-delta") == "true", }) elapsed = time.time() - start print(f"Listed {len(items)} objects in {elapsed*1000:.0f}ms") - return items, response.next_continuation_token + return items, response.get("NextContinuationToken") # Example: List first page items, next_token = fast_bucket_listing('releases') @@ -75,12 +75,12 @@ def paginated_listing(bucket: str, page_size: int = 50): FetchMetadata=False # Keep it fast ) - all_objects.extend(response.contents) + all_objects.extend(response["Contents"]) - if not response.is_truncated: + if not response.get("IsTruncated"): break - continuation_token = response.next_continuation_token + continuation_token = response.get("NextContinuationToken") print(f"Fetched {len(all_objects)} objects so far...") return all_objects @@ -96,8 +96,8 @@ print(f"Total objects: {len(all_objects)}") def dashboard_with_stats(bucket: str): """Dashboard view with optional detailed stats.""" - # Quick overview (fast - no metadata) - stats = client.get_bucket_stats(bucket, detailed_stats=False) + # Quick overview (fast LIST-only) + stats = client.get_bucket_stats(bucket) print(f"Quick Stats for {bucket}:") print(f" Total Objects: {stats.object_count}") @@ -108,7 +108,7 @@ def dashboard_with_stats(bucket: str): # Detailed compression analysis (slower - fetches metadata for deltas only) if stats.delta_objects > 0: - detailed_stats = client.get_bucket_stats(bucket, detailed_stats=True) + detailed_stats = client.get_bucket_stats(bucket, mode='detailed') print(f"\nDetailed Compression Stats:") print(f" Average Compression: {detailed_stats.average_compression_ratio:.1%}") print(f" Space Saved: {detailed_stats.space_saved / (1024**3):.2f} GB") @@ -131,11 +131,25 @@ def compression_analysis(bucket: str, prefix: str = ""): ) # Analyze compression effectiveness - delta_files = [obj for obj in response.contents if obj.is_delta] + delta_files: list[dict[str, float | int | str]] = [] + for obj in response["Contents"]: + metadata = obj.get("Metadata", {}) + if metadata.get("deltaglider-is-delta") != "true": + continue + original_size = int(metadata.get("deltaglider-original-size", obj["Size"])) + compression_ratio = float(metadata.get("deltaglider-compression-ratio", 0.0)) + delta_files.append( + { + "key": obj["Key"], + "original": original_size, + "compressed": obj["Size"], + "ratio": compression_ratio, + } + ) if delta_files: - total_original = sum(obj.original_size for obj in delta_files) - total_compressed = sum(obj.compressed_size for obj in delta_files) + total_original = sum(obj["original"] for obj in delta_files) + total_compressed = sum(obj["compressed"] for obj in delta_files) avg_ratio = (total_original - total_compressed) / total_original print(f"Compression Analysis for {prefix or 'all files'}:") @@ -145,11 +159,11 @@ def compression_analysis(bucket: str, prefix: str = ""): print(f" Average Compression: {avg_ratio:.1%}") # Find best and worst compression - best = max(delta_files, key=lambda x: x.compression_ratio or 0) - worst = min(delta_files, key=lambda x: x.compression_ratio or 1) + best = max(delta_files, key=lambda x: x["ratio"]) + worst = min(delta_files, key=lambda x: x["ratio"]) - print(f" Best Compression: {best.key} ({best.compression_ratio:.1%})") - print(f" Worst Compression: {worst.key} ({worst.compression_ratio:.1%})") + print(f" Best Compression: {best['key']} ({best['ratio']:.1%})") + print(f" Worst Compression: {worst['key']} ({worst['ratio']:.1%})") # Example: Analyze v2.0 releases compression_analysis('releases', 'v2.0/') @@ -180,7 +194,11 @@ def performance_comparison(bucket: str): ) time_detailed = (time.time() - start) * 1000 - delta_count = sum(1 for obj in response_fast.contents if obj.is_delta) + delta_count = sum( + 1 + for obj in response_fast["Contents"] + if obj.get("Metadata", {}).get("deltaglider-is-delta") == "true" + ) print(f"Performance Comparison for {bucket}:") print(f" Fast Listing: {time_fast:.0f}ms (1 API call)") @@ -203,7 +221,7 @@ performance_comparison('releases') ## Bucket Statistics and Monitoring -DeltaGlider provides powerful bucket statistics with session-level caching for performance. +DeltaGlider provides powerful bucket statistics with S3-backed caching for performance. ### Quick Dashboard Stats (Cached) @@ -244,7 +262,7 @@ def detailed_compression_report(bucket: str): """Generate detailed compression report with accurate ratios.""" # Detailed stats fetch metadata for delta files (slower, accurate) - stats = client.get_bucket_stats(bucket, detailed_stats=True) + stats = client.get_bucket_stats(bucket, mode='detailed') efficiency = (stats.space_saved / stats.total_size * 100) if stats.total_size > 0 else 0 @@ -286,7 +304,7 @@ def list_buckets_with_stats(): # Pre-fetch stats for important buckets important_buckets = ['releases', 'backups'] for bucket_name in important_buckets: - client.get_bucket_stats(bucket_name, detailed_stats=True) + client.get_bucket_stats(bucket_name, mode='detailed') # List all buckets (includes cached stats automatically) response = client.list_buckets() @@ -357,7 +375,7 @@ except KeyboardInterrupt: ## Session-Level Cache Management -DeltaGlider maintains session-level caches for optimal performance in long-running applications. +DeltaGlider maintains an encrypted reference cache for optimal performance in long-running applications. ### Long-Running Application Pattern @@ -410,11 +428,8 @@ def handle_external_bucket_changes(bucket: str): print("External backup tool running...") run_external_backup_tool(bucket) # Your external tool - # Clear cache to get fresh data - client.clear_cache() - - # Get updated stats - stats_after = client.get_bucket_stats(bucket) + # Force a recompute of the cached stats + stats_after = client.get_bucket_stats(bucket, refresh_cache=True) print(f"After: {stats_after.object_count} objects") print(f"Added: {stats_after.object_count - stats_before.object_count} objects") @@ -422,35 +437,6 @@ def handle_external_bucket_changes(bucket: str): handle_external_bucket_changes('backups') ``` -### Selective Cache Eviction - -```python -def selective_cache_management(): - """Manage cache for specific delta spaces.""" - - client = create_client() - - # Upload to multiple delta spaces - versions = ['v1.0.0', 'v1.1.0', 'v1.2.0'] - - for version in versions: - client.upload(f"app-{version}.zip", f"s3://releases/{version}/") - - # Update reference for specific version - print("Updating v1.1.0 reference...") - client.upload("new-reference.zip", "s3://releases/v1.1.0/") - - # Evict only v1.1.0 cache (others remain cached) - client.evict_cache("s3://releases/v1.1.0/reference.bin") - - # Next upload to v1.1.0 fetches fresh reference - # v1.0.0 and v1.2.0 still use cached references - client.upload("similar-file.zip", "s3://releases/v1.1.0/") - -# Example: Selective eviction -selective_cache_management() -``` - ### Testing with Clean Cache ```python @@ -491,19 +477,18 @@ def measure_cache_performance(bucket: str): client = create_client() # Test 1: Cold cache - client.clear_cache() start = time.time() - stats1 = client.get_bucket_stats(bucket, detailed_stats=True) + stats1 = client.get_bucket_stats(bucket, mode='detailed', refresh_cache=True) cold_time = (time.time() - start) * 1000 # Test 2: Warm cache start = time.time() - stats2 = client.get_bucket_stats(bucket, detailed_stats=True) + stats2 = client.get_bucket_stats(bucket, mode='detailed') warm_time = (time.time() - start) * 1000 # Test 3: Quick stats from detailed cache start = time.time() - stats3 = client.get_bucket_stats(bucket, detailed_stats=False) + stats3 = client.get_bucket_stats(bucket, mode='quick') reuse_time = (time.time() - start) * 1000 print(f"Cache Performance for {bucket}:") @@ -1707,4 +1692,4 @@ files_to_upload = [ results = uploader.upload_batch(files_to_upload) ``` -These examples demonstrate real-world usage patterns for DeltaGlider across various domains. Each example includes error handling, monitoring, and best practices for production deployments. \ No newline at end of file +These examples demonstrate real-world usage patterns for DeltaGlider across various domains. Each example includes error handling, monitoring, and best practices for production deployments.