updates to docs

2026-03-19 07:43:55 +01:00 · 2025-11-11 17:05:50 +01:00
parent 7a4d30a007
commit 284f030fae
4 changed files with 110 additions and 208 deletions
--- a/README.md
+++ b/README.md
@@ -6,13 +6,13 @@
 [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
 [![xdelta3](https://img.shields.io/badge/powered%20by-xdelta3-green.svg)](https://github.com/jmacd/xdelta)

-> 🌟 Star if you like this! 🙏
-> Leave a message in [Issues](https://github.com/beshu-tech/deltaglider/issues) - we are listening!

 **Store 4TB of similar files in 5GB. No, that's not a typo.**

 DeltaGlider is a drop-in S3 replacement that may achieve 99.9% size reduction for versioned compressed artifacts, backups, and release archives through intelligent binary delta compression (via xdelta3).

+> 🌟 Star if you like this! Or Leave a message in [Issues](https://github.com/beshu-tech/deltaglider/issues) - we are listening!
+
 ## The Problem We Solved

 You're storing hundreds of versions of your software releases. Each 100MB build differs by <1% from the previous version. You're paying to store 100GB of what's essentially 100MB of unique data.
@@ -33,7 +33,7 @@ We don't expect significant benefit for multimedia content like videos, but we n

 ## Quick Start

-The quickest way to start is using the GUI
+Deltaglider comes as SDK, CLI, but we also have a GUI:
 * https://github.com/beshu-tech/deltaglider_commander/

 <div align="center">
@@ -210,6 +210,12 @@ deltaglider stats my-bucket --refresh             # Force cache refresh
 deltaglider stats my-bucket --no-cache            # Skip caching entirely
 deltaglider stats my-bucket --json                # JSON output for automation

+# Integrity verification & maintenance
+deltaglider verify s3://releases/file.zip         # Validate stored SHA256
+deltaglider purge my-bucket                       # Clean expired .deltaglider/tmp files
+deltaglider purge my-bucket --dry-run             # Preview purge results
+deltaglider purge my-bucket --json                # Machine-readable purge stats
+
 # Migrate existing S3 buckets to DeltaGlider compression
 deltaglider migrate s3://old-bucket/ s3://new-bucket/         # Interactive migration
 deltaglider migrate s3://old-bucket/ s3://new-bucket/ --yes   # Skip confirmation
--- a/docs/sdk/README.md
+++ b/docs/sdk/README.md
@@ -206,10 +206,17 @@ from deltaglider import create_client
 client = create_client(
    endpoint_url="http://minio.internal:9000",  # Custom S3 endpoint
    log_level="DEBUG",                           # Detailed logging
-    cache_dir="/var/cache/deltaglider",         # Custom cache location
+    aws_access_key_id="minio",
+    aws_secret_access_key="minio",
+    region_name="eu-west-1",
+    max_ratio=0.3,                                # Stricter delta acceptance
 )
 ```

+> ℹ️  The SDK now manages an encrypted, process-isolated cache automatically in `/tmp/deltaglider-*`.
+> Tune cache behavior via environment variables such as `DG_CACHE_BACKEND`,
+> `DG_CACHE_MEMORY_SIZE_MB`, and `DG_CACHE_ENCRYPTION_KEY` instead of passing a `cache_dir` argument.
+
 ## Real-World Example

 ```python
@@ -299,4 +306,4 @@ url = client.generate_presigned_url(

 ## License

-MIT License - See [LICENSE](https://github.com/beshu-tech/deltaglider/blob/main/LICENSE) for details.
+MIT License - See [LICENSE](https://github.com/beshu-tech/deltaglider/blob/main/LICENSE) for details.
--- a/docs/sdk/api.md
+++ b/docs/sdk/api.md
@@ -156,29 +156,34 @@ for obj in response['Contents']:

 #### `get_bucket_stats`

-Get statistics for a bucket with optional detailed compression metrics. Results are cached per client session for performance.
+Get statistics for a bucket with optional detailed compression metrics. Results are cached inside the bucket for performance.

 ```python
 def get_bucket_stats(
    self,
    bucket: str,
-    detailed_stats: bool = False
+    mode: Literal["quick", "sampled", "detailed"] = "quick",
+    use_cache: bool = True,
+    refresh_cache: bool = False,
 ) -> BucketStats
 ```

 ##### Parameters

 - **bucket** (`str`): S3 bucket name.
- **detailed_stats** (`bool`): If True, fetch accurate compression ratios for delta files. Default: False.
-  - With `detailed_stats=False`: ~50ms for any bucket size (LIST calls only)
-  - With `detailed_stats=True`: ~2-3s per 1000 objects (adds HEAD calls for delta files)
+- **mode** (`Literal[...]`): Accuracy/cost trade-off:
+  - `"quick"` (default): LIST-only scan; compression ratios for deltas are estimated.
+  - `"sampled"`: HEAD one delta per deltaspace and reuse the ratio.
+  - `"detailed"`: HEAD every delta object; slowest but exact.
+- **use_cache** (`bool`): If True, read/write `.deltaglider/stats_{mode}.json` in the bucket for reuse.
+- **refresh_cache** (`bool`): Force recomputation even if a cache file is valid.

 ##### Caching Behavior

- **Session-scoped cache**: Results cached within client instance lifetime
- **Automatic invalidation**: Cache cleared on bucket mutations (put, delete, bucket operations)
- **Intelligent reuse**: Detailed stats can serve quick stat requests
- **Manual cache control**: Use `clear_cache()` to invalidate all cached stats
+- Stats are cached per mode directly inside the bucket at `.deltaglider/stats_{mode}.json`.
+- Every call validates cache freshness via a quick LIST (object count + compressed size).
+- `refresh_cache=True` skips cache validation and recomputes immediately.
+- `use_cache=False` bypasses both reading and writing cache artifacts.

 ##### Returns

@@ -195,24 +200,20 @@ def get_bucket_stats(
 ##### Examples

 ```python
-# Quick stats for dashboard display (cached after first call)
+# Quick stats (fast LIST-only)
 stats = client.get_bucket_stats('releases')
 print(f"Objects: {stats.object_count}, Size: {stats.total_size}")

-# Second call hits cache (instant response)
-stats = client.get_bucket_stats('releases')
-print(f"Space saved: {stats.space_saved} bytes")
+# Sampled/detailed modes for analytics
+sampled = client.get_bucket_stats('releases', mode='sampled')
+detailed = client.get_bucket_stats('releases', mode='detailed')
+print(f"Compression ratio: {detailed.average_compression_ratio:.1%}")

-# Detailed stats for analytics (slower but accurate, also cached)
-stats = client.get_bucket_stats('releases', detailed_stats=True)
-print(f"Compression ratio: {stats.average_compression_ratio:.1%}")
+# Force refresh if an external tool modified the bucket
+fresh = client.get_bucket_stats('releases', mode='quick', refresh_cache=True)

-# Quick call after detailed call reuses detailed cache (more accurate)
-quick_stats = client.get_bucket_stats('releases')  # Uses detailed cache
-
-# Clear cache to force refresh
-client.clear_cache()
-stats = client.get_bucket_stats('releases')  # Fresh computation
+# Skip cache entirely when running ad-hoc diagnostics
+uncached = client.get_bucket_stats('releases', use_cache=False)
 ```

 #### `put_object`
@@ -334,7 +335,7 @@ client.delete_bucket(Bucket='old-releases')

 #### `list_buckets`

-List all S3 buckets (boto3-compatible). Includes cached statistics when available.
+List all S3 buckets (boto3-compatible).

 ```python
 def list_buckets(
@@ -345,51 +346,18 @@ def list_buckets(

 ##### Returns

-Dict with list of buckets and owner information (identical to boto3). Each bucket may include optional `DeltaGliderStats` metadata if statistics have been previously cached.
-
-##### Response Structure
-
-```python
-{
-    'Buckets': [
-        {
-            'Name': 'bucket-name',
-            'CreationDate': datetime(2025, 1, 1),
-            'DeltaGliderStats': {  # Optional, only if cached
-                'Cached': True,
-                'Detailed': bool,  # Whether detailed stats were fetched
-                'ObjectCount': int,
-                'TotalSize': int,
-                'CompressedSize': int,
-                'SpaceSaved': int,
-                'AverageCompressionRatio': float,
-                'DeltaObjects': int,
-                'DirectObjects': int
-            }
-        }
-    ],
-    'Owner': {...}
-}
-```
+Dict with the same structure boto3 returns (`Buckets`, `Owner`, `ResponseMetadata`). DeltaGlider does not inject additional metadata; use `get_bucket_stats()` for compression data.

 ##### Examples

 ```python
-# List all buckets
 response = client.list_buckets()
 for bucket in response['Buckets']:
    print(f"{bucket['Name']} - Created: {bucket['CreationDate']}")

-    # Check if stats are cached
-    if 'DeltaGliderStats' in bucket:
-        stats = bucket['DeltaGliderStats']
-        print(f"  Cached stats: {stats['ObjectCount']} objects, "
-              f"{stats['AverageCompressionRatio']:.1%} compression")
-
-# Fetch stats first, then list buckets to see cached data
-client.get_bucket_stats('my-bucket', detailed_stats=True)
-response = client.list_buckets()
-# Now 'my-bucket' will include DeltaGliderStats in response
+# Combine with get_bucket_stats for deeper insights
+stats = client.get_bucket_stats('releases', mode='detailed')
+print(f"releases -> {stats.object_count} objects, {stats.space_saved/(1024**3):.2f} GB saved")
 ```

 ### Simple API Methods
@@ -528,13 +496,9 @@ else:

 ### Cache Management Methods

-DeltaGlider maintains two types of caches for performance optimization:
-1. **Reference cache**: Binary reference files used for delta reconstruction
-2. **Statistics cache**: Bucket statistics (session-scoped)
-
 #### `clear_cache`

-Clear all cached data including reference files and bucket statistics.
+Clear all locally cached reference files.

 ```python
 def clear_cache(self) -> None
@@ -542,23 +506,20 @@ def clear_cache(self) -> None

 ##### Description

-Removes all cached reference files from the local filesystem and invalidates all bucket statistics. Useful for:
- Forcing fresh statistics computation
+Removes all cached reference files from the local filesystem. Useful for:
 - Freeing disk space in long-running applications
- Ensuring latest data after external bucket modifications
+- Ensuring the next upload/download fetches fresh references from S3
+- Resetting cache after configuration or credential changes
 - Testing and development workflows

-##### Cache Types Cleared
+##### Cache Scope

-1. **Reference Cache**: Binary reference files stored in `/tmp/deltaglider-*/`
-   - Encrypted at rest with ephemeral keys
-   - Content-addressed storage (SHA256-based filenames)
-   - Automatically cleaned up on process exit
-
-2. **Statistics Cache**: Bucket statistics cached per client session
-   - Metadata about compression ratios and object counts
-   - Session-scoped (not persisted to disk)
-   - Automatically invalidated on bucket mutations
+- **Reference Cache**: Binary reference files stored in `/tmp/deltaglider-*/`
+  - Encrypted at rest with ephemeral keys
+  - Content-addressed storage (SHA256-based filenames)
+  - Automatically cleaned up on process exit
+- **Statistics Cache**: Stored inside the bucket as `.deltaglider/stats_{mode}.json`.
+  - `clear_cache()` does *not* remove these S3 objects; use `refresh_cache=True` or delete the objects manually if needed.

 ##### Examples

@@ -574,71 +535,14 @@ for i in range(1000):
    if i % 100 == 0:
        client.clear_cache()

-# Force fresh statistics after external changes
-stats_before = client.get_bucket_stats('releases')  # Cached
-# ... external tool modifies bucket ...
-client.clear_cache()
-stats_after = client.get_bucket_stats('releases')  # Fresh data
+# Force fresh statistics after external changes (skip cache instead of clearing)
+stats_before = client.get_bucket_stats('releases')
+stats_after = client.get_bucket_stats('releases', refresh_cache=True)

 # Development workflow
 client.clear_cache()  # Start with clean state
 ```

-#### `evict_cache`
-
-Remove a specific cached reference file from the local cache.
-
-```python
-def evict_cache(self, s3_url: str) -> None
-```
-
-##### Parameters
-
- **s3_url** (`str`): S3 URL of the reference file to evict (e.g., `s3://bucket/prefix/reference.bin`)
-
-##### Description
-
-Removes a specific reference file from the cache without affecting other cached files or statistics. Useful for:
- Selective cache invalidation when specific references are updated
- Memory management in applications with many delta spaces
- Testing specific delta compression scenarios
-
-##### Examples
-
-```python
-# Evict specific reference after update
-client.upload("new-reference.zip", "s3://releases/v2.0.0/")
-client.evict_cache("s3://releases/v2.0.0/reference.bin")
-
-# Next upload will fetch fresh reference
-client.upload("similar-file.zip", "s3://releases/v2.0.0/")
-
-# Selective eviction for specific delta spaces
-delta_spaces = ["v1.0.0", "v1.1.0", "v1.2.0"]
-for space in delta_spaces:
-    client.evict_cache(f"s3://releases/{space}/reference.bin")
-```
-
-##### See Also
-
- [docs/CACHE_MANAGEMENT.md](../../CACHE_MANAGEMENT.md): Complete cache management guide
- `clear_cache()`: Clear all caches
-
-#### `lifecycle_policy`
-
-Set lifecycle policy for S3 prefix (placeholder for future implementation).
-
-```python
-def lifecycle_policy(
-    self,
-    s3_prefix: str,
-    days_before_archive: int = 30,
-    days_before_delete: int = 90
-) -> None
-```
-
-**Note**: This method is a placeholder for future S3 lifecycle policy management.
-
 ## UploadSummary

 Data class containing upload operation results.
@@ -995,4 +899,4 @@ client = create_client(log_level="DEBUG")

 - **GitHub Issues**: [github.com/beshu-tech/deltaglider/issues](https://github.com/beshu-tech/deltaglider/issues)
 - **Documentation**: [github.com/beshu-tech/deltaglider](https://github.com/beshu-tech/deltaglider)
- **PyPI Package**: [pypi.org/project/deltaglider](https://pypi.org/project/deltaglider)
+- **PyPI Package**: [pypi.org/project/deltaglider](https://pypi.org/project/deltaglider)
--- a/docs/sdk/examples.md
+++ b/docs/sdk/examples.md
@@ -41,19 +41,19 @@ def fast_bucket_listing(bucket: str):

    # Process objects for display
    items = []
-    for obj in response.contents:
+    for obj in response['Contents']:
+        metadata = obj.get("Metadata", {})
        items.append({
-            "key": obj.key,
-            "size": obj.size,
-            "last_modified": obj.last_modified,
-            "is_delta": obj.is_delta,  # Determined from filename
-            # No compression_ratio - would require HEAD request
+            "key": obj["Key"],
+            "size": obj["Size"],
+            "last_modified": obj["LastModified"],
+            "is_delta": metadata.get("deltaglider-is-delta") == "true",
        })

    elapsed = time.time() - start
    print(f"Listed {len(items)} objects in {elapsed*1000:.0f}ms")

-    return items, response.next_continuation_token
+    return items, response.get("NextContinuationToken")

 # Example: List first page
 items, next_token = fast_bucket_listing('releases')
@@ -75,12 +75,12 @@ def paginated_listing(bucket: str, page_size: int = 50):
            FetchMetadata=False  # Keep it fast
        )

-        all_objects.extend(response.contents)
+        all_objects.extend(response["Contents"])

-        if not response.is_truncated:
+        if not response.get("IsTruncated"):
            break

-        continuation_token = response.next_continuation_token
+        continuation_token = response.get("NextContinuationToken")
        print(f"Fetched {len(all_objects)} objects so far...")

    return all_objects
@@ -96,8 +96,8 @@ print(f"Total objects: {len(all_objects)}")
 def dashboard_with_stats(bucket: str):
    """Dashboard view with optional detailed stats."""

-    # Quick overview (fast - no metadata)
-    stats = client.get_bucket_stats(bucket, detailed_stats=False)
+    # Quick overview (fast LIST-only)
+    stats = client.get_bucket_stats(bucket)

    print(f"Quick Stats for {bucket}:")
    print(f"  Total Objects: {stats.object_count}")
@@ -108,7 +108,7 @@ def dashboard_with_stats(bucket: str):

    # Detailed compression analysis (slower - fetches metadata for deltas only)
    if stats.delta_objects > 0:
-        detailed_stats = client.get_bucket_stats(bucket, detailed_stats=True)
+        detailed_stats = client.get_bucket_stats(bucket, mode='detailed')
        print(f"\nDetailed Compression Stats:")
        print(f"  Average Compression: {detailed_stats.average_compression_ratio:.1%}")
        print(f"  Space Saved: {detailed_stats.space_saved / (1024**3):.2f} GB")
@@ -131,11 +131,25 @@ def compression_analysis(bucket: str, prefix: str = ""):
    )

    # Analyze compression effectiveness
-    delta_files = [obj for obj in response.contents if obj.is_delta]
+    delta_files: list[dict[str, float | int | str]] = []
+    for obj in response["Contents"]:
+        metadata = obj.get("Metadata", {})
+        if metadata.get("deltaglider-is-delta") != "true":
+            continue
+        original_size = int(metadata.get("deltaglider-original-size", obj["Size"]))
+        compression_ratio = float(metadata.get("deltaglider-compression-ratio", 0.0))
+        delta_files.append(
+            {
+                "key": obj["Key"],
+                "original": original_size,
+                "compressed": obj["Size"],
+                "ratio": compression_ratio,
+            }
+        )

    if delta_files:
-        total_original = sum(obj.original_size for obj in delta_files)
-        total_compressed = sum(obj.compressed_size for obj in delta_files)
+        total_original = sum(obj["original"] for obj in delta_files)
+        total_compressed = sum(obj["compressed"] for obj in delta_files)
        avg_ratio = (total_original - total_compressed) / total_original

        print(f"Compression Analysis for {prefix or 'all files'}:")
@@ -145,11 +159,11 @@ def compression_analysis(bucket: str, prefix: str = ""):
        print(f"  Average Compression: {avg_ratio:.1%}")

        # Find best and worst compression
-        best = max(delta_files, key=lambda x: x.compression_ratio or 0)
-        worst = min(delta_files, key=lambda x: x.compression_ratio or 1)
+        best = max(delta_files, key=lambda x: x["ratio"])
+        worst = min(delta_files, key=lambda x: x["ratio"])

-        print(f"  Best Compression: {best.key} ({best.compression_ratio:.1%})")
-        print(f"  Worst Compression: {worst.key} ({worst.compression_ratio:.1%})")
+        print(f"  Best Compression: {best['key']} ({best['ratio']:.1%})")
+        print(f"  Worst Compression: {worst['key']} ({worst['ratio']:.1%})")

 # Example: Analyze v2.0 releases
 compression_analysis('releases', 'v2.0/')
@@ -180,7 +194,11 @@ def performance_comparison(bucket: str):
    )
    time_detailed = (time.time() - start) * 1000

-    delta_count = sum(1 for obj in response_fast.contents if obj.is_delta)
+    delta_count = sum(
+        1
+        for obj in response_fast["Contents"]
+        if obj.get("Metadata", {}).get("deltaglider-is-delta") == "true"
+    )

    print(f"Performance Comparison for {bucket}:")
    print(f"  Fast Listing: {time_fast:.0f}ms (1 API call)")
@@ -203,7 +221,7 @@ performance_comparison('releases')

 ## Bucket Statistics and Monitoring

-DeltaGlider provides powerful bucket statistics with session-level caching for performance.
+DeltaGlider provides powerful bucket statistics with S3-backed caching for performance.

 ### Quick Dashboard Stats (Cached)

@@ -244,7 +262,7 @@ def detailed_compression_report(bucket: str):
    """Generate detailed compression report with accurate ratios."""

    # Detailed stats fetch metadata for delta files (slower, accurate)
-    stats = client.get_bucket_stats(bucket, detailed_stats=True)
+    stats = client.get_bucket_stats(bucket, mode='detailed')

    efficiency = (stats.space_saved / stats.total_size * 100) if stats.total_size > 0 else 0

@@ -286,7 +304,7 @@ def list_buckets_with_stats():
    # Pre-fetch stats for important buckets
    important_buckets = ['releases', 'backups']
    for bucket_name in important_buckets:
-        client.get_bucket_stats(bucket_name, detailed_stats=True)
+        client.get_bucket_stats(bucket_name, mode='detailed')

    # List all buckets (includes cached stats automatically)
    response = client.list_buckets()
@@ -357,7 +375,7 @@ except KeyboardInterrupt:

 ## Session-Level Cache Management

-DeltaGlider maintains session-level caches for optimal performance in long-running applications.
+DeltaGlider maintains an encrypted reference cache for optimal performance in long-running applications.

 ### Long-Running Application Pattern

@@ -410,11 +428,8 @@ def handle_external_bucket_changes(bucket: str):
    print("External backup tool running...")
    run_external_backup_tool(bucket)  # Your external tool

-    # Clear cache to get fresh data
-    client.clear_cache()
-
-    # Get updated stats
-    stats_after = client.get_bucket_stats(bucket)
+    # Force a recompute of the cached stats
+    stats_after = client.get_bucket_stats(bucket, refresh_cache=True)
    print(f"After: {stats_after.object_count} objects")
    print(f"Added: {stats_after.object_count - stats_before.object_count} objects")

@@ -422,35 +437,6 @@ def handle_external_bucket_changes(bucket: str):
 handle_external_bucket_changes('backups')
 ```

-### Selective Cache Eviction
-
-```python
-def selective_cache_management():
-    """Manage cache for specific delta spaces."""
-
-    client = create_client()
-
-    # Upload to multiple delta spaces
-    versions = ['v1.0.0', 'v1.1.0', 'v1.2.0']
-
-    for version in versions:
-        client.upload(f"app-{version}.zip", f"s3://releases/{version}/")
-
-    # Update reference for specific version
-    print("Updating v1.1.0 reference...")
-    client.upload("new-reference.zip", "s3://releases/v1.1.0/")
-
-    # Evict only v1.1.0 cache (others remain cached)
-    client.evict_cache("s3://releases/v1.1.0/reference.bin")
-
-    # Next upload to v1.1.0 fetches fresh reference
-    # v1.0.0 and v1.2.0 still use cached references
-    client.upload("similar-file.zip", "s3://releases/v1.1.0/")
-
-# Example: Selective eviction
-selective_cache_management()
-```
-
 ### Testing with Clean Cache

 ```python
@@ -491,19 +477,18 @@ def measure_cache_performance(bucket: str):
    client = create_client()

    # Test 1: Cold cache
-    client.clear_cache()
    start = time.time()
-    stats1 = client.get_bucket_stats(bucket, detailed_stats=True)
+    stats1 = client.get_bucket_stats(bucket, mode='detailed', refresh_cache=True)
    cold_time = (time.time() - start) * 1000

    # Test 2: Warm cache
    start = time.time()
-    stats2 = client.get_bucket_stats(bucket, detailed_stats=True)
+    stats2 = client.get_bucket_stats(bucket, mode='detailed')
    warm_time = (time.time() - start) * 1000

    # Test 3: Quick stats from detailed cache
    start = time.time()
-    stats3 = client.get_bucket_stats(bucket, detailed_stats=False)
+    stats3 = client.get_bucket_stats(bucket, mode='quick')
    reuse_time = (time.time() - start) * 1000

    print(f"Cache Performance for {bucket}:")
@@ -1707,4 +1692,4 @@ files_to_upload = [
 results = uploader.upload_batch(files_to_upload)
 ```

-These examples demonstrate real-world usage patterns for DeltaGlider across various domains. Each example includes error handling, monitoring, and best practices for production deployments.
+These examples demonstrate real-world usage patterns for DeltaGlider across various domains. Each example includes error handling, monitoring, and best practices for production deployments.