- Fix pagination bug using continuation_token instead of start_after - Add stats caching to prevent blocking web apps - Improve code formatting and type checking - Add comprehensive unit tests for new features - Fix test mock usage in object_listing tests
10 KiB
Bucket Statistics Caching
TL;DR: Bucket stats are now cached in S3 with automatic validation. What took 20 minutes now takes ~100ms when the bucket hasn't changed.
Overview
DeltaGlider's get_bucket_stats() operation now includes intelligent S3-based caching that dramatically improves performance for read-heavy workloads while maintaining accuracy through automatic validation.
The Problem
Computing bucket statistics requires:
- LIST operation: Get all objects (~50-100ms per 1000 objects)
- HEAD operations: Fetch metadata for delta files (expensive!)
- For a bucket with 10,000 delta files: 10,000 HEAD calls
- Even with 10 parallel workers: ~1,000 sequential batches
- At ~100ms per batch: 100+ seconds minimum
- With network issues or throttling: 20+ minutes 😱
This made monitoring dashboards and repeated stats checks impractical.
The Solution
S3-Based Cache with Automatic Validation
Statistics are cached in S3 at .deltaglider/stats_{mode}.json (one per mode). On every call:
- Quick LIST operation (~50-100ms) - always performed for validation
- Compare current object_count + compressed_size with cache
- If unchanged → Return cached stats instantly ✅ (~100ms total)
- If changed → Recompute and update cache automatically
Three Stats Modes
# Quick mode (default): Fast listing-only, approximate compression metrics
deltaglider stats my-bucket
# Sampled mode: One HEAD per deltaspace, balanced accuracy/speed
deltaglider stats my-bucket --sampled
# Detailed mode: All HEAD calls, most accurate (slowest)
deltaglider stats my-bucket --detailed
Each mode has its own independent cache file.
Performance
| Scenario | Before | After | Speedup |
|---|---|---|---|
| First run (cold cache) | 20 min | 20 min | 1x (must compute) |
| Bucket unchanged (warm cache) | 20 min | 100ms | 200x ✨ |
| Bucket changed (stale cache) | 20 min | 20 min | 1x (auto-recompute) |
| Dashboard monitoring | 20 min/check | 100ms/check | 200x ✨ |
CLI Usage
Basic Usage
# Use cache (default behavior)
deltaglider stats my-bucket
# Force recomputation even if cache valid
deltaglider stats my-bucket --refresh
# Skip cache entirely (both read and write)
deltaglider stats my-bucket --no-cache
# Different modes with caching
deltaglider stats my-bucket --sampled
deltaglider stats my-bucket --detailed
Cache Control Flags
| Flag | Description | Use Case |
|---|---|---|
| (none) | Use cache if valid | Default - Fast monitoring |
--refresh |
Force recomputation | Updated data needed now |
--no-cache |
Skip caching entirely | Testing, one-off analysis |
--sampled |
Balanced mode | Good accuracy, faster than detailed |
--detailed |
Most accurate mode | Analytics, reports |
Python SDK Usage
from deltaglider import create_client
client = create_client()
# Use cache (fast, ~100ms with cache hit)
stats = client.get_bucket_stats('releases')
# Force refresh (slow, recomputes everything)
stats = client.get_bucket_stats('releases', refresh_cache=True)
# Skip cache entirely
stats = client.get_bucket_stats('releases', use_cache=False)
# Different modes with caching
stats = client.get_bucket_stats('releases', mode='quick') # Fast
stats = client.get_bucket_stats('releases', mode='sampled') # Balanced
stats = client.get_bucket_stats('releases', mode='detailed') # Accurate
Cache Structure
Cache files are stored at .deltaglider/stats_{mode}.json in your bucket:
{
"version": "1.0",
"mode": "quick",
"computed_at": "2025-10-14T10:30:00Z",
"validation": {
"object_count": 1523,
"compressed_size": 1234567890
},
"stats": {
"bucket": "releases",
"object_count": 1523,
"total_size": 50000000000,
"compressed_size": 1234567890,
"space_saved": 48765432110,
"average_compression_ratio": 0.9753,
"delta_objects": 1500,
"direct_objects": 23
}
}
How Validation Works
Smart Staleness Detection:
- Always perform quick LIST operation (required anyway, ~50-100ms)
- Calculate current
object_countandcompressed_sizefrom LIST - Compare with cached values
- If both match → Cache valid, return instantly
- If either differs → Bucket changed, recompute automatically
This catches:
- ✅ Objects added (count increases)
- ✅ Objects removed (count decreases)
- ✅ Objects replaced (size changes)
- ✅ Content modified (size changes)
Edge Case: If only metadata changes (tags, headers) but not content/count/size, cache remains valid. This is acceptable since metadata changes are rare and don't affect core statistics.
Use Cases
✅ Perfect For
-
Monitoring Dashboards
- Check stats every minute
- Bucket rarely changes
- 20 min → 100ms per check ✨
-
CI/CD Status Checks
- Verify upload success
- Check compression effectiveness
- Near-instant feedback
-
Repeated Analysis
- Multiple stats queries during investigation
- Cache persists across sessions
- Huge time savings
⚠️ Less Beneficial For
-
Write-Heavy Buckets
- Bucket changes on every check
- Cache always stale
- No benefit, but no harm either (graceful degradation)
-
One-Off Queries
- Single stats check
- Cache doesn't help (cold cache)
- Still works normally
Cache Management
Automatic Management
- Creation: Automatic on first
get_bucket_stats()call - Validation: Automatic on every call (always current)
- Updates: Automatic when bucket changes
- Cleanup: Not needed (cache files are tiny ~1-10KB)
Manual Management
# View cache files
deltaglider ls s3://my-bucket/.deltaglider/
# Delete cache manually (will be recreated automatically)
deltaglider rm s3://my-bucket/.deltaglider/stats_quick.json
deltaglider rm s3://my-bucket/.deltaglider/stats_sampled.json
deltaglider rm s3://my-bucket/.deltaglider/stats_detailed.json
# Or delete entire .deltaglider prefix
deltaglider rm -r s3://my-bucket/.deltaglider/
Technical Details
Cache Files
- Location:
.deltaglider/prefix in each bucket - Naming:
stats_{mode}.json(quick, sampled, detailed) - Size: ~1-10KB per file
- Format: JSON with version, mode, validation data, and stats
Validation Logic
def is_cache_valid(cached, current):
"""Cache is valid if object count and size unchanged."""
return (
cached['object_count'] == current['object_count'] and
cached['compressed_size'] == current['compressed_size']
)
Error Handling
Cache operations are non-fatal:
- ✅ Cache read fails → Compute normally, log warning
- ✅ Cache write fails → Return computed stats, log warning
- ✅ Corrupted cache → Ignore, recompute, overwrite
- ✅ Version mismatch → Ignore, recompute with new version
- ✅ Permission denied → Log warning, continue without caching
The stats operation never fails due to cache issues.
Future Enhancements
Potential improvements for the future:
- TTL-Based Expiration: Auto-refresh after N hours even if unchanged
- Cache Cleanup Command:
deltaglider cache clearfor manual invalidation - Cache Statistics: Show hit/miss rates, staleness info
- Async Cache Updates: Background refresh for very large buckets
- Cross-Bucket Cache: Share reference data across related buckets
Comparison with Old Implementation
| Aspect | Old (In-Memory) | New (S3-Based) |
|---|---|---|
| Storage | Process memory | S3 bucket |
| Persistence | Lost on restart | Survives restarts |
| Sharing | Per-process | Shared across all clients |
| Validation | None | Automatic on every call |
| Staleness | Always fresh | Automatically detected |
| Use Case | Single session | Monitoring, dashboards |
Examples
Example 1: Monitoring Dashboard
from deltaglider import create_client
import time
client = create_client()
while True:
# Fast stats check (~100ms with cache)
stats = client.get_bucket_stats('releases')
print(f"Objects: {stats.object_count}, "
f"Compression: {stats.average_compression_ratio:.1%}")
time.sleep(60) # Check every minute
# First run: 20 min (computes and caches)
# All subsequent runs: ~100ms (cache hit)
Example 2: CI/CD Pipeline
from deltaglider import create_client
client = create_client()
# Upload new release
client.upload("v2.0.0.zip", "s3://releases/v2.0.0/")
# Quick verification (fast with cache)
stats = client.get_bucket_stats('releases')
if stats.average_compression_ratio < 0.90:
print("Warning: Lower than expected compression")
Example 3: Force Fresh Stats
from deltaglider import create_client
client = create_client()
# Force recomputation for accurate report
stats = client.get_bucket_stats(
'releases',
mode='detailed',
refresh_cache=True
)
print(f"Accurate compression report:")
print(f" Original: {stats.total_size / 1e9:.1f} GB")
print(f" Stored: {stats.compressed_size / 1e9:.1f} GB")
print(f" Saved: {stats.space_saved / 1e9:.1f} GB ({stats.average_compression_ratio:.1%})")
FAQ
Q: Does caching affect accuracy? A: No! Cache is automatically validated on every call. If the bucket changed, stats are recomputed automatically.
Q: What if I need fresh stats immediately?
A: Use --refresh flag (CLI) or refresh_cache=True (SDK) to force recomputation.
Q: Can I disable caching?
A: Yes, use --no-cache flag (CLI) or use_cache=False (SDK).
Q: How much space do cache files use? A: ~1-10KB per mode, negligible for any bucket.
Q: What happens if cache write fails? A: The operation continues normally - computed stats are returned and a warning is logged. Caching is optional and non-fatal.
Q: Do I need to clean up cache files?
A: No, they're tiny and automatically managed. But you can delete .deltaglider/ prefix if desired.
Q: Does cache work across different modes? A: Each mode (quick, sampled, detailed) has its own independent cache file.
Implementation: See [PR #XX] for complete implementation details and test coverage.
Related: SDK Documentation | CLI Reference | Architecture