mirror of
https://github.com/beshu-tech/deltaglider.git
synced 2026-04-30 12:14:32 +02:00
Compare commits
18 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c3d385bf18 | ||
|
|
aea5cb5d9a | ||
|
|
b2ca59490b | ||
|
|
4f56c4b600 | ||
|
|
14c6af0f35 | ||
|
|
67792b2031 | ||
|
|
a9a1396e6e | ||
|
|
52eb5bba21 | ||
|
|
f75db142e8 | ||
|
|
35d34d4862 | ||
|
|
9230cbd762 | ||
|
|
2eba6e8d38 | ||
|
|
656726b57b | ||
|
|
85dd315424 | ||
|
|
dbd2632cae | ||
|
|
3d04a407c0 | ||
|
|
47f022fffe | ||
|
|
7a2ed16ee7 |
90
CHANGELOG.md
90
CHANGELOG.md
@@ -7,6 +7,94 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
- **EC2 Region Detection & Cost Optimization**
|
||||
- Automatic detection of EC2 instance region using IMDSv2
|
||||
- Warns when EC2 region ≠ S3 client region (potential cross-region charges)
|
||||
- Different warnings for auto-detected vs. explicit `--region` flag mismatches
|
||||
- Green checkmark when regions are aligned (optimal configuration)
|
||||
- Can be disabled with `DG_DISABLE_EC2_DETECTION=true` environment variable
|
||||
- Helps users optimize for cost and performance before migration starts
|
||||
- **New CLI Command**: `deltaglider migrate` for S3-to-S3 bucket migration with compression
|
||||
- Supports resume capability (skips already migrated files)
|
||||
- Real-time progress tracking with file count and statistics
|
||||
- Interactive confirmation prompt (use `--yes` to skip)
|
||||
- Prefix preservation by default (use `--no-preserve-prefix` to disable)
|
||||
- Dry run mode with `--dry-run` flag
|
||||
- Include/exclude pattern filtering
|
||||
- Shows compression statistics after migration
|
||||
- **EC2-aware region logging**: Detects EC2 instance and warns about cross-region charges
|
||||
- **FIXED**: Now correctly preserves original filenames during migration
|
||||
- **S3-to-S3 Recursive Copy**: `deltaglider cp -r s3://source/ s3://dest/` now supported
|
||||
- Automatically uses migration functionality with prefix preservation
|
||||
- Applies delta compression during transfer
|
||||
- Preserves original filenames correctly
|
||||
- **Version Command**: Added `--version` flag to show deltaglider version
|
||||
- Usage: `deltaglider --version`
|
||||
- **DeltaService API Enhancement**: Added `override_name` parameter to `put()` method
|
||||
- Allows specifying destination filename independently of source filesystem path
|
||||
- Enables proper S3-to-S3 transfers without filesystem renaming tricks
|
||||
|
||||
### Fixed
|
||||
- **Critical**: S3-to-S3 migration now preserves original filenames
|
||||
- Previously created files with temp names like `tmp1b9cpdsn.zip`
|
||||
- Now correctly uses original filenames from source S3 keys
|
||||
- Fixed by adding `override_name` parameter to `DeltaService.put()`
|
||||
- **CLI Region Support**: `--region` flag now properly passes region to boto3 client
|
||||
- Previously only set environment variable, relied on boto3 auto-detection
|
||||
- Now explicitly passes `region_name` to `boto3.client()` via `boto3_kwargs`
|
||||
- Ensures consistent behavior with `DeltaGliderClient` SDK
|
||||
|
||||
### Changed
|
||||
- Recursive S3-to-S3 copy operations now preserve source prefix structure by default
|
||||
- Migration operations show formatted output with source and destination paths
|
||||
|
||||
### Documentation
|
||||
- Added comprehensive migration guide in README.md
|
||||
- Updated CLI reference with migrate command examples
|
||||
- Added prefix preservation behavior documentation
|
||||
|
||||
## [5.1.1] - 2025-01-10
|
||||
|
||||
### Fixed
|
||||
- **Stats Command**: Fixed incorrect compression ratio calculations
|
||||
- Now correctly counts ALL files including reference.bin in compressed size
|
||||
- Fixed handling of orphaned reference.bin files (reference files with no delta files)
|
||||
- Added prominent warnings for orphaned reference files with cleanup commands
|
||||
- Fixed stats for buckets with no compression (now shows 0% instead of negative)
|
||||
- SHA1 checksum files are now properly included in calculations
|
||||
|
||||
### Improved
|
||||
- **Stats Performance**: Optimized metadata fetching with parallel requests
|
||||
- 5-10x faster for buckets with many delta files
|
||||
- Uses ThreadPoolExecutor for concurrent HEAD requests
|
||||
- Single-pass calculation algorithm for better efficiency
|
||||
|
||||
## [5.1.0] - 2025-10-10
|
||||
|
||||
### Added
|
||||
- **New CLI Command**: `deltaglider stats <bucket>` for bucket statistics and compression metrics
|
||||
- Supports `--detailed` flag for comprehensive analysis
|
||||
- Supports `--json` flag for machine-readable output
|
||||
- Accepts multiple formats: `s3://bucket/`, `s3://bucket`, `bucket`
|
||||
- **Session-Level Statistics Caching**: Bucket stats now cached per client instance
|
||||
- Automatic cache invalidation on mutations (put, delete, bucket operations)
|
||||
- Intelligent cache reuse (detailed stats serve quick stat requests)
|
||||
- Enhanced `list_buckets()` includes cached stats when available
|
||||
- **Programmatic Cache Management**: Added cache management APIs for long-running applications
|
||||
- `clear_cache()`: Clear all cached references
|
||||
- `evict_cache()`: Remove specific cached reference
|
||||
- Session-scoped cache lifecycle management
|
||||
|
||||
### Changed
|
||||
- Bucket statistics are now cached within client session for performance
|
||||
- `list_buckets()` response includes `DeltaGliderStats` metadata when cached
|
||||
|
||||
### Documentation
|
||||
- Added comprehensive DG_MAX_RATIO tuning guide in docs/
|
||||
- Updated CLI command reference in CLAUDE.md and README.md
|
||||
- Added detailed cache management documentation
|
||||
|
||||
## [5.0.3] - 2025-10-10
|
||||
|
||||
### Security
|
||||
@@ -152,6 +240,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- Delta compression for versioned artifacts
|
||||
- 99%+ compression for similar files
|
||||
|
||||
[5.1.0]: https://github.com/beshu-tech/deltaglider/compare/v5.0.3...v5.1.0
|
||||
[5.0.3]: https://github.com/beshu-tech/deltaglider/compare/v5.0.1...v5.0.3
|
||||
[5.0.1]: https://github.com/beshu-tech/deltaglider/compare/v5.0.0...v5.0.1
|
||||
[5.0.0]: https://github.com/beshu-tech/deltaglider/compare/v4.2.4...v5.0.0
|
||||
[4.2.4]: https://github.com/beshu-tech/deltaglider/compare/v4.2.3...v4.2.4
|
||||
|
||||
17
CLAUDE.md
17
CLAUDE.md
@@ -74,6 +74,17 @@ export AWS_SECRET_ACCESS_KEY=minioadmin
|
||||
|
||||
# Now you can use deltaglider commands
|
||||
deltaglider cp test.zip s3://test-bucket/
|
||||
deltaglider stats test-bucket # Get bucket statistics
|
||||
```
|
||||
|
||||
### Available CLI Commands
|
||||
```bash
|
||||
cp # Copy files to/from S3 (AWS S3 compatible)
|
||||
ls # List S3 buckets or objects (AWS S3 compatible)
|
||||
rm # Remove S3 objects (AWS S3 compatible)
|
||||
sync # Synchronize directories with S3 (AWS S3 compatible)
|
||||
stats # Get bucket statistics and compression metrics
|
||||
verify # Verify integrity of delta file
|
||||
```
|
||||
|
||||
## Architecture
|
||||
@@ -189,7 +200,11 @@ Core delta logic is in `src/deltaglider/core/service.py`:
|
||||
## Environment Variables
|
||||
|
||||
- `DG_LOG_LEVEL`: Logging level (default: "INFO")
|
||||
- `DG_MAX_RATIO`: Maximum acceptable delta/file ratio (default: "0.5")
|
||||
- `DG_MAX_RATIO`: Maximum acceptable delta/file ratio (default: "0.5", range: "0.0-1.0")
|
||||
- **See [docs/DG_MAX_RATIO.md](docs/DG_MAX_RATIO.md) for complete tuning guide**
|
||||
- Controls when to use delta vs. direct storage
|
||||
- Lower (0.2-0.3) = conservative, only high-quality compression
|
||||
- Higher (0.6-0.7) = permissive, accept modest savings
|
||||
- `DG_CACHE_BACKEND`: Cache backend type - "filesystem" (default) or "memory"
|
||||
- `DG_CACHE_MEMORY_SIZE_MB`: Memory cache size limit in MB (default: "100")
|
||||
- `DG_CACHE_ENCRYPTION_KEY`: Optional base64-encoded Fernet key for persistent encryption (ephemeral by default)
|
||||
|
||||
@@ -71,6 +71,11 @@ HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
||||
ENV DG_LOG_LEVEL=INFO
|
||||
|
||||
# Performance & Compression
|
||||
# DG_MAX_RATIO: Maximum delta/file ratio (0.0-1.0)
|
||||
# Default 0.5 means: only use delta if delta_size ≤ 50% of original_size
|
||||
# Lower (0.2-0.3) = more conservative, only high-quality compression
|
||||
# Higher (0.6-0.7) = more permissive, accept modest savings
|
||||
# See docs/DG_MAX_RATIO.md for complete tuning guide
|
||||
ENV DG_MAX_RATIO=0.5
|
||||
|
||||
# Cache Configuration
|
||||
|
||||
68
README.md
68
README.md
@@ -85,15 +85,18 @@ docker run -v /shared-cache:/tmp/.deltaglider \
|
||||
|
||||
**Environment Variables**:
|
||||
- `DG_LOG_LEVEL`: Logging level (default: `INFO`, options: `DEBUG`, `INFO`, `WARNING`, `ERROR`)
|
||||
- `DG_MAX_RATIO`: Maximum delta/file ratio (default: `0.5`, range: `0.0-1.0`)
|
||||
- `DG_MAX_RATIO`: Maximum delta/file ratio (default: `0.5`, range: `0.0-1.0`) - [📖 Complete Guide](docs/DG_MAX_RATIO.md)
|
||||
- `DG_CACHE_BACKEND`: Cache backend (default: `filesystem`, options: `filesystem`, `memory`)
|
||||
- `DG_CACHE_MEMORY_SIZE_MB`: Memory cache size in MB (default: `100`)
|
||||
- `DG_CACHE_ENCRYPTION_KEY`: Optional base64-encoded encryption key for cross-process cache sharing
|
||||
- `DG_DISABLE_EC2_DETECTION`: Disable EC2 instance detection (default: `false`, set to `true` to disable)
|
||||
- `AWS_ENDPOINT_URL`: S3 endpoint URL (default: AWS S3)
|
||||
- `AWS_ACCESS_KEY_ID`: AWS access key
|
||||
- `AWS_SECRET_ACCESS_KEY`: AWS secret key
|
||||
- `AWS_DEFAULT_REGION`: AWS region (default: `us-east-1`)
|
||||
|
||||
> **💡 Tip**: `DG_MAX_RATIO` is a powerful tuning parameter. See the [DG_MAX_RATIO guide](docs/DG_MAX_RATIO.md) to learn how to optimize compression for your use case.
|
||||
|
||||
**Security Notes**:
|
||||
- Encryption is **always enabled** (cannot be disabled)
|
||||
- Each container gets ephemeral encryption keys for maximum security
|
||||
@@ -114,6 +117,9 @@ deltaglider ls s3://releases/
|
||||
|
||||
# Sync directories
|
||||
deltaglider sync ./dist/ s3://releases/v1.0.0/
|
||||
|
||||
# Migrate existing S3 bucket to DeltaGlider-compressed storage
|
||||
deltaglider migrate s3://old-bucket/ s3://new-bucket/
|
||||
```
|
||||
|
||||
**That's it!** DeltaGlider automatically detects similar files and applies 99%+ compression. For more commands and options, see [CLI Reference](#cli-reference).
|
||||
@@ -187,6 +193,19 @@ deltaglider sync s3://releases/ ./local-backup/ # Sync from S3
|
||||
deltaglider sync --delete ./src/ s3://backup/ # Mirror exactly
|
||||
deltaglider sync --exclude "*.log" ./src/ s3://backup/ # Exclude patterns
|
||||
|
||||
# Get bucket statistics (compression metrics)
|
||||
deltaglider stats my-bucket # Quick stats overview
|
||||
deltaglider stats s3://my-bucket # Also accepts s3:// format
|
||||
deltaglider stats s3://my-bucket/ # With or without trailing slash
|
||||
deltaglider stats my-bucket --detailed # Detailed compression metrics (slower)
|
||||
deltaglider stats my-bucket --json # JSON output for automation
|
||||
|
||||
# Migrate existing S3 buckets to DeltaGlider compression
|
||||
deltaglider migrate s3://old-bucket/ s3://new-bucket/ # Interactive migration
|
||||
deltaglider migrate s3://old-bucket/ s3://new-bucket/ --yes # Skip confirmation
|
||||
deltaglider migrate --dry-run s3://old-bucket/ s3://new/ # Preview migration
|
||||
deltaglider migrate s3://bucket/v1/ s3://bucket/v2/ # Migrate prefixes
|
||||
|
||||
# Works with MinIO, R2, and S3-compatible storage
|
||||
deltaglider cp file.zip s3://bucket/ --endpoint-url http://localhost:9000
|
||||
```
|
||||
@@ -510,10 +529,57 @@ Migrating from `aws s3` to `deltaglider` is as simple as changing the command na
|
||||
| `aws s3 rm s3://bucket/file` | `deltaglider rm s3://bucket/file` | - |
|
||||
| `aws s3 sync dir/ s3://bucket/` | `deltaglider sync dir/ s3://bucket/` | ✅ 99% incremental |
|
||||
|
||||
### Migrating Existing S3 Buckets
|
||||
|
||||
DeltaGlider provides a dedicated `migrate` command to compress your existing S3 data:
|
||||
|
||||
```bash
|
||||
# Migrate an entire bucket
|
||||
deltaglider migrate s3://old-bucket/ s3://compressed-bucket/
|
||||
|
||||
# Migrate a prefix (preserves prefix structure by default)
|
||||
deltaglider migrate s3://bucket/releases/ s3://bucket/archive/
|
||||
# Result: s3://bucket/archive/releases/ contains the files
|
||||
|
||||
# Migrate without preserving source prefix
|
||||
deltaglider migrate --no-preserve-prefix s3://bucket/v1/ s3://bucket/archive/
|
||||
# Result: Files go directly into s3://bucket/archive/
|
||||
|
||||
# Preview migration (dry run)
|
||||
deltaglider migrate --dry-run s3://old/ s3://new/
|
||||
|
||||
# Skip confirmation prompt
|
||||
deltaglider migrate --yes s3://old/ s3://new/
|
||||
|
||||
# Exclude certain file patterns
|
||||
deltaglider migrate --exclude "*.log" s3://old/ s3://new/
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- **Resume Support**: Migration automatically skips files that already exist in the destination
|
||||
- **Progress Tracking**: Shows real-time migration progress and statistics
|
||||
- **Safety First**: Interactive confirmation shows file count before starting
|
||||
- **EC2 Cost Optimization**: Automatically detects EC2 instance region and warns about cross-region charges
|
||||
- ✅ Green checkmark when regions align (no extra charges)
|
||||
- ℹ️ INFO when auto-detected mismatch (suggests optimal region)
|
||||
- ⚠️ WARNING when user explicitly set wrong `--region` (expect data transfer costs)
|
||||
- Disable with `DG_DISABLE_EC2_DETECTION=true` if needed
|
||||
- **AWS Region Transparency**: Displays the actual AWS region being used
|
||||
- **Prefix Preservation**: By default, source prefix is preserved in destination (use `--no-preserve-prefix` to disable)
|
||||
- **S3-to-S3 Transfer**: Both regular S3 and DeltaGlider buckets supported
|
||||
|
||||
**Prefix Preservation Examples:**
|
||||
- `s3://src/data/` → `s3://dest/` creates `s3://dest/data/`
|
||||
- `s3://src/a/b/c/` → `s3://dest/x/` creates `s3://dest/x/c/`
|
||||
- Use `--no-preserve-prefix` to place files directly in destination without the source prefix
|
||||
|
||||
The migration preserves all file names and structure while applying DeltaGlider's compression transparently.
|
||||
|
||||
## Production Ready
|
||||
|
||||
- ✅ **Battle tested**: 200K+ files in production
|
||||
- ✅ **Data integrity**: SHA256 verification on every operation
|
||||
- ✅ **Cost optimization**: Automatic EC2 region detection warns about cross-region charges - [📖 EC2 Detection Guide](docs/EC2_REGION_DETECTION.md)
|
||||
- ✅ **S3 compatible**: Works with AWS, MinIO, Cloudflare R2, etc.
|
||||
- ✅ **Atomic operations**: No partial states
|
||||
- ✅ **Concurrent safe**: Multiple clients supported
|
||||
|
||||
684
docs/CACHE_MANAGEMENT.md
Normal file
684
docs/CACHE_MANAGEMENT.md
Normal file
@@ -0,0 +1,684 @@
|
||||
# Cache Management for Long-Running Applications
|
||||
|
||||
## Overview
|
||||
|
||||
DeltaGlider uses caching to store reference files locally for efficient delta compression. However, unlike the CLI which automatically cleans up cache on exit, **programmatic SDK usage requires manual cache management** for long-running applications.
|
||||
|
||||
This guide explains how to manage cache in production applications, including:
|
||||
- When and how to clear cache
|
||||
- Encryption key management strategies
|
||||
- Memory vs. filesystem cache trade-offs
|
||||
- Best practices for different application types
|
||||
|
||||
## The Problem
|
||||
|
||||
**CLI (Automatic Cleanup)**:
|
||||
```bash
|
||||
# Cache created in /tmp/deltaglider-xyz123/
|
||||
deltaglider cp file.zip s3://bucket/
|
||||
|
||||
# Process exits → Cache automatically deleted via atexit handler
|
||||
# ✅ No manual cleanup needed
|
||||
```
|
||||
|
||||
**SDK (Manual Cleanup Required)**:
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
|
||||
# Long-running application (runs for hours/days)
|
||||
while True:
|
||||
client.put_object(Bucket='releases', Key='file.zip', Body=data)
|
||||
time.sleep(600) # Upload every 10 minutes
|
||||
|
||||
# ❌ Process never exits → Cache never cleaned
|
||||
# ❌ Cache grows indefinitely
|
||||
# ❌ Memory/disk exhaustion after days/weeks
|
||||
```
|
||||
|
||||
## Solution: Manual Cache Management
|
||||
|
||||
### Basic Cache Clearing
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
|
||||
# Do some uploads
|
||||
client.put_object(Bucket='releases', Key='file1.zip', Body=data1)
|
||||
client.put_object(Bucket='releases', Key='file2.zip', Body=data2)
|
||||
|
||||
# Clear cache to free resources
|
||||
client.clear_cache()
|
||||
|
||||
# ✅ All cached references removed
|
||||
# ✅ Memory/disk freed
|
||||
# ✅ Next upload will fetch fresh reference from S3
|
||||
```
|
||||
|
||||
### When to Clear Cache
|
||||
|
||||
| Scenario | Frequency | Reason |
|
||||
|----------|-----------|--------|
|
||||
| **Long-running services** | Every 1-4 hours | Prevent memory/disk growth |
|
||||
| **After config changes** | Immediately | Old cache is invalid |
|
||||
| **High memory pressure** | As needed | Free resources |
|
||||
| **Test cleanup** | After each test | Ensure clean state |
|
||||
| **Scheduled jobs** | After job completes | Clean up before next run |
|
||||
| **Key rotation** | After rotation | Old encrypted cache unusable |
|
||||
|
||||
## Cache Strategies by Application Type
|
||||
|
||||
### 1. Long-Running Background Service
|
||||
|
||||
**Scenario**: Continuous upload service running 24/7
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
import schedule
|
||||
import time
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
client = create_client()
|
||||
|
||||
def upload_task():
|
||||
"""Upload latest build."""
|
||||
try:
|
||||
with open('latest-build.zip', 'rb') as f:
|
||||
response = client.put_object(
|
||||
Bucket='releases',
|
||||
Key=f'builds/{datetime.now().isoformat()}.zip',
|
||||
Body=f
|
||||
)
|
||||
logger.info(f"Uploaded: {response['ETag']}")
|
||||
except Exception as e:
|
||||
logger.error(f"Upload failed: {e}")
|
||||
|
||||
def cleanup_task():
|
||||
"""Clear cache to prevent growth."""
|
||||
client.clear_cache()
|
||||
logger.info("Cache cleared - freed resources")
|
||||
|
||||
# Upload every 10 minutes
|
||||
schedule.every(10).minutes.do(upload_task)
|
||||
|
||||
# Clear cache every 2 hours (balance performance vs. memory)
|
||||
schedule.every(2).hours.do(cleanup_task)
|
||||
|
||||
# Run indefinitely
|
||||
while True:
|
||||
schedule.run_pending()
|
||||
time.sleep(60)
|
||||
```
|
||||
|
||||
**Cache Clearing Frequency Guidelines**:
|
||||
- Every 1 hour: High upload frequency (>100/day), memory-constrained
|
||||
- Every 2-4 hours: Moderate upload frequency (10-100/day), normal memory
|
||||
- Every 6-12 hours: Low upload frequency (<10/day), abundant memory
|
||||
|
||||
### 2. Periodic Batch Job
|
||||
|
||||
**Scenario**: Daily backup script
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
import glob
|
||||
from pathlib import Path
|
||||
|
||||
def daily_backup():
|
||||
"""Upload daily backups and clean up."""
|
||||
client = create_client()
|
||||
|
||||
try:
|
||||
# Upload all backup files
|
||||
for backup in glob.glob('/backups/*.zip'):
|
||||
with open(backup, 'rb') as f:
|
||||
client.put_object(
|
||||
Bucket='backups',
|
||||
Key=f'daily/{Path(backup).name}',
|
||||
Body=f
|
||||
)
|
||||
print(f"Backed up: {backup}")
|
||||
|
||||
finally:
|
||||
# ALWAYS clear cache at end of job
|
||||
client.clear_cache()
|
||||
print("Cache cleared - job complete")
|
||||
|
||||
# Run daily via cron/systemd timer
|
||||
if __name__ == '__main__':
|
||||
daily_backup()
|
||||
```
|
||||
|
||||
**Best Practice**: Always clear cache in `finally` block to ensure cleanup even if job fails.
|
||||
|
||||
### 3. Web Application / API Server
|
||||
|
||||
**Scenario**: Flask/FastAPI app with upload endpoints
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI, UploadFile, BackgroundTasks
|
||||
from deltaglider import create_client
|
||||
import asyncio
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
# Create client once at startup
|
||||
client = create_client()
|
||||
|
||||
async def periodic_cache_cleanup():
|
||||
"""Background task to clear cache periodically."""
|
||||
while True:
|
||||
await asyncio.sleep(3600) # Every hour
|
||||
client.clear_cache()
|
||||
print("Cache cleared in background")
|
||||
|
||||
@app.on_event("startup")
|
||||
async def startup_event():
|
||||
"""Start background cache cleanup task."""
|
||||
asyncio.create_task(periodic_cache_cleanup())
|
||||
|
||||
@app.post("/upload")
|
||||
async def upload_file(file: UploadFile):
|
||||
"""Upload endpoint with automatic cache management."""
|
||||
content = await file.read()
|
||||
|
||||
response = client.put_object(
|
||||
Bucket='uploads',
|
||||
Key=f'files/{file.filename}',
|
||||
Body=content
|
||||
)
|
||||
|
||||
return {"message": "Uploaded", "etag": response['ETag']}
|
||||
|
||||
@app.post("/admin/clear-cache")
|
||||
async def admin_clear_cache():
|
||||
"""Manual cache clear endpoint for admin."""
|
||||
client.clear_cache()
|
||||
return {"message": "Cache cleared"}
|
||||
```
|
||||
|
||||
**Best Practice**: Run periodic cache cleanup in background task, provide manual clear endpoint for emergencies.
|
||||
|
||||
### 4. Testing / CI/CD
|
||||
|
||||
**Scenario**: Test suite using DeltaGlider
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from deltaglider import create_client
|
||||
|
||||
@pytest.fixture
|
||||
def deltaglider_client():
|
||||
"""Provide clean client for each test."""
|
||||
client = create_client()
|
||||
yield client
|
||||
# ALWAYS clear cache after test
|
||||
client.clear_cache()
|
||||
|
||||
def test_upload(deltaglider_client):
|
||||
"""Test upload with automatic cleanup."""
|
||||
response = deltaglider_client.put_object(
|
||||
Bucket='test-bucket',
|
||||
Key='test-file.zip',
|
||||
Body=b'test data'
|
||||
)
|
||||
assert response['ETag'] is not None
|
||||
# Cache automatically cleared by fixture
|
||||
|
||||
def test_download(deltaglider_client):
|
||||
"""Test download with clean cache."""
|
||||
# Cache is clean from previous test
|
||||
deltaglider_client.put_object(Bucket='test', Key='file.zip', Body=b'data')
|
||||
response = deltaglider_client.get_object(Bucket='test', Key='file.zip')
|
||||
assert response['Body'].read() == b'data'
|
||||
# Cache automatically cleared by fixture
|
||||
```
|
||||
|
||||
**Best Practice**: Use pytest fixtures to ensure cache is cleared after each test.
|
||||
|
||||
### 5. AWS Lambda / Serverless
|
||||
|
||||
**Scenario**: Lambda function with warm container reuse
|
||||
|
||||
```python
|
||||
import os
|
||||
from deltaglider import create_client
|
||||
|
||||
# Initialize client outside handler (reused across invocations)
|
||||
client = create_client()
|
||||
|
||||
# Track invocation count for cache clearing
|
||||
invocation_count = 0
|
||||
|
||||
def lambda_handler(event, context):
|
||||
"""Lambda handler with periodic cache clearing."""
|
||||
global invocation_count
|
||||
invocation_count += 1
|
||||
|
||||
try:
|
||||
# Upload file
|
||||
response = client.put_object(
|
||||
Bucket='lambda-uploads',
|
||||
Key=event['filename'],
|
||||
Body=event['data']
|
||||
)
|
||||
|
||||
# Clear cache every 50 invocations (warm container optimization)
|
||||
if invocation_count % 50 == 0:
|
||||
client.clear_cache()
|
||||
print(f"Cache cleared after {invocation_count} invocations")
|
||||
|
||||
return {'statusCode': 200, 'etag': response['ETag']}
|
||||
|
||||
except Exception as e:
|
||||
# Clear cache on error to prevent poisoned state
|
||||
client.clear_cache()
|
||||
return {'statusCode': 500, 'error': str(e)}
|
||||
```
|
||||
|
||||
**Best Practice**: Clear cache periodically (every N invocations) and on errors. Lambda warm containers can reuse cache across invocations for performance.
|
||||
|
||||
## Encryption Key Management
|
||||
|
||||
DeltaGlider always encrypts cache data. Understanding key management is critical for programmatic usage.
|
||||
|
||||
### Ephemeral Keys (Default - Recommended)
|
||||
|
||||
**How It Works**:
|
||||
- New encryption key generated per client instance
|
||||
- Cache encrypted with instance-specific key
|
||||
- Key lost when client is garbage collected
|
||||
- **Maximum security** - keys never persist
|
||||
|
||||
**When to Use**:
|
||||
- Single-process applications
|
||||
- Short-lived scripts
|
||||
- CI/CD pipelines
|
||||
- Testing
|
||||
- Maximum security requirements
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
# Create client (generates ephemeral key automatically)
|
||||
client = create_client()
|
||||
|
||||
# Upload file (encrypted with ephemeral key)
|
||||
client.put_object(Bucket='bucket', Key='file.zip', Body=data)
|
||||
|
||||
# Clear cache
|
||||
client.clear_cache()
|
||||
|
||||
# ✅ Encrypted cache cleared
|
||||
# ✅ Key was never persisted
|
||||
# ✅ Perfect forward secrecy
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- ✅ Maximum security (keys never leave process)
|
||||
- ✅ Perfect forward secrecy
|
||||
- ✅ No key management overhead
|
||||
- ❌ Cache not shareable between processes
|
||||
- ❌ Cache not reusable after client recreation
|
||||
|
||||
### Persistent Keys (Advanced - Shared Cache)
|
||||
|
||||
**How It Works**:
|
||||
- Use same encryption key across multiple processes/clients
|
||||
- Key stored in environment variable or secrets manager
|
||||
- All processes can read each other's encrypted cache
|
||||
- **Trade-off**: Convenience vs. security
|
||||
|
||||
**When to Use**:
|
||||
- Multi-process applications (workers, replicas)
|
||||
- Shared cache across containers
|
||||
- Cache persistence across application restarts
|
||||
- Horizontal scaling scenarios
|
||||
|
||||
**Example - Environment Variable**:
|
||||
```python
|
||||
import os
|
||||
from cryptography.fernet import Fernet
|
||||
import base64
|
||||
|
||||
# Generate persistent key (do this ONCE, securely)
|
||||
key = Fernet.generate_key()
|
||||
key_b64 = base64.b64encode(key).decode('utf-8')
|
||||
print(f"DG_CACHE_ENCRYPTION_KEY={key_b64}") # Store in secrets manager!
|
||||
|
||||
# Set in environment (or use secrets manager)
|
||||
os.environ['DG_CACHE_ENCRYPTION_KEY'] = key_b64
|
||||
|
||||
# All client instances use same key
|
||||
client1 = create_client()
|
||||
client2 = create_client()
|
||||
|
||||
# Client1 writes to cache
|
||||
client1.put_object(Bucket='bucket', Key='file.zip', Body=data)
|
||||
|
||||
# Client2 can read same cached data (same key!)
|
||||
client2.get_object(Bucket='bucket', Key='file.zip')
|
||||
|
||||
# ✅ Cache shared between processes
|
||||
# ⚠️ Key must be securely managed
|
||||
```
|
||||
|
||||
**Example - AWS Secrets Manager**:
|
||||
```python
|
||||
import boto3
|
||||
import json
|
||||
from deltaglider import create_client
|
||||
|
||||
def get_encryption_key_from_secrets_manager():
|
||||
"""Retrieve encryption key from AWS Secrets Manager."""
|
||||
secrets = boto3.client('secretsmanager', region_name='us-west-2')
|
||||
response = secrets.get_secret_value(SecretId='deltaglider/cache-encryption-key')
|
||||
secret = json.loads(response['SecretString'])
|
||||
return secret['encryption_key']
|
||||
|
||||
# Retrieve key securely
|
||||
os.environ['DG_CACHE_ENCRYPTION_KEY'] = get_encryption_key_from_secrets_manager()
|
||||
|
||||
# Create client with persistent key
|
||||
client = create_client()
|
||||
```
|
||||
|
||||
**Security Best Practices for Persistent Keys**:
|
||||
1. **Generate once**: Never regenerate unless rotating
|
||||
2. **Store securely**: AWS Secrets Manager, HashiCorp Vault, etc.
|
||||
3. **Rotate regularly**: Follow your key rotation policy
|
||||
4. **Audit access**: Log who accesses encryption keys
|
||||
5. **Principle of least privilege**: Only processes that need shared cache get the key
|
||||
|
||||
### Key Rotation
|
||||
|
||||
**When to Rotate**:
|
||||
- Regular schedule (every 90 days recommended)
|
||||
- After security incident
|
||||
- When team members leave
|
||||
- After key exposure
|
||||
|
||||
**How to Rotate**:
|
||||
```python
|
||||
import os
|
||||
from deltaglider import create_client
|
||||
|
||||
# OLD KEY (existing cache encrypted with this)
|
||||
old_key = os.environ.get('DG_CACHE_ENCRYPTION_KEY')
|
||||
|
||||
# Generate NEW KEY
|
||||
from cryptography.fernet import Fernet
|
||||
new_key = Fernet.generate_key()
|
||||
new_key_b64 = base64.b64encode(new_key).decode('utf-8')
|
||||
|
||||
# Steps for rotation:
|
||||
# 1. Clear old cache (encrypted with old key)
|
||||
client_old = create_client() # Uses old key from env
|
||||
client_old.clear_cache()
|
||||
|
||||
# 2. Update environment with new key
|
||||
os.environ['DG_CACHE_ENCRYPTION_KEY'] = new_key_b64
|
||||
|
||||
# 3. Create new client with new key
|
||||
client_new = create_client() # Uses new key
|
||||
|
||||
# 4. Continue operations
|
||||
client_new.put_object(Bucket='bucket', Key='file.zip', Body=data)
|
||||
|
||||
# ✅ Cache now encrypted with new key
|
||||
# ✅ Old encrypted cache cleared
|
||||
```
|
||||
|
||||
## Memory vs. Filesystem Cache
|
||||
|
||||
### Filesystem Cache (Default)
|
||||
|
||||
**Characteristics**:
|
||||
- Stored in `/tmp/deltaglider-*/`
|
||||
- Persistent across client recreations (within same process)
|
||||
- Can be shared between processes (with persistent encryption key)
|
||||
- Slower than memory cache (disk I/O)
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
import os
|
||||
|
||||
# Explicitly set filesystem cache (this is the default)
|
||||
os.environ['DG_CACHE_BACKEND'] = 'filesystem'
|
||||
|
||||
from deltaglider import create_client
|
||||
client = create_client()
|
||||
```
|
||||
|
||||
**When to Use**:
|
||||
- Default choice for most applications
|
||||
- When cache should persist across client recreations
|
||||
- Multi-process applications (with persistent key)
|
||||
- Memory-constrained environments
|
||||
|
||||
**Cache Clearing**:
|
||||
```python
|
||||
client.clear_cache()
|
||||
# Removes all files from /tmp/deltaglider-*/
|
||||
# Frees disk space
|
||||
```
|
||||
|
||||
### Memory Cache
|
||||
|
||||
**Characteristics**:
|
||||
- Stored in process memory (RAM)
|
||||
- Fast access (no disk I/O)
|
||||
- Automatically freed when process exits
|
||||
- LRU eviction prevents unlimited growth
|
||||
- Not shared between processes
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
import os
|
||||
|
||||
# Enable memory cache
|
||||
os.environ['DG_CACHE_BACKEND'] = 'memory'
|
||||
os.environ['DG_CACHE_MEMORY_SIZE_MB'] = '200' # Default: 100MB
|
||||
|
||||
from deltaglider import create_client
|
||||
client = create_client()
|
||||
```
|
||||
|
||||
**When to Use**:
|
||||
- High-performance requirements
|
||||
- Ephemeral environments (containers, Lambda)
|
||||
- Short-lived applications
|
||||
- CI/CD pipelines
|
||||
- When disk I/O is bottleneck
|
||||
|
||||
**Cache Clearing**:
|
||||
```python
|
||||
client.clear_cache()
|
||||
# Frees memory immediately
|
||||
# No disk I/O
|
||||
```
|
||||
|
||||
**LRU Eviction**:
|
||||
Memory cache automatically evicts least recently used entries when size limit is reached. No manual intervention needed.
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
### ✅ DO
|
||||
|
||||
1. **Clear cache periodically** in long-running applications
|
||||
2. **Clear cache in `finally` blocks** for batch jobs
|
||||
3. **Use fixtures for tests** to ensure clean state
|
||||
4. **Monitor cache size** in production
|
||||
5. **Use ephemeral keys** when possible (maximum security)
|
||||
6. **Store persistent keys securely** (Secrets Manager, Vault)
|
||||
7. **Rotate encryption keys** regularly
|
||||
8. **Use memory cache** for ephemeral environments
|
||||
9. **Clear cache after config changes**
|
||||
10. **Document cache strategy** for your application
|
||||
|
||||
### ❌ DON'T
|
||||
|
||||
1. **Never let cache grow unbounded** in long-running apps
|
||||
2. **Don't share ephemeral encrypted cache** between processes
|
||||
3. **Don't store persistent keys in code** or version control
|
||||
4. **Don't forget to clear cache in tests**
|
||||
5. **Don't assume cache is automatically cleaned** in SDK usage
|
||||
6. **Don't use persistent keys** unless you need cross-process sharing
|
||||
7. **Don't skip key rotation**
|
||||
8. **Don't ignore memory limits** for memory cache
|
||||
|
||||
## Monitoring Cache Health
|
||||
|
||||
### Cache Size Tracking
|
||||
|
||||
```python
|
||||
import os
|
||||
from pathlib import Path
|
||||
from deltaglider import create_client
|
||||
|
||||
def get_cache_size_mb(cache_dir: Path) -> float:
|
||||
"""Calculate total cache size in MB."""
|
||||
total_bytes = sum(f.stat().st_size for f in cache_dir.rglob('*') if f.is_file())
|
||||
return total_bytes / (1024 * 1024)
|
||||
|
||||
# Get cache directory (ephemeral, changes per process)
|
||||
cache_dir = Path(tempfile.gettempdir())
|
||||
deltaglider_caches = list(cache_dir.glob('deltaglider-*'))
|
||||
|
||||
if deltaglider_caches:
|
||||
cache_path = deltaglider_caches[0]
|
||||
cache_size_mb = get_cache_size_mb(cache_path)
|
||||
print(f"Cache size: {cache_size_mb:.2f} MB")
|
||||
|
||||
# Clear if over threshold
|
||||
if cache_size_mb > 500:
|
||||
client = create_client()
|
||||
client.clear_cache()
|
||||
print("Cache cleared - exceeded 500MB")
|
||||
```
|
||||
|
||||
### Memory Cache Monitoring
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ['DG_CACHE_BACKEND'] = 'memory'
|
||||
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
|
||||
# Memory cache auto-evicts at configured limit
|
||||
# Monitor via application memory usage tools (not direct API)
|
||||
|
||||
# Example with memory_profiler
|
||||
from memory_profiler import profile
|
||||
|
||||
@profile
|
||||
def upload_many_files():
|
||||
for i in range(1000):
|
||||
client.put_object(Bucket='test', Key=f'file{i}.zip', Body=b'data' * 1000)
|
||||
# Check memory profile to see cache memory usage
|
||||
|
||||
upload_many_files()
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Problem: Cache Growing Too Large
|
||||
|
||||
**Symptoms**:
|
||||
- Disk space running out (`/tmp` filling up)
|
||||
- High memory usage
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Implement automatic cache clearing
|
||||
import psutil
|
||||
|
||||
client = create_client()
|
||||
|
||||
def smart_cache_clear():
|
||||
"""Clear cache if memory/disk pressure detected."""
|
||||
# Check disk space
|
||||
disk = psutil.disk_usage('/tmp')
|
||||
if disk.percent > 80:
|
||||
client.clear_cache()
|
||||
print("Cache cleared - disk pressure")
|
||||
return
|
||||
|
||||
# Check memory usage
|
||||
memory = psutil.virtual_memory()
|
||||
if memory.percent > 80:
|
||||
client.clear_cache()
|
||||
print("Cache cleared - memory pressure")
|
||||
|
||||
# Call periodically
|
||||
schedule.every(15).minutes.do(smart_cache_clear)
|
||||
```
|
||||
|
||||
### Problem: Decryption Failures After Key Rotation
|
||||
|
||||
**Symptoms**:
|
||||
- `CacheCorruptionError: Decryption failed`
|
||||
- After rotating encryption keys
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Clear cache before using new key
|
||||
old_client = create_client() # Old key
|
||||
old_client.clear_cache()
|
||||
|
||||
# Update encryption key
|
||||
os.environ['DG_CACHE_ENCRYPTION_KEY'] = new_key
|
||||
|
||||
# Create new client
|
||||
new_client = create_client() # New key
|
||||
```
|
||||
|
||||
### Problem: Tests Failing Due to Cached Data
|
||||
|
||||
**Symptoms**:
|
||||
- Tests pass in isolation, fail when run together
|
||||
- Unexpected data in downloads
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Always clear cache in test teardown
|
||||
@pytest.fixture(autouse=True)
|
||||
def clear_cache_after_test():
|
||||
"""Automatically clear cache after every test."""
|
||||
yield
|
||||
# Teardown
|
||||
client = create_client()
|
||||
client.clear_cache()
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [docs/sdk/getting-started.md](sdk/getting-started.md) - SDK configuration
|
||||
- [README.md](../README.md) - Docker and environment variables
|
||||
- [CLAUDE.md](../CLAUDE.md) - Development guide
|
||||
|
||||
## Quick Reference
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
# Create client
|
||||
client = create_client()
|
||||
|
||||
# Clear all cache
|
||||
client.clear_cache()
|
||||
|
||||
# Use memory cache
|
||||
os.environ['DG_CACHE_BACKEND'] = 'memory'
|
||||
|
||||
# Use persistent encryption key
|
||||
os.environ['DG_CACHE_ENCRYPTION_KEY'] = 'base64-encoded-key'
|
||||
```
|
||||
526
docs/DG_MAX_RATIO.md
Normal file
526
docs/DG_MAX_RATIO.md
Normal file
@@ -0,0 +1,526 @@
|
||||
# DG_MAX_RATIO - Delta Compression Efficiency Guard
|
||||
|
||||
## Overview
|
||||
|
||||
`DG_MAX_RATIO` is a **safety threshold** that prevents inefficient delta compression. It controls the maximum acceptable ratio of `delta_size / original_file_size`.
|
||||
|
||||
**Default**: `0.5` (50%)
|
||||
**Range**: `0.0` to `1.0`
|
||||
**Environment Variable**: `DG_MAX_RATIO`
|
||||
|
||||
## The Problem It Solves
|
||||
|
||||
When files are **too different**, xdelta3 can create a delta file that's **almost as large as the original file** (or even larger!). This defeats the purpose of compression and wastes:
|
||||
- ❌ Storage space (storing a large delta instead of the original)
|
||||
- ❌ CPU time (creating and applying the delta)
|
||||
- ❌ Network bandwidth (downloading delta + reference instead of just the file)
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
1. Upload file → Create delta using xdelta3
|
||||
2. Calculate ratio = delta_size / original_file_size
|
||||
3. If ratio > DG_MAX_RATIO:
|
||||
❌ Discard delta, store original file directly
|
||||
Else:
|
||||
✅ Keep delta, save storage space
|
||||
```
|
||||
|
||||
### Example Flow
|
||||
|
||||
```
|
||||
Original file: 100MB
|
||||
Reference file: 100MB (uploaded previously)
|
||||
|
||||
xdelta3 creates delta: 60MB
|
||||
Ratio = 60MB / 100MB = 0.6 (60%)
|
||||
|
||||
With DG_MAX_RATIO=0.5:
|
||||
❌ Delta rejected (60% > 50%)
|
||||
Action: Store 100MB original file
|
||||
Reason: Delta not efficient enough
|
||||
|
||||
With DG_MAX_RATIO=0.7:
|
||||
✅ Delta accepted (60% ≤ 70%)
|
||||
Action: Store 60MB delta
|
||||
Savings: 40MB (40%)
|
||||
```
|
||||
|
||||
## Default Value: 0.5 (50%)
|
||||
|
||||
**Meaning**: Only use delta compression if the delta is **≤50% of the original file size**
|
||||
|
||||
This default provides:
|
||||
- ✅ Minimum 50% storage savings when delta compression is used
|
||||
- ✅ Prevents wasting CPU on inefficient compression
|
||||
- ✅ Works well for typical versioned releases (minor updates between versions)
|
||||
- ✅ Balanced approach without manual tuning
|
||||
|
||||
## When to Adjust
|
||||
|
||||
### 🔽 Lower Value (More Conservative)
|
||||
|
||||
**Set `DG_MAX_RATIO=0.2-0.3` when:**
|
||||
- Files change significantly between versions (major updates, refactors)
|
||||
- Storage cost is **very high** (premium S3 tiers, small quotas)
|
||||
- You want to avoid **any** inefficient compression
|
||||
- You need guaranteed high-quality compression (≥70% savings)
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=0.3
|
||||
|
||||
# Only accept deltas that are ≤30% of original size
|
||||
# More files stored directly, but guaranteed ≥70% savings when using deltas
|
||||
```
|
||||
|
||||
**Trade-off**:
|
||||
- ✅ Higher quality compression when deltas are used
|
||||
- ❌ Fewer files will use delta compression
|
||||
- ❌ More direct uploads (higher storage for dissimilar files)
|
||||
|
||||
### 🔼 Higher Value (More Permissive)
|
||||
|
||||
**Set `DG_MAX_RATIO=0.6-0.8` when:**
|
||||
- Files are very similar (minor patches, nightly builds, incremental changes)
|
||||
- Storage cost is **cheap** (large S3 buckets, unlimited quotas)
|
||||
- CPU time is **expensive** (you want to save re-uploading even with modest compression)
|
||||
- You want to maximize delta compression usage
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=0.7
|
||||
|
||||
# Accept deltas up to 70% of original size
|
||||
# More files use delta compression, even with modest 30% savings
|
||||
```
|
||||
|
||||
**Trade-off**:
|
||||
- ✅ More files use delta compression
|
||||
- ✅ Save bandwidth even with modest compression
|
||||
- ❌ Some deltas may only save 20-30% space
|
||||
- ❌ More CPU time spent on marginal compressions
|
||||
|
||||
## Real-World Scenarios
|
||||
|
||||
### Scenario 1: Nightly Builds (Minimal Changes) ⭐ IDEAL
|
||||
|
||||
```
|
||||
my-app-v1.0.0.zip → 100MB (reference)
|
||||
my-app-v1.0.1.zip → 100MB (0.1% code change)
|
||||
|
||||
Delta: 200KB (0.2% of original)
|
||||
Ratio: 0.002
|
||||
|
||||
With ANY DG_MAX_RATIO: ✅ Use delta (99.8% savings!)
|
||||
Result: Store 200KB instead of 100MB
|
||||
```
|
||||
|
||||
**This is what DeltaGlider excels at!**
|
||||
|
||||
### Scenario 2: Major Version (Significant Changes)
|
||||
|
||||
```
|
||||
my-app-v1.0.0.zip → 100MB (reference)
|
||||
my-app-v2.0.0.zip → 100MB (complete rewrite, 85% different)
|
||||
|
||||
Delta: 85MB (85% of original)
|
||||
Ratio: 0.85
|
||||
|
||||
With DG_MAX_RATIO=0.5: ❌ Store original (85% > 50%)
|
||||
→ Stores 100MB directly
|
||||
→ No compression benefit, but no CPU waste
|
||||
|
||||
With DG_MAX_RATIO=0.9: ✅ Use delta (85% ≤ 90%)
|
||||
→ Stores 85MB delta
|
||||
→ Only 15% savings, questionable benefit
|
||||
```
|
||||
|
||||
**Recommendation**: For major versions, default `0.5` correctly rejects inefficient compression.
|
||||
|
||||
### Scenario 3: Different File Format (Same Content)
|
||||
|
||||
```
|
||||
my-app-v1.0.0.zip → 100MB (ZIP archive)
|
||||
my-app-v1.0.0.tar → 100MB (TAR archive, same content)
|
||||
|
||||
Delta: 70MB (completely different format structure)
|
||||
Ratio: 0.70
|
||||
|
||||
With DG_MAX_RATIO=0.5: ❌ Store original (70% > 50%)
|
||||
→ Stores 100MB directly
|
||||
→ Correct decision: formats too different
|
||||
|
||||
With DG_MAX_RATIO=0.8: ✅ Use delta (70% ≤ 80%)
|
||||
→ Stores 70MB delta
|
||||
→ 30% savings, but CPU-intensive
|
||||
```
|
||||
|
||||
**Recommendation**: Use consistent file formats for better compression. Default `0.5` correctly rejects cross-format compression.
|
||||
|
||||
### Scenario 4: Incremental Updates (Sweet Spot) ⭐
|
||||
|
||||
```
|
||||
my-app-v1.0.0.zip → 100MB (reference)
|
||||
my-app-v1.0.1.zip → 100MB (minor bugfix, 5% code change)
|
||||
|
||||
Delta: 5MB (5% of original)
|
||||
Ratio: 0.05
|
||||
|
||||
With ANY DG_MAX_RATIO: ✅ Use delta (95% savings!)
|
||||
Result: Store 5MB instead of 100MB
|
||||
```
|
||||
|
||||
**This is the target use case for delta compression!**
|
||||
|
||||
## How to Choose the Right Value
|
||||
|
||||
### Decision Tree
|
||||
|
||||
```
|
||||
Do your files have minimal changes between versions? (< 5% different)
|
||||
├─ YES → Use default 0.5 ✅
|
||||
│ Delta compression will work perfectly
|
||||
│
|
||||
└─ NO → Are your files significantly different? (> 50% different)
|
||||
├─ YES → Lower to 0.2-0.3 🔽
|
||||
│ Avoid wasting time on inefficient compression
|
||||
│
|
||||
└─ NO → Are they moderately different? (20-50% different)
|
||||
├─ Storage is expensive → Lower to 0.3 🔽
|
||||
│ Only high-quality compression
|
||||
│
|
||||
└─ Storage is cheap → Raise to 0.6-0.7 🔼
|
||||
Accept modest savings
|
||||
```
|
||||
|
||||
### Quick Reference Table
|
||||
|
||||
| File Similarity | Recommended DG_MAX_RATIO | Expected Behavior |
|
||||
|----------------|--------------------------|-------------------|
|
||||
| Nearly identical (< 5% change) | **0.5 (default)** | 🟢 95%+ savings |
|
||||
| Minor updates (5-20% change) | **0.5 (default)** | 🟢 80-95% savings |
|
||||
| Moderate changes (20-50% change) | **0.4-0.6** | 🟡 50-80% savings |
|
||||
| Major changes (50-80% change) | **0.3 or lower** | 🔴 Store directly |
|
||||
| Complete rewrites (> 80% change) | **0.3 or lower** | 🔴 Store directly |
|
||||
|
||||
### Use Cases by Industry
|
||||
|
||||
**Software Releases (SaaS)**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=0.5 # Default
|
||||
# Nightly builds with minor changes compress perfectly
|
||||
```
|
||||
|
||||
**Mobile App Builds**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=0.4 # Slightly conservative
|
||||
# iOS/Android builds can vary, want quality compression only
|
||||
```
|
||||
|
||||
**Database Backups**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=0.7 # Permissive
|
||||
# Daily backups are very similar, accept modest savings
|
||||
```
|
||||
|
||||
**Document Archives**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=0.6 # Moderate
|
||||
# Documents change incrementally, accept good savings
|
||||
```
|
||||
|
||||
**Video/Media Archives**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=0.2 # Very conservative
|
||||
# Media files are unique, only compress if very similar
|
||||
```
|
||||
|
||||
## Configuration Examples
|
||||
|
||||
### Docker
|
||||
|
||||
**Conservative (Premium Storage)**:
|
||||
```bash
|
||||
docker run -e DG_MAX_RATIO=0.3 \
|
||||
-e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
|
||||
deltaglider/deltaglider cp file.zip s3://releases/
|
||||
```
|
||||
|
||||
**Default (Balanced)**:
|
||||
```bash
|
||||
docker run -e DG_MAX_RATIO=0.5 \
|
||||
-e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
|
||||
deltaglider/deltaglider cp file.zip s3://releases/
|
||||
```
|
||||
|
||||
**Permissive (Cheap Storage)**:
|
||||
```bash
|
||||
docker run -e DG_MAX_RATIO=0.7 \
|
||||
-e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
|
||||
deltaglider/deltaglider cp file.zip s3://releases/
|
||||
```
|
||||
|
||||
### Python SDK
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
# Conservative (only high-quality compression)
|
||||
os.environ['DG_MAX_RATIO'] = '0.3'
|
||||
|
||||
# Default (balanced)
|
||||
os.environ['DG_MAX_RATIO'] = '0.5'
|
||||
|
||||
# Permissive (accept modest savings)
|
||||
os.environ['DG_MAX_RATIO'] = '0.7'
|
||||
|
||||
from deltaglider import create_client
|
||||
client = create_client()
|
||||
|
||||
summary = client.upload("file.zip", "s3://bucket/")
|
||||
print(f"Delta ratio: {summary.delta_ratio:.2f}")
|
||||
print(f"Used delta: {summary.is_delta}")
|
||||
```
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
# Conservative
|
||||
export DG_MAX_RATIO=0.3
|
||||
deltaglider cp my-app-v2.0.0.zip s3://releases/
|
||||
|
||||
# Default
|
||||
export DG_MAX_RATIO=0.5
|
||||
deltaglider cp my-app-v2.0.0.zip s3://releases/
|
||||
|
||||
# Permissive
|
||||
export DG_MAX_RATIO=0.7
|
||||
deltaglider cp my-app-v2.0.0.zip s3://releases/
|
||||
```
|
||||
|
||||
### Override Per-Upload (CLI)
|
||||
|
||||
```bash
|
||||
# Use custom ratio for specific file
|
||||
deltaglider cp large-file.zip s3://releases/ --max-ratio 0.3
|
||||
```
|
||||
|
||||
## Monitoring and Tuning
|
||||
|
||||
### Check Delta Ratios After Upload
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
summary = client.upload("file.zip", "s3://bucket/")
|
||||
|
||||
print(f"Delta ratio: {summary.delta_ratio:.2f}")
|
||||
print(f"Savings: {summary.savings_percent:.0f}%")
|
||||
print(f"Used delta: {summary.is_delta}")
|
||||
|
||||
if summary.is_delta:
|
||||
print(f"✅ Used delta compression (ratio {summary.delta_ratio:.2f})")
|
||||
if summary.delta_ratio > 0.4:
|
||||
print(f"⚠️ Consider lowering DG_MAX_RATIO for better quality")
|
||||
else:
|
||||
print(f"❌ Stored directly (delta would have been ratio ~{summary.delta_ratio:.2f})")
|
||||
if summary.delta_ratio < 0.6:
|
||||
print(f"💡 Consider raising DG_MAX_RATIO to enable compression")
|
||||
```
|
||||
|
||||
### Batch Analysis
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
from pathlib import Path
|
||||
|
||||
client = create_client()
|
||||
ratios = []
|
||||
|
||||
for file in Path("releases").glob("*.zip"):
|
||||
summary = client.upload(str(file), f"s3://bucket/{file.name}")
|
||||
if summary.is_delta:
|
||||
ratios.append(summary.delta_ratio)
|
||||
|
||||
if ratios:
|
||||
avg_ratio = sum(ratios) / len(ratios)
|
||||
max_ratio = max(ratios)
|
||||
|
||||
print(f"Average delta ratio: {avg_ratio:.2f}")
|
||||
print(f"Maximum delta ratio: {max_ratio:.2f}")
|
||||
print(f"Files compressed: {len(ratios)}")
|
||||
|
||||
if avg_ratio < 0.2:
|
||||
print("💡 Consider raising DG_MAX_RATIO - you're getting excellent compression")
|
||||
elif max_ratio > 0.6:
|
||||
print("⚠️ Consider lowering DG_MAX_RATIO - some deltas are inefficient")
|
||||
```
|
||||
|
||||
### Optimization Strategy
|
||||
|
||||
1. **Start with default (0.5)**
|
||||
2. **Monitor delta ratios** for 1 week of uploads
|
||||
3. **Analyze results**:
|
||||
- If most ratios < 0.2: Consider raising to 0.6-0.7
|
||||
- If many ratios > 0.4: Consider lowering to 0.3-0.4
|
||||
- If ratios vary widely: Keep default 0.5
|
||||
4. **Adjust and re-test** for 1 week
|
||||
5. **Repeat until optimal** for your use case
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Dynamic Ratio Based on File Type
|
||||
|
||||
```python
|
||||
import os
|
||||
from pathlib import Path
|
||||
from deltaglider import create_client
|
||||
|
||||
def get_optimal_ratio(file_path: str) -> float:
|
||||
"""Determine optimal ratio based on file type."""
|
||||
suffix = Path(file_path).suffix.lower()
|
||||
|
||||
# Very compressible (source code archives)
|
||||
if suffix in ['.zip', '.tar', '.gz']:
|
||||
return 0.6
|
||||
|
||||
# Moderately compressible (compiled binaries)
|
||||
elif suffix in ['.jar', '.war', '.deb', '.rpm']:
|
||||
return 0.5
|
||||
|
||||
# Rarely compressible (media, already compressed)
|
||||
elif suffix in ['.mp4', '.jpg', '.png']:
|
||||
return 0.2
|
||||
|
||||
# Default
|
||||
return 0.5
|
||||
|
||||
file = "my-app.zip"
|
||||
os.environ['DG_MAX_RATIO'] = str(get_optimal_ratio(file))
|
||||
|
||||
client = create_client()
|
||||
summary = client.upload(file, "s3://bucket/")
|
||||
```
|
||||
|
||||
### A/B Testing Different Ratios
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
import os
|
||||
|
||||
def test_ratios(file_path: str, ratios: list[float]):
|
||||
"""Test different ratios and report results."""
|
||||
results = []
|
||||
|
||||
for ratio in ratios:
|
||||
os.environ['DG_MAX_RATIO'] = str(ratio)
|
||||
client = create_client()
|
||||
|
||||
# Simulate upload (don't actually upload)
|
||||
summary = client.estimate_compression(file_path, "s3://bucket/test/")
|
||||
|
||||
results.append({
|
||||
'ratio_threshold': ratio,
|
||||
'would_use_delta': summary.delta_ratio <= ratio,
|
||||
'delta_ratio': summary.delta_ratio,
|
||||
'savings': summary.savings_percent if summary.delta_ratio <= ratio else 0
|
||||
})
|
||||
|
||||
return results
|
||||
|
||||
# Test different ratios
|
||||
file = "my-app-v2.0.0.zip"
|
||||
test_results = test_ratios(file, [0.3, 0.5, 0.7])
|
||||
|
||||
for result in test_results:
|
||||
print(f"Ratio {result['ratio_threshold']}: "
|
||||
f"Delta={result['would_use_delta']}, "
|
||||
f"Savings={result['savings']:.0f}%")
|
||||
```
|
||||
|
||||
## FAQ
|
||||
|
||||
### Q: What happens if I set DG_MAX_RATIO=1.0?
|
||||
|
||||
**A**: Delta compression will **always** be used, even if the delta is larger than the original file! This is generally a bad idea and defeats the purpose of the threshold.
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
export DG_MAX_RATIO=1.0
|
||||
|
||||
# File: 100MB, Delta: 120MB
|
||||
# Ratio: 1.2
|
||||
# With DG_MAX_RATIO=1.0: ✅ Use delta (1.2 > 1.0 but we accept anything ≤1.0)
|
||||
# Wait, that's wrong! The delta is LARGER than the original!
|
||||
|
||||
# NEVER set DG_MAX_RATIO to 1.0 or higher
|
||||
```
|
||||
|
||||
### Q: What happens if I set DG_MAX_RATIO=0.0?
|
||||
|
||||
**A**: Delta compression will **never** be used. All files will be stored directly. This is equivalent to disabling DeltaGlider's compression entirely.
|
||||
|
||||
### Q: Can I disable the ratio check?
|
||||
|
||||
**A**: No, and you shouldn't want to. The ratio check is a critical safety feature that prevents wasting storage and CPU on inefficient compression.
|
||||
|
||||
### Q: Does DG_MAX_RATIO affect downloading?
|
||||
|
||||
**A**: No, `DG_MAX_RATIO` only affects **uploads**. During download, DeltaGlider automatically detects whether a file is stored as a delta or directly and handles reconstruction accordingly.
|
||||
|
||||
### Q: Can I set different ratios for different buckets?
|
||||
|
||||
**A**: Not directly via environment variables, but you can change `DG_MAX_RATIO` before each upload in your code:
|
||||
|
||||
```python
|
||||
import os
|
||||
from deltaglider import create_client
|
||||
|
||||
# High-quality compression for production releases
|
||||
os.environ['DG_MAX_RATIO'] = '0.3'
|
||||
client = create_client()
|
||||
client.upload("prod-release.zip", "s3://production/")
|
||||
|
||||
# Permissive compression for dev builds
|
||||
os.environ['DG_MAX_RATIO'] = '0.7'
|
||||
client = create_client()
|
||||
client.upload("dev-build.zip", "s3://development/")
|
||||
```
|
||||
|
||||
### Q: How do I know if my DG_MAX_RATIO is set correctly?
|
||||
|
||||
**A**: Monitor your upload summaries. If most deltas have ratios close to your threshold (e.g., 0.45-0.50 with default 0.5), you might want to lower it. If most deltas have very low ratios (e.g., < 0.2), you could raise it.
|
||||
|
||||
**Ideal scenario**: Most successful delta compressions have ratios < 0.3, and inefficient deltas (> 0.5) are correctly rejected.
|
||||
|
||||
## Summary
|
||||
|
||||
**`DG_MAX_RATIO` prevents wasting time and storage on inefficient delta compression.**
|
||||
|
||||
### Quick Takeaways
|
||||
|
||||
✅ **Default 0.5 works for 90% of use cases**
|
||||
✅ **Lower values (0.2-0.3) for dissimilar files or expensive storage**
|
||||
✅ **Higher values (0.6-0.7) for very similar files or cheap storage**
|
||||
✅ **Monitor delta ratios to tune for your use case**
|
||||
✅ **Never set to 1.0 or higher (defeats the purpose)**
|
||||
✅ **Never set to 0.0 (disables delta compression entirely)**
|
||||
|
||||
### Golden Rule
|
||||
|
||||
**If you're not sure, keep the default `0.5`.**
|
||||
|
||||
It's a sensible balance that:
|
||||
- Prevents inefficient compression (no deltas > 50% of original size)
|
||||
- Allows excellent savings on similar files (most deltas are < 20%)
|
||||
- Works well for typical versioned releases
|
||||
- Requires no manual tuning for most use cases
|
||||
|
||||
---
|
||||
|
||||
**Related Documentation**:
|
||||
- [CLAUDE.md](../CLAUDE.md) - Environment variables reference
|
||||
- [README.md](../README.md) - Docker usage and configuration
|
||||
- [docs/sdk/getting-started.md](sdk/getting-started.md) - SDK configuration guide
|
||||
242
docs/EC2_REGION_DETECTION.md
Normal file
242
docs/EC2_REGION_DETECTION.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# EC2 Region Detection & Cost Optimization
|
||||
|
||||
DeltaGlider automatically detects when you're running on an EC2 instance and warns you about potential cross-region data transfer charges.
|
||||
|
||||
## Overview
|
||||
|
||||
When running `deltaglider migrate` on an EC2 instance, DeltaGlider:
|
||||
|
||||
1. **Detects EC2 Environment**: Uses IMDSv2 (Instance Metadata Service v2) to determine if running on EC2
|
||||
2. **Retrieves Instance Region**: Gets the actual AWS region where your EC2 instance is running
|
||||
3. **Compares Regions**: Checks if your EC2 region matches the S3 client region
|
||||
4. **Warns About Costs**: Displays clear warnings when regions don't match
|
||||
|
||||
## Why This Matters
|
||||
|
||||
**AWS Cross-Region Data Transfer Costs**:
|
||||
- **Same region**: No additional charges for data transfer
|
||||
- **Cross-region**: $0.02 per GB transferred (can add up quickly for large migrations)
|
||||
- **NAT Gateway**: Additional charges if going through NAT
|
||||
|
||||
**Example Cost Impact**:
|
||||
- Migrating 1TB from `us-east-1` EC2 → `us-west-2` S3 = ~$20 in data transfer charges
|
||||
- Same migration within same region = $0 in data transfer charges
|
||||
|
||||
## Output Examples
|
||||
|
||||
### Scenario 1: Regions Aligned (Optimal) ✅
|
||||
|
||||
```bash
|
||||
$ deltaglider migrate s3://old-bucket/ s3://new-bucket/
|
||||
EC2 Instance: us-east-1a
|
||||
S3 Client Region: us-east-1
|
||||
✓ Regions aligned - no cross-region charges
|
||||
Migrating from s3://old-bucket/
|
||||
to s3://new-bucket/
|
||||
...
|
||||
```
|
||||
|
||||
**Result**: No warnings, optimal configuration, no extra charges.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Auto-Detected Mismatch (INFO) ℹ️
|
||||
|
||||
```bash
|
||||
$ deltaglider migrate s3://old-bucket/ s3://new-bucket/
|
||||
EC2 Instance: us-west-2a
|
||||
S3 Client Region: us-east-1
|
||||
|
||||
ℹ️ INFO: EC2 region (us-west-2) differs from configured S3 region (us-east-1)
|
||||
Consider using --region us-west-2 to avoid cross-region charges.
|
||||
|
||||
Migrating from s3://old-bucket/
|
||||
to s3://new-bucket/
|
||||
...
|
||||
```
|
||||
|
||||
**Result**: Informational warning, suggests optimal region. User didn't explicitly set wrong region, so it's likely from their AWS config.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3: Explicit Region Override Mismatch (WARNING) ⚠️
|
||||
|
||||
```bash
|
||||
$ deltaglider migrate --region us-east-1 s3://old-bucket/ s3://new-bucket/
|
||||
EC2 Instance: us-west-2a
|
||||
S3 Client Region: us-east-1
|
||||
|
||||
⚠️ WARNING: EC2 region=us-west-2 != S3 client region=us-east-1
|
||||
Expect cross-region/NAT data charges. Align regions (set client region=us-west-2)
|
||||
before proceeding. Or drop --region for automatic region resolution.
|
||||
|
||||
Migrating from s3://old-bucket/
|
||||
to s3://new-bucket/
|
||||
...
|
||||
```
|
||||
|
||||
**Result**: Strong warning because user explicitly set the wrong region with `--region` flag. They might not realize the cost implications.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4: Not on EC2
|
||||
|
||||
```bash
|
||||
$ deltaglider migrate s3://old-bucket/ s3://new-bucket/
|
||||
S3 Client Region: us-east-1
|
||||
Migrating from s3://old-bucket/
|
||||
to s3://new-bucket/
|
||||
...
|
||||
```
|
||||
|
||||
**Result**: Simple region display, no EC2 warnings (not applicable).
|
||||
|
||||
## Configuration
|
||||
|
||||
### Disable EC2 Detection
|
||||
|
||||
If you want to disable EC2 detection (e.g., for testing or if it causes issues):
|
||||
|
||||
```bash
|
||||
export DG_DISABLE_EC2_DETECTION=true
|
||||
deltaglider migrate s3://old/ s3://new/
|
||||
```
|
||||
|
||||
Or in your script:
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ["DG_DISABLE_EC2_DETECTION"] = "true"
|
||||
```
|
||||
|
||||
### How It Works
|
||||
|
||||
DeltaGlider uses **IMDSv2** (Instance Metadata Service v2) for security:
|
||||
|
||||
1. **Token Request** (PUT with TTL):
|
||||
```
|
||||
PUT http://169.254.169.254/latest/api/token
|
||||
X-aws-ec2-metadata-token-ttl-seconds: 21600
|
||||
```
|
||||
|
||||
2. **Metadata Request** (GET with token):
|
||||
```
|
||||
GET http://169.254.169.254/latest/meta-data/placement/region
|
||||
X-aws-ec2-metadata-token: <token>
|
||||
```
|
||||
|
||||
3. **Fast Timeout**: 1 second timeout for non-EC2 environments (no delay if not on EC2)
|
||||
|
||||
### Security Notes
|
||||
|
||||
- **IMDSv2 Only**: DeltaGlider uses the more secure IMDSv2, not the legacy IMDSv1
|
||||
- **No Credentials**: Only reads metadata, never accesses credentials
|
||||
- **Graceful Fallback**: Silently skips detection if IMDS unavailable
|
||||
- **No Network Impact**: Uses local-only IP (169.254.169.254), never leaves the instance
|
||||
|
||||
## Best Practices
|
||||
|
||||
### For Cost Optimization
|
||||
|
||||
1. **Same Region**: Always try to keep EC2 instance and S3 bucket in the same region
|
||||
2. **Check First**: Run with `--dry-run` to verify the setup before actual migration
|
||||
3. **Use Auto-Detection**: Don't specify `--region` unless you have a specific reason
|
||||
4. **Monitor Costs**: Use AWS Cost Explorer to track cross-region data transfer
|
||||
|
||||
### For Terraform/IaC
|
||||
|
||||
```hcl
|
||||
# Good: EC2 and S3 in same region
|
||||
resource "aws_instance" "app" {
|
||||
region = "us-west-2"
|
||||
}
|
||||
|
||||
resource "aws_s3_bucket" "data" {
|
||||
region = "us-west-2" # Same region
|
||||
}
|
||||
```
|
||||
|
||||
### For Multi-Region Setups
|
||||
|
||||
If you MUST do cross-region transfers:
|
||||
|
||||
1. **Use VPC Endpoints**: Reduce NAT Gateway costs
|
||||
2. **Schedule Off-Peak**: AWS charges less during off-peak hours in some regions
|
||||
3. **Consider S3 Transfer Acceleration**: May be cheaper for very large transfers
|
||||
4. **Batch Operations**: Minimize number of API calls
|
||||
|
||||
## Technical Details
|
||||
|
||||
### EC2MetadataAdapter
|
||||
|
||||
Location: `src/deltaglider/adapters/ec2_metadata.py`
|
||||
|
||||
Key methods:
|
||||
- `is_running_on_ec2()`: Detects EC2 environment
|
||||
- `get_region()`: Returns AWS region code (e.g., "us-east-1")
|
||||
- `get_availability_zone()`: Returns AZ (e.g., "us-east-1a")
|
||||
|
||||
### Region Logging
|
||||
|
||||
Location: `src/deltaglider/app/cli/aws_compat.py`
|
||||
|
||||
Function: `log_aws_region(service, region_override=False)`
|
||||
|
||||
Logic:
|
||||
- If not EC2: Show S3 region only
|
||||
- If EC2 + regions match: Green checkmark ✅
|
||||
- If EC2 + auto-detected mismatch: Blue INFO ℹ️
|
||||
- If EC2 + `--region` mismatch: Yellow WARNING ⚠️
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Cannot connect to IMDS"
|
||||
|
||||
**Cause**: Network policy blocks access to 169.254.169.254
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Test IMDS connectivity
|
||||
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
|
||||
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
|
||||
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
|
||||
http://169.254.169.254/latest/meta-data/placement/region
|
||||
|
||||
# If it fails, disable detection
|
||||
export DG_DISABLE_EC2_DETECTION=true
|
||||
```
|
||||
|
||||
### "Wrong region detected"
|
||||
|
||||
**Cause**: Cached metadata or race condition
|
||||
|
||||
**Solution**: DeltaGlider caches metadata for performance. Restart the process to refresh.
|
||||
|
||||
### "Warning appears but I want cross-region"
|
||||
|
||||
**Cause**: You intentionally need cross-region transfer
|
||||
|
||||
**Solution**: This is just a warning, not an error. The migration will proceed. The warning helps you confirm you understand the cost implications.
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: Does this slow down my migrations?**
|
||||
A: No. EC2 detection happens once before migration starts (< 100ms). It doesn't affect migration performance.
|
||||
|
||||
**Q: What if I'm not on EC2 but the detection is slow?**
|
||||
A: The timeout is 1 second. If IMDS is unreachable, it fails fast. Disable with `DG_DISABLE_EC2_DETECTION=true`.
|
||||
|
||||
**Q: Does this work on Fargate/ECS/Lambda?**
|
||||
A: Yes! All AWS compute services support IMDSv2. The detection works the same way.
|
||||
|
||||
**Q: Can I use this with LocalStack/MinIO?**
|
||||
A: Yes. When using `--endpoint-url`, DeltaGlider skips EC2 detection (not applicable for non-AWS S3).
|
||||
|
||||
**Q: Will this detect VPC endpoints?**
|
||||
A: No. VPC endpoints don't change the "region" from an EC2 perspective. The warning still applies if regions don't match.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [AWS Data Transfer Pricing](https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer)
|
||||
- [AWS IMDSv2 Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html)
|
||||
- [S3 Transfer Costs](https://aws.amazon.com/s3/pricing/)
|
||||
@@ -9,6 +9,8 @@ DeltaGlider provides AWS S3 CLI compatible commands with automatic delta compres
|
||||
- `deltaglider ls [s3_url]` - List buckets and objects
|
||||
- `deltaglider rm <s3_url>` - Remove objects
|
||||
- `deltaglider sync <source> <destination>` - Synchronize directories
|
||||
- `deltaglider migrate <source> <destination>` - Migrate S3 buckets with compression and EC2 cost warnings
|
||||
- `deltaglider stats <bucket>` - Get bucket statistics and compression metrics
|
||||
- `deltaglider verify <s3_url>` - Verify file integrity
|
||||
|
||||
### Current Usage Examples
|
||||
|
||||
174
docs/sdk/api.md
174
docs/sdk/api.md
@@ -156,7 +156,7 @@ for obj in response['Contents']:
|
||||
|
||||
#### `get_bucket_stats`
|
||||
|
||||
Get statistics for a bucket with optional detailed compression metrics.
|
||||
Get statistics for a bucket with optional detailed compression metrics. Results are cached per client session for performance.
|
||||
|
||||
```python
|
||||
def get_bucket_stats(
|
||||
@@ -173,16 +173,46 @@ def get_bucket_stats(
|
||||
- With `detailed_stats=False`: ~50ms for any bucket size (LIST calls only)
|
||||
- With `detailed_stats=True`: ~2-3s per 1000 objects (adds HEAD calls for delta files)
|
||||
|
||||
##### Caching Behavior
|
||||
|
||||
- **Session-scoped cache**: Results cached within client instance lifetime
|
||||
- **Automatic invalidation**: Cache cleared on bucket mutations (put, delete, bucket operations)
|
||||
- **Intelligent reuse**: Detailed stats can serve quick stat requests
|
||||
- **Manual cache control**: Use `clear_cache()` to invalidate all cached stats
|
||||
|
||||
##### Returns
|
||||
|
||||
`BucketStats`: Dataclass containing:
|
||||
- **bucket** (`str`): Bucket name
|
||||
- **object_count** (`int`): Total number of objects
|
||||
- **total_size** (`int`): Original size in bytes (before compression)
|
||||
- **compressed_size** (`int`): Actual stored size in bytes
|
||||
- **space_saved** (`int`): Bytes saved through compression
|
||||
- **average_compression_ratio** (`float`): Average compression ratio (0.0-1.0)
|
||||
- **delta_objects** (`int`): Number of delta-compressed objects
|
||||
- **direct_objects** (`int`): Number of directly stored objects
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Quick stats for dashboard display
|
||||
# Quick stats for dashboard display (cached after first call)
|
||||
stats = client.get_bucket_stats('releases')
|
||||
print(f"Objects: {stats.object_count}, Size: {stats.total_size}")
|
||||
|
||||
# Detailed stats for analytics (slower but accurate)
|
||||
# Second call hits cache (instant response)
|
||||
stats = client.get_bucket_stats('releases')
|
||||
print(f"Space saved: {stats.space_saved} bytes")
|
||||
|
||||
# Detailed stats for analytics (slower but accurate, also cached)
|
||||
stats = client.get_bucket_stats('releases', detailed_stats=True)
|
||||
print(f"Compression ratio: {stats.average_compression_ratio:.1%}")
|
||||
|
||||
# Quick call after detailed call reuses detailed cache (more accurate)
|
||||
quick_stats = client.get_bucket_stats('releases') # Uses detailed cache
|
||||
|
||||
# Clear cache to force refresh
|
||||
client.clear_cache()
|
||||
stats = client.get_bucket_stats('releases') # Fresh computation
|
||||
```
|
||||
|
||||
#### `put_object`
|
||||
@@ -304,7 +334,7 @@ client.delete_bucket(Bucket='old-releases')
|
||||
|
||||
#### `list_buckets`
|
||||
|
||||
List all S3 buckets (boto3-compatible).
|
||||
List all S3 buckets (boto3-compatible). Includes cached statistics when available.
|
||||
|
||||
```python
|
||||
def list_buckets(
|
||||
@@ -315,7 +345,32 @@ def list_buckets(
|
||||
|
||||
##### Returns
|
||||
|
||||
Dict with list of buckets and owner information (identical to boto3).
|
||||
Dict with list of buckets and owner information (identical to boto3). Each bucket may include optional `DeltaGliderStats` metadata if statistics have been previously cached.
|
||||
|
||||
##### Response Structure
|
||||
|
||||
```python
|
||||
{
|
||||
'Buckets': [
|
||||
{
|
||||
'Name': 'bucket-name',
|
||||
'CreationDate': datetime(2025, 1, 1),
|
||||
'DeltaGliderStats': { # Optional, only if cached
|
||||
'Cached': True,
|
||||
'Detailed': bool, # Whether detailed stats were fetched
|
||||
'ObjectCount': int,
|
||||
'TotalSize': int,
|
||||
'CompressedSize': int,
|
||||
'SpaceSaved': int,
|
||||
'AverageCompressionRatio': float,
|
||||
'DeltaObjects': int,
|
||||
'DirectObjects': int
|
||||
}
|
||||
}
|
||||
],
|
||||
'Owner': {...}
|
||||
}
|
||||
```
|
||||
|
||||
##### Examples
|
||||
|
||||
@@ -324,6 +379,17 @@ Dict with list of buckets and owner information (identical to boto3).
|
||||
response = client.list_buckets()
|
||||
for bucket in response['Buckets']:
|
||||
print(f"{bucket['Name']} - Created: {bucket['CreationDate']}")
|
||||
|
||||
# Check if stats are cached
|
||||
if 'DeltaGliderStats' in bucket:
|
||||
stats = bucket['DeltaGliderStats']
|
||||
print(f" Cached stats: {stats['ObjectCount']} objects, "
|
||||
f"{stats['AverageCompressionRatio']:.1%} compression")
|
||||
|
||||
# Fetch stats first, then list buckets to see cached data
|
||||
client.get_bucket_stats('my-bucket', detailed_stats=True)
|
||||
response = client.list_buckets()
|
||||
# Now 'my-bucket' will include DeltaGliderStats in response
|
||||
```
|
||||
|
||||
### Simple API Methods
|
||||
@@ -460,6 +526,104 @@ else:
|
||||
# Re-upload or investigate
|
||||
```
|
||||
|
||||
### Cache Management Methods
|
||||
|
||||
DeltaGlider maintains two types of caches for performance optimization:
|
||||
1. **Reference cache**: Binary reference files used for delta reconstruction
|
||||
2. **Statistics cache**: Bucket statistics (session-scoped)
|
||||
|
||||
#### `clear_cache`
|
||||
|
||||
Clear all cached data including reference files and bucket statistics.
|
||||
|
||||
```python
|
||||
def clear_cache(self) -> None
|
||||
```
|
||||
|
||||
##### Description
|
||||
|
||||
Removes all cached reference files from the local filesystem and invalidates all bucket statistics. Useful for:
|
||||
- Forcing fresh statistics computation
|
||||
- Freeing disk space in long-running applications
|
||||
- Ensuring latest data after external bucket modifications
|
||||
- Testing and development workflows
|
||||
|
||||
##### Cache Types Cleared
|
||||
|
||||
1. **Reference Cache**: Binary reference files stored in `/tmp/deltaglider-*/`
|
||||
- Encrypted at rest with ephemeral keys
|
||||
- Content-addressed storage (SHA256-based filenames)
|
||||
- Automatically cleaned up on process exit
|
||||
|
||||
2. **Statistics Cache**: Bucket statistics cached per client session
|
||||
- Metadata about compression ratios and object counts
|
||||
- Session-scoped (not persisted to disk)
|
||||
- Automatically invalidated on bucket mutations
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Long-running application
|
||||
client = create_client()
|
||||
|
||||
# Work with files
|
||||
for i in range(1000):
|
||||
client.upload(f"file_{i}.zip", "s3://bucket/")
|
||||
|
||||
# Periodic cache cleanup to prevent disk buildup
|
||||
if i % 100 == 0:
|
||||
client.clear_cache()
|
||||
|
||||
# Force fresh statistics after external changes
|
||||
stats_before = client.get_bucket_stats('releases') # Cached
|
||||
# ... external tool modifies bucket ...
|
||||
client.clear_cache()
|
||||
stats_after = client.get_bucket_stats('releases') # Fresh data
|
||||
|
||||
# Development workflow
|
||||
client.clear_cache() # Start with clean state
|
||||
```
|
||||
|
||||
#### `evict_cache`
|
||||
|
||||
Remove a specific cached reference file from the local cache.
|
||||
|
||||
```python
|
||||
def evict_cache(self, s3_url: str) -> None
|
||||
```
|
||||
|
||||
##### Parameters
|
||||
|
||||
- **s3_url** (`str`): S3 URL of the reference file to evict (e.g., `s3://bucket/prefix/reference.bin`)
|
||||
|
||||
##### Description
|
||||
|
||||
Removes a specific reference file from the cache without affecting other cached files or statistics. Useful for:
|
||||
- Selective cache invalidation when specific references are updated
|
||||
- Memory management in applications with many delta spaces
|
||||
- Testing specific delta compression scenarios
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Evict specific reference after update
|
||||
client.upload("new-reference.zip", "s3://releases/v2.0.0/")
|
||||
client.evict_cache("s3://releases/v2.0.0/reference.bin")
|
||||
|
||||
# Next upload will fetch fresh reference
|
||||
client.upload("similar-file.zip", "s3://releases/v2.0.0/")
|
||||
|
||||
# Selective eviction for specific delta spaces
|
||||
delta_spaces = ["v1.0.0", "v1.1.0", "v1.2.0"]
|
||||
for space in delta_spaces:
|
||||
client.evict_cache(f"s3://releases/{space}/reference.bin")
|
||||
```
|
||||
|
||||
##### See Also
|
||||
|
||||
- [docs/CACHE_MANAGEMENT.md](../../CACHE_MANAGEMENT.md): Complete cache management guide
|
||||
- `clear_cache()`: Clear all caches
|
||||
|
||||
#### `lifecycle_policy`
|
||||
|
||||
Set lifecycle policy for S3 prefix (placeholder for future implementation).
|
||||
|
||||
@@ -5,15 +5,17 @@ Real-world examples and patterns for using DeltaGlider in production application
|
||||
## Table of Contents
|
||||
|
||||
1. [Performance-Optimized Bucket Listing](#performance-optimized-bucket-listing)
|
||||
2. [Bucket Management](#bucket-management)
|
||||
3. [Software Release Management](#software-release-management)
|
||||
4. [Database Backup System](#database-backup-system)
|
||||
5. [CI/CD Pipeline Integration](#cicd-pipeline-integration)
|
||||
6. [Container Registry Storage](#container-registry-storage)
|
||||
7. [Machine Learning Model Versioning](#machine-learning-model-versioning)
|
||||
8. [Game Asset Distribution](#game-asset-distribution)
|
||||
9. [Log Archive Management](#log-archive-management)
|
||||
10. [Multi-Region Replication](#multi-region-replication)
|
||||
2. [Bucket Statistics and Monitoring](#bucket-statistics-and-monitoring)
|
||||
3. [Session-Level Cache Management](#session-level-cache-management)
|
||||
4. [Bucket Management](#bucket-management)
|
||||
5. [Software Release Management](#software-release-management)
|
||||
6. [Database Backup System](#database-backup-system)
|
||||
7. [CI/CD Pipeline Integration](#cicd-pipeline-integration)
|
||||
8. [Container Registry Storage](#container-registry-storage)
|
||||
9. [Machine Learning Model Versioning](#machine-learning-model-versioning)
|
||||
10. [Game Asset Distribution](#game-asset-distribution)
|
||||
11. [Log Archive Management](#log-archive-management)
|
||||
12. [Multi-Region Replication](#multi-region-replication)
|
||||
|
||||
## Performance-Optimized Bucket Listing
|
||||
|
||||
@@ -199,6 +201,322 @@ performance_comparison('releases')
|
||||
|
||||
2. **Never Fetch for Non-Deltas**: The SDK automatically skips metadata fetching for non-delta files even when `FetchMetadata=True`.
|
||||
|
||||
## Bucket Statistics and Monitoring
|
||||
|
||||
DeltaGlider provides powerful bucket statistics with session-level caching for performance.
|
||||
|
||||
### Quick Dashboard Stats (Cached)
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
|
||||
def show_bucket_dashboard(bucket: str):
|
||||
"""Display real-time bucket statistics with caching."""
|
||||
|
||||
# First call: computes stats (~50ms)
|
||||
stats = client.get_bucket_stats(bucket)
|
||||
|
||||
# Second call: instant (cached)
|
||||
stats = client.get_bucket_stats(bucket)
|
||||
|
||||
print(f"Dashboard for {stats.bucket}")
|
||||
print(f"=" * 60)
|
||||
print(f"Total Objects: {stats.object_count:,}")
|
||||
print(f" Delta Objects: {stats.delta_objects:,}")
|
||||
print(f" Direct Objects: {stats.direct_objects:,}")
|
||||
print()
|
||||
print(f"Original Size: {stats.total_size / (1024**3):.2f} GB")
|
||||
print(f"Stored Size: {stats.compressed_size / (1024**3):.2f} GB")
|
||||
print(f"Space Saved: {stats.space_saved / (1024**3):.2f} GB")
|
||||
print(f"Compression Ratio: {stats.average_compression_ratio:.1%}")
|
||||
|
||||
# Example: Show stats for multiple buckets (each cached separately)
|
||||
for bucket_name in ['releases', 'backups', 'archives']:
|
||||
show_bucket_dashboard(bucket_name)
|
||||
```
|
||||
|
||||
### Detailed Compression Analysis
|
||||
|
||||
```python
|
||||
def detailed_compression_report(bucket: str):
|
||||
"""Generate detailed compression report with accurate ratios."""
|
||||
|
||||
# Detailed stats fetch metadata for delta files (slower, accurate)
|
||||
stats = client.get_bucket_stats(bucket, detailed_stats=True)
|
||||
|
||||
efficiency = (stats.space_saved / stats.total_size * 100) if stats.total_size > 0 else 0
|
||||
|
||||
print(f"Detailed Compression Report: {stats.bucket}")
|
||||
print(f"=" * 60)
|
||||
print(f"Object Distribution:")
|
||||
print(f" Total: {stats.object_count:,}")
|
||||
print(f" Delta-Compressed: {stats.delta_objects:,} ({stats.delta_objects/stats.object_count*100:.1f}%)")
|
||||
print(f" Direct Storage: {stats.direct_objects:,} ({stats.direct_objects/stats.object_count*100:.1f}%)")
|
||||
print()
|
||||
print(f"Storage Efficiency:")
|
||||
print(f" Original Data: {stats.total_size / (1024**3):.2f} GB")
|
||||
print(f" Actual Storage: {stats.compressed_size / (1024**3):.2f} GB")
|
||||
print(f" Space Saved: {stats.space_saved / (1024**3):.2f} GB")
|
||||
print(f" Efficiency: {efficiency:.1f}%")
|
||||
print(f" Avg Compression: {stats.average_compression_ratio:.2%}")
|
||||
|
||||
# Calculate estimated monthly costs (example: $0.023/GB S3 Standard)
|
||||
cost_without = stats.total_size / (1024**3) * 0.023
|
||||
cost_with = stats.compressed_size / (1024**3) * 0.023
|
||||
monthly_savings = cost_without - cost_with
|
||||
|
||||
print()
|
||||
print(f"Estimated Monthly S3 Costs ($0.023/GB):")
|
||||
print(f" Without DeltaGlider: ${cost_without:.2f}")
|
||||
print(f" With DeltaGlider: ${cost_with:.2f}")
|
||||
print(f" Monthly Savings: ${monthly_savings:.2f}")
|
||||
|
||||
# Example: Detailed report
|
||||
detailed_compression_report('releases')
|
||||
```
|
||||
|
||||
### List Buckets with Cached Stats
|
||||
|
||||
```python
|
||||
def list_buckets_with_stats():
|
||||
"""List all buckets and show cached statistics if available."""
|
||||
|
||||
# Pre-fetch stats for important buckets
|
||||
important_buckets = ['releases', 'backups']
|
||||
for bucket_name in important_buckets:
|
||||
client.get_bucket_stats(bucket_name, detailed_stats=True)
|
||||
|
||||
# List all buckets (includes cached stats automatically)
|
||||
response = client.list_buckets()
|
||||
|
||||
print("All Buckets:")
|
||||
print(f"{'Name':<30} {'Objects':<10} {'Compression':<15} {'Cached'}")
|
||||
print("=" * 70)
|
||||
|
||||
for bucket in response['Buckets']:
|
||||
name = bucket['Name']
|
||||
|
||||
# Check if stats are cached
|
||||
if 'DeltaGliderStats' in bucket:
|
||||
stats = bucket['DeltaGliderStats']
|
||||
obj_count = f"{stats['ObjectCount']:,}"
|
||||
compression = f"{stats['AverageCompressionRatio']:.1%}"
|
||||
cached = "✓ (detailed)" if stats['Detailed'] else "✓ (quick)"
|
||||
else:
|
||||
obj_count = "N/A"
|
||||
compression = "N/A"
|
||||
cached = "✗"
|
||||
|
||||
print(f"{name:<30} {obj_count:<10} {compression:<15} {cached}")
|
||||
|
||||
# Example: List with stats
|
||||
list_buckets_with_stats()
|
||||
```
|
||||
|
||||
### Monitoring Dashboard (Real-Time)
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def monitoring_dashboard(buckets: list[str], refresh_seconds: int = 60):
|
||||
"""Real-time monitoring dashboard with periodic refresh."""
|
||||
|
||||
while True:
|
||||
print("\033[2J\033[H") # Clear screen
|
||||
print(f"DeltaGlider Monitoring Dashboard - {time.strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
print("=" * 80)
|
||||
|
||||
for bucket_name in buckets:
|
||||
# Get cached stats (instant) or compute fresh
|
||||
stats = client.get_bucket_stats(bucket_name)
|
||||
|
||||
print(f"\n{bucket_name}:")
|
||||
print(f" Objects: {stats.object_count:,} | "
|
||||
f"Delta: {stats.delta_objects:,} | "
|
||||
f"Direct: {stats.direct_objects:,}")
|
||||
print(f" Size: {stats.compressed_size/(1024**3):.2f} GB | "
|
||||
f"Saved: {stats.space_saved/(1024**3):.2f} GB | "
|
||||
f"Compression: {stats.average_compression_ratio:.1%}")
|
||||
|
||||
print(f"\n{'=' * 80}")
|
||||
print(f"Refreshing in {refresh_seconds} seconds... (Ctrl+C to exit)")
|
||||
|
||||
time.sleep(refresh_seconds)
|
||||
|
||||
# Clear cache for fresh data on next iteration
|
||||
client.clear_cache()
|
||||
|
||||
# Example: Monitor key buckets
|
||||
try:
|
||||
monitoring_dashboard(['releases', 'backups', 'archives'], refresh_seconds=30)
|
||||
except KeyboardInterrupt:
|
||||
print("\nMonitoring stopped.")
|
||||
```
|
||||
|
||||
## Session-Level Cache Management
|
||||
|
||||
DeltaGlider maintains session-level caches for optimal performance in long-running applications.
|
||||
|
||||
### Long-Running Application Pattern
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
import time
|
||||
|
||||
def long_running_upload_service():
|
||||
"""Upload service with periodic cache cleanup."""
|
||||
|
||||
client = create_client()
|
||||
processed_count = 0
|
||||
|
||||
while True:
|
||||
# Simulate file processing
|
||||
files_to_upload = get_pending_files() # Your file queue
|
||||
|
||||
for file_path in files_to_upload:
|
||||
try:
|
||||
summary = client.upload(file_path, "s3://releases/")
|
||||
processed_count += 1
|
||||
|
||||
print(f"Uploaded {file_path}: {summary.savings_percent:.0f}% saved")
|
||||
|
||||
# Periodic cache cleanup (every 100 files)
|
||||
if processed_count % 100 == 0:
|
||||
client.clear_cache()
|
||||
print(f"Cache cleared after {processed_count} files")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error uploading {file_path}: {e}")
|
||||
|
||||
time.sleep(60) # Check for new files every minute
|
||||
|
||||
# Example: Run upload service
|
||||
# long_running_upload_service()
|
||||
```
|
||||
|
||||
### Cache Invalidation After External Changes
|
||||
|
||||
```python
|
||||
def handle_external_bucket_changes(bucket: str):
|
||||
"""Refresh statistics after external tools modify bucket."""
|
||||
|
||||
# Get initial stats (cached)
|
||||
stats_before = client.get_bucket_stats(bucket)
|
||||
print(f"Before: {stats_before.object_count} objects")
|
||||
|
||||
# External process modifies bucket
|
||||
print("External backup tool running...")
|
||||
run_external_backup_tool(bucket) # Your external tool
|
||||
|
||||
# Clear cache to get fresh data
|
||||
client.clear_cache()
|
||||
|
||||
# Get updated stats
|
||||
stats_after = client.get_bucket_stats(bucket)
|
||||
print(f"After: {stats_after.object_count} objects")
|
||||
print(f"Added: {stats_after.object_count - stats_before.object_count} objects")
|
||||
|
||||
# Example usage
|
||||
handle_external_bucket_changes('backups')
|
||||
```
|
||||
|
||||
### Selective Cache Eviction
|
||||
|
||||
```python
|
||||
def selective_cache_management():
|
||||
"""Manage cache for specific delta spaces."""
|
||||
|
||||
client = create_client()
|
||||
|
||||
# Upload to multiple delta spaces
|
||||
versions = ['v1.0.0', 'v1.1.0', 'v1.2.0']
|
||||
|
||||
for version in versions:
|
||||
client.upload(f"app-{version}.zip", f"s3://releases/{version}/")
|
||||
|
||||
# Update reference for specific version
|
||||
print("Updating v1.1.0 reference...")
|
||||
client.upload("new-reference.zip", "s3://releases/v1.1.0/")
|
||||
|
||||
# Evict only v1.1.0 cache (others remain cached)
|
||||
client.evict_cache("s3://releases/v1.1.0/reference.bin")
|
||||
|
||||
# Next upload to v1.1.0 fetches fresh reference
|
||||
# v1.0.0 and v1.2.0 still use cached references
|
||||
client.upload("similar-file.zip", "s3://releases/v1.1.0/")
|
||||
|
||||
# Example: Selective eviction
|
||||
selective_cache_management()
|
||||
```
|
||||
|
||||
### Testing with Clean Cache
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from deltaglider import create_client
|
||||
|
||||
def test_upload_workflow():
|
||||
"""Test with clean cache state."""
|
||||
|
||||
client = create_client()
|
||||
client.clear_cache() # Start with clean state
|
||||
|
||||
# Test first upload (no reference exists)
|
||||
summary1 = client.upload("file1.zip", "s3://test-bucket/prefix/")
|
||||
assert not summary1.is_delta # First file is reference
|
||||
|
||||
# Test subsequent upload (uses cached reference)
|
||||
summary2 = client.upload("file2.zip", "s3://test-bucket/prefix/")
|
||||
assert summary2.is_delta # Should use delta
|
||||
|
||||
# Clear and test again
|
||||
client.clear_cache()
|
||||
summary3 = client.upload("file3.zip", "s3://test-bucket/prefix/")
|
||||
assert summary3.is_delta # Still delta (reference in S3)
|
||||
|
||||
# Run test
|
||||
# test_upload_workflow()
|
||||
```
|
||||
|
||||
### Cache Performance Monitoring
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def measure_cache_performance(bucket: str):
|
||||
"""Measure performance impact of caching."""
|
||||
|
||||
client = create_client()
|
||||
|
||||
# Test 1: Cold cache
|
||||
client.clear_cache()
|
||||
start = time.time()
|
||||
stats1 = client.get_bucket_stats(bucket, detailed_stats=True)
|
||||
cold_time = (time.time() - start) * 1000
|
||||
|
||||
# Test 2: Warm cache
|
||||
start = time.time()
|
||||
stats2 = client.get_bucket_stats(bucket, detailed_stats=True)
|
||||
warm_time = (time.time() - start) * 1000
|
||||
|
||||
# Test 3: Quick stats from detailed cache
|
||||
start = time.time()
|
||||
stats3 = client.get_bucket_stats(bucket, detailed_stats=False)
|
||||
reuse_time = (time.time() - start) * 1000
|
||||
|
||||
print(f"Cache Performance for {bucket}:")
|
||||
print(f" Cold Cache (detailed): {cold_time:.0f}ms")
|
||||
print(f" Warm Cache (detailed): {warm_time:.0f}ms")
|
||||
print(f" Cache Reuse (quick): {reuse_time:.0f}ms")
|
||||
print(f" Speedup (detailed): {cold_time/warm_time:.1f}x")
|
||||
print(f" Speedup (reuse): {cold_time/reuse_time:.1f}x")
|
||||
|
||||
# Example: Measure cache performance
|
||||
measure_cache_performance('releases')
|
||||
```
|
||||
|
||||
3. **Use Pagination**: For large buckets, use `MaxKeys` and `ContinuationToken` to paginate results.
|
||||
|
||||
4. **Cache Results**: If you need metadata frequently, consider caching the results to avoid repeated HEAD requests.
|
||||
|
||||
@@ -76,6 +76,10 @@ DeltaGlider supports the following environment variables:
|
||||
**Logging & Performance**:
|
||||
- `DG_LOG_LEVEL`: Logging level (default: `INFO`, options: `DEBUG`, `INFO`, `WARNING`, `ERROR`)
|
||||
- `DG_MAX_RATIO`: Maximum delta/file ratio (default: `0.5`, range: `0.0-1.0`)
|
||||
- **See [DG_MAX_RATIO.md](../DG_MAX_RATIO.md) for complete tuning guide**
|
||||
- Controls when to use delta compression vs. direct storage
|
||||
- Lower (0.2-0.3) = conservative, only high-quality compression
|
||||
- Higher (0.6-0.7) = permissive, accept modest savings
|
||||
|
||||
**Cache Configuration**:
|
||||
- `DG_CACHE_BACKEND`: Cache backend type (default: `filesystem`, options: `filesystem`, `memory`)
|
||||
|
||||
@@ -53,6 +53,7 @@ dependencies = [
|
||||
"click>=8.1.0",
|
||||
"cryptography>=42.0.0",
|
||||
"python-dateutil>=2.9.0",
|
||||
"requests>=2.32.0",
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
@@ -109,6 +110,7 @@ dev-dependencies = [
|
||||
"mypy>=1.13.0",
|
||||
"boto3-stubs[s3]>=1.35.0",
|
||||
"types-python-dateutil>=2.9.0",
|
||||
"types-requests>=2.32.0",
|
||||
"setuptools-scm>=8.0.0",
|
||||
]
|
||||
|
||||
|
||||
@@ -6,20 +6,22 @@ from .cache_fs import FsCacheAdapter
|
||||
from .cache_memory import MemoryCache
|
||||
from .clock_utc import UtcClockAdapter
|
||||
from .diff_xdelta import XdeltaAdapter
|
||||
from .ec2_metadata import EC2MetadataAdapter
|
||||
from .hash_sha import Sha256Adapter
|
||||
from .logger_std import StdLoggerAdapter
|
||||
from .metrics_noop import NoopMetricsAdapter
|
||||
from .storage_s3 import S3StorageAdapter
|
||||
|
||||
__all__ = [
|
||||
"S3StorageAdapter",
|
||||
"XdeltaAdapter",
|
||||
"Sha256Adapter",
|
||||
"FsCacheAdapter",
|
||||
"ContentAddressedCache",
|
||||
"EC2MetadataAdapter",
|
||||
"EncryptedCache",
|
||||
"FsCacheAdapter",
|
||||
"MemoryCache",
|
||||
"UtcClockAdapter",
|
||||
"StdLoggerAdapter",
|
||||
"NoopMetricsAdapter",
|
||||
"S3StorageAdapter",
|
||||
"Sha256Adapter",
|
||||
"StdLoggerAdapter",
|
||||
"UtcClockAdapter",
|
||||
"XdeltaAdapter",
|
||||
]
|
||||
|
||||
@@ -242,5 +242,29 @@ class ContentAddressedCache(CachePort):
|
||||
|
||||
# NOTE: We don't delete the actual CAS file because:
|
||||
# - Other deltaspaces may reference the same SHA
|
||||
|
||||
def clear(self) -> None:
|
||||
"""Clear all cached references.
|
||||
|
||||
Removes all cached files and mappings. This is a destructive operation
|
||||
that forcibly removes the entire cache directory.
|
||||
|
||||
Use cases:
|
||||
- Long-running applications that need to free disk space
|
||||
- Manual cache invalidation
|
||||
- Test cleanup
|
||||
- Ensuring fresh data fetch after configuration changes
|
||||
"""
|
||||
import shutil
|
||||
|
||||
# Clear in-memory mapping
|
||||
self._deltaspace_to_sha.clear()
|
||||
|
||||
# Remove all cache files (destructive!)
|
||||
if self.base_dir.exists():
|
||||
shutil.rmtree(self.base_dir, ignore_errors=True)
|
||||
|
||||
# Recreate base directory with secure permissions
|
||||
self.base_dir.mkdir(parents=True, mode=0o700, exist_ok=True)
|
||||
# - The ephemeral cache will be cleaned on process exit anyway
|
||||
# - For persistent cache (future), we'd need reference counting
|
||||
|
||||
@@ -281,3 +281,25 @@ class EncryptedCache(CachePort):
|
||||
|
||||
# Evict from backend
|
||||
self.backend.evict(bucket, prefix)
|
||||
|
||||
def clear(self) -> None:
|
||||
"""Clear all cached references and encryption mappings.
|
||||
|
||||
Removes all cached data and clears encryption key mappings.
|
||||
This is the proper way to forcibly clean up cache in long-running
|
||||
applications.
|
||||
|
||||
Use cases:
|
||||
- Long-running applications needing to free resources
|
||||
- Manual cache invalidation after key rotation
|
||||
- Test cleanup
|
||||
- Memory pressure situations
|
||||
|
||||
Note: After clearing, the cache will use a fresh encryption key
|
||||
(ephemeral mode) or the same persistent key (if DG_CACHE_ENCRYPTION_KEY set).
|
||||
"""
|
||||
# Clear encryption mapping
|
||||
self._plaintext_sha_map.clear()
|
||||
|
||||
# Delegate to backend to clear actual files/memory
|
||||
self.backend.clear()
|
||||
|
||||
126
src/deltaglider/adapters/ec2_metadata.py
Normal file
126
src/deltaglider/adapters/ec2_metadata.py
Normal file
@@ -0,0 +1,126 @@
|
||||
"""EC2 Instance Metadata Service (IMDS) adapter.
|
||||
|
||||
Provides access to EC2 instance metadata using IMDSv2 with token-based authentication.
|
||||
Falls back gracefully when not running on EC2.
|
||||
"""
|
||||
|
||||
import os
|
||||
|
||||
import requests
|
||||
|
||||
|
||||
class EC2MetadataAdapter:
|
||||
"""Adapter for EC2 Instance Metadata Service (IMDSv2)."""
|
||||
|
||||
IMDS_BASE_URL = "http://169.254.169.254/latest"
|
||||
TOKEN_URL = f"{IMDS_BASE_URL}/api/token"
|
||||
TOKEN_TTL_SECONDS = 21600 # 6 hours
|
||||
TOKEN_HEADER = "X-aws-ec2-metadata-token"
|
||||
TIMEOUT_SECONDS = 1 # Fast timeout for non-EC2 environments
|
||||
|
||||
def __init__(self) -> None:
|
||||
"""Initialize EC2 metadata adapter."""
|
||||
self._token: str | None = None
|
||||
self._is_ec2: bool | None = None
|
||||
self._region: str | None = None
|
||||
|
||||
def is_running_on_ec2(self) -> bool:
|
||||
"""Check if running on an EC2 instance.
|
||||
|
||||
Returns:
|
||||
True if running on EC2, False otherwise
|
||||
|
||||
Note:
|
||||
Result is cached after first check for performance.
|
||||
"""
|
||||
if self._is_ec2 is not None:
|
||||
return self._is_ec2
|
||||
|
||||
# Skip check if explicitly disabled
|
||||
if os.environ.get("DG_DISABLE_EC2_DETECTION", "").lower() in ("true", "1", "yes"):
|
||||
self._is_ec2 = False
|
||||
return False
|
||||
|
||||
try:
|
||||
# Try to get IMDSv2 token
|
||||
self._token = self._get_token()
|
||||
self._is_ec2 = self._token is not None
|
||||
except Exception:
|
||||
self._is_ec2 = False
|
||||
|
||||
return self._is_ec2
|
||||
|
||||
def get_region(self) -> str | None:
|
||||
"""Get the EC2 instance's AWS region.
|
||||
|
||||
Returns:
|
||||
AWS region code (e.g., "us-east-1") or None if not on EC2
|
||||
|
||||
Note:
|
||||
Result is cached after first successful fetch.
|
||||
"""
|
||||
if not self.is_running_on_ec2():
|
||||
return None
|
||||
|
||||
if self._region is not None:
|
||||
return self._region
|
||||
|
||||
try:
|
||||
if self._token:
|
||||
response = requests.get(
|
||||
f"{self.IMDS_BASE_URL}/meta-data/placement/region",
|
||||
headers={self.TOKEN_HEADER: self._token},
|
||||
timeout=self.TIMEOUT_SECONDS,
|
||||
)
|
||||
if response.status_code == 200:
|
||||
self._region = response.text.strip()
|
||||
return self._region
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
def get_availability_zone(self) -> str | None:
|
||||
"""Get the EC2 instance's availability zone.
|
||||
|
||||
Returns:
|
||||
Availability zone (e.g., "us-east-1a") or None if not on EC2
|
||||
"""
|
||||
if not self.is_running_on_ec2():
|
||||
return None
|
||||
|
||||
try:
|
||||
if self._token:
|
||||
response = requests.get(
|
||||
f"{self.IMDS_BASE_URL}/meta-data/placement/availability-zone",
|
||||
headers={self.TOKEN_HEADER: self._token},
|
||||
timeout=self.TIMEOUT_SECONDS,
|
||||
)
|
||||
if response.status_code == 200:
|
||||
return str(response.text.strip())
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
def _get_token(self) -> str | None:
|
||||
"""Get IMDSv2 token for authenticated metadata requests.
|
||||
|
||||
Returns:
|
||||
IMDSv2 token or None if unable to retrieve
|
||||
|
||||
Note:
|
||||
Uses IMDSv2 for security. IMDSv1 is not supported.
|
||||
"""
|
||||
try:
|
||||
response = requests.put(
|
||||
self.TOKEN_URL,
|
||||
headers={"X-aws-ec2-metadata-token-ttl-seconds": str(self.TOKEN_TTL_SECONDS)},
|
||||
timeout=self.TIMEOUT_SECONDS,
|
||||
)
|
||||
if response.status_code == 200:
|
||||
return response.text.strip()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return None
|
||||
@@ -1,5 +1,6 @@
|
||||
"""AWS S3 CLI compatible commands."""
|
||||
|
||||
import shutil
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
@@ -7,6 +8,95 @@ import click
|
||||
|
||||
from ...core import DeltaService, DeltaSpace, ObjectKey
|
||||
|
||||
__all__ = [
|
||||
"is_s3_path",
|
||||
"parse_s3_url",
|
||||
"determine_operation",
|
||||
"upload_file",
|
||||
"download_file",
|
||||
"copy_s3_to_s3",
|
||||
"migrate_s3_to_s3",
|
||||
"handle_recursive",
|
||||
"log_aws_region",
|
||||
]
|
||||
|
||||
|
||||
def log_aws_region(service: DeltaService, region_override: bool = False) -> None:
|
||||
"""Log the AWS region being used and warn about cross-region charges.
|
||||
|
||||
This function:
|
||||
1. Detects if running on EC2
|
||||
2. Compares EC2 region with S3 client region
|
||||
3. Warns about potential cross-region data transfer charges
|
||||
4. Helps users optimize for cost and performance
|
||||
|
||||
Args:
|
||||
service: DeltaService instance with storage adapter
|
||||
region_override: True if user explicitly specified --region flag
|
||||
"""
|
||||
try:
|
||||
from ...adapters.ec2_metadata import EC2MetadataAdapter
|
||||
from ...adapters.storage_s3 import S3StorageAdapter
|
||||
|
||||
if not isinstance(service.storage, S3StorageAdapter):
|
||||
return # Not using S3 storage, skip
|
||||
|
||||
# Get S3 client region
|
||||
s3_region = service.storage.client.meta.region_name
|
||||
if not s3_region:
|
||||
s3_region = "us-east-1" # boto3 default
|
||||
|
||||
# Check if running on EC2
|
||||
ec2_metadata = EC2MetadataAdapter()
|
||||
if ec2_metadata.is_running_on_ec2():
|
||||
ec2_region = ec2_metadata.get_region()
|
||||
ec2_az = ec2_metadata.get_availability_zone()
|
||||
|
||||
# Log EC2 context
|
||||
click.echo(f"EC2 Instance: {ec2_az or ec2_region or 'unknown'}")
|
||||
click.echo(f"S3 Client Region: {s3_region}")
|
||||
|
||||
# Check for region mismatch
|
||||
if ec2_region and ec2_region != s3_region:
|
||||
if region_override:
|
||||
# User explicitly set --region, warn about costs
|
||||
click.echo("")
|
||||
click.secho(
|
||||
f"⚠️ WARNING: EC2 region={ec2_region} != S3 client region={s3_region}",
|
||||
fg="yellow",
|
||||
bold=True,
|
||||
)
|
||||
click.secho(
|
||||
f" Expect cross-region/NAT data charges. Align regions (set client region={ec2_region})",
|
||||
fg="yellow",
|
||||
)
|
||||
click.secho(
|
||||
" before proceeding. Or drop --region for automatic region resolution.",
|
||||
fg="yellow",
|
||||
)
|
||||
click.echo("")
|
||||
else:
|
||||
# Auto-detected mismatch, but user can still cancel
|
||||
click.echo("")
|
||||
click.secho(
|
||||
f"ℹ️ INFO: EC2 region ({ec2_region}) differs from configured S3 region ({s3_region})",
|
||||
fg="cyan",
|
||||
)
|
||||
click.secho(
|
||||
f" Consider using --region {ec2_region} to avoid cross-region charges.",
|
||||
fg="cyan",
|
||||
)
|
||||
click.echo("")
|
||||
elif ec2_region and ec2_region == s3_region:
|
||||
# Regions match - optimal configuration
|
||||
click.secho("✓ Regions aligned - no cross-region charges", fg="green")
|
||||
else:
|
||||
# Not on EC2, just show S3 region
|
||||
click.echo(f"S3 Client Region: {s3_region}")
|
||||
|
||||
except Exception:
|
||||
pass # Silently ignore errors getting region info
|
||||
|
||||
|
||||
def is_s3_path(path: str) -> bool:
|
||||
"""Check if path is an S3 URL."""
|
||||
@@ -149,31 +239,304 @@ def copy_s3_to_s3(
|
||||
source_url: str,
|
||||
dest_url: str,
|
||||
quiet: bool = False,
|
||||
max_ratio: float | None = None,
|
||||
no_delta: bool = False,
|
||||
) -> None:
|
||||
"""Copy object between S3 locations."""
|
||||
# For now, implement as download + upload
|
||||
# TODO: Optimize with server-side copy when possible
|
||||
"""Copy object between S3 locations with optional delta compression.
|
||||
|
||||
This performs a direct S3-to-S3 transfer using streaming to preserve
|
||||
the original file content and apply delta compression at the destination.
|
||||
"""
|
||||
source_bucket, source_key = parse_s3_url(source_url)
|
||||
dest_bucket, dest_key = parse_s3_url(dest_url)
|
||||
|
||||
if not quiet:
|
||||
click.echo(f"copy: 's3://{source_bucket}/{source_key}' to 's3://{dest_bucket}/{dest_key}'")
|
||||
|
||||
# Use temporary file
|
||||
import tempfile
|
||||
try:
|
||||
# Get the source object as a stream
|
||||
source_stream = service.storage.get(f"{source_bucket}/{source_key}")
|
||||
|
||||
with tempfile.NamedTemporaryFile(suffix=Path(source_key).suffix) as tmp:
|
||||
tmp_path = Path(tmp.name)
|
||||
# Determine the destination deltaspace
|
||||
dest_key_parts = dest_key.split("/")
|
||||
if len(dest_key_parts) > 1:
|
||||
dest_prefix = "/".join(dest_key_parts[:-1])
|
||||
else:
|
||||
dest_prefix = ""
|
||||
|
||||
# Download from source
|
||||
download_file(service, source_url, tmp_path, quiet=True)
|
||||
dest_deltaspace = DeltaSpace(bucket=dest_bucket, prefix=dest_prefix)
|
||||
|
||||
# Upload to destination
|
||||
upload_file(service, tmp_path, dest_url, quiet=True)
|
||||
# If delta is disabled or max_ratio specified, use direct put
|
||||
if no_delta:
|
||||
# Direct storage put without delta compression
|
||||
service.storage.put(f"{dest_bucket}/{dest_key}", source_stream, {})
|
||||
if not quiet:
|
||||
click.echo("Copy completed (no delta compression)")
|
||||
else:
|
||||
# Write to a temporary file and use override_name to preserve original filename
|
||||
import tempfile
|
||||
|
||||
# Extract original filename from source
|
||||
original_filename = Path(source_key).name
|
||||
|
||||
with tempfile.NamedTemporaryFile(delete=False, suffix=Path(source_key).suffix) as tmp:
|
||||
tmp_path = Path(tmp.name)
|
||||
|
||||
# Write stream to temp file
|
||||
with open(tmp_path, "wb") as f:
|
||||
shutil.copyfileobj(source_stream, f)
|
||||
|
||||
try:
|
||||
# Use DeltaService.put() with override_name to preserve original filename
|
||||
summary = service.put(
|
||||
tmp_path, dest_deltaspace, max_ratio, override_name=original_filename
|
||||
)
|
||||
|
||||
if not quiet:
|
||||
if summary.delta_size:
|
||||
ratio = round((summary.delta_size / summary.file_size) * 100, 1)
|
||||
click.echo(f"Copy completed with delta compression ({ratio}% of original)")
|
||||
else:
|
||||
click.echo("Copy completed (stored as reference)")
|
||||
finally:
|
||||
# Clean up temp file
|
||||
tmp_path.unlink(missing_ok=True)
|
||||
|
||||
except Exception as e:
|
||||
click.echo(f"S3-to-S3 copy failed: {e}", err=True)
|
||||
raise
|
||||
|
||||
|
||||
def migrate_s3_to_s3(
|
||||
service: DeltaService,
|
||||
source_url: str,
|
||||
dest_url: str,
|
||||
exclude: str | None = None,
|
||||
include: str | None = None,
|
||||
quiet: bool = False,
|
||||
no_delta: bool = False,
|
||||
max_ratio: float | None = None,
|
||||
dry_run: bool = False,
|
||||
skip_confirm: bool = False,
|
||||
preserve_prefix: bool = False,
|
||||
region_override: bool = False,
|
||||
) -> None:
|
||||
"""Migrate objects from one S3 location to another with delta compression.
|
||||
|
||||
Features:
|
||||
- Resume support: Only copies files that don't exist in destination
|
||||
- Progress tracking: Shows migration progress
|
||||
- Confirmation prompt: Shows file count before starting
|
||||
- Prefix preservation: Optionally preserves source prefix structure in destination
|
||||
- EC2 region detection: Warns about cross-region data transfer charges
|
||||
|
||||
Args:
|
||||
service: DeltaService instance
|
||||
source_url: Source S3 URL
|
||||
dest_url: Destination S3 URL
|
||||
exclude: Pattern to exclude files
|
||||
include: Pattern to include files
|
||||
quiet: Suppress output
|
||||
no_delta: Disable delta compression
|
||||
max_ratio: Maximum delta/file ratio
|
||||
dry_run: Show what would be migrated without migrating
|
||||
skip_confirm: Skip confirmation prompt
|
||||
preserve_prefix: Preserve source prefix in destination
|
||||
region_override: True if user explicitly specified --region flag
|
||||
"""
|
||||
import fnmatch
|
||||
|
||||
source_bucket, source_prefix = parse_s3_url(source_url)
|
||||
dest_bucket, dest_prefix = parse_s3_url(dest_url)
|
||||
|
||||
# Ensure prefixes end with / if they exist
|
||||
if source_prefix and not source_prefix.endswith("/"):
|
||||
source_prefix += "/"
|
||||
if dest_prefix and not dest_prefix.endswith("/"):
|
||||
dest_prefix += "/"
|
||||
|
||||
# Determine the effective destination prefix based on preserve_prefix setting
|
||||
effective_dest_prefix = dest_prefix
|
||||
if preserve_prefix and source_prefix:
|
||||
# Extract the last component of the source prefix (e.g., "prefix1/" from "path/to/prefix1/")
|
||||
source_prefix_name = source_prefix.rstrip("/").split("/")[-1]
|
||||
if source_prefix_name:
|
||||
# Append source prefix name to destination
|
||||
effective_dest_prefix = (dest_prefix or "") + source_prefix_name + "/"
|
||||
|
||||
if not quiet:
|
||||
# Log AWS region being used (helps users verify their configuration)
|
||||
# Pass region_override to warn about cross-region charges if user explicitly set --region
|
||||
log_aws_region(service, region_override=region_override)
|
||||
|
||||
if preserve_prefix and source_prefix:
|
||||
click.echo(f"Migrating from s3://{source_bucket}/{source_prefix}")
|
||||
click.echo(f" to s3://{dest_bucket}/{effective_dest_prefix}")
|
||||
else:
|
||||
click.echo(
|
||||
f"Migrating from s3://{source_bucket}/{source_prefix} to s3://{dest_bucket}/{dest_prefix}"
|
||||
)
|
||||
click.echo("Scanning source and destination buckets...")
|
||||
|
||||
# List source objects
|
||||
source_list_prefix = f"{source_bucket}/{source_prefix}" if source_prefix else source_bucket
|
||||
source_objects = []
|
||||
|
||||
for obj in service.storage.list(source_list_prefix):
|
||||
# Skip reference.bin files (internal delta reference)
|
||||
if obj.key.endswith("/reference.bin"):
|
||||
continue
|
||||
# Skip .delta files in source (we'll handle the original files)
|
||||
if obj.key.endswith(".delta"):
|
||||
continue
|
||||
|
||||
# Apply include/exclude filters
|
||||
rel_key = obj.key.removeprefix(source_prefix) if source_prefix else obj.key
|
||||
if exclude and fnmatch.fnmatch(rel_key, exclude):
|
||||
continue
|
||||
if include and not fnmatch.fnmatch(rel_key, include):
|
||||
continue
|
||||
|
||||
source_objects.append(obj)
|
||||
|
||||
# List destination objects to detect what needs copying
|
||||
dest_list_prefix = (
|
||||
f"{dest_bucket}/{effective_dest_prefix}" if effective_dest_prefix else dest_bucket
|
||||
)
|
||||
dest_keys = set()
|
||||
|
||||
for obj in service.storage.list(dest_list_prefix):
|
||||
# Get the relative key in destination
|
||||
rel_key = obj.key.removeprefix(effective_dest_prefix) if effective_dest_prefix else obj.key
|
||||
# Remove .delta suffix for comparison
|
||||
if rel_key.endswith(".delta"):
|
||||
rel_key = rel_key[:-6]
|
||||
# Skip reference.bin
|
||||
if not rel_key.endswith("/reference.bin"):
|
||||
dest_keys.add(rel_key)
|
||||
|
||||
# Determine files to migrate (not in destination)
|
||||
files_to_migrate = []
|
||||
total_size = 0
|
||||
|
||||
for source_obj in source_objects:
|
||||
# Get relative path from source prefix
|
||||
rel_key = source_obj.key.removeprefix(source_prefix) if source_prefix else source_obj.key
|
||||
|
||||
# Check if already exists in destination
|
||||
if rel_key not in dest_keys:
|
||||
files_to_migrate.append((source_obj, rel_key))
|
||||
total_size += source_obj.size
|
||||
|
||||
# Show summary and ask for confirmation
|
||||
if not files_to_migrate:
|
||||
if not quiet:
|
||||
click.echo("Copy completed")
|
||||
click.echo("All files are already migrated. Nothing to do.")
|
||||
return
|
||||
|
||||
if not quiet:
|
||||
|
||||
def format_bytes(size: int) -> str:
|
||||
size_float = float(size)
|
||||
for unit in ["B", "KB", "MB", "GB", "TB"]:
|
||||
if size_float < 1024.0:
|
||||
return f"{size_float:.2f} {unit}"
|
||||
size_float /= 1024.0
|
||||
return f"{size_float:.2f} PB"
|
||||
|
||||
click.echo("")
|
||||
click.echo(f"Files to migrate: {len(files_to_migrate)}")
|
||||
click.echo(f"Total size: {format_bytes(total_size)}")
|
||||
if len(dest_keys) > 0:
|
||||
click.echo(f"Already migrated: {len(dest_keys)} files (will be skipped)")
|
||||
|
||||
# Handle dry run mode early (before confirmation prompt)
|
||||
if dry_run:
|
||||
if not quiet:
|
||||
click.echo("\n--- DRY RUN MODE ---")
|
||||
for _obj, rel_key in files_to_migrate[:10]: # Show first 10 files
|
||||
click.echo(f" Would migrate: {rel_key}")
|
||||
if len(files_to_migrate) > 10:
|
||||
click.echo(f" ... and {len(files_to_migrate) - 10} more files")
|
||||
return
|
||||
|
||||
# Ask for confirmation before proceeding with actual migration
|
||||
if not quiet and not skip_confirm:
|
||||
click.echo("")
|
||||
if not click.confirm("Do you want to proceed with the migration?"):
|
||||
click.echo("Migration cancelled.")
|
||||
return
|
||||
|
||||
# Perform migration
|
||||
if not quiet:
|
||||
click.echo(f"\nStarting migration of {len(files_to_migrate)} files...")
|
||||
|
||||
successful = 0
|
||||
failed = 0
|
||||
failed_files = []
|
||||
|
||||
for i, (source_obj, rel_key) in enumerate(files_to_migrate, 1):
|
||||
source_s3_url = f"s3://{source_bucket}/{source_obj.key}"
|
||||
|
||||
# Construct destination URL using effective prefix
|
||||
if effective_dest_prefix:
|
||||
dest_key = effective_dest_prefix + rel_key
|
||||
else:
|
||||
dest_key = rel_key
|
||||
dest_s3_url = f"s3://{dest_bucket}/{dest_key}"
|
||||
|
||||
try:
|
||||
if not quiet:
|
||||
progress = f"[{i}/{len(files_to_migrate)}]"
|
||||
click.echo(f"{progress} Migrating {rel_key}...", nl=False)
|
||||
|
||||
# Copy with delta compression
|
||||
copy_s3_to_s3(
|
||||
service,
|
||||
source_s3_url,
|
||||
dest_s3_url,
|
||||
quiet=True,
|
||||
max_ratio=max_ratio,
|
||||
no_delta=no_delta,
|
||||
)
|
||||
|
||||
successful += 1
|
||||
if not quiet:
|
||||
click.echo(" ✓")
|
||||
|
||||
except Exception as e:
|
||||
failed += 1
|
||||
failed_files.append((rel_key, str(e)))
|
||||
if not quiet:
|
||||
click.echo(f" ✗ ({e})")
|
||||
|
||||
# Show final summary
|
||||
if not quiet:
|
||||
click.echo("")
|
||||
click.echo("Migration Summary:")
|
||||
click.echo(f" Successfully migrated: {successful} files")
|
||||
if failed > 0:
|
||||
click.echo(f" Failed: {failed} files")
|
||||
click.echo("\nFailed files:")
|
||||
for file, error in failed_files[:10]: # Show first 10 failures
|
||||
click.echo(f" {file}: {error}")
|
||||
if len(failed_files) > 10:
|
||||
click.echo(f" ... and {len(failed_files) - 10} more failures")
|
||||
|
||||
# Show compression statistics if available and delta was used
|
||||
if successful > 0 and not no_delta:
|
||||
try:
|
||||
from ...client import DeltaGliderClient
|
||||
|
||||
client = DeltaGliderClient(service)
|
||||
dest_stats = client.get_bucket_stats(dest_bucket, detailed_stats=False)
|
||||
if dest_stats.delta_objects > 0:
|
||||
click.echo(
|
||||
f"\nCompression achieved: {dest_stats.average_compression_ratio:.1%}"
|
||||
)
|
||||
click.echo(f"Space saved: {format_bytes(dest_stats.space_saved)}")
|
||||
except Exception:
|
||||
pass # Ignore stats errors
|
||||
|
||||
|
||||
def handle_recursive(
|
||||
@@ -264,6 +627,19 @@ def handle_recursive(
|
||||
s3_url = f"s3://{bucket}/{obj.key}"
|
||||
download_file(service, s3_url, local_path, quiet)
|
||||
|
||||
else:
|
||||
click.echo("S3-to-S3 recursive copy not yet implemented", err=True)
|
||||
sys.exit(1)
|
||||
elif operation == "copy":
|
||||
# S3-to-S3 recursive copy with migration support
|
||||
migrate_s3_to_s3(
|
||||
service,
|
||||
source,
|
||||
dest,
|
||||
exclude=exclude,
|
||||
include=include,
|
||||
quiet=quiet,
|
||||
no_delta=no_delta,
|
||||
max_ratio=max_ratio,
|
||||
dry_run=False,
|
||||
skip_confirm=True, # Don't prompt for cp command
|
||||
preserve_prefix=True, # Always preserve prefix for cp -r
|
||||
region_override=False, # cp command doesn't track region override explicitly
|
||||
)
|
||||
|
||||
@@ -7,9 +7,11 @@ import shutil
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import click
|
||||
|
||||
from ... import __version__
|
||||
from ...adapters import (
|
||||
NoopMetricsAdapter,
|
||||
S3StorageAdapter,
|
||||
@@ -49,7 +51,7 @@ def create_service(
|
||||
# Register cleanup handler to remove cache on exit
|
||||
atexit.register(lambda: shutil.rmtree(cache_dir, ignore_errors=True))
|
||||
|
||||
# Set AWS environment variables if provided
|
||||
# Set AWS environment variables if provided (for compatibility with other AWS tools)
|
||||
if endpoint_url:
|
||||
os.environ["AWS_ENDPOINT_URL"] = endpoint_url
|
||||
if region:
|
||||
@@ -57,9 +59,14 @@ def create_service(
|
||||
if profile:
|
||||
os.environ["AWS_PROFILE"] = profile
|
||||
|
||||
# Build boto3_kwargs for explicit parameter passing (preferred over env vars)
|
||||
boto3_kwargs: dict[str, Any] = {}
|
||||
if region:
|
||||
boto3_kwargs["region_name"] = region
|
||||
|
||||
# Create adapters
|
||||
hasher = Sha256Adapter()
|
||||
storage = S3StorageAdapter(endpoint_url=endpoint_url)
|
||||
storage = S3StorageAdapter(endpoint_url=endpoint_url, boto3_kwargs=boto3_kwargs)
|
||||
diff = XdeltaAdapter()
|
||||
|
||||
# SECURITY: Configurable cache with encryption and backend selection
|
||||
@@ -113,8 +120,23 @@ def create_service(
|
||||
)
|
||||
|
||||
|
||||
def _version_callback(ctx: click.Context, param: click.Parameter, value: bool) -> None:
|
||||
"""Callback for --version option."""
|
||||
if value:
|
||||
click.echo(f"deltaglider {__version__}")
|
||||
ctx.exit(0)
|
||||
|
||||
|
||||
@click.group()
|
||||
@click.option("--debug", is_flag=True, help="Enable debug logging")
|
||||
@click.option(
|
||||
"--version",
|
||||
is_flag=True,
|
||||
is_eager=True,
|
||||
expose_value=False,
|
||||
callback=_version_callback,
|
||||
help="Show version and exit",
|
||||
)
|
||||
@click.pass_context
|
||||
def cli(ctx: click.Context, debug: bool) -> None:
|
||||
"""DeltaGlider - Delta-aware S3 file storage wrapper."""
|
||||
@@ -172,9 +194,6 @@ def cp(
|
||||
|
||||
# Handle recursive operations for directories
|
||||
if recursive:
|
||||
if operation == "copy":
|
||||
click.echo("S3-to-S3 recursive copy not yet implemented", err=True)
|
||||
sys.exit(1)
|
||||
handle_recursive(
|
||||
service, source, dest, recursive, exclude, include, quiet, no_delta, max_ratio
|
||||
)
|
||||
@@ -196,7 +215,7 @@ def cp(
|
||||
download_file(service, source, local_path, quiet)
|
||||
|
||||
elif operation == "copy":
|
||||
copy_s3_to_s3(service, source, dest, quiet)
|
||||
copy_s3_to_s3(service, source, dest, quiet, max_ratio, no_delta)
|
||||
|
||||
except ValueError as e:
|
||||
click.echo(f"Error: {e}", err=True)
|
||||
@@ -640,6 +659,178 @@ def verify(service: DeltaService, s3_url: str) -> None:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
@cli.command()
|
||||
@click.argument("source")
|
||||
@click.argument("dest")
|
||||
@click.option("--exclude", help="Exclude files matching pattern")
|
||||
@click.option("--include", help="Include only files matching pattern")
|
||||
@click.option("--quiet", "-q", is_flag=True, help="Suppress output")
|
||||
@click.option("--no-delta", is_flag=True, help="Disable delta compression")
|
||||
@click.option("--max-ratio", type=float, help="Max delta/file ratio (default: 0.5)")
|
||||
@click.option("--dry-run", is_flag=True, help="Show what would be migrated without migrating")
|
||||
@click.option("--yes", "-y", is_flag=True, help="Skip confirmation prompt")
|
||||
@click.option(
|
||||
"--no-preserve-prefix", is_flag=True, help="Don't preserve source prefix in destination"
|
||||
)
|
||||
@click.option("--endpoint-url", help="Override S3 endpoint URL")
|
||||
@click.option("--region", help="AWS region")
|
||||
@click.option("--profile", help="AWS profile to use")
|
||||
@click.pass_obj
|
||||
def migrate(
|
||||
service: DeltaService,
|
||||
source: str,
|
||||
dest: str,
|
||||
exclude: str | None,
|
||||
include: str | None,
|
||||
quiet: bool,
|
||||
no_delta: bool,
|
||||
max_ratio: float | None,
|
||||
dry_run: bool,
|
||||
yes: bool,
|
||||
no_preserve_prefix: bool,
|
||||
endpoint_url: str | None,
|
||||
region: str | None,
|
||||
profile: str | None,
|
||||
) -> None:
|
||||
"""Migrate S3 bucket/prefix to DeltaGlider-compressed storage.
|
||||
|
||||
This command facilitates the migration of existing S3 objects to another bucket
|
||||
with DeltaGlider compression. It supports:
|
||||
- Resume capability: Only copies files that don't exist in destination
|
||||
- Progress tracking: Shows migration progress
|
||||
- Confirmation prompt: Shows file count before starting (use --yes to skip)
|
||||
- Prefix preservation: By default, source prefix is preserved in destination
|
||||
|
||||
When migrating a prefix, the source prefix name is preserved by default:
|
||||
s3://src/prefix1/ → s3://dest/ creates s3://dest/prefix1/
|
||||
s3://src/a/b/c/ → s3://dest/x/ creates s3://dest/x/c/
|
||||
|
||||
Use --no-preserve-prefix to disable this behavior:
|
||||
s3://src/prefix1/ → s3://dest/ creates s3://dest/ (files at root)
|
||||
|
||||
Examples:
|
||||
deltaglider migrate s3://old-bucket/ s3://new-bucket/
|
||||
deltaglider migrate s3://old-bucket/data/ s3://new-bucket/
|
||||
deltaglider migrate --no-preserve-prefix s3://src/v1/ s3://dest/
|
||||
deltaglider migrate --dry-run s3://old-bucket/ s3://new-bucket/
|
||||
deltaglider migrate --yes --quiet s3://old-bucket/ s3://new-bucket/
|
||||
"""
|
||||
from .aws_compat import is_s3_path, migrate_s3_to_s3
|
||||
|
||||
# Recreate service with AWS parameters if provided
|
||||
if endpoint_url or region or profile:
|
||||
service = create_service(
|
||||
log_level=os.environ.get("DG_LOG_LEVEL", "INFO"),
|
||||
endpoint_url=endpoint_url,
|
||||
region=region,
|
||||
profile=profile,
|
||||
)
|
||||
|
||||
try:
|
||||
# Validate both paths are S3
|
||||
if not is_s3_path(source) or not is_s3_path(dest):
|
||||
click.echo("Error: Both source and destination must be S3 paths", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
# Perform migration
|
||||
migrate_s3_to_s3(
|
||||
service,
|
||||
source,
|
||||
dest,
|
||||
exclude=exclude,
|
||||
include=include,
|
||||
quiet=quiet,
|
||||
no_delta=no_delta,
|
||||
max_ratio=max_ratio,
|
||||
dry_run=dry_run,
|
||||
skip_confirm=yes,
|
||||
preserve_prefix=not no_preserve_prefix,
|
||||
region_override=region is not None, # True if user explicitly specified --region
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
click.echo(f"Migration failed: {e}", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
@cli.command()
|
||||
@click.argument("bucket")
|
||||
@click.option("--detailed", is_flag=True, help="Fetch detailed compression metrics (slower)")
|
||||
@click.option("--json", "output_json", is_flag=True, help="Output in JSON format")
|
||||
@click.pass_obj
|
||||
def stats(service: DeltaService, bucket: str, detailed: bool, output_json: bool) -> None:
|
||||
"""Get bucket statistics and compression metrics.
|
||||
|
||||
BUCKET can be specified as:
|
||||
- s3://bucket-name/
|
||||
- s3://bucket-name
|
||||
- bucket-name
|
||||
"""
|
||||
from ...client import DeltaGliderClient
|
||||
|
||||
try:
|
||||
# Parse bucket from S3 URL if needed
|
||||
if bucket.startswith("s3://"):
|
||||
# Remove s3:// prefix and any trailing slashes
|
||||
bucket = bucket[5:].rstrip("/")
|
||||
# Extract just the bucket name (first path component)
|
||||
bucket = bucket.split("/")[0] if "/" in bucket else bucket
|
||||
|
||||
if not bucket:
|
||||
click.echo("Error: Invalid bucket name", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
# Create client from service
|
||||
client = DeltaGliderClient(service=service)
|
||||
|
||||
# Get bucket stats
|
||||
bucket_stats = client.get_bucket_stats(bucket, detailed_stats=detailed)
|
||||
|
||||
if output_json:
|
||||
# JSON output
|
||||
output = {
|
||||
"bucket": bucket_stats.bucket,
|
||||
"object_count": bucket_stats.object_count,
|
||||
"total_size": bucket_stats.total_size,
|
||||
"compressed_size": bucket_stats.compressed_size,
|
||||
"space_saved": bucket_stats.space_saved,
|
||||
"average_compression_ratio": bucket_stats.average_compression_ratio,
|
||||
"delta_objects": bucket_stats.delta_objects,
|
||||
"direct_objects": bucket_stats.direct_objects,
|
||||
}
|
||||
click.echo(json.dumps(output, indent=2))
|
||||
else:
|
||||
# Human-readable output
|
||||
def format_bytes(size: float) -> str:
|
||||
"""Format bytes to human-readable size."""
|
||||
for unit in ["B", "KB", "MB", "GB", "TB"]:
|
||||
if size < 1024.0:
|
||||
return f"{size:.2f} {unit}"
|
||||
size /= 1024.0
|
||||
return f"{size:.2f} PB"
|
||||
|
||||
click.echo(f"Bucket Statistics: {bucket_stats.bucket}")
|
||||
click.echo(f"{'=' * 60}")
|
||||
click.echo(f"Total Objects: {bucket_stats.object_count:,}")
|
||||
click.echo(f" Delta Objects: {bucket_stats.delta_objects:,}")
|
||||
click.echo(f" Direct Objects: {bucket_stats.direct_objects:,}")
|
||||
click.echo("")
|
||||
click.echo(
|
||||
f"Original Size: {format_bytes(bucket_stats.total_size)} ({bucket_stats.total_size:,} bytes)"
|
||||
)
|
||||
click.echo(
|
||||
f"Compressed Size: {format_bytes(bucket_stats.compressed_size)} ({bucket_stats.compressed_size:,} bytes)"
|
||||
)
|
||||
click.echo(
|
||||
f"Space Saved: {format_bytes(bucket_stats.space_saved)} ({bucket_stats.space_saved:,} bytes)"
|
||||
)
|
||||
click.echo(f"Compression Ratio: {bucket_stats.average_compression_ratio:.1%}")
|
||||
|
||||
except Exception as e:
|
||||
click.echo(f"Error: {e}", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""Main entry point."""
|
||||
cli()
|
||||
|
||||
@@ -63,6 +63,52 @@ class DeltaGliderClient:
|
||||
self.service = service
|
||||
self.endpoint_url = endpoint_url
|
||||
self._multipart_uploads: dict[str, Any] = {} # Track multipart uploads
|
||||
# Session-scoped bucket statistics cache (cleared with the client lifecycle)
|
||||
self._bucket_stats_cache: dict[str, dict[bool, BucketStats]] = {}
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Internal helpers
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
def _invalidate_bucket_stats_cache(self, bucket: str | None = None) -> None:
|
||||
"""Invalidate cached bucket statistics."""
|
||||
if bucket is None:
|
||||
self._bucket_stats_cache.clear()
|
||||
else:
|
||||
self._bucket_stats_cache.pop(bucket, None)
|
||||
|
||||
def _store_bucket_stats_cache(
|
||||
self,
|
||||
bucket: str,
|
||||
detailed_stats: bool,
|
||||
stats: BucketStats,
|
||||
) -> None:
|
||||
"""Store bucket statistics in the session cache."""
|
||||
bucket_cache = self._bucket_stats_cache.setdefault(bucket, {})
|
||||
bucket_cache[detailed_stats] = stats
|
||||
# Detailed stats are a superset of quick stats; reuse them for quick calls.
|
||||
if detailed_stats:
|
||||
bucket_cache[False] = stats
|
||||
|
||||
def _get_cached_bucket_stats(self, bucket: str, detailed_stats: bool) -> BucketStats | None:
|
||||
"""Retrieve cached stats for a bucket, preferring detailed metrics when available."""
|
||||
bucket_cache = self._bucket_stats_cache.get(bucket)
|
||||
if not bucket_cache:
|
||||
return None
|
||||
if detailed_stats:
|
||||
return bucket_cache.get(True)
|
||||
return bucket_cache.get(False) or bucket_cache.get(True)
|
||||
|
||||
def _get_cached_bucket_stats_for_listing(self, bucket: str) -> tuple[BucketStats | None, bool]:
|
||||
"""Return best cached stats for bucket listings."""
|
||||
bucket_cache = self._bucket_stats_cache.get(bucket)
|
||||
if not bucket_cache:
|
||||
return (None, False)
|
||||
if True in bucket_cache:
|
||||
return (bucket_cache[True], True)
|
||||
if False in bucket_cache:
|
||||
return (bucket_cache[False], False)
|
||||
return (None, False)
|
||||
|
||||
# ============================================================================
|
||||
# Boto3-compatible APIs (matches S3 client interface)
|
||||
@@ -171,13 +217,15 @@ class DeltaGliderClient:
|
||||
}
|
||||
|
||||
# Return as dict[str, Any] for public API (TypedDict is a dict at runtime!)
|
||||
return cast(
|
||||
response = cast(
|
||||
dict[str, Any],
|
||||
build_put_response(
|
||||
etag=f'"{sha256_hash}"',
|
||||
deltaglider_info=deltaglider_info,
|
||||
),
|
||||
)
|
||||
self._invalidate_bucket_stats_cache(Bucket)
|
||||
return response
|
||||
finally:
|
||||
# Clean up temp file
|
||||
if tmp_path.exists():
|
||||
@@ -418,7 +466,7 @@ class DeltaGliderClient:
|
||||
deltaglider_info["DependentDeltas"] = dependent_deltas
|
||||
|
||||
# Return as dict[str, Any] for public API (TypedDict is a dict at runtime!)
|
||||
return cast(
|
||||
response = cast(
|
||||
dict[str, Any],
|
||||
build_delete_response(
|
||||
delete_marker=False,
|
||||
@@ -426,6 +474,8 @@ class DeltaGliderClient:
|
||||
deltaglider_info=deltaglider_info,
|
||||
),
|
||||
)
|
||||
self._invalidate_bucket_stats_cache(Bucket)
|
||||
return response
|
||||
|
||||
def delete_objects(
|
||||
self,
|
||||
@@ -502,6 +552,7 @@ class DeltaGliderClient:
|
||||
}
|
||||
|
||||
response["ResponseMetadata"] = {"HTTPStatusCode": 200}
|
||||
self._invalidate_bucket_stats_cache(Bucket)
|
||||
return response
|
||||
|
||||
def delete_objects_recursive(
|
||||
@@ -627,6 +678,7 @@ class DeltaGliderClient:
|
||||
if single_details:
|
||||
response["DeltaGliderInfo"]["SingleDeletes"] = single_details # type: ignore[index]
|
||||
|
||||
self._invalidate_bucket_stats_cache(Bucket)
|
||||
return response
|
||||
|
||||
def head_object(
|
||||
@@ -703,7 +755,7 @@ class DeltaGliderClient:
|
||||
is_delta = summary.delta_size is not None
|
||||
stored_size = summary.delta_size if is_delta else summary.file_size
|
||||
|
||||
return UploadSummary(
|
||||
upload_summary = UploadSummary(
|
||||
operation=summary.operation,
|
||||
bucket=summary.bucket,
|
||||
key=summary.key,
|
||||
@@ -712,6 +764,8 @@ class DeltaGliderClient:
|
||||
is_delta=is_delta,
|
||||
delta_ratio=summary.delta_ratio or 0.0,
|
||||
)
|
||||
self._invalidate_bucket_stats_cache(bucket)
|
||||
return upload_summary
|
||||
|
||||
def download(self, s3_url: str, output_path: str | Path) -> None:
|
||||
"""Download and reconstruct a file from S3.
|
||||
@@ -938,7 +992,12 @@ class DeltaGliderClient:
|
||||
stats = client.get_bucket_stats('releases', detailed_stats=True)
|
||||
print(f"Compression ratio: {stats.average_compression_ratio:.1%}")
|
||||
"""
|
||||
cached = self._get_cached_bucket_stats(bucket, detailed_stats)
|
||||
if cached:
|
||||
return cached
|
||||
|
||||
result: BucketStats = _get_bucket_stats(self, bucket, detailed_stats)
|
||||
self._store_bucket_stats_cache(bucket, detailed_stats, result)
|
||||
return result
|
||||
|
||||
def generate_presigned_url(
|
||||
@@ -1010,7 +1069,9 @@ class DeltaGliderClient:
|
||||
... CreateBucketConfiguration={'LocationConstraint': 'us-west-2'}
|
||||
... )
|
||||
"""
|
||||
return _create_bucket(self, Bucket, CreateBucketConfiguration, **kwargs)
|
||||
response = _create_bucket(self, Bucket, CreateBucketConfiguration, **kwargs)
|
||||
self._invalidate_bucket_stats_cache(Bucket)
|
||||
return response
|
||||
|
||||
def delete_bucket(
|
||||
self,
|
||||
@@ -1032,7 +1093,9 @@ class DeltaGliderClient:
|
||||
>>> client = create_client()
|
||||
>>> client.delete_bucket(Bucket='my-bucket')
|
||||
"""
|
||||
return _delete_bucket(self, Bucket, **kwargs)
|
||||
response = _delete_bucket(self, Bucket, **kwargs)
|
||||
self._invalidate_bucket_stats_cache(Bucket)
|
||||
return response
|
||||
|
||||
def list_buckets(self, **kwargs: Any) -> dict[str, Any]:
|
||||
"""List all S3 buckets (boto3-compatible).
|
||||
@@ -1061,6 +1124,87 @@ class DeltaGliderClient:
|
||||
tags[key] = value
|
||||
return tags
|
||||
|
||||
# ============================================================================
|
||||
# Cache Management APIs (DeltaGlider Extensions)
|
||||
# ============================================================================
|
||||
|
||||
def clear_cache(self) -> None:
|
||||
"""Clear all cached reference files.
|
||||
|
||||
Forcibly removes all cached data from memory or disk. This is essential for
|
||||
long-running applications that need to:
|
||||
- Free memory/disk space
|
||||
- Invalidate cache after configuration changes
|
||||
- Ensure fresh data fetch from S3
|
||||
- Clean up after tests
|
||||
|
||||
**Important for Long-Running Applications**:
|
||||
Unlike the CLI which cleans up cache on exit, programmatic SDK usage
|
||||
requires manual cache management. Call this method periodically or when:
|
||||
- Application runs for extended periods (hours/days)
|
||||
- Memory usage is high
|
||||
- Configuration changes (endpoint, credentials, encryption key)
|
||||
- Testing scenarios requiring clean state
|
||||
|
||||
**Effects**:
|
||||
- Filesystem cache: Removes all files from cache directory
|
||||
- Memory cache: Clears all in-memory data
|
||||
- Encrypted cache: Clears encryption key mappings
|
||||
- Next upload will re-fetch reference from S3
|
||||
|
||||
**Example - Long-Running Service**:
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
import schedule
|
||||
import time
|
||||
|
||||
client = create_client()
|
||||
|
||||
def upload_task():
|
||||
client.put_object(Bucket='releases', Key='app.zip', Body=open('app.zip', 'rb'))
|
||||
|
||||
def cleanup_task():
|
||||
client.clear_cache() # Free resources every hour
|
||||
print("Cache cleared")
|
||||
|
||||
# Upload every 10 minutes
|
||||
schedule.every(10).minutes.do(upload_task)
|
||||
|
||||
# Clear cache every hour
|
||||
schedule.every().hour.do(cleanup_task)
|
||||
|
||||
while True:
|
||||
schedule.run_pending()
|
||||
time.sleep(1)
|
||||
```
|
||||
|
||||
**Example - Test Cleanup**:
|
||||
```python
|
||||
def test_upload():
|
||||
client = create_client()
|
||||
try:
|
||||
client.put_object(Bucket='test', Key='file.zip', Body=b'data')
|
||||
finally:
|
||||
client.clear_cache() # Ensure clean state for next test
|
||||
```
|
||||
|
||||
**Example - After Configuration Change**:
|
||||
```python
|
||||
client = create_client(endpoint_url='http://minio1:9000')
|
||||
client.put_object(Bucket='bucket', Key='file.zip', Body=b'data')
|
||||
|
||||
# Switch to different endpoint
|
||||
client.clear_cache() # Clear cache from old endpoint
|
||||
client = create_client(endpoint_url='http://minio2:9000')
|
||||
```
|
||||
|
||||
See Also:
|
||||
- `evict_cache()`: Remove specific cached reference
|
||||
- docs/CACHE_MANAGEMENT.md: Complete cache management guide
|
||||
"""
|
||||
self._invalidate_bucket_stats_cache()
|
||||
self.service.cache.clear()
|
||||
|
||||
|
||||
def create_client(
|
||||
endpoint_url: str | None = None,
|
||||
|
||||
@@ -138,10 +138,32 @@ def list_buckets(
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
response = storage_adapter.client.list_buckets()
|
||||
raw_response = storage_adapter.client.list_buckets()
|
||||
|
||||
buckets: list[dict[str, Any]] = []
|
||||
for bucket_entry in raw_response.get("Buckets", []):
|
||||
bucket_data = dict(bucket_entry)
|
||||
name = bucket_data.get("Name")
|
||||
if isinstance(name, str) and name:
|
||||
cached_stats, detailed = client._get_cached_bucket_stats_for_listing(name)
|
||||
if cached_stats is not None:
|
||||
bucket_data["DeltaGliderStats"] = {
|
||||
"Cached": True,
|
||||
"Detailed": detailed,
|
||||
"ObjectCount": cached_stats.object_count,
|
||||
"TotalSize": cached_stats.total_size,
|
||||
"CompressedSize": cached_stats.compressed_size,
|
||||
"SpaceSaved": cached_stats.space_saved,
|
||||
"AverageCompressionRatio": cached_stats.average_compression_ratio,
|
||||
"DeltaObjects": cached_stats.delta_objects,
|
||||
"DirectObjects": cached_stats.direct_objects,
|
||||
}
|
||||
|
||||
buckets.append(bucket_data)
|
||||
|
||||
return {
|
||||
"Buckets": response.get("Buckets", []),
|
||||
"Owner": response.get("Owner", {}),
|
||||
"Buckets": buckets,
|
||||
"Owner": raw_response.get("Owner", {}),
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
|
||||
@@ -89,82 +89,188 @@ def get_bucket_stats(
|
||||
stats = client.get_bucket_stats('releases', detailed_stats=True)
|
||||
print(f"Compression ratio: {stats.average_compression_ratio:.1%}")
|
||||
"""
|
||||
# List all objects with smart metadata fetching
|
||||
# List all objects DIRECTLY from storage adapter to see reference.bin files
|
||||
# (client.list_objects filters them out for user-facing operations)
|
||||
all_objects = []
|
||||
continuation_token = None
|
||||
start_after = None
|
||||
|
||||
import concurrent.futures
|
||||
|
||||
# Phase 1: Collect all objects and identify delta files
|
||||
raw_objects = []
|
||||
delta_keys = []
|
||||
|
||||
while True:
|
||||
response = client.list_objects(
|
||||
Bucket=bucket,
|
||||
MaxKeys=1000,
|
||||
ContinuationToken=continuation_token,
|
||||
FetchMetadata=detailed_stats, # Only fetch metadata if detailed stats requested
|
||||
# Call storage adapter directly to see ALL files including reference.bin
|
||||
response = client.service.storage.list_objects(
|
||||
bucket=bucket,
|
||||
prefix="",
|
||||
max_keys=1000,
|
||||
start_after=start_after,
|
||||
)
|
||||
|
||||
# Extract S3Objects from response (with Metadata containing DeltaGlider info)
|
||||
for obj_dict in response["Contents"]:
|
||||
# Convert dict back to ObjectInfo for backward compatibility with stats calculation
|
||||
metadata = obj_dict.get("Metadata", {})
|
||||
# Parse compression ratio safely (handle "unknown" value)
|
||||
compression_ratio_str = metadata.get("deltaglider-compression-ratio", "0.0")
|
||||
try:
|
||||
compression_ratio = (
|
||||
float(compression_ratio_str) if compression_ratio_str != "unknown" else 0.0
|
||||
)
|
||||
except ValueError:
|
||||
compression_ratio = 0.0
|
||||
# Collect objects and identify delta files
|
||||
for obj_dict in response.get("objects", []):
|
||||
raw_objects.append(obj_dict)
|
||||
if obj_dict["key"].endswith(".delta"):
|
||||
delta_keys.append(obj_dict["key"])
|
||||
|
||||
all_objects.append(
|
||||
ObjectInfo(
|
||||
key=obj_dict["Key"],
|
||||
size=obj_dict["Size"],
|
||||
last_modified=obj_dict.get("LastModified", ""),
|
||||
etag=obj_dict.get("ETag"),
|
||||
storage_class=obj_dict.get("StorageClass", "STANDARD"),
|
||||
original_size=int(metadata.get("deltaglider-original-size", obj_dict["Size"])),
|
||||
compressed_size=obj_dict["Size"],
|
||||
is_delta=metadata.get("deltaglider-is-delta", "false") == "true",
|
||||
compression_ratio=compression_ratio,
|
||||
reference_key=metadata.get("deltaglider-reference-key"),
|
||||
)
|
||||
)
|
||||
|
||||
if not response.get("IsTruncated"):
|
||||
if not response.get("is_truncated"):
|
||||
break
|
||||
|
||||
continuation_token = response.get("NextContinuationToken")
|
||||
start_after = response.get("next_continuation_token")
|
||||
|
||||
# Calculate statistics
|
||||
total_size = 0
|
||||
compressed_size = 0
|
||||
# Phase 2: Fetch metadata for delta files in parallel (10x faster)
|
||||
metadata_map = {}
|
||||
if delta_keys:
|
||||
client.service.logger.info(
|
||||
f"Fetching metadata for {len(delta_keys)} delta files in parallel..."
|
||||
)
|
||||
|
||||
def fetch_metadata(key: str) -> tuple[str, dict[str, Any] | None]:
|
||||
try:
|
||||
obj_head = client.service.storage.head(f"{bucket}/{key}")
|
||||
if obj_head and obj_head.metadata:
|
||||
return key, obj_head.metadata
|
||||
except Exception as e:
|
||||
client.service.logger.debug(f"Failed to fetch metadata for {key}: {e}")
|
||||
return key, None
|
||||
|
||||
with concurrent.futures.ThreadPoolExecutor(
|
||||
max_workers=min(10, len(delta_keys))
|
||||
) as executor:
|
||||
futures = [executor.submit(fetch_metadata, key) for key in delta_keys]
|
||||
for future in concurrent.futures.as_completed(futures):
|
||||
key, metadata = future.result()
|
||||
if metadata:
|
||||
metadata_map[key] = metadata
|
||||
|
||||
# Phase 3: Build ObjectInfo list with metadata
|
||||
for obj_dict in raw_objects:
|
||||
key = obj_dict["key"]
|
||||
size = obj_dict["size"]
|
||||
is_delta = key.endswith(".delta")
|
||||
|
||||
# Get metadata from our parallel fetch
|
||||
metadata = metadata_map.get(key, {})
|
||||
|
||||
# Parse compression ratio and original size
|
||||
compression_ratio = 0.0
|
||||
original_size = size
|
||||
if is_delta and metadata:
|
||||
try:
|
||||
ratio_str = metadata.get("compression_ratio", "0.0")
|
||||
compression_ratio = float(ratio_str) if ratio_str != "unknown" else 0.0
|
||||
except (ValueError, TypeError):
|
||||
compression_ratio = 0.0
|
||||
try:
|
||||
original_size = int(metadata.get("file_size", size))
|
||||
client.service.logger.debug(f"Delta {key}: using original_size={original_size}")
|
||||
except (ValueError, TypeError):
|
||||
original_size = size
|
||||
|
||||
all_objects.append(
|
||||
ObjectInfo(
|
||||
key=key,
|
||||
size=size,
|
||||
last_modified=obj_dict.get("last_modified", ""),
|
||||
etag=obj_dict.get("etag"),
|
||||
storage_class=obj_dict.get("storage_class", "STANDARD"),
|
||||
original_size=original_size,
|
||||
compressed_size=size,
|
||||
is_delta=is_delta,
|
||||
compression_ratio=compression_ratio,
|
||||
reference_key=metadata.get("ref_key") if metadata else None,
|
||||
)
|
||||
)
|
||||
|
||||
# Calculate statistics - COUNT ALL FILES
|
||||
total_original_size = 0
|
||||
total_compressed_size = 0
|
||||
delta_count = 0
|
||||
direct_count = 0
|
||||
reference_files = {} # Track all reference.bin files and their deltaspaces
|
||||
|
||||
# First pass: identify what we have
|
||||
for obj in all_objects:
|
||||
# Skip reference.bin files - they are internal implementation details
|
||||
# and their size is already accounted for in delta metadata
|
||||
if obj.key.endswith("/reference.bin") or obj.key == "reference.bin":
|
||||
# Extract deltaspace prefix
|
||||
if "/" in obj.key:
|
||||
deltaspace = obj.key.rsplit("/reference.bin", 1)[0]
|
||||
else:
|
||||
deltaspace = "" # Root level reference.bin
|
||||
reference_files[deltaspace] = obj.size
|
||||
elif obj.is_delta:
|
||||
delta_count += 1
|
||||
else:
|
||||
direct_count += 1
|
||||
|
||||
# Second pass: calculate sizes
|
||||
for obj in all_objects:
|
||||
# Skip reference.bin in this pass (we'll handle it separately)
|
||||
if obj.key.endswith("/reference.bin") or obj.key == "reference.bin":
|
||||
continue
|
||||
|
||||
compressed_size += obj.size
|
||||
|
||||
if obj.is_delta:
|
||||
delta_count += 1
|
||||
# Use actual original size if we have it, otherwise estimate
|
||||
total_size += obj.original_size or obj.size
|
||||
# Delta file: original from metadata, compressed = delta size
|
||||
if obj.original_size and obj.original_size != obj.size:
|
||||
client.service.logger.debug(
|
||||
f"Delta {obj.key}: using original_size={obj.original_size}"
|
||||
)
|
||||
total_original_size += obj.original_size
|
||||
else:
|
||||
client.service.logger.warning(
|
||||
f"Delta {obj.key}: no original_size, using compressed size={obj.size}"
|
||||
)
|
||||
total_original_size += obj.size
|
||||
total_compressed_size += obj.size
|
||||
else:
|
||||
direct_count += 1
|
||||
# For non-delta files, original equals compressed
|
||||
total_size += obj.size
|
||||
# Direct files: original = compressed = actual size
|
||||
total_original_size += obj.size
|
||||
total_compressed_size += obj.size
|
||||
|
||||
space_saved = total_size - compressed_size
|
||||
avg_ratio = (space_saved / total_size) if total_size > 0 else 0.0
|
||||
# Handle reference.bin files
|
||||
total_reference_size = sum(reference_files.values())
|
||||
|
||||
if delta_count > 0 and total_reference_size > 0:
|
||||
# Add all reference.bin files to compressed size
|
||||
total_compressed_size += total_reference_size
|
||||
client.service.logger.info(
|
||||
f"Including {len(reference_files)} reference.bin file(s) ({total_reference_size:,} bytes) in compressed size"
|
||||
)
|
||||
elif delta_count == 0 and total_reference_size > 0:
|
||||
# ORPHANED REFERENCE WARNING
|
||||
waste_mb = total_reference_size / 1024 / 1024
|
||||
client.service.logger.warning(
|
||||
f"\n{'=' * 60}\n"
|
||||
f"WARNING: ORPHANED REFERENCE FILE(S) DETECTED!\n"
|
||||
f"{'=' * 60}\n"
|
||||
f"Found {len(reference_files)} reference.bin file(s) totaling {total_reference_size:,} bytes ({waste_mb:.2f} MB)\n"
|
||||
f"but NO delta files are using them.\n"
|
||||
f"\n"
|
||||
f"This wastes {waste_mb:.2f} MB of storage!\n"
|
||||
f"\n"
|
||||
f"Orphaned reference files:\n"
|
||||
)
|
||||
for deltaspace, size in reference_files.items():
|
||||
path = f"{deltaspace}/reference.bin" if deltaspace else "reference.bin"
|
||||
client.service.logger.warning(f" - s3://{bucket}/{path} ({size:,} bytes)")
|
||||
|
||||
client.service.logger.warning("\nConsider removing these orphaned files:\n")
|
||||
for deltaspace in reference_files:
|
||||
path = f"{deltaspace}/reference.bin" if deltaspace else "reference.bin"
|
||||
client.service.logger.warning(f" aws s3 rm s3://{bucket}/{path}")
|
||||
|
||||
client.service.logger.warning(f"{'=' * 60}")
|
||||
|
||||
space_saved = total_original_size - total_compressed_size
|
||||
avg_ratio = (space_saved / total_original_size) if total_original_size > 0 else 0.0
|
||||
|
||||
return BucketStats(
|
||||
bucket=bucket,
|
||||
object_count=len(all_objects),
|
||||
total_size=total_size,
|
||||
compressed_size=compressed_size,
|
||||
object_count=delta_count + direct_count, # Only count user files, not reference.bin
|
||||
total_size=total_original_size,
|
||||
compressed_size=total_compressed_size,
|
||||
space_saved=space_saved,
|
||||
average_compression_ratio=avg_ratio,
|
||||
delta_objects=delta_count,
|
||||
|
||||
@@ -93,16 +93,27 @@ class DeltaService:
|
||||
return any(name_lower.endswith(ext) for ext in self.delta_extensions)
|
||||
|
||||
def put(
|
||||
self, local_file: Path, delta_space: DeltaSpace, max_ratio: float | None = None
|
||||
self,
|
||||
local_file: Path,
|
||||
delta_space: DeltaSpace,
|
||||
max_ratio: float | None = None,
|
||||
override_name: str | None = None,
|
||||
) -> PutSummary:
|
||||
"""Upload file as reference or delta (for archive files) or directly (for other files)."""
|
||||
"""Upload file as reference or delta (for archive files) or directly (for other files).
|
||||
|
||||
Args:
|
||||
local_file: Path to the local file to upload
|
||||
delta_space: DeltaSpace (bucket + prefix) for the upload
|
||||
max_ratio: Maximum acceptable delta/file ratio (default: service max_ratio)
|
||||
override_name: Optional name to use instead of local_file.name (useful for S3-to-S3 copies)
|
||||
"""
|
||||
if max_ratio is None:
|
||||
max_ratio = self.max_ratio
|
||||
|
||||
start_time = self.clock.now()
|
||||
file_size = local_file.stat().st_size
|
||||
file_sha256 = self.hasher.sha256(local_file)
|
||||
original_name = local_file.name
|
||||
original_name = override_name if override_name else local_file.name
|
||||
|
||||
self.logger.info(
|
||||
"Starting put operation",
|
||||
|
||||
@@ -42,3 +42,18 @@ class CachePort(Protocol):
|
||||
def evict(self, bucket: str, prefix: str) -> None:
|
||||
"""Remove cached reference."""
|
||||
...
|
||||
|
||||
def clear(self) -> None:
|
||||
"""Clear all cached references.
|
||||
|
||||
This method forcibly removes all cached data, useful for:
|
||||
- Long-running applications that need to free memory
|
||||
- Test cleanup
|
||||
- Manual cache invalidation
|
||||
- Ensuring fresh data fetch
|
||||
|
||||
Note: For filesystem caches, this removes all files in the cache directory.
|
||||
For memory caches, this clears all in-memory data.
|
||||
For encrypted caches, this also clears encryption key mappings.
|
||||
"""
|
||||
...
|
||||
|
||||
@@ -1,11 +1,13 @@
|
||||
"""Tests for bucket management APIs."""
|
||||
|
||||
from typing import Any
|
||||
from unittest.mock import Mock
|
||||
|
||||
import pytest
|
||||
|
||||
from deltaglider.app.cli.main import create_service
|
||||
from deltaglider.client import DeltaGliderClient
|
||||
from deltaglider.client_models import BucketStats
|
||||
|
||||
|
||||
class TestBucketManagement:
|
||||
@@ -123,6 +125,47 @@ class TestBucketManagement:
|
||||
assert response["Buckets"] == []
|
||||
assert response["ResponseMetadata"]["HTTPStatusCode"] == 200
|
||||
|
||||
def test_list_buckets_includes_cached_stats(self):
|
||||
"""Bucket list should merge cached stats when available."""
|
||||
service = create_service()
|
||||
mock_storage = Mock()
|
||||
service.storage = mock_storage
|
||||
|
||||
mock_boto3_client = Mock()
|
||||
mock_boto3_client.list_buckets.return_value = {
|
||||
"Buckets": [
|
||||
{"Name": "bucket1", "CreationDate": "2025-01-01T00:00:00Z"},
|
||||
{"Name": "bucket2", "CreationDate": "2025-01-02T00:00:00Z"},
|
||||
],
|
||||
"Owner": {"DisplayName": "test-user", "ID": "12345"},
|
||||
}
|
||||
mock_storage.client = mock_boto3_client
|
||||
|
||||
client = DeltaGliderClient(service)
|
||||
|
||||
cached_stats = BucketStats(
|
||||
bucket="bucket1",
|
||||
object_count=10,
|
||||
total_size=1000,
|
||||
compressed_size=600,
|
||||
space_saved=400,
|
||||
average_compression_ratio=0.4,
|
||||
delta_objects=6,
|
||||
direct_objects=4,
|
||||
)
|
||||
client._store_bucket_stats_cache("bucket1", detailed_stats=True, stats=cached_stats)
|
||||
|
||||
response = client.list_buckets()
|
||||
|
||||
bucket1 = next(bucket for bucket in response["Buckets"] if bucket["Name"] == "bucket1")
|
||||
assert bucket1["DeltaGliderStats"]["Cached"] is True
|
||||
assert bucket1["DeltaGliderStats"]["Detailed"] is True
|
||||
assert bucket1["DeltaGliderStats"]["ObjectCount"] == cached_stats.object_count
|
||||
assert bucket1["DeltaGliderStats"]["TotalSize"] == cached_stats.total_size
|
||||
|
||||
bucket2 = next(bucket for bucket in response["Buckets"] if bucket["Name"] == "bucket2")
|
||||
assert "DeltaGliderStats" not in bucket2
|
||||
|
||||
def test_delete_bucket_success(self):
|
||||
"""Test deleting a bucket successfully."""
|
||||
service = create_service()
|
||||
@@ -178,6 +221,69 @@ class TestBucketManagement:
|
||||
with pytest.raises(RuntimeError, match="Failed to delete bucket"):
|
||||
client.delete_bucket(Bucket="full-bucket")
|
||||
|
||||
def test_get_bucket_stats_caches_per_session(self, monkeypatch):
|
||||
"""Verify bucket stats are cached within the client session."""
|
||||
service = create_service()
|
||||
mock_storage = Mock()
|
||||
service.storage = mock_storage
|
||||
|
||||
mock_storage.client = Mock()
|
||||
|
||||
client = DeltaGliderClient(service)
|
||||
|
||||
quick_stats = BucketStats(
|
||||
bucket="bucket1",
|
||||
object_count=5,
|
||||
total_size=500,
|
||||
compressed_size=300,
|
||||
space_saved=200,
|
||||
average_compression_ratio=0.4,
|
||||
delta_objects=3,
|
||||
direct_objects=2,
|
||||
)
|
||||
detailed_stats = BucketStats(
|
||||
bucket="bucket1",
|
||||
object_count=5,
|
||||
total_size=520,
|
||||
compressed_size=300,
|
||||
space_saved=220,
|
||||
average_compression_ratio=0.423,
|
||||
delta_objects=3,
|
||||
direct_objects=2,
|
||||
)
|
||||
|
||||
call_count = {"value": 0}
|
||||
|
||||
def fake_get_bucket_stats(_: Any, bucket: str, detailed_stats_flag: bool) -> BucketStats:
|
||||
call_count["value"] += 1
|
||||
assert bucket == "bucket1"
|
||||
return detailed_stats if detailed_stats_flag else quick_stats
|
||||
|
||||
monkeypatch.setattr("deltaglider.client._get_bucket_stats", fake_get_bucket_stats)
|
||||
|
||||
# First call should invoke underlying function
|
||||
result_quick = client.get_bucket_stats("bucket1")
|
||||
assert result_quick is quick_stats
|
||||
assert call_count["value"] == 1
|
||||
|
||||
# Second quick call should hit cache
|
||||
assert client.get_bucket_stats("bucket1") is quick_stats
|
||||
assert call_count["value"] == 1
|
||||
|
||||
# Detailed call triggers new computation
|
||||
result_detailed = client.get_bucket_stats("bucket1", detailed_stats=True)
|
||||
assert result_detailed is detailed_stats
|
||||
assert call_count["value"] == 2
|
||||
|
||||
# Quick call after detailed uses detailed cached value (more accurate)
|
||||
assert client.get_bucket_stats("bucket1") is detailed_stats
|
||||
assert call_count["value"] == 2
|
||||
|
||||
# Clearing the cache should force recomputation
|
||||
client.clear_cache()
|
||||
assert client.get_bucket_stats("bucket1") is quick_stats
|
||||
assert call_count["value"] == 3
|
||||
|
||||
def test_bucket_methods_without_boto3_client(self):
|
||||
"""Test that bucket methods raise NotImplementedError when storage doesn't support it."""
|
||||
service = create_service()
|
||||
|
||||
271
tests/integration/test_s3_migration.py
Normal file
271
tests/integration/test_s3_migration.py
Normal file
@@ -0,0 +1,271 @@
|
||||
"""Test S3-to-S3 migration functionality."""
|
||||
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from deltaglider.app.cli.aws_compat import migrate_s3_to_s3
|
||||
from deltaglider.core import DeltaService
|
||||
from deltaglider.ports import ObjectHead
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_service():
|
||||
"""Create a mock DeltaService."""
|
||||
service = MagicMock(spec=DeltaService)
|
||||
service.storage = MagicMock()
|
||||
return service
|
||||
|
||||
|
||||
def test_migrate_s3_to_s3_with_resume(mock_service):
|
||||
"""Test migration with resume support (skips existing files)."""
|
||||
# Setup mock storage with source files
|
||||
source_objects = [
|
||||
ObjectHead(
|
||||
key="file1.zip",
|
||||
size=1024,
|
||||
etag="abc123",
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
metadata={},
|
||||
),
|
||||
ObjectHead(
|
||||
key="file2.zip",
|
||||
size=2048,
|
||||
etag="def456",
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
metadata={},
|
||||
),
|
||||
ObjectHead(
|
||||
key="subdir/file3.zip",
|
||||
size=512,
|
||||
etag="ghi789",
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
metadata={},
|
||||
),
|
||||
]
|
||||
|
||||
# Destination already has file1.zip (as .delta)
|
||||
dest_objects = [
|
||||
ObjectHead(
|
||||
key="file1.zip.delta",
|
||||
size=100,
|
||||
last_modified="2024-01-02T00:00:00Z",
|
||||
etag="delta123",
|
||||
metadata={},
|
||||
),
|
||||
]
|
||||
|
||||
# Configure mock to return appropriate objects
|
||||
def list_side_effect(prefix):
|
||||
if "source-bucket" in prefix:
|
||||
return iter(source_objects)
|
||||
elif "dest-bucket" in prefix:
|
||||
return iter(dest_objects)
|
||||
return iter([])
|
||||
|
||||
mock_service.storage.list.side_effect = list_side_effect
|
||||
|
||||
# Mock the copy operation and click functions
|
||||
# Use quiet=True to skip EC2 detection logging
|
||||
with patch("deltaglider.app.cli.aws_compat.copy_s3_to_s3") as mock_copy:
|
||||
with patch("deltaglider.app.cli.aws_compat.click.confirm", return_value=True):
|
||||
migrate_s3_to_s3(
|
||||
mock_service,
|
||||
"s3://source-bucket/",
|
||||
"s3://dest-bucket/",
|
||||
exclude=None,
|
||||
include=None,
|
||||
quiet=True, # Skip EC2 detection and logging
|
||||
no_delta=False,
|
||||
max_ratio=None,
|
||||
dry_run=False,
|
||||
skip_confirm=False,
|
||||
)
|
||||
|
||||
# Should copy only file2.zip and subdir/file3.zip (file1 already exists)
|
||||
assert mock_copy.call_count == 2
|
||||
|
||||
# Verify the files being migrated
|
||||
call_args = [call[0] for call in mock_copy.call_args_list]
|
||||
migrated_files = [(args[1], args[2]) for args in call_args]
|
||||
|
||||
assert ("s3://source-bucket/file2.zip", "s3://dest-bucket/file2.zip") in migrated_files
|
||||
assert (
|
||||
"s3://source-bucket/subdir/file3.zip",
|
||||
"s3://dest-bucket/subdir/file3.zip",
|
||||
) in migrated_files
|
||||
|
||||
|
||||
def test_migrate_s3_to_s3_dry_run(mock_service):
|
||||
"""Test dry run mode shows what would be migrated without actually migrating."""
|
||||
source_objects = [
|
||||
ObjectHead(
|
||||
key="file1.zip",
|
||||
size=1024,
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
etag="abc123",
|
||||
metadata={},
|
||||
),
|
||||
]
|
||||
|
||||
mock_service.storage.list.return_value = iter(source_objects)
|
||||
|
||||
# Mock the copy operation and EC2 detection
|
||||
with patch("deltaglider.app.cli.aws_compat.copy_s3_to_s3") as mock_copy:
|
||||
with patch("deltaglider.app.cli.aws_compat.click.echo") as mock_echo:
|
||||
with patch("deltaglider.app.cli.aws_compat.log_aws_region"):
|
||||
migrate_s3_to_s3(
|
||||
mock_service,
|
||||
"s3://source-bucket/",
|
||||
"s3://dest-bucket/",
|
||||
exclude=None,
|
||||
include=None,
|
||||
quiet=False, # Allow output to test dry run messages
|
||||
no_delta=False,
|
||||
max_ratio=None,
|
||||
dry_run=True,
|
||||
skip_confirm=False,
|
||||
)
|
||||
|
||||
# Should not actually copy anything in dry run mode
|
||||
mock_copy.assert_not_called()
|
||||
|
||||
# Should show dry run message
|
||||
echo_calls = [str(call[0][0]) for call in mock_echo.call_args_list if call[0]]
|
||||
assert any("DRY RUN MODE" in msg for msg in echo_calls)
|
||||
|
||||
|
||||
def test_migrate_s3_to_s3_with_filters(mock_service):
|
||||
"""Test migration with include/exclude filters."""
|
||||
source_objects = [
|
||||
ObjectHead(
|
||||
key="file1.zip",
|
||||
size=1024,
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
etag="abc123",
|
||||
metadata={},
|
||||
),
|
||||
ObjectHead(
|
||||
key="file2.log",
|
||||
size=256,
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
etag="def456",
|
||||
metadata={},
|
||||
),
|
||||
ObjectHead(
|
||||
key="file3.tar",
|
||||
size=512,
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
etag="ghi789",
|
||||
metadata={},
|
||||
),
|
||||
]
|
||||
|
||||
mock_service.storage.list.return_value = iter(source_objects)
|
||||
|
||||
# Mock the copy operation
|
||||
with patch("deltaglider.app.cli.aws_compat.copy_s3_to_s3") as mock_copy:
|
||||
with patch("click.echo"):
|
||||
with patch("deltaglider.app.cli.aws_compat.click.confirm", return_value=True):
|
||||
# Exclude .log files
|
||||
migrate_s3_to_s3(
|
||||
mock_service,
|
||||
"s3://source-bucket/",
|
||||
"s3://dest-bucket/",
|
||||
exclude="*.log",
|
||||
include=None,
|
||||
quiet=True, # Skip EC2 detection
|
||||
no_delta=False,
|
||||
max_ratio=None,
|
||||
dry_run=False,
|
||||
skip_confirm=False,
|
||||
)
|
||||
|
||||
# Should copy file1.zip and file3.tar, but not file2.log
|
||||
assert mock_copy.call_count == 2
|
||||
|
||||
call_args = [call[0] for call in mock_copy.call_args_list]
|
||||
migrated_sources = [args[1] for args in call_args]
|
||||
|
||||
assert "s3://source-bucket/file1.zip" in migrated_sources
|
||||
assert "s3://source-bucket/file3.tar" in migrated_sources
|
||||
assert "s3://source-bucket/file2.log" not in migrated_sources
|
||||
|
||||
|
||||
def test_migrate_s3_to_s3_skip_confirm(mock_service):
|
||||
"""Test skipping confirmation prompt with skip_confirm=True."""
|
||||
source_objects = [
|
||||
ObjectHead(
|
||||
key="file1.zip",
|
||||
size=1024,
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
etag="abc123",
|
||||
metadata={},
|
||||
),
|
||||
]
|
||||
|
||||
mock_service.storage.list.return_value = iter(source_objects)
|
||||
|
||||
with patch("deltaglider.app.cli.aws_compat.copy_s3_to_s3") as mock_copy:
|
||||
with patch("click.echo"):
|
||||
with patch("deltaglider.app.cli.aws_compat.click.confirm") as mock_confirm:
|
||||
migrate_s3_to_s3(
|
||||
mock_service,
|
||||
"s3://source-bucket/",
|
||||
"s3://dest-bucket/",
|
||||
exclude=None,
|
||||
include=None,
|
||||
quiet=True, # Skip EC2 detection
|
||||
no_delta=False,
|
||||
max_ratio=None,
|
||||
dry_run=False,
|
||||
skip_confirm=True, # Skip confirmation
|
||||
)
|
||||
|
||||
# Should not ask for confirmation
|
||||
mock_confirm.assert_not_called()
|
||||
|
||||
# Should still perform the copy
|
||||
mock_copy.assert_called_once()
|
||||
|
||||
|
||||
def test_migrate_s3_to_s3_with_prefix(mock_service):
|
||||
"""Test migration with source and destination prefixes."""
|
||||
source_objects = [
|
||||
ObjectHead(
|
||||
key="data/file1.zip",
|
||||
size=1024,
|
||||
last_modified="2024-01-01T00:00:00Z",
|
||||
etag="abc123",
|
||||
metadata={},
|
||||
),
|
||||
]
|
||||
|
||||
def list_side_effect(prefix):
|
||||
if "source-bucket/data" in prefix:
|
||||
return iter(source_objects)
|
||||
return iter([])
|
||||
|
||||
mock_service.storage.list.side_effect = list_side_effect
|
||||
|
||||
with patch("deltaglider.app.cli.aws_compat.copy_s3_to_s3") as mock_copy:
|
||||
with patch("click.echo"):
|
||||
with patch("deltaglider.app.cli.aws_compat.click.confirm", return_value=True):
|
||||
migrate_s3_to_s3(
|
||||
mock_service,
|
||||
"s3://source-bucket/data/",
|
||||
"s3://dest-bucket/archive/",
|
||||
exclude=None,
|
||||
include=None,
|
||||
quiet=True, # Skip EC2 detection
|
||||
no_delta=False,
|
||||
max_ratio=None,
|
||||
dry_run=False,
|
||||
skip_confirm=False,
|
||||
)
|
||||
|
||||
# Verify the correct destination path is used
|
||||
mock_copy.assert_called_once()
|
||||
call_args = mock_copy.call_args[0]
|
||||
assert call_args[1] == "s3://source-bucket/data/file1.zip"
|
||||
assert call_args[2] == "s3://dest-bucket/archive/file1.zip"
|
||||
214
tests/integration/test_stats_command.py
Normal file
214
tests/integration/test_stats_command.py
Normal file
@@ -0,0 +1,214 @@
|
||||
"""Integration tests for stats CLI command."""
|
||||
|
||||
import json
|
||||
from unittest.mock import Mock, patch
|
||||
|
||||
from click.testing import CliRunner
|
||||
|
||||
from deltaglider.app.cli.main import cli
|
||||
from deltaglider.client_models import BucketStats
|
||||
|
||||
|
||||
class TestStatsCommand:
|
||||
"""Test stats CLI command."""
|
||||
|
||||
def test_stats_json_output(self):
|
||||
"""Test stats command with JSON output."""
|
||||
# Create mock bucket stats
|
||||
mock_stats = BucketStats(
|
||||
bucket="test-bucket",
|
||||
object_count=10,
|
||||
total_size=1000000,
|
||||
compressed_size=500000,
|
||||
space_saved=500000,
|
||||
average_compression_ratio=0.5,
|
||||
delta_objects=7,
|
||||
direct_objects=3,
|
||||
)
|
||||
|
||||
with patch("deltaglider.client.DeltaGliderClient") as mock_client_class:
|
||||
# Setup mock client
|
||||
mock_client = Mock()
|
||||
mock_client.get_bucket_stats.return_value = mock_stats
|
||||
mock_client_class.return_value = mock_client
|
||||
|
||||
# Run command
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(cli, ["stats", "test-bucket", "--json"])
|
||||
|
||||
# Verify
|
||||
assert result.exit_code == 0
|
||||
output = json.loads(result.output)
|
||||
assert output["bucket"] == "test-bucket"
|
||||
assert output["object_count"] == 10
|
||||
assert output["total_size"] == 1000000
|
||||
assert output["compressed_size"] == 500000
|
||||
assert output["space_saved"] == 500000
|
||||
assert output["average_compression_ratio"] == 0.5
|
||||
assert output["delta_objects"] == 7
|
||||
assert output["direct_objects"] == 3
|
||||
|
||||
# Verify client was called correctly
|
||||
mock_client.get_bucket_stats.assert_called_once_with(
|
||||
"test-bucket", detailed_stats=False
|
||||
)
|
||||
|
||||
def test_stats_json_output_detailed(self):
|
||||
"""Test stats command with detailed JSON output."""
|
||||
mock_stats = BucketStats(
|
||||
bucket="test-bucket",
|
||||
object_count=5,
|
||||
total_size=2000000,
|
||||
compressed_size=100000,
|
||||
space_saved=1900000,
|
||||
average_compression_ratio=0.95,
|
||||
delta_objects=5,
|
||||
direct_objects=0,
|
||||
)
|
||||
|
||||
with patch("deltaglider.client.DeltaGliderClient") as mock_client_class:
|
||||
mock_client = Mock()
|
||||
mock_client.get_bucket_stats.return_value = mock_stats
|
||||
mock_client_class.return_value = mock_client
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(cli, ["stats", "test-bucket", "--detailed", "--json"])
|
||||
|
||||
assert result.exit_code == 0
|
||||
output = json.loads(result.output)
|
||||
assert output["average_compression_ratio"] == 0.95
|
||||
|
||||
# Verify detailed flag was passed
|
||||
mock_client.get_bucket_stats.assert_called_once_with("test-bucket", detailed_stats=True)
|
||||
|
||||
def test_stats_human_readable_output(self):
|
||||
"""Test stats command with human-readable output."""
|
||||
mock_stats = BucketStats(
|
||||
bucket="test-bucket",
|
||||
object_count=10,
|
||||
total_size=1500000, # ~1.43 MB
|
||||
compressed_size=300000, # ~293 KB
|
||||
space_saved=1200000, # ~1.14 MB
|
||||
average_compression_ratio=0.8,
|
||||
delta_objects=7,
|
||||
direct_objects=3,
|
||||
)
|
||||
|
||||
with patch("deltaglider.client.DeltaGliderClient") as mock_client_class:
|
||||
mock_client = Mock()
|
||||
mock_client.get_bucket_stats.return_value = mock_stats
|
||||
mock_client_class.return_value = mock_client
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(cli, ["stats", "test-bucket"])
|
||||
|
||||
assert result.exit_code == 0
|
||||
output = result.output
|
||||
|
||||
# Verify human-readable format
|
||||
assert "Bucket Statistics: test-bucket" in output
|
||||
assert "Total Objects:" in output
|
||||
assert "10" in output
|
||||
assert "Delta Objects:" in output
|
||||
assert "7" in output
|
||||
assert "Direct Objects:" in output
|
||||
assert "3" in output
|
||||
assert "Original Size:" in output
|
||||
assert "Compressed Size:" in output
|
||||
assert "Space Saved:" in output
|
||||
assert "Compression Ratio:" in output
|
||||
assert "80.0%" in output # 0.8 = 80%
|
||||
|
||||
def test_stats_error_handling(self):
|
||||
"""Test stats command error handling."""
|
||||
with patch("deltaglider.client.DeltaGliderClient") as mock_client_class:
|
||||
mock_client = Mock()
|
||||
mock_client.get_bucket_stats.side_effect = Exception("Bucket not found")
|
||||
mock_client_class.return_value = mock_client
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(cli, ["stats", "nonexistent-bucket"])
|
||||
|
||||
assert result.exit_code == 1
|
||||
assert "Error: Bucket not found" in result.output
|
||||
|
||||
def test_stats_with_s3_url(self):
|
||||
"""Test stats command with s3:// URL format."""
|
||||
mock_stats = BucketStats(
|
||||
bucket="test-bucket",
|
||||
object_count=5,
|
||||
total_size=1000000,
|
||||
compressed_size=500000,
|
||||
space_saved=500000,
|
||||
average_compression_ratio=0.5,
|
||||
delta_objects=3,
|
||||
direct_objects=2,
|
||||
)
|
||||
|
||||
with patch("deltaglider.client.DeltaGliderClient") as mock_client_class:
|
||||
mock_client = Mock()
|
||||
mock_client.get_bucket_stats.return_value = mock_stats
|
||||
mock_client_class.return_value = mock_client
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(cli, ["stats", "s3://test-bucket", "--json"])
|
||||
|
||||
assert result.exit_code == 0
|
||||
# Verify bucket name was parsed correctly from S3 URL
|
||||
mock_client.get_bucket_stats.assert_called_once_with(
|
||||
"test-bucket", detailed_stats=False
|
||||
)
|
||||
|
||||
def test_stats_with_s3_url_trailing_slash(self):
|
||||
"""Test stats command with s3:// URL format with trailing slash."""
|
||||
mock_stats = BucketStats(
|
||||
bucket="test-bucket",
|
||||
object_count=5,
|
||||
total_size=1000000,
|
||||
compressed_size=500000,
|
||||
space_saved=500000,
|
||||
average_compression_ratio=0.5,
|
||||
delta_objects=3,
|
||||
direct_objects=2,
|
||||
)
|
||||
|
||||
with patch("deltaglider.client.DeltaGliderClient") as mock_client_class:
|
||||
mock_client = Mock()
|
||||
mock_client.get_bucket_stats.return_value = mock_stats
|
||||
mock_client_class.return_value = mock_client
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(cli, ["stats", "s3://test-bucket/", "--json"])
|
||||
|
||||
assert result.exit_code == 0
|
||||
# Verify bucket name was parsed correctly from S3 URL with trailing slash
|
||||
mock_client.get_bucket_stats.assert_called_once_with(
|
||||
"test-bucket", detailed_stats=False
|
||||
)
|
||||
|
||||
def test_stats_with_s3_url_with_prefix(self):
|
||||
"""Test stats command with s3:// URL format with prefix (should ignore prefix)."""
|
||||
mock_stats = BucketStats(
|
||||
bucket="test-bucket",
|
||||
object_count=5,
|
||||
total_size=1000000,
|
||||
compressed_size=500000,
|
||||
space_saved=500000,
|
||||
average_compression_ratio=0.5,
|
||||
delta_objects=3,
|
||||
direct_objects=2,
|
||||
)
|
||||
|
||||
with patch("deltaglider.client.DeltaGliderClient") as mock_client_class:
|
||||
mock_client = Mock()
|
||||
mock_client.get_bucket_stats.return_value = mock_stats
|
||||
mock_client_class.return_value = mock_client
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(cli, ["stats", "s3://test-bucket/some/prefix/", "--json"])
|
||||
|
||||
assert result.exit_code == 0
|
||||
# Verify only bucket name was extracted, prefix ignored
|
||||
mock_client.get_bucket_stats.assert_called_once_with(
|
||||
"test-bucket", detailed_stats=False
|
||||
)
|
||||
454
tests/unit/test_stats_algorithm.py
Normal file
454
tests/unit/test_stats_algorithm.py
Normal file
@@ -0,0 +1,454 @@
|
||||
"""Exhaustive tests for the bucket statistics algorithm."""
|
||||
|
||||
from unittest.mock import MagicMock, Mock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from deltaglider.client_operations.stats import get_bucket_stats
|
||||
|
||||
|
||||
class TestBucketStatsAlgorithm:
|
||||
"""Test suite for get_bucket_stats algorithm."""
|
||||
|
||||
@pytest.fixture
|
||||
def mock_client(self):
|
||||
"""Create a mock DeltaGliderClient."""
|
||||
client = Mock()
|
||||
client.service = Mock()
|
||||
client.service.storage = Mock()
|
||||
client.service.logger = Mock()
|
||||
return client
|
||||
|
||||
def test_empty_bucket(self, mock_client):
|
||||
"""Test statistics for an empty bucket."""
|
||||
# Setup: Empty bucket
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [],
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "empty-bucket")
|
||||
|
||||
# Verify
|
||||
assert stats.bucket == "empty-bucket"
|
||||
assert stats.object_count == 0
|
||||
assert stats.total_size == 0
|
||||
assert stats.compressed_size == 0
|
||||
assert stats.space_saved == 0
|
||||
assert stats.average_compression_ratio == 0.0
|
||||
assert stats.delta_objects == 0
|
||||
assert stats.direct_objects == 0
|
||||
|
||||
def test_bucket_with_only_direct_files(self, mock_client):
|
||||
"""Test bucket with only direct files (no compression)."""
|
||||
# Setup: Bucket with 3 direct files
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "file1.pdf", "size": 1000000, "last_modified": "2024-01-01"},
|
||||
{"key": "file2.html", "size": 500000, "last_modified": "2024-01-02"},
|
||||
{"key": "file3.txt", "size": 250000, "last_modified": "2024-01-03"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
mock_client.service.storage.head.return_value = None
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "direct-only-bucket")
|
||||
|
||||
# Verify
|
||||
assert stats.object_count == 3
|
||||
assert stats.total_size == 1750000 # Sum of all files
|
||||
assert stats.compressed_size == 1750000 # Same as total (no compression)
|
||||
assert stats.space_saved == 0
|
||||
assert stats.average_compression_ratio == 0.0
|
||||
assert stats.delta_objects == 0
|
||||
assert stats.direct_objects == 3
|
||||
|
||||
def test_bucket_with_delta_compression(self, mock_client):
|
||||
"""Test bucket with delta-compressed files."""
|
||||
# Setup: Bucket with reference.bin and 2 delta files
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{"key": "file1.zip.delta", "size": 50000, "last_modified": "2024-01-02"},
|
||||
{"key": "file2.zip.delta", "size": 60000, "last_modified": "2024-01-03"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Mock metadata for delta files
|
||||
def mock_head(path):
|
||||
if "file1.zip.delta" in path:
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "19500000", "compression_ratio": "0.997"}
|
||||
return head
|
||||
elif "file2.zip.delta" in path:
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "19600000", "compression_ratio": "0.997"}
|
||||
return head
|
||||
return None
|
||||
|
||||
mock_client.service.storage.head.side_effect = mock_head
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "compressed-bucket")
|
||||
|
||||
# Verify
|
||||
assert stats.object_count == 2 # Only delta files counted (not reference.bin)
|
||||
assert stats.total_size == 39100000 # 19.5M + 19.6M
|
||||
assert stats.compressed_size == 20110000 # reference (20M) + deltas (50K + 60K)
|
||||
assert stats.space_saved == 18990000 # ~19MB saved
|
||||
assert stats.average_compression_ratio > 0.48 # ~48.6% compression
|
||||
assert stats.delta_objects == 2
|
||||
assert stats.direct_objects == 0
|
||||
|
||||
def test_orphaned_reference_bin_detection(self, mock_client):
|
||||
"""Test detection of orphaned reference.bin files."""
|
||||
# Setup: Bucket with reference.bin but no delta files
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{"key": "regular.pdf", "size": 1000000, "last_modified": "2024-01-02"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
mock_client.service.storage.head.return_value = None
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "orphaned-ref-bucket")
|
||||
|
||||
# Verify stats
|
||||
assert stats.object_count == 1 # Only regular.pdf
|
||||
assert stats.total_size == 1000000 # Only regular.pdf size
|
||||
assert stats.compressed_size == 1000000 # reference.bin NOT included
|
||||
assert stats.space_saved == 0
|
||||
assert stats.delta_objects == 0
|
||||
assert stats.direct_objects == 1
|
||||
|
||||
# Verify warning was logged
|
||||
warning_calls = mock_client.service.logger.warning.call_args_list
|
||||
assert any("ORPHANED REFERENCE FILE" in str(call) for call in warning_calls)
|
||||
assert any("20,000,000 bytes" in str(call) for call in warning_calls)
|
||||
assert any(
|
||||
"aws s3 rm s3://orphaned-ref-bucket/reference.bin" in str(call)
|
||||
for call in warning_calls
|
||||
)
|
||||
|
||||
def test_mixed_bucket(self, mock_client):
|
||||
"""Test bucket with both delta and direct files."""
|
||||
# Setup: Mixed bucket
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "pro/reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{"key": "pro/v1.zip.delta", "size": 50000, "last_modified": "2024-01-02"},
|
||||
{"key": "pro/v2.zip.delta", "size": 60000, "last_modified": "2024-01-03"},
|
||||
{"key": "docs/readme.pdf", "size": 500000, "last_modified": "2024-01-04"},
|
||||
{"key": "docs/manual.html", "size": 300000, "last_modified": "2024-01-05"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Mock metadata for delta files
|
||||
def mock_head(path):
|
||||
if "v1.zip.delta" in path:
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "19500000"}
|
||||
return head
|
||||
elif "v2.zip.delta" in path:
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "19600000"}
|
||||
return head
|
||||
return None
|
||||
|
||||
mock_client.service.storage.head.side_effect = mock_head
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "mixed-bucket")
|
||||
|
||||
# Verify
|
||||
assert stats.object_count == 4 # 2 delta + 2 direct files
|
||||
assert stats.total_size == 39900000 # 19.5M + 19.6M + 0.5M + 0.3M
|
||||
assert stats.compressed_size == 20910000 # ref (20M) + deltas (110K) + direct (800K)
|
||||
assert stats.space_saved == 18990000
|
||||
assert stats.delta_objects == 2
|
||||
assert stats.direct_objects == 2
|
||||
|
||||
def test_sha1_files_included(self, mock_client):
|
||||
"""Test that .sha1 checksum files are counted properly."""
|
||||
# Setup: Bucket with .sha1 files
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "file1.zip", "size": 1000000, "last_modified": "2024-01-01"},
|
||||
{"key": "file1.zip.sha1", "size": 41, "last_modified": "2024-01-01"},
|
||||
{"key": "file2.tar", "size": 2000000, "last_modified": "2024-01-02"},
|
||||
{"key": "file2.tar.sha1", "size": 41, "last_modified": "2024-01-02"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
mock_client.service.storage.head.return_value = None
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "sha1-bucket")
|
||||
|
||||
# Verify - .sha1 files ARE counted
|
||||
assert stats.object_count == 4
|
||||
assert stats.total_size == 3000082 # All files including .sha1
|
||||
assert stats.compressed_size == 3000082
|
||||
assert stats.direct_objects == 4
|
||||
|
||||
def test_multiple_deltaspaces(self, mock_client):
|
||||
"""Test bucket with multiple deltaspaces (different prefixes)."""
|
||||
# Setup: Multiple deltaspaces
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "pro/reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{"key": "pro/v1.zip.delta", "size": 50000, "last_modified": "2024-01-02"},
|
||||
{
|
||||
"key": "enterprise/reference.bin",
|
||||
"size": 25000000,
|
||||
"last_modified": "2024-01-03",
|
||||
},
|
||||
{"key": "enterprise/v1.zip.delta", "size": 70000, "last_modified": "2024-01-04"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Mock metadata
|
||||
def mock_head(path):
|
||||
if "pro/v1.zip.delta" in path:
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "19500000"}
|
||||
return head
|
||||
elif "enterprise/v1.zip.delta" in path:
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "24500000"}
|
||||
return head
|
||||
return None
|
||||
|
||||
mock_client.service.storage.head.side_effect = mock_head
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "multi-deltaspace-bucket")
|
||||
|
||||
# Verify
|
||||
assert stats.object_count == 2 # Only delta files
|
||||
assert stats.total_size == 44000000 # 19.5M + 24.5M
|
||||
assert stats.compressed_size == 45120000 # Both references + both deltas
|
||||
assert stats.delta_objects == 2
|
||||
assert stats.direct_objects == 0
|
||||
|
||||
def test_pagination_handling(self, mock_client):
|
||||
"""Test handling of paginated results."""
|
||||
# Setup: Paginated responses
|
||||
mock_client.service.storage.list_objects.side_effect = [
|
||||
{
|
||||
"objects": [
|
||||
{"key": f"file{i}.txt", "size": 1000, "last_modified": "2024-01-01"}
|
||||
for i in range(1000)
|
||||
],
|
||||
"is_truncated": True,
|
||||
"next_continuation_token": "token1",
|
||||
},
|
||||
{
|
||||
"objects": [
|
||||
{"key": f"file{i}.txt", "size": 1000, "last_modified": "2024-01-01"}
|
||||
for i in range(1000, 1500)
|
||||
],
|
||||
"is_truncated": False,
|
||||
},
|
||||
]
|
||||
mock_client.service.storage.head.return_value = None
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "paginated-bucket")
|
||||
|
||||
# Verify
|
||||
assert stats.object_count == 1500
|
||||
assert stats.total_size == 1500000
|
||||
assert stats.compressed_size == 1500000
|
||||
assert stats.direct_objects == 1500
|
||||
|
||||
# Verify pagination was handled
|
||||
assert mock_client.service.storage.list_objects.call_count == 2
|
||||
|
||||
def test_delta_file_without_metadata(self, mock_client):
|
||||
"""Test handling of delta files with missing metadata."""
|
||||
# Setup: Delta file without metadata
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{"key": "file.zip.delta", "size": 50000, "last_modified": "2024-01-02"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# No metadata available
|
||||
mock_client.service.storage.head.return_value = None
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "no-metadata-bucket")
|
||||
|
||||
# Verify - falls back to using delta size as original size
|
||||
assert stats.object_count == 1
|
||||
assert stats.total_size == 50000 # Falls back to delta size
|
||||
assert stats.compressed_size == 20050000 # reference + delta
|
||||
assert stats.delta_objects == 1
|
||||
|
||||
# Verify warning was logged
|
||||
warning_calls = mock_client.service.logger.warning.call_args_list
|
||||
assert any("no original_size" in str(call) for call in warning_calls)
|
||||
|
||||
def test_parallel_metadata_fetching(self, mock_client):
|
||||
"""Test that metadata is fetched in parallel for performance."""
|
||||
# Setup: Many delta files
|
||||
num_deltas = 50
|
||||
objects = [{"key": "reference.bin", "size": 20000000, "last_modified": "2024-01-01"}]
|
||||
objects.extend(
|
||||
[
|
||||
{
|
||||
"key": f"file{i}.zip.delta",
|
||||
"size": 50000 + i,
|
||||
"last_modified": f"2024-01-{i + 2:02d}",
|
||||
}
|
||||
for i in range(num_deltas)
|
||||
]
|
||||
)
|
||||
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": objects,
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Mock metadata
|
||||
def mock_head(path):
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "19500000"}
|
||||
return head
|
||||
|
||||
mock_client.service.storage.head.side_effect = mock_head
|
||||
|
||||
# Execute with mocked ThreadPoolExecutor
|
||||
with patch("concurrent.futures.ThreadPoolExecutor") as mock_executor:
|
||||
mock_pool = MagicMock()
|
||||
mock_executor.return_value.__enter__.return_value = mock_pool
|
||||
|
||||
# Simulate parallel execution
|
||||
futures = []
|
||||
for i in range(num_deltas):
|
||||
future = Mock()
|
||||
future.result.return_value = (f"file{i}.zip.delta", {"file_size": "19500000"})
|
||||
futures.append(future)
|
||||
|
||||
mock_pool.submit.side_effect = futures
|
||||
patch_as_completed = patch(
|
||||
"concurrent.futures.as_completed",
|
||||
return_value=futures,
|
||||
)
|
||||
|
||||
with patch_as_completed:
|
||||
_ = get_bucket_stats(mock_client, "parallel-bucket")
|
||||
|
||||
# Verify ThreadPoolExecutor was used with correct max_workers
|
||||
mock_executor.assert_called_once_with(max_workers=10) # min(10, 50) = 10
|
||||
|
||||
def test_detailed_stats_flag(self, mock_client):
|
||||
"""Test that detailed_stats flag controls metadata fetching."""
|
||||
# Setup
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{"key": "file.zip.delta", "size": 50000, "last_modified": "2024-01-02"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Test with detailed_stats=False (default)
|
||||
# NOTE: Currently, the implementation always fetches metadata regardless of the flag
|
||||
# This test documents the current behavior
|
||||
_ = get_bucket_stats(mock_client, "test-bucket", detailed_stats=False)
|
||||
|
||||
# Currently metadata is always fetched for delta files
|
||||
assert mock_client.service.storage.head.called
|
||||
|
||||
# Reset mock
|
||||
mock_client.service.storage.head.reset_mock()
|
||||
|
||||
# Test with detailed_stats=True
|
||||
mock_client.service.storage.head.return_value = Mock(metadata={"file_size": "19500000"})
|
||||
|
||||
_ = get_bucket_stats(mock_client, "test-bucket", detailed_stats=True)
|
||||
|
||||
# Should fetch metadata
|
||||
assert mock_client.service.storage.head.called
|
||||
|
||||
def test_error_handling_in_metadata_fetch(self, mock_client):
|
||||
"""Test graceful handling of errors during metadata fetch."""
|
||||
# Setup
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{"key": "file1.zip.delta", "size": 50000, "last_modified": "2024-01-02"},
|
||||
{"key": "file2.zip.delta", "size": 60000, "last_modified": "2024-01-03"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Mock metadata fetch to fail for one file
|
||||
def mock_head(path):
|
||||
if "file1.zip.delta" in path:
|
||||
raise Exception("S3 error")
|
||||
elif "file2.zip.delta" in path:
|
||||
head = Mock()
|
||||
head.metadata = {"file_size": "19600000"}
|
||||
return head
|
||||
return None
|
||||
|
||||
mock_client.service.storage.head.side_effect = mock_head
|
||||
|
||||
# Execute - should handle error gracefully
|
||||
stats = get_bucket_stats(mock_client, "error-bucket", detailed_stats=True)
|
||||
|
||||
# Verify - file1 uses fallback, file2 uses metadata
|
||||
assert stats.object_count == 2
|
||||
assert stats.delta_objects == 2
|
||||
# file1 falls back to delta size (50000), file2 uses metadata (19600000)
|
||||
assert stats.total_size == 50000 + 19600000
|
||||
|
||||
def test_multiple_orphaned_references(self, mock_client):
|
||||
"""Test detection of multiple orphaned reference.bin files."""
|
||||
# Setup: Multiple orphaned references
|
||||
mock_client.service.storage.list_objects.return_value = {
|
||||
"objects": [
|
||||
{"key": "pro/reference.bin", "size": 20000000, "last_modified": "2024-01-01"},
|
||||
{
|
||||
"key": "enterprise/reference.bin",
|
||||
"size": 25000000,
|
||||
"last_modified": "2024-01-02",
|
||||
},
|
||||
{"key": "community/reference.bin", "size": 15000000, "last_modified": "2024-01-03"},
|
||||
{"key": "regular.pdf", "size": 1000000, "last_modified": "2024-01-04"},
|
||||
],
|
||||
"is_truncated": False,
|
||||
}
|
||||
mock_client.service.storage.head.return_value = None
|
||||
|
||||
# Execute
|
||||
stats = get_bucket_stats(mock_client, "multi-orphaned-bucket")
|
||||
|
||||
# Verify stats
|
||||
assert stats.object_count == 1 # Only regular.pdf
|
||||
assert stats.total_size == 1000000
|
||||
assert stats.compressed_size == 1000000 # No references included
|
||||
assert stats.space_saved == 0
|
||||
|
||||
# Verify warnings for all orphaned references
|
||||
warning_calls = [str(call) for call in mock_client.service.logger.warning.call_args_list]
|
||||
warning_text = " ".join(warning_calls)
|
||||
|
||||
assert "ORPHANED REFERENCE FILE" in warning_text
|
||||
assert "3 reference.bin file(s)" in warning_text
|
||||
assert "60,000,000 bytes" in warning_text # Total of all references
|
||||
assert "s3://multi-orphaned-bucket/pro/reference.bin" in warning_text
|
||||
assert "s3://multi-orphaned-bucket/enterprise/reference.bin" in warning_text
|
||||
assert "s3://multi-orphaned-bucket/community/reference.bin" in warning_text
|
||||
Reference in New Issue
Block a user