mirror of
https://github.com/beshu-tech/deltaglider.git
synced 2026-04-30 12:14:32 +02:00
Compare commits
22 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5e333254ba | ||
|
|
04cc984d4a | ||
|
|
ac7d4e067f | ||
|
|
e8fb926fd6 | ||
|
|
626e28eaf6 | ||
|
|
90a342dc33 | ||
|
|
f9f2b036e3 | ||
|
|
778d7f0148 | ||
|
|
37ea2f138c | ||
|
|
5e3b76791e | ||
|
|
fb2877bfd3 | ||
|
|
88fd1f51cd | ||
|
|
0857e02edd | ||
|
|
689cf00d02 | ||
|
|
743d52e783 | ||
|
|
8bc0a0eaf3 | ||
|
|
4cf25e4681 | ||
|
|
69ed9056d2 | ||
|
|
38134f28f5 | ||
|
|
fa1f8b85a9 | ||
|
|
a06cc2939c | ||
|
|
5b8477ed61 |
96
CHANGELOG.md
96
CHANGELOG.md
@@ -5,6 +5,99 @@ All notable changes to this project will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
## [5.0.3] - 2025-10-10
|
||||
|
||||
### Security
|
||||
- **BREAKING**: Removed all legacy shared cache code for security
|
||||
- **BREAKING**: Encryption is now ALWAYS ON (cannot be disabled)
|
||||
- Ephemeral process-isolated cache is now the ONLY mode (no opt-out)
|
||||
- **Content-Addressed Storage (CAS)**: Implemented SHA256-based cache storage
|
||||
- Zero collision risk (SHA256 namespace guarantees uniqueness)
|
||||
- Automatic deduplication (same content = same filename)
|
||||
- Tampering protection (changing content changes SHA, breaks lookup)
|
||||
- Two-level directory structure for filesystem optimization
|
||||
- **Encrypted Cache**: All cache data encrypted at rest using Fernet (AES-128-CBC + HMAC)
|
||||
- Ephemeral encryption keys per process (forward secrecy)
|
||||
- Optional persistent keys via `DG_CACHE_ENCRYPTION_KEY` for shared filesystems
|
||||
- Automatic cleanup of corrupted cache files on decryption failures
|
||||
- Fixed TOCTOU vulnerabilities with atomic SHA validation at use-time
|
||||
- Added `get_validated_ref()` method to prevent cache poisoning
|
||||
- Eliminated multi-user data exposure through mandatory cache isolation
|
||||
|
||||
### Removed
|
||||
- **BREAKING**: Removed `DG_UNSAFE_SHARED_CACHE` environment variable
|
||||
- **BREAKING**: Removed `DG_CACHE_DIR` environment variable
|
||||
- **BREAKING**: Removed `DG_CACHE_ENCRYPTION` environment variable (encryption always on)
|
||||
- **BREAKING**: Removed `cache_dir` parameter from `create_client()`
|
||||
|
||||
### Changed
|
||||
- Cache is now auto-created in `/tmp/deltaglider-*` and cleaned on exit
|
||||
- All cache operations use file locking (Unix) and SHA validation
|
||||
- Added `CacheMissError` and `CacheCorruptionError` exceptions
|
||||
|
||||
### Added
|
||||
- New `ContentAddressedCache` adapter in `adapters/cache_cas.py`
|
||||
- New `EncryptedCache` wrapper in `adapters/cache_encrypted.py`
|
||||
- New `MemoryCache` adapter in `adapters/cache_memory.py` with LRU eviction
|
||||
- Self-describing cache structure with SHA256-based filenames
|
||||
- Configurable cache backends via `DG_CACHE_BACKEND` (filesystem or memory)
|
||||
- Memory cache size limit via `DG_CACHE_MEMORY_SIZE_MB` (default: 100MB)
|
||||
|
||||
### Internal
|
||||
- Updated all tests to use Content-Addressed Storage and encryption
|
||||
- All 119 tests passing with zero errors (99 original + 20 new cache tests)
|
||||
- Type checking: 0 errors (mypy)
|
||||
- Linting: All checks passed (ruff)
|
||||
- Completed Phase 1, 2, and 7 of SECURITY_FIX_ROADMAP.md
|
||||
- Added comprehensive test suites for encryption (13 tests) and memory cache (10 tests)
|
||||
|
||||
## [5.0.1] - 2025-01-10
|
||||
|
||||
### Changed
|
||||
- **Code Organization**: Refactored client.py from 1560 to 1154 lines (26% reduction)
|
||||
- Extracted client operations into modular `client_operations/` package:
|
||||
- `bucket.py` - S3 bucket management operations
|
||||
- `presigned.py` - Presigned URL generation
|
||||
- `batch.py` - Batch upload/download operations
|
||||
- `stats.py` - Analytics and statistics operations
|
||||
- Improved code maintainability with logical separation of concerns
|
||||
- Better developer experience with cleaner module structure
|
||||
|
||||
### Internal
|
||||
- Full type safety maintained with mypy (0 errors)
|
||||
- All 99 tests passing
|
||||
- Code quality checks passing (ruff)
|
||||
- No breaking changes - all public APIs remain unchanged
|
||||
|
||||
## [5.0.0] - 2025-01-10
|
||||
|
||||
### Added
|
||||
- boto3-compatible TypedDict types for S3 responses (no boto3 import needed)
|
||||
- Complete boto3 compatibility vision document
|
||||
- Type-safe response builders using TypedDict patterns
|
||||
|
||||
### Changed
|
||||
- **BREAKING**: `list_objects()` now returns boto3-compatible dict instead of custom dataclass
|
||||
- Use `response['Contents']` instead of `response.contents`
|
||||
- Use `response.get('IsTruncated')` instead of `response.is_truncated`
|
||||
- Use `response.get('NextContinuationToken')` instead of `response.next_continuation_token`
|
||||
- DeltaGlider metadata now in `Metadata` field of each object
|
||||
- Internal response building now uses TypedDict for compile-time type safety
|
||||
- All S3 responses are dicts at runtime (TypedDict is a dict!)
|
||||
|
||||
### Fixed
|
||||
- Updated all documentation examples to use dict-based responses
|
||||
- Fixed pagination examples in README and API docs
|
||||
- Corrected SDK documentation with accurate method signatures
|
||||
|
||||
## [4.2.4] - 2025-01-10
|
||||
|
||||
### Fixed
|
||||
- Show only filename in `ls` output instead of full path for cleaner display
|
||||
- Correct `ls` command path handling and prefix display logic
|
||||
|
||||
## [4.2.3] - 2025-01-07
|
||||
|
||||
### Added
|
||||
@@ -59,6 +152,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- Delta compression for versioned artifacts
|
||||
- 99%+ compression for similar files
|
||||
|
||||
[5.0.1]: https://github.com/beshu-tech/deltaglider/compare/v5.0.0...v5.0.1
|
||||
[5.0.0]: https://github.com/beshu-tech/deltaglider/compare/v4.2.4...v5.0.0
|
||||
[4.2.4]: https://github.com/beshu-tech/deltaglider/compare/v4.2.3...v4.2.4
|
||||
[4.2.3]: https://github.com/beshu-tech/deltaglider/compare/v4.2.2...v4.2.3
|
||||
[4.2.2]: https://github.com/beshu-tech/deltaglider/compare/v4.2.1...v4.2.2
|
||||
[4.2.1]: https://github.com/beshu-tech/deltaglider/compare/v4.2.0...v4.2.1
|
||||
|
||||
49
CLAUDE.md
49
CLAUDE.md
@@ -97,13 +97,15 @@ src/deltaglider/
|
||||
│ ├── logger.py # LoggerPort protocol for logging
|
||||
│ └── metrics.py # MetricsPort protocol for observability
|
||||
├── adapters/ # Concrete implementations
|
||||
│ ├── storage_s3.py # S3StorageAdapter using boto3
|
||||
│ ├── diff_xdelta.py # XdeltaAdapter using xdelta3 binary
|
||||
│ ├── hash_sha256.py # Sha256Adapter for checksums
|
||||
│ ├── cache_fs.py # FsCacheAdapter for file system cache
|
||||
│ ├── clock_utc.py # UtcClockAdapter for UTC timestamps
|
||||
│ ├── logger_std.py # StdLoggerAdapter for console output
|
||||
│ └── metrics_noop.py # NoopMetricsAdapter (placeholder)
|
||||
│ ├── storage_s3.py # S3StorageAdapter using boto3
|
||||
│ ├── diff_xdelta.py # XdeltaAdapter using xdelta3 binary
|
||||
│ ├── hash_sha256.py # Sha256Adapter for checksums
|
||||
│ ├── cache_cas.py # ContentAddressedCache (SHA256-based storage)
|
||||
│ ├── cache_encrypted.py # EncryptedCache (Fernet encryption wrapper)
|
||||
│ ├── cache_memory.py # MemoryCache (LRU in-memory cache)
|
||||
│ ├── clock_utc.py # UtcClockAdapter for UTC timestamps
|
||||
│ ├── logger_std.py # StdLoggerAdapter for console output
|
||||
│ └── metrics_noop.py # NoopMetricsAdapter (placeholder)
|
||||
└── app/
|
||||
└── cli/ # Click-based CLI application
|
||||
├── main.py # Main CLI entry point with AWS S3 commands
|
||||
@@ -140,7 +142,13 @@ src/deltaglider/
|
||||
2. **Reference Management** (`core/service.py`):
|
||||
- Reference stored at `{deltaspace.prefix}/reference.bin`
|
||||
- SHA256 verification on every read/write
|
||||
- Local cache in `/tmp/.deltaglider/reference_cache` for performance
|
||||
- **Content-Addressed Storage (CAS)** cache in `/tmp/deltaglider-*` (ephemeral)
|
||||
- Cache uses SHA256 as filename with two-level directory structure (ab/cd/abcdef...)
|
||||
- Automatic deduplication: same content = same SHA = same cache file
|
||||
- Zero collision risk: SHA256 namespace guarantees uniqueness
|
||||
- **Encryption**: Optional Fernet (AES-128-CBC + HMAC) encryption at rest (enabled by default)
|
||||
- Ephemeral encryption keys per process for forward secrecy
|
||||
- **Cache Backends**: Configurable filesystem or in-memory cache with LRU eviction
|
||||
|
||||
3. **Sync Algorithm** (`app/cli/sync.py`):
|
||||
- Compares local vs S3 using size and modification time
|
||||
@@ -181,13 +189,22 @@ Core delta logic is in `src/deltaglider/core/service.py`:
|
||||
## Environment Variables
|
||||
|
||||
- `DG_LOG_LEVEL`: Logging level (default: "INFO")
|
||||
- `DG_CACHE_DIR`: Local reference cache directory (default: "/tmp/.deltaglider/reference_cache")
|
||||
- `DG_MAX_RATIO`: Maximum acceptable delta/file ratio (default: "0.5")
|
||||
- `DG_CACHE_BACKEND`: Cache backend type - "filesystem" (default) or "memory"
|
||||
- `DG_CACHE_MEMORY_SIZE_MB`: Memory cache size limit in MB (default: "100")
|
||||
- `DG_CACHE_ENCRYPTION_KEY`: Optional base64-encoded Fernet key for persistent encryption (ephemeral by default)
|
||||
- `AWS_ENDPOINT_URL`: Override S3 endpoint for MinIO/LocalStack
|
||||
- `AWS_ACCESS_KEY_ID`: AWS credentials
|
||||
- `AWS_SECRET_ACCESS_KEY`: AWS credentials
|
||||
- `AWS_DEFAULT_REGION`: AWS region
|
||||
|
||||
**Security Notes**:
|
||||
- **Encryption Always On**: Cache data is ALWAYS encrypted (cannot be disabled)
|
||||
- **Ephemeral Keys**: Encryption keys auto-generated per process for maximum security
|
||||
- **Auto-Cleanup**: Corrupted cache files automatically deleted on decryption failures
|
||||
- **Process Isolation**: Each process gets isolated cache in `/tmp/deltaglider-*`, cleaned up on exit
|
||||
- **Persistent Keys**: Set `DG_CACHE_ENCRYPTION_KEY` only if you need cross-process cache sharing (e.g., shared filesystems)
|
||||
|
||||
## Important Implementation Details
|
||||
|
||||
1. **xdelta3 Binary Dependency**: The system requires xdelta3 binary installed on the system. The `XdeltaAdapter` uses subprocess to call it.
|
||||
@@ -202,7 +219,11 @@ Core delta logic is in `src/deltaglider/core/service.py`:
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- Local reference caching dramatically improves performance for repeated operations
|
||||
- **Content-Addressed Storage**: SHA256-based deduplication eliminates redundant storage
|
||||
- **Cache Backends**:
|
||||
- Filesystem cache (default): persistent across processes, good for shared workflows
|
||||
- Memory cache: faster, zero I/O, perfect for ephemeral CI/CD pipelines
|
||||
- **Encryption Overhead**: ~10-15% performance impact, provides security at rest
|
||||
- Delta compression is CPU-intensive; consider parallelization for bulk uploads
|
||||
- The default max_ratio of 0.5 prevents storing inefficient deltas
|
||||
- For files <1MB, delta overhead may exceed benefits
|
||||
@@ -212,4 +233,10 @@ Core delta logic is in `src/deltaglider/core/service.py`:
|
||||
- Never store AWS credentials in code
|
||||
- Use IAM roles when possible
|
||||
- All S3 operations respect bucket policies and encryption settings
|
||||
- SHA256 checksums prevent tampering and corruption
|
||||
- SHA256 checksums prevent tampering and corruption
|
||||
- **Encryption Always On**: Cache data is ALWAYS encrypted using Fernet (AES-128-CBC + HMAC) - cannot be disabled
|
||||
- **Ephemeral Keys**: Encryption keys auto-generated per process for forward secrecy and process isolation
|
||||
- **Auto-Cleanup**: Corrupted or tampered cache files automatically deleted on decryption failures
|
||||
- **Persistent Keys**: Set `DG_CACHE_ENCRYPTION_KEY` only for cross-process cache sharing (use secrets management)
|
||||
- **Content-Addressed Storage**: SHA256-based filenames prevent collision attacks
|
||||
- **Zero-Trust Cache**: All cache operations include cryptographic validation
|
||||
33
Dockerfile
33
Dockerfile
@@ -30,7 +30,16 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
||||
# Runtime stage - minimal image
|
||||
FROM python:${PYTHON_VERSION}
|
||||
|
||||
# Install xdelta3
|
||||
# Skip man pages and docs to speed up builds
|
||||
RUN mkdir -p /etc/dpkg/dpkg.cfg.d && \
|
||||
echo 'path-exclude /usr/share/doc/*' > /etc/dpkg/dpkg.cfg.d/01_nodoc && \
|
||||
echo 'path-exclude /usr/share/man/*' >> /etc/dpkg/dpkg.cfg.d/01_nodoc && \
|
||||
echo 'path-exclude /usr/share/groff/*' >> /etc/dpkg/dpkg.cfg.d/01_nodoc && \
|
||||
echo 'path-exclude /usr/share/info/*' >> /etc/dpkg/dpkg.cfg.d/01_nodoc && \
|
||||
echo 'path-exclude /usr/share/lintian/*' >> /etc/dpkg/dpkg.cfg.d/01_nodoc && \
|
||||
echo 'path-exclude /usr/share/linda/*' >> /etc/dpkg/dpkg.cfg.d/01_nodoc
|
||||
|
||||
# Install xdelta3 (now much faster without man pages)
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends xdelta3 && \
|
||||
apt-get clean && \
|
||||
@@ -57,10 +66,28 @@ USER deltaglider
|
||||
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
||||
CMD deltaglider --help || exit 1
|
||||
|
||||
# Environment variables (all optional, can be overridden at runtime)
|
||||
# Logging
|
||||
ENV DG_LOG_LEVEL=INFO
|
||||
|
||||
# Performance & Compression
|
||||
ENV DG_MAX_RATIO=0.5
|
||||
|
||||
# Cache Configuration
|
||||
ENV DG_CACHE_BACKEND=filesystem
|
||||
ENV DG_CACHE_MEMORY_SIZE_MB=100
|
||||
# ENV DG_CACHE_ENCRYPTION_KEY=<base64-key> # Optional: Set for cross-process cache sharing
|
||||
|
||||
# AWS Configuration (override at runtime)
|
||||
# ENV AWS_ENDPOINT_URL=https://s3.amazonaws.com
|
||||
# ENV AWS_ACCESS_KEY_ID=<your-key>
|
||||
# ENV AWS_SECRET_ACCESS_KEY=<your-secret>
|
||||
# ENV AWS_DEFAULT_REGION=us-east-1
|
||||
|
||||
# Labels
|
||||
LABEL org.opencontainers.image.title="DeltaGlider" \
|
||||
org.opencontainers.image.description="Delta-aware S3 file storage wrapper" \
|
||||
org.opencontainers.image.version="0.1.0" \
|
||||
org.opencontainers.image.description="Delta-aware S3 file storage wrapper with encryption" \
|
||||
org.opencontainers.image.version="5.0.3" \
|
||||
org.opencontainers.image.authors="Beshu Limited" \
|
||||
org.opencontainers.image.source="https://github.com/beshu-tech/deltaglider"
|
||||
|
||||
|
||||
62
README.md
62
README.md
@@ -46,6 +46,60 @@ uv pip install deltaglider
|
||||
docker run -v ~/.aws:/root/.aws deltaglider/deltaglider --help
|
||||
```
|
||||
|
||||
### Docker Usage
|
||||
|
||||
DeltaGlider provides a secure, production-ready Docker image with encryption always enabled:
|
||||
|
||||
```bash
|
||||
# Basic usage with AWS credentials from environment
|
||||
docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
|
||||
deltaglider/deltaglider ls s3://my-bucket/
|
||||
|
||||
# Mount AWS credentials
|
||||
docker run -v ~/.aws:/root/.aws:ro \
|
||||
deltaglider/deltaglider cp file.zip s3://releases/
|
||||
|
||||
# Use memory cache for ephemeral CI/CD pipelines (faster)
|
||||
docker run -e DG_CACHE_BACKEND=memory \
|
||||
-e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
|
||||
deltaglider/deltaglider sync ./dist/ s3://releases/v1.0.0/
|
||||
|
||||
# Configure memory cache size (default: 100MB)
|
||||
docker run -e DG_CACHE_BACKEND=memory \
|
||||
-e DG_CACHE_MEMORY_SIZE_MB=500 \
|
||||
-e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
|
||||
deltaglider/deltaglider cp large-file.zip s3://releases/
|
||||
|
||||
# Use MinIO or custom S3 endpoint
|
||||
docker run -e AWS_ENDPOINT_URL=http://minio:9000 \
|
||||
-e AWS_ACCESS_KEY_ID=minioadmin \
|
||||
-e AWS_SECRET_ACCESS_KEY=minioadmin \
|
||||
deltaglider/deltaglider ls s3://test-bucket/
|
||||
|
||||
# Persistent encryption key for cross-container cache sharing
|
||||
# (Only needed if sharing cache across containers via volume mount)
|
||||
docker run -v /shared-cache:/tmp/.deltaglider \
|
||||
-e DG_CACHE_ENCRYPTION_KEY=$(openssl rand -base64 32) \
|
||||
deltaglider/deltaglider cp file.zip s3://releases/
|
||||
```
|
||||
|
||||
**Environment Variables**:
|
||||
- `DG_LOG_LEVEL`: Logging level (default: `INFO`, options: `DEBUG`, `INFO`, `WARNING`, `ERROR`)
|
||||
- `DG_MAX_RATIO`: Maximum delta/file ratio (default: `0.5`, range: `0.0-1.0`)
|
||||
- `DG_CACHE_BACKEND`: Cache backend (default: `filesystem`, options: `filesystem`, `memory`)
|
||||
- `DG_CACHE_MEMORY_SIZE_MB`: Memory cache size in MB (default: `100`)
|
||||
- `DG_CACHE_ENCRYPTION_KEY`: Optional base64-encoded encryption key for cross-process cache sharing
|
||||
- `AWS_ENDPOINT_URL`: S3 endpoint URL (default: AWS S3)
|
||||
- `AWS_ACCESS_KEY_ID`: AWS access key
|
||||
- `AWS_SECRET_ACCESS_KEY`: AWS secret key
|
||||
- `AWS_DEFAULT_REGION`: AWS region (default: `us-east-1`)
|
||||
|
||||
**Security Notes**:
|
||||
- Encryption is **always enabled** (cannot be disabled)
|
||||
- Each container gets ephemeral encryption keys for maximum security
|
||||
- Corrupted cache files are automatically deleted
|
||||
- Use `DG_CACHE_ENCRYPTION_KEY` only for persistent cache sharing (store securely)
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
@@ -207,14 +261,18 @@ with open('downloaded.zip', 'wb') as f:
|
||||
|
||||
# Smart list_objects with optimized performance
|
||||
response = client.list_objects(Bucket='releases', Prefix='v2.0.0/')
|
||||
for obj in response['Contents']:
|
||||
print(f"{obj['Key']}: {obj['Size']} bytes")
|
||||
|
||||
# Paginated listing for large buckets
|
||||
response = client.list_objects(Bucket='releases', MaxKeys=100)
|
||||
while response.is_truncated:
|
||||
while response.get('IsTruncated'):
|
||||
for obj in response['Contents']:
|
||||
print(obj['Key'])
|
||||
response = client.list_objects(
|
||||
Bucket='releases',
|
||||
MaxKeys=100,
|
||||
ContinuationToken=response.next_continuation_token
|
||||
ContinuationToken=response.get('NextContinuationToken')
|
||||
)
|
||||
|
||||
# Delete and inspect objects
|
||||
|
||||
630
SECURITY_FIX_ROADMAP.md
Normal file
630
SECURITY_FIX_ROADMAP.md
Normal file
@@ -0,0 +1,630 @@
|
||||
# 🛡️ DeltaGlider Security Fix Roadmap
|
||||
|
||||
## Executive Summary
|
||||
Critical security vulnerabilities have been identified in DeltaGlider's cache system that enable multi-user attacks, data exposure, and cache poisoning. This document provides a **chronological, actionable roadmap** to eliminate these threats through bold architectural changes.
|
||||
|
||||
**Key Innovation**: Instead of patching individual issues, we propose a **"Zero-Trust Cache Architecture"** that eliminates entire classes of vulnerabilities.
|
||||
|
||||
---
|
||||
|
||||
## 🚀 The Bold Solution: Ephemeral Signed Cache
|
||||
|
||||
### Core Concept
|
||||
Replace filesystem cache with **ephemeral, cryptographically-signed, user-isolated cache** that eliminates:
|
||||
- TOCTOU vulnerabilities (no shared filesystem)
|
||||
- Multi-user interference (process isolation)
|
||||
- Cache poisoning (cryptographic signatures)
|
||||
- Information disclosure (encrypted metadata)
|
||||
- Cross-endpoint collision (content-addressed storage)
|
||||
|
||||
**Note**: DeltaGlider is designed as a standalone CLI/SDK application. All solutions maintain this architecture without requiring external services.
|
||||
|
||||
---
|
||||
|
||||
## 📋 Implementation Roadmap
|
||||
|
||||
### **DAY 1-2: Emergency Hotfix** (v5.0.3) ✅ COMPLETED
|
||||
*Stop the bleeding - minimal changes for immediate deployment*
|
||||
|
||||
#### 1. **Ephemeral Process-Isolated Cache** (2 hours) ✅ COMPLETED
|
||||
```python
|
||||
# src/deltaglider/app/cli/main.py
|
||||
import tempfile
|
||||
import atexit
|
||||
|
||||
# SECURITY: Always use ephemeral process-isolated cache
|
||||
cache_dir = Path(tempfile.mkdtemp(prefix="deltaglider-", dir="/tmp"))
|
||||
atexit.register(lambda: shutil.rmtree(cache_dir, ignore_errors=True))
|
||||
```
|
||||
|
||||
**Impact**: Each process gets isolated cache, auto-cleaned on exit. Eliminates multi-user attacks.
|
||||
**Implementation**: All legacy shared cache code removed. Ephemeral cache is now the ONLY mode.
|
||||
|
||||
#### 2. **Add SHA Validation at Use-Time** (2 hours) ✅ COMPLETED
|
||||
```python
|
||||
# src/deltaglider/ports/cache.py
|
||||
class CachePort(Protocol):
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get reference with atomic SHA validation - MUST use this for all operations."""
|
||||
...
|
||||
|
||||
# src/deltaglider/adapters/cache_fs.py
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
path = self.ref_path(bucket, prefix)
|
||||
if not path.exists():
|
||||
raise CacheMissError(f"Cache miss for {bucket}/{prefix}")
|
||||
|
||||
# Lock file for atomic read (Unix only)
|
||||
with open(path, 'rb') as f:
|
||||
if sys.platform != "win32":
|
||||
fcntl.flock(f.fileno(), fcntl.LOCK_SH)
|
||||
content = f.read()
|
||||
actual_sha = hashlib.sha256(content).hexdigest()
|
||||
|
||||
if actual_sha != expected_sha:
|
||||
path.unlink() # Remove corrupted cache
|
||||
raise CacheCorruptionError(f"SHA mismatch: cache corrupted")
|
||||
|
||||
return path
|
||||
```
|
||||
|
||||
#### 3. **Update All Usage Points** (1 hour) ✅ COMPLETED
|
||||
```python
|
||||
# src/deltaglider/core/service.py
|
||||
# Replaced ALL instances in two locations:
|
||||
# - Line 234 (get method for decoding)
|
||||
# - Line 415 (_create_delta method for encoding)
|
||||
|
||||
ref_path = self.cache.get_validated_ref(
|
||||
delta_space.bucket,
|
||||
delta_space.prefix,
|
||||
ref_sha256 # Pass expected SHA
|
||||
)
|
||||
```
|
||||
|
||||
**Test & Deploy**: ✅ All 99 tests passing + ready for release
|
||||
|
||||
---
|
||||
|
||||
### **DAY 3-5: Quick Wins** (v5.0.3) ✅ COMPLETED
|
||||
*Low-risk improvements with high security impact*
|
||||
|
||||
#### 4. **Implement Content-Addressed Storage** (4 hours) ✅ COMPLETED
|
||||
```python
|
||||
# src/deltaglider/adapters/cache_cas.py
|
||||
class ContentAddressedCache(CachePort):
|
||||
"""Cache using SHA as filename - eliminates collisions"""
|
||||
|
||||
def ref_path(self, bucket: str, prefix: str, sha256: str) -> Path:
|
||||
# Use SHA as filename - guaranteed unique
|
||||
return self.base_dir / sha256[:2] / sha256[2:4] / sha256
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path, sha256: str) -> Path:
|
||||
path = self.ref_path(bucket, prefix, sha256)
|
||||
|
||||
# If file with this SHA exists, we're done (deduplication!)
|
||||
if path.exists():
|
||||
return path
|
||||
|
||||
# Atomic write
|
||||
path.parent.mkdir(parents=True, mode=0o700, exist_ok=True)
|
||||
tmp = path.with_suffix('.tmp')
|
||||
shutil.copy2(src, tmp)
|
||||
os.chmod(tmp, 0o600)
|
||||
|
||||
# Verify content before committing
|
||||
actual_sha = self.hasher.sha256(tmp)
|
||||
if actual_sha != sha256:
|
||||
tmp.unlink()
|
||||
raise ValueError("File corruption during cache write")
|
||||
|
||||
os.replace(tmp, path) # Atomic
|
||||
return path
|
||||
```
|
||||
|
||||
**Benefits**: ✅ ACHIEVED
|
||||
- Same file cached once regardless of bucket/prefix (automatic deduplication)
|
||||
- No collision possible (SHA256 uniqueness guarantees)
|
||||
- Natural cache validation (filename IS the checksum)
|
||||
- Two-level directory structure (ab/cd/abcdef...) for filesystem optimization
|
||||
|
||||
**Implementation**: Complete in `src/deltaglider/adapters/cache_cas.py` with:
|
||||
- `_cas_path()` method for SHA256-based path computation
|
||||
- `get_validated_ref()` with atomic validation and locking
|
||||
- `write_ref()` with atomic temp-file + rename pattern
|
||||
- Ephemeral deltaspace-to-SHA mapping for compatibility
|
||||
|
||||
#### 5. **Add Secure Directory Creation** (2 hours)
|
||||
```python
|
||||
# src/deltaglider/utils/secure_fs.py
|
||||
import os
|
||||
import stat
|
||||
|
||||
def secure_makedirs(path: Path, mode: int = 0o700) -> None:
|
||||
"""Create directory with secure permissions atomically."""
|
||||
try:
|
||||
path.mkdir(parents=True, mode=mode, exist_ok=False)
|
||||
except FileExistsError:
|
||||
# Verify it's ours and has correct permissions
|
||||
st = path.stat()
|
||||
if st.st_uid != os.getuid():
|
||||
raise SecurityError(f"Directory {path} owned by different user")
|
||||
if stat.S_IMODE(st.st_mode) != mode:
|
||||
os.chmod(path, mode) # Fix permissions
|
||||
```
|
||||
|
||||
#### 6. **Unify Cache Configuration** (1 hour)
|
||||
```python
|
||||
# src/deltaglider/config.py
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
def get_cache_dir() -> Path:
|
||||
"""Single source of truth for cache directory."""
|
||||
if os.environ.get("DG_NO_CACHE") == "true":
|
||||
return None # Feature flag to disable cache
|
||||
|
||||
if os.environ.get("DG_EPHEMERAL_CACHE") == "true":
|
||||
return Path(tempfile.mkdtemp(prefix="dg-cache-"))
|
||||
|
||||
# User-specific cache by default
|
||||
cache_base = os.environ.get("DG_CACHE_DIR",
|
||||
os.path.expanduser("~/.cache/deltaglider"))
|
||||
return Path(cache_base) / "v2" # Version cache format
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **DAY 6-10: Architecture Redesign** (v5.0.3) ✅ COMPLETED
|
||||
*The bold solution that eliminates entire vulnerability classes*
|
||||
|
||||
#### 7. **Implement Memory Cache with Encryption** (8 hours) ✅ COMPLETED
|
||||
```python
|
||||
# src/deltaglider/adapters/cache_memory.py
|
||||
class MemoryCache(CachePort):
|
||||
"""In-memory cache with LRU eviction and configurable size limits."""
|
||||
|
||||
def __init__(self, hasher: HashPort, max_size_mb: int = 100, temp_dir: Path | None = None):
|
||||
self.hasher = hasher
|
||||
self.max_size_bytes = max_size_mb * 1024 * 1024
|
||||
self._current_size = 0
|
||||
self._cache: dict[tuple[str, str], tuple[bytes, str]] = {} # (bucket, prefix) -> (content, SHA)
|
||||
self._access_order: list[tuple[str, str]] = [] # LRU tracking
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path) -> Path:
|
||||
"""Write reference to in-memory cache with LRU eviction."""
|
||||
# Read content and compute SHA
|
||||
content = src.read_bytes()
|
||||
sha256 = self.hasher.sha256_bytes(content)
|
||||
|
||||
# Check if file fits in cache
|
||||
needed_bytes = len(content)
|
||||
if needed_bytes > self.max_size_bytes:
|
||||
raise CacheCorruptionError(f"File too large for cache: {needed_bytes} > {self.max_size_bytes}")
|
||||
|
||||
# Evict LRU if needed
|
||||
self._evict_lru(needed_bytes)
|
||||
|
||||
# Store in memory
|
||||
key = (bucket, prefix)
|
||||
self._cache[key] = (content, sha256)
|
||||
self._current_size += needed_bytes
|
||||
self._access_order.append(key)
|
||||
|
||||
return src # Return original path for compatibility
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get cached reference with validation."""
|
||||
key = (bucket, prefix)
|
||||
if key not in self._cache:
|
||||
raise CacheMissError(f"Cache miss for {bucket}/{prefix}")
|
||||
|
||||
content, stored_sha = self._cache[key]
|
||||
|
||||
# Validate SHA matches
|
||||
if stored_sha != expected_sha:
|
||||
raise CacheCorruptionError(f"SHA mismatch for {bucket}/{prefix}")
|
||||
|
||||
# Update LRU order
|
||||
self._access_order.remove(key)
|
||||
self._access_order.append(key)
|
||||
|
||||
# Write to temp file for compatibility
|
||||
temp_path = self.temp_dir / f"{expected_sha}.bin"
|
||||
temp_path.write_bytes(content)
|
||||
return temp_path
|
||||
```
|
||||
|
||||
# src/deltaglider/adapters/cache_encrypted.py
|
||||
class EncryptedCache(CachePort):
|
||||
"""Encrypted cache wrapper using Fernet symmetric encryption."""
|
||||
|
||||
def __init__(self, backend: CachePort, encryption_key: bytes | None = None):
|
||||
self.backend = backend
|
||||
|
||||
# Key management: ephemeral (default) or provided
|
||||
if encryption_key is None:
|
||||
self._key = Fernet.generate_key() # Ephemeral per process
|
||||
self._ephemeral = True
|
||||
else:
|
||||
self._key = encryption_key
|
||||
self._ephemeral = False
|
||||
|
||||
self._cipher = Fernet(self._key)
|
||||
# Track plaintext SHA since encrypted content has different SHA
|
||||
self._plaintext_sha_map: dict[tuple[str, str], str] = {}
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path) -> Path:
|
||||
"""Encrypt and cache reference file."""
|
||||
# Read plaintext and compute SHA
|
||||
plaintext_data = src.read_bytes()
|
||||
plaintext_sha = hashlib.sha256(plaintext_data).hexdigest()
|
||||
|
||||
# Encrypt data
|
||||
encrypted_data = self._cipher.encrypt(plaintext_data)
|
||||
|
||||
# Write encrypted data to temp file
|
||||
temp_encrypted = src.with_suffix(".encrypted.tmp")
|
||||
temp_encrypted.write_bytes(encrypted_data)
|
||||
|
||||
try:
|
||||
# Store encrypted file via backend
|
||||
result_path = self.backend.write_ref(bucket, prefix, temp_encrypted)
|
||||
|
||||
# Store plaintext SHA mapping
|
||||
key = (bucket, prefix)
|
||||
self._plaintext_sha_map[key] = plaintext_sha
|
||||
|
||||
return result_path
|
||||
finally:
|
||||
temp_encrypted.unlink(missing_ok=True)
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get cached reference with decryption and validation."""
|
||||
# Verify we have the plaintext SHA mapped
|
||||
key = (bucket, prefix)
|
||||
if key not in self._plaintext_sha_map:
|
||||
raise CacheMissError(f"Cache miss for {bucket}/{prefix}")
|
||||
|
||||
if self._plaintext_sha_map[key] != expected_sha:
|
||||
raise CacheCorruptionError(f"SHA mismatch for {bucket}/{prefix}")
|
||||
|
||||
# Get encrypted file from backend
|
||||
encrypted_path = self.backend.ref_path(bucket, prefix)
|
||||
if not encrypted_path.exists():
|
||||
raise CacheMissError(f"Encrypted cache file not found")
|
||||
|
||||
# Decrypt content
|
||||
encrypted_data = encrypted_path.read_bytes()
|
||||
try:
|
||||
decrypted_data = self._cipher.decrypt(encrypted_data)
|
||||
except Exception as e:
|
||||
raise CacheCorruptionError(f"Decryption failed: {e}") from e
|
||||
|
||||
# Validate plaintext SHA
|
||||
actual_sha = hashlib.sha256(decrypted_data).hexdigest()
|
||||
if actual_sha != expected_sha:
|
||||
raise CacheCorruptionError(f"Decrypted content SHA mismatch")
|
||||
|
||||
# Write decrypted content to temp file
|
||||
decrypted_path = encrypted_path.with_suffix(".decrypted")
|
||||
decrypted_path.write_bytes(decrypted_data)
|
||||
return decrypted_path
|
||||
```
|
||||
|
||||
**Implementation**: ✅ COMPLETED
|
||||
- **MemoryCache**: In-memory cache with LRU eviction, configurable size limits, zero filesystem I/O
|
||||
- **EncryptedCache**: Fernet (AES-128-CBC + HMAC) encryption wrapper, ephemeral keys by default
|
||||
- **Configuration**: `DG_CACHE_BACKEND` (filesystem/memory), `DG_CACHE_ENCRYPTION` (true/false)
|
||||
- **Environment Variables**: `DG_CACHE_MEMORY_SIZE_MB`, `DG_CACHE_ENCRYPTION_KEY`
|
||||
|
||||
**Benefits**: ✅ ACHIEVED
|
||||
- No filesystem access for memory cache = no permission issues
|
||||
- Encrypted at rest = secure cache storage
|
||||
- Per-process ephemeral keys = forward secrecy and process isolation
|
||||
- LRU eviction = prevents memory exhaustion
|
||||
- Zero TOCTOU window = memory operations are atomic
|
||||
- Configurable backends = flexibility for different use cases
|
||||
|
||||
#### 8. **Implement Signed Cache Entries** (6 hours)
|
||||
```python
|
||||
# src/deltaglider/adapters/cache_signed.py
|
||||
import hmac
|
||||
import json
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
class SignedCache(CachePort):
|
||||
"""Cache with cryptographic signatures and expiry."""
|
||||
|
||||
def __init__(self, base_dir: Path, secret_key: bytes = None):
|
||||
self.base_dir = base_dir
|
||||
# Per-session key if not provided
|
||||
self.secret = secret_key or os.urandom(32)
|
||||
|
||||
def _sign_metadata(self, metadata: dict) -> str:
|
||||
"""Create HMAC signature for metadata."""
|
||||
json_meta = json.dumps(metadata, sort_keys=True)
|
||||
signature = hmac.new(
|
||||
self.secret,
|
||||
json_meta.encode(),
|
||||
hashlib.sha256
|
||||
).hexdigest()
|
||||
return signature
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path, sha256: str) -> Path:
|
||||
# Create signed metadata
|
||||
metadata = {
|
||||
"sha256": sha256,
|
||||
"bucket": bucket,
|
||||
"prefix": prefix,
|
||||
"timestamp": datetime.utcnow().isoformat(),
|
||||
"expires": (datetime.utcnow() + timedelta(hours=24)).isoformat(),
|
||||
"pid": os.getpid(),
|
||||
"uid": os.getuid(),
|
||||
}
|
||||
signature = self._sign_metadata(metadata)
|
||||
|
||||
# Store data + metadata
|
||||
cache_dir = self.base_dir / signature[:8] # Use signature prefix as namespace
|
||||
cache_dir.mkdir(parents=True, mode=0o700, exist_ok=True)
|
||||
|
||||
data_path = cache_dir / f"{sha256}.bin"
|
||||
meta_path = cache_dir / f"{sha256}.meta"
|
||||
|
||||
# Atomic writes
|
||||
shutil.copy2(src, data_path)
|
||||
os.chmod(data_path, 0o600)
|
||||
|
||||
with open(meta_path, 'w') as f:
|
||||
json.dump({"metadata": metadata, "signature": signature}, f)
|
||||
os.chmod(meta_path, 0o600)
|
||||
|
||||
return data_path
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, sha256: str) -> Path:
|
||||
# Find and validate signed entry
|
||||
pattern = self.base_dir / "*" / f"{sha256}.meta"
|
||||
matches = list(Path(self.base_dir).glob(f"*/{sha256}.meta"))
|
||||
|
||||
for meta_path in matches:
|
||||
with open(meta_path) as f:
|
||||
entry = json.load(f)
|
||||
|
||||
# Verify signature
|
||||
expected_sig = self._sign_metadata(entry["metadata"])
|
||||
if not hmac.compare_digest(entry["signature"], expected_sig):
|
||||
meta_path.unlink() # Remove tampered entry
|
||||
continue
|
||||
|
||||
# Check expiry
|
||||
expires = datetime.fromisoformat(entry["metadata"]["expires"])
|
||||
if datetime.utcnow() > expires:
|
||||
meta_path.unlink()
|
||||
continue
|
||||
|
||||
# Validate data integrity
|
||||
data_path = meta_path.with_suffix('.bin')
|
||||
actual_sha = self.hasher.sha256(data_path)
|
||||
if actual_sha != sha256:
|
||||
data_path.unlink()
|
||||
meta_path.unlink()
|
||||
continue
|
||||
|
||||
return data_path
|
||||
|
||||
raise CacheMissError(f"No valid cache entry for {sha256}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **DAY 11-15: Advanced Security** (v6.0.0)
|
||||
*Next-generation features for standalone security*
|
||||
|
||||
#### 9. **Add Integrity Monitoring** (4 hours)
|
||||
```python
|
||||
# src/deltaglider/security/monitor.py
|
||||
import inotify
|
||||
import logging
|
||||
|
||||
class CacheIntegrityMonitor:
|
||||
"""Detect and alert on cache tampering attempts."""
|
||||
|
||||
def __init__(self, cache_dir: Path):
|
||||
self.cache_dir = cache_dir
|
||||
self.notifier = inotify.INotify()
|
||||
self.watch_desc = self.notifier.add_watch(
|
||||
str(cache_dir),
|
||||
inotify.IN_MODIFY | inotify.IN_DELETE | inotify.IN_ATTRIB
|
||||
)
|
||||
self.logger = logging.getLogger("security")
|
||||
|
||||
async def monitor(self):
|
||||
"""Monitor for unauthorized cache modifications."""
|
||||
async for event in self.notifier:
|
||||
if event.mask & inotify.IN_MODIFY:
|
||||
# File modified - verify it was by our process
|
||||
if not self._is_our_modification(event):
|
||||
self.logger.critical(
|
||||
f"SECURITY: Unauthorized cache modification detected: {event.path}"
|
||||
)
|
||||
# Immediately invalidate affected cache
|
||||
Path(event.path).unlink(missing_ok=True)
|
||||
|
||||
elif event.mask & inotify.IN_ATTRIB:
|
||||
# Permission change - always suspicious
|
||||
self.logger.warning(
|
||||
f"SECURITY: Cache permission change: {event.path}"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **DAY 16-20: Testing & Rollout** (v6.0.0 release)
|
||||
|
||||
#### 10. **Security Test Suite** (8 hours)
|
||||
```python
|
||||
# tests/security/test_cache_attacks.py
|
||||
import pytest
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
|
||||
class TestCacheSecurity:
|
||||
"""Test all known attack vectors."""
|
||||
|
||||
def test_toctou_attack_prevented(self, cache):
|
||||
"""Verify TOCTOU window is eliminated."""
|
||||
sha = "abc123"
|
||||
cache.write_ref("bucket", "prefix", test_file, sha)
|
||||
|
||||
# Attacker thread tries to replace file during read
|
||||
def attacker():
|
||||
time.sleep(0.0001) # Try to hit the TOCTOU window
|
||||
cache_path = cache.ref_path("bucket", "prefix", sha)
|
||||
cache_path.write_bytes(b"malicious")
|
||||
|
||||
thread = threading.Thread(target=attacker)
|
||||
thread.start()
|
||||
|
||||
# Should detect tampering
|
||||
with pytest.raises(CacheCorruptionError):
|
||||
cache.get_validated_ref("bucket", "prefix", sha)
|
||||
|
||||
def test_multi_user_isolation(self, cache):
|
||||
"""Verify users can't access each other's cache."""
|
||||
# Create cache as user A
|
||||
cache_a = SignedCache(Path("/tmp/cache"), secret=b"key_a")
|
||||
cache_a.write_ref("bucket", "prefix", test_file, "sha_a")
|
||||
|
||||
# Try to read as user B with different key
|
||||
cache_b = SignedCache(Path("/tmp/cache"), secret=b"key_b")
|
||||
|
||||
with pytest.raises(CacheMissError):
|
||||
cache_b.get_validated_ref("bucket", "prefix", "sha_a")
|
||||
|
||||
def test_cache_poisoning_prevented(self, cache):
|
||||
"""Verify corrupted cache is detected."""
|
||||
sha = "abc123"
|
||||
cache.write_ref("bucket", "prefix", test_file, sha)
|
||||
|
||||
# Corrupt the cache file
|
||||
cache_path = cache.ref_path("bucket", "prefix", sha)
|
||||
with open(cache_path, 'ab') as f:
|
||||
f.write(b"corrupted")
|
||||
|
||||
# Should detect corruption
|
||||
with pytest.raises(CacheCorruptionError):
|
||||
cache.get_validated_ref("bucket", "prefix", sha)
|
||||
```
|
||||
|
||||
#### 11. **Migration Guide** (4 hours)
|
||||
```python
|
||||
# src/deltaglider/migration/v5_to_v6.py
|
||||
def migrate_cache():
|
||||
"""Migrate from v5 shared cache to v6 secure cache."""
|
||||
old_cache = Path("/tmp/.deltaglider/cache")
|
||||
|
||||
if old_cache.exists():
|
||||
print("WARNING: Old insecure cache detected at", old_cache)
|
||||
print("This cache had security vulnerabilities and will not be migrated.")
|
||||
|
||||
response = input("Delete old cache? [y/N]: ")
|
||||
if response.lower() == 'y':
|
||||
shutil.rmtree(old_cache)
|
||||
print("Old cache deleted. New secure cache will be created on demand.")
|
||||
else:
|
||||
print("Old cache retained at", old_cache)
|
||||
print("Set DG_CACHE_DIR to use a different location.")
|
||||
```
|
||||
|
||||
#### 12. **Performance Benchmarks** (4 hours)
|
||||
```python
|
||||
# benchmarks/cache_performance.py
|
||||
def benchmark_cache_implementations():
|
||||
"""Compare performance of cache implementations."""
|
||||
|
||||
implementations = [
|
||||
("Filesystem (v5)", FsCacheAdapter),
|
||||
("Content-Addressed", ContentAddressedCache),
|
||||
("Memory", MemoryCache),
|
||||
("Signed", SignedCache),
|
||||
]
|
||||
|
||||
for name, cache_class in implementations:
|
||||
cache = cache_class(test_dir)
|
||||
|
||||
# Measure write performance
|
||||
start = time.perf_counter()
|
||||
for i in range(1000):
|
||||
cache.write_ref("bucket", f"prefix{i}", test_file, f"sha{i}")
|
||||
write_time = time.perf_counter() - start
|
||||
|
||||
# Measure read performance
|
||||
start = time.perf_counter()
|
||||
for i in range(1000):
|
||||
cache.get_validated_ref("bucket", f"prefix{i}", f"sha{i}")
|
||||
read_time = time.perf_counter() - start
|
||||
|
||||
print(f"{name}: Write={write_time:.3f}s Read={read_time:.3f}s")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Decision Matrix
|
||||
|
||||
| Solution | Security | Performance | Complexity | Breaking Change |
|
||||
|----------|----------|-------------|------------|-----------------|
|
||||
| Hotfix (Day 1-2) | ⭐⭐⭐ | ⭐⭐ | ⭐ | No |
|
||||
| Content-Addressed | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | No |
|
||||
| Memory Cache | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | No |
|
||||
| Signed Cache | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | No |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Recommended Approach
|
||||
|
||||
### For Immediate Production (Next 48 hours)
|
||||
Deploy **Hotfix v5.0.3** with ephemeral cache + SHA validation
|
||||
|
||||
### For Next Release (1 week)
|
||||
Implement **Content-Addressed Storage** (v5.1.0) - best balance of security and simplicity
|
||||
|
||||
### For Enterprise (1 month)
|
||||
Deploy **Signed Cache** (v6.0.0) for maximum security with built-in TTL and integrity
|
||||
|
||||
---
|
||||
|
||||
## 🚦 Success Metrics
|
||||
|
||||
After implementation, verify:
|
||||
|
||||
1. **Security Tests Pass**: All attack vectors prevented
|
||||
2. **Performance Maintained**: <10% degradation vs v5
|
||||
3. **Zero CVEs**: No security vulnerabilities in cache
|
||||
4. **User Isolation**: Multi-user systems work safely
|
||||
5. **Backward Compatible**: Existing workflows unaffected
|
||||
|
||||
---
|
||||
|
||||
## 📞 Support
|
||||
|
||||
For questions or security concerns:
|
||||
- Security Team: security@deltaglider.io
|
||||
- Lead Developer: @architect
|
||||
- Immediate Issues: Create SECURITY labeled issue
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Disclosure Timeline
|
||||
|
||||
- **Day 0**: Vulnerabilities discovered
|
||||
- **Day 1**: Hotfix released (v5.0.3)
|
||||
- **Day 7**: Improved version released (v5.1.0)
|
||||
- **Day 30**: Full disclosure published
|
||||
- **Day 45**: v6.0.0 with complete redesign
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.0*
|
||||
*Classification: SENSITIVE - INTERNAL USE ONLY*
|
||||
*Last Updated: 2024-10-09*
|
||||
316
docs/BOTO3_COMPATIBILITY_VISION.md
Normal file
316
docs/BOTO3_COMPATIBILITY_VISION.md
Normal file
@@ -0,0 +1,316 @@
|
||||
# boto3 Compatibility Vision
|
||||
|
||||
## Current State (v4.2.3)
|
||||
|
||||
DeltaGlider currently uses custom dataclasses for responses:
|
||||
|
||||
```python
|
||||
from deltaglider import create_client, ListObjectsResponse, ObjectInfo
|
||||
|
||||
client = create_client()
|
||||
response: ListObjectsResponse = client.list_objects(Bucket='my-bucket')
|
||||
|
||||
for obj in response.contents: # Custom field name
|
||||
print(f"{obj.key}: {obj.size}") # Custom ObjectInfo dataclass
|
||||
```
|
||||
|
||||
**Problems:**
|
||||
- ❌ Not a true drop-in replacement for boto3
|
||||
- ❌ Users need to learn DeltaGlider-specific types
|
||||
- ❌ Can't use with tools expecting boto3 responses
|
||||
- ❌ Different API surface (`.contents` vs `['Contents']`)
|
||||
|
||||
## Target State (v5.0.0)
|
||||
|
||||
DeltaGlider should return native boto3-compatible dicts with TypedDict type hints:
|
||||
|
||||
```python
|
||||
from deltaglider import create_client, ListObjectsV2Response
|
||||
|
||||
client = create_client()
|
||||
response: ListObjectsV2Response = client.list_objects(Bucket='my-bucket')
|
||||
|
||||
for obj in response['Contents']: # boto3-compatible!
|
||||
print(f"{obj['Key']}: {obj['Size']}") # Works exactly like boto3
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ **True drop-in replacement** - swap `boto3.client('s3')` with `create_client()`
|
||||
- ✅ **No learning curve** - if you know boto3, you know DeltaGlider
|
||||
- ✅ **Tool compatibility** - works with any library expecting boto3 types
|
||||
- ✅ **Type safety** - TypedDict provides IDE autocomplete without boto3 import
|
||||
- ✅ **Zero runtime overhead** - TypedDict compiles to plain dict
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Type Definitions ✅ (DONE)
|
||||
|
||||
Created `deltaglider/types.py` with comprehensive TypedDict definitions:
|
||||
|
||||
```python
|
||||
from typing import TypedDict, NotRequired
|
||||
from datetime import datetime
|
||||
|
||||
class S3Object(TypedDict):
|
||||
Key: str
|
||||
Size: int
|
||||
LastModified: datetime
|
||||
ETag: NotRequired[str]
|
||||
StorageClass: NotRequired[str]
|
||||
|
||||
class ListObjectsV2Response(TypedDict):
|
||||
Contents: list[S3Object]
|
||||
CommonPrefixes: NotRequired[list[dict[str, str]]]
|
||||
IsTruncated: NotRequired[bool]
|
||||
NextContinuationToken: NotRequired[str]
|
||||
```
|
||||
|
||||
**Key insight:** TypedDict provides type safety at development time but compiles to plain `dict` at runtime!
|
||||
|
||||
### Phase 2: Refactor Client Methods (TODO)
|
||||
|
||||
Update all client methods to return boto3-compatible dicts:
|
||||
|
||||
#### `list_objects()`
|
||||
|
||||
**Before:**
|
||||
```python
|
||||
def list_objects(...) -> ListObjectsResponse: # Custom dataclass
|
||||
return ListObjectsResponse(
|
||||
name=bucket,
|
||||
contents=[ObjectInfo(...), ...] # Custom dataclass
|
||||
)
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
def list_objects(...) -> ListObjectsV2Response: # TypedDict
|
||||
return {
|
||||
'Contents': [
|
||||
{
|
||||
'Key': 'file.zip', # .delta suffix already stripped
|
||||
'Size': 1024,
|
||||
'LastModified': datetime(...),
|
||||
'ETag': '"abc123"',
|
||||
}
|
||||
],
|
||||
'CommonPrefixes': [{'Prefix': 'dir/'}],
|
||||
'IsTruncated': False,
|
||||
}
|
||||
```
|
||||
|
||||
**Key changes:**
|
||||
1. Return plain dict instead of custom dataclass
|
||||
2. Use boto3 field names: `Contents` not `contents`, `Key` not `key`
|
||||
3. Strip `.delta` suffix transparently (already done)
|
||||
4. Hide `reference.bin` files (already done)
|
||||
|
||||
#### `put_object()`
|
||||
|
||||
**Before:**
|
||||
```python
|
||||
def put_object(...) -> dict[str, Any]:
|
||||
return {
|
||||
"ETag": etag,
|
||||
"VersionId": None,
|
||||
"DeltaGliderInfo": {...} # Custom field
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
def put_object(...) -> PutObjectResponse: # TypedDict
|
||||
return {
|
||||
'ETag': etag,
|
||||
'ResponseMetadata': {'HTTPStatusCode': 200},
|
||||
# DeltaGlider metadata goes in Metadata field
|
||||
'Metadata': {
|
||||
'deltaglider-is-delta': 'true',
|
||||
'deltaglider-compression-ratio': '0.99'
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `get_object()`
|
||||
|
||||
**Before:**
|
||||
```python
|
||||
def get_object(...) -> dict[str, Any]:
|
||||
return {
|
||||
"Body": data,
|
||||
"ContentLength": len(data),
|
||||
"DeltaGliderInfo": {...} # Custom field
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
def get_object(...) -> GetObjectResponse: # TypedDict
|
||||
return {
|
||||
'Body': data, # bytes, not StreamingBody (simpler!)
|
||||
'ContentLength': len(data),
|
||||
'LastModified': datetime(...),
|
||||
'ETag': '"abc123"',
|
||||
'Metadata': { # DeltaGlider metadata here
|
||||
'deltaglider-is-delta': 'true'
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `delete_object()`, `delete_objects()`, `head_object()`, etc.
|
||||
|
||||
All follow the same pattern: return boto3-compatible dicts with TypedDict hints.
|
||||
|
||||
### Phase 3: Backward Compatibility (TODO)
|
||||
|
||||
Keep old dataclasses for 1-2 versions with deprecation warnings:
|
||||
|
||||
```python
|
||||
class ListObjectsResponse:
|
||||
"""DEPRECATED: Use dict responses with ListObjectsV2Response type hint.
|
||||
|
||||
This will be removed in v6.0.0. Update your code:
|
||||
|
||||
Before:
|
||||
response.contents[0].key
|
||||
|
||||
After:
|
||||
response['Contents'][0]['Key']
|
||||
"""
|
||||
def __init__(self, data: dict):
|
||||
warnings.warn(
|
||||
"ListObjectsResponse dataclass is deprecated. "
|
||||
"Use dict responses with ListObjectsV2Response type hint.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
self._data = data
|
||||
|
||||
@property
|
||||
def contents(self):
|
||||
return [ObjectInfo(obj) for obj in self._data.get('Contents', [])]
|
||||
```
|
||||
|
||||
### Phase 4: Update Documentation (TODO)
|
||||
|
||||
1. Update all examples to use dict responses
|
||||
2. Add migration guide from v4.x to v5.0
|
||||
3. Update BOTO3_COMPATIBILITY.md
|
||||
4. Add "Drop-in Replacement" marketing language
|
||||
|
||||
### Phase 5: Update Tests (TODO)
|
||||
|
||||
Convert all tests from:
|
||||
```python
|
||||
assert response.contents[0].key == "file.zip"
|
||||
```
|
||||
|
||||
To:
|
||||
```python
|
||||
assert response['Contents'][0]['Key'] == "file.zip"
|
||||
```
|
||||
|
||||
## Migration Guide (for users)
|
||||
|
||||
### v4.x → v5.0
|
||||
|
||||
**Old code (v4.x):**
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
response = client.list_objects(Bucket='my-bucket')
|
||||
|
||||
for obj in response.contents: # Dataclass attribute
|
||||
print(f"{obj.key}: {obj.size}") # Dataclass attributes
|
||||
```
|
||||
|
||||
**New code (v5.0):**
|
||||
```python
|
||||
from deltaglider import create_client, ListObjectsV2Response
|
||||
|
||||
client = create_client()
|
||||
response: ListObjectsV2Response = client.list_objects(Bucket='my-bucket')
|
||||
|
||||
for obj in response['Contents']: # Dict key (boto3-compatible)
|
||||
print(f"{obj['Key']}: {obj['Size']}") # Dict keys (boto3-compatible)
|
||||
```
|
||||
|
||||
**Or even simpler - no type hint needed:**
|
||||
```python
|
||||
client = create_client()
|
||||
response = client.list_objects(Bucket='my-bucket')
|
||||
|
||||
for obj in response['Contents']:
|
||||
print(f"{obj['Key']}: {obj['Size']}")
|
||||
```
|
||||
|
||||
## Benefits Summary
|
||||
|
||||
### For Users
|
||||
- **Zero learning curve** - if you know boto3, you're done
|
||||
- **Drop-in replacement** - literally change one line (client creation)
|
||||
- **Type safety** - TypedDict provides autocomplete without boto3 dependency
|
||||
- **Tool compatibility** - works with all boto3-compatible libraries
|
||||
|
||||
### For DeltaGlider
|
||||
- **Simpler codebase** - no custom dataclasses to maintain
|
||||
- **Better marketing** - true "drop-in replacement" claim
|
||||
- **Easier testing** - test against boto3 behavior directly
|
||||
- **Future-proof** - if boto3 adds fields, users can access them immediately
|
||||
|
||||
## Technical Details
|
||||
|
||||
### How TypedDict Works
|
||||
|
||||
```python
|
||||
from typing import TypedDict
|
||||
|
||||
class MyResponse(TypedDict):
|
||||
Key: str
|
||||
Size: int
|
||||
|
||||
# At runtime, this is just a dict!
|
||||
response: MyResponse = {'Key': 'file.zip', 'Size': 1024}
|
||||
print(type(response)) # <class 'dict'>
|
||||
|
||||
# But mypy and IDEs understand the structure
|
||||
response['Key'] # ✅ Autocomplete works!
|
||||
response['Nonexistent'] # ❌ Mypy error: Key 'Nonexistent' not found
|
||||
```
|
||||
|
||||
### DeltaGlider-Specific Metadata
|
||||
|
||||
Store in standard boto3 `Metadata` field:
|
||||
|
||||
```python
|
||||
{
|
||||
'Key': 'file.zip',
|
||||
'Size': 1024,
|
||||
'Metadata': {
|
||||
# DeltaGlider-specific fields (prefixed for safety)
|
||||
'deltaglider-is-delta': 'true',
|
||||
'deltaglider-compression-ratio': '0.99',
|
||||
'deltaglider-original-size': '100000',
|
||||
'deltaglider-reference-key': 'releases/v1.0.0/reference.bin',
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is:
|
||||
- ✅ boto3-compatible (Metadata is a standard field)
|
||||
- ✅ Namespaced (deltaglider- prefix prevents conflicts)
|
||||
- ✅ Optional (tools can ignore it)
|
||||
- ✅ Type-safe (Metadata: NotRequired[dict[str, str]])
|
||||
|
||||
## Status
|
||||
|
||||
- ✅ **Phase 1:** TypedDict definitions created
|
||||
- ✅ **Phase 2:** `list_objects()` refactored to return boto3-compatible dict
|
||||
- ⏳ **Phase 3:** Refactor remaining methods (`put_object`, `get_object`, etc.) (TODO)
|
||||
- ⏳ **Phase 4:** Backward compatibility with deprecation warnings (TODO)
|
||||
- ⏳ **Phase 5:** Documentation updates (TODO)
|
||||
- ⏳ **Phase 6:** Full test coverage updates (PARTIAL - list_objects tests done)
|
||||
|
||||
**Current:** v4.2.3+ (Phase 2 complete - `list_objects()` boto3-compatible)
|
||||
**Target:** v5.0.0 release (all phases complete)
|
||||
@@ -38,10 +38,21 @@ response = client.get_object(Bucket='releases', Key='v1.0.0/app.zip')
|
||||
# Optimized list_objects with smart performance defaults (NEW!)
|
||||
# Fast by default - no unnecessary metadata fetching
|
||||
response = client.list_objects(Bucket='releases', Prefix='v1.0.0/')
|
||||
for obj in response['Contents']:
|
||||
print(f"{obj['Key']}: {obj['Size']} bytes")
|
||||
|
||||
# Pagination for large buckets
|
||||
response = client.list_objects(Bucket='releases', MaxKeys=100,
|
||||
ContinuationToken=response.next_continuation_token)
|
||||
response = client.list_objects(Bucket='releases', MaxKeys=100)
|
||||
while response.get('IsTruncated'):
|
||||
# Process current page
|
||||
for obj in response['Contents']:
|
||||
print(obj['Key'])
|
||||
# Get next page
|
||||
response = client.list_objects(
|
||||
Bucket='releases',
|
||||
MaxKeys=100,
|
||||
ContinuationToken=response.get('NextContinuationToken')
|
||||
)
|
||||
|
||||
# Get detailed compression stats only when needed
|
||||
response = client.list_objects(Bucket='releases', FetchMetadata=True) # Slower but detailed
|
||||
|
||||
@@ -21,7 +21,6 @@ Factory function to create a configured DeltaGlider client with sensible default
|
||||
def create_client(
|
||||
endpoint_url: Optional[str] = None,
|
||||
log_level: str = "INFO",
|
||||
cache_dir: str = "/tmp/.deltaglider/cache",
|
||||
**kwargs
|
||||
) -> DeltaGliderClient
|
||||
```
|
||||
@@ -30,11 +29,12 @@ def create_client(
|
||||
|
||||
- **endpoint_url** (`Optional[str]`): S3 endpoint URL for MinIO, R2, or other S3-compatible storage. If None, uses AWS S3.
|
||||
- **log_level** (`str`): Logging verbosity level. Options: "DEBUG", "INFO", "WARNING", "ERROR". Default: "INFO".
|
||||
- **cache_dir** (`str`): Directory for local reference cache. Default: "/tmp/.deltaglider/cache".
|
||||
- **kwargs**: Additional arguments passed to `DeltaService`:
|
||||
- **tool_version** (`str`): Version string for metadata. Default: "deltaglider/0.1.0"
|
||||
- **max_ratio** (`float`): Maximum acceptable delta/file ratio. Default: 0.5
|
||||
|
||||
**Security Note**: DeltaGlider automatically uses ephemeral, process-isolated cache (`/tmp/deltaglider-*`) that is cleaned up on exit. No configuration needed.
|
||||
|
||||
#### Returns
|
||||
|
||||
`DeltaGliderClient`: Configured client instance ready for use.
|
||||
@@ -48,11 +48,8 @@ client = create_client()
|
||||
# Custom endpoint for MinIO
|
||||
client = create_client(endpoint_url="http://localhost:9000")
|
||||
|
||||
# Debug mode with custom cache
|
||||
client = create_client(
|
||||
log_level="DEBUG",
|
||||
cache_dir="/var/cache/deltaglider"
|
||||
)
|
||||
# Debug mode
|
||||
client = create_client(log_level="DEBUG")
|
||||
|
||||
# Custom delta ratio threshold
|
||||
client = create_client(max_ratio=0.3) # Only use delta if <30% of original
|
||||
@@ -94,7 +91,7 @@ def list_objects(
|
||||
StartAfter: Optional[str] = None,
|
||||
FetchMetadata: bool = False,
|
||||
**kwargs
|
||||
) -> ListObjectsResponse
|
||||
) -> dict[str, Any]
|
||||
```
|
||||
|
||||
##### Parameters
|
||||
@@ -117,19 +114,32 @@ The method intelligently optimizes performance by:
|
||||
2. Only fetching metadata for delta files when explicitly requested
|
||||
3. Supporting efficient pagination for large buckets
|
||||
|
||||
##### Returns
|
||||
|
||||
boto3-compatible dict with:
|
||||
- **Contents** (`list[dict]`): List of S3Object dicts with Key, Size, LastModified, Metadata
|
||||
- **CommonPrefixes** (`list[dict]`): Optional list of common prefixes (folders)
|
||||
- **IsTruncated** (`bool`): Whether more results are available
|
||||
- **NextContinuationToken** (`str`): Token for next page
|
||||
- **KeyCount** (`int`): Number of keys returned
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Fast listing for UI display (no metadata fetching)
|
||||
response = client.list_objects(Bucket='releases')
|
||||
for obj in response['Contents']:
|
||||
print(f"{obj['Key']}: {obj['Size']} bytes")
|
||||
|
||||
# Paginated listing for large buckets
|
||||
response = client.list_objects(Bucket='releases', MaxKeys=100)
|
||||
while response.is_truncated:
|
||||
while response.get('IsTruncated'):
|
||||
for obj in response['Contents']:
|
||||
print(obj['Key'])
|
||||
response = client.list_objects(
|
||||
Bucket='releases',
|
||||
MaxKeys=100,
|
||||
ContinuationToken=response.next_continuation_token
|
||||
ContinuationToken=response.get('NextContinuationToken')
|
||||
)
|
||||
|
||||
# Get detailed compression stats (slower, only for analytics)
|
||||
@@ -137,6 +147,11 @@ response = client.list_objects(
|
||||
Bucket='releases',
|
||||
FetchMetadata=True # Only fetches for delta files
|
||||
)
|
||||
for obj in response['Contents']:
|
||||
metadata = obj.get('Metadata', {})
|
||||
if metadata.get('deltaglider-is-delta') == 'true':
|
||||
compression = metadata.get('deltaglider-compression-ratio', 'unknown')
|
||||
print(f"{obj['Key']}: {compression} compression")
|
||||
```
|
||||
|
||||
#### `get_bucket_stats`
|
||||
@@ -708,9 +723,10 @@ DeltaGlider respects these environment variables:
|
||||
### DeltaGlider Configuration
|
||||
|
||||
- **DG_LOG_LEVEL**: Logging level (DEBUG, INFO, WARNING, ERROR)
|
||||
- **DG_CACHE_DIR**: Local cache directory
|
||||
- **DG_MAX_RATIO**: Default maximum delta ratio
|
||||
|
||||
**Note**: Cache is automatically managed (ephemeral, process-isolated) and requires no configuration.
|
||||
|
||||
### Example
|
||||
|
||||
```bash
|
||||
@@ -721,10 +737,9 @@ export AWS_SECRET_ACCESS_KEY=minioadmin
|
||||
|
||||
# Configure DeltaGlider
|
||||
export DG_LOG_LEVEL=DEBUG
|
||||
export DG_CACHE_DIR=/var/cache/deltaglider
|
||||
export DG_MAX_RATIO=0.3
|
||||
|
||||
# Now use normally
|
||||
# Now use normally (cache managed automatically)
|
||||
python my_script.py
|
||||
```
|
||||
|
||||
|
||||
@@ -69,6 +69,38 @@ Or via environment variable:
|
||||
export AWS_ENDPOINT_URL=http://minio.local:9000
|
||||
```
|
||||
|
||||
### DeltaGlider Configuration
|
||||
|
||||
DeltaGlider supports the following environment variables:
|
||||
|
||||
**Logging & Performance**:
|
||||
- `DG_LOG_LEVEL`: Logging level (default: `INFO`, options: `DEBUG`, `INFO`, `WARNING`, `ERROR`)
|
||||
- `DG_MAX_RATIO`: Maximum delta/file ratio (default: `0.5`, range: `0.0-1.0`)
|
||||
|
||||
**Cache Configuration**:
|
||||
- `DG_CACHE_BACKEND`: Cache backend type (default: `filesystem`, options: `filesystem`, `memory`)
|
||||
- `DG_CACHE_MEMORY_SIZE_MB`: Memory cache size in MB (default: `100`)
|
||||
- `DG_CACHE_ENCRYPTION_KEY`: Optional base64-encoded Fernet key for persistent encryption
|
||||
|
||||
**Security**:
|
||||
- Encryption is **always enabled** (cannot be disabled)
|
||||
- Ephemeral encryption keys per process (forward secrecy)
|
||||
- Corrupted cache files automatically deleted
|
||||
- Set `DG_CACHE_ENCRYPTION_KEY` only for cross-process cache sharing
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
# Use memory cache for faster performance in CI/CD
|
||||
export DG_CACHE_BACKEND=memory
|
||||
export DG_CACHE_MEMORY_SIZE_MB=500
|
||||
|
||||
# Enable debug logging
|
||||
export DG_LOG_LEVEL=DEBUG
|
||||
|
||||
# Adjust delta compression threshold
|
||||
export DG_MAX_RATIO=0.3 # More aggressive compression
|
||||
```
|
||||
|
||||
## Your First Upload
|
||||
|
||||
### Basic Example
|
||||
|
||||
64
examples/boto3_compatible_types.py
Normal file
64
examples/boto3_compatible_types.py
Normal file
@@ -0,0 +1,64 @@
|
||||
"""Example: Using boto3-compatible responses without importing boto3.
|
||||
|
||||
This demonstrates how DeltaGlider provides full type safety and boto3 compatibility
|
||||
without requiring boto3 imports in user code.
|
||||
|
||||
As of v5.0.0, DeltaGlider returns plain dicts (not custom dataclasses) that are
|
||||
100% compatible with boto3 S3 responses. You get IDE autocomplete through TypedDict
|
||||
type hints without any runtime overhead.
|
||||
"""
|
||||
|
||||
from deltaglider import ListObjectsV2Response, S3Object, create_client
|
||||
|
||||
# Create client (no boto3 import needed!)
|
||||
client = create_client()
|
||||
|
||||
# Type hints work perfectly without boto3
|
||||
def process_files(bucket: str, prefix: str) -> None:
|
||||
"""Process files in S3 with full type safety."""
|
||||
# Return type is fully typed - IDE autocomplete works!
|
||||
response: ListObjectsV2Response = client.list_objects(
|
||||
Bucket=bucket, Prefix=prefix, Delimiter="/"
|
||||
)
|
||||
|
||||
# Response is a plain dict - 100% boto3-compatible
|
||||
# TypedDict provides autocomplete and type checking
|
||||
for obj in response["Contents"]:
|
||||
# obj is typed as S3Object - all fields have autocomplete!
|
||||
key: str = obj["Key"] # ✅ IDE knows this is str
|
||||
size: int = obj["Size"] # ✅ IDE knows this is int
|
||||
print(f"{key}: {size} bytes")
|
||||
|
||||
# DeltaGlider metadata is in the standard Metadata field
|
||||
metadata = obj.get("Metadata", {})
|
||||
if metadata.get("deltaglider-is-delta") == "true":
|
||||
compression = metadata.get("deltaglider-compression-ratio", "unknown")
|
||||
print(f" └─ Delta file (compression: {compression})")
|
||||
|
||||
# Optional fields work too
|
||||
for prefix_dict in response.get("CommonPrefixes", []):
|
||||
print(f"Directory: {prefix_dict['Prefix']}")
|
||||
|
||||
# Pagination info
|
||||
if response.get("IsTruncated"):
|
||||
next_token = response.get("NextContinuationToken")
|
||||
print(f"More results available, token: {next_token}")
|
||||
|
||||
|
||||
# This is 100% compatible with boto3 code!
|
||||
def works_with_boto3_or_deltaglider(s3_client) -> None:
|
||||
"""This function works with EITHER boto3 or DeltaGlider client."""
|
||||
# Because the response structure is identical!
|
||||
response = s3_client.list_objects(Bucket="my-bucket")
|
||||
|
||||
for obj in response["Contents"]:
|
||||
print(obj["Key"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example usage
|
||||
print("✅ Full type safety without boto3 imports!")
|
||||
print("✅ 100% compatible with boto3")
|
||||
print("✅ Drop-in replacement")
|
||||
print("✅ Plain dict responses (not custom dataclasses)")
|
||||
print("✅ DeltaGlider metadata in standard Metadata field")
|
||||
@@ -51,6 +51,7 @@ classifiers = [
|
||||
dependencies = [
|
||||
"boto3>=1.35.0",
|
||||
"click>=8.1.0",
|
||||
"cryptography>=42.0.0",
|
||||
"python-dateutil>=2.9.0",
|
||||
]
|
||||
|
||||
|
||||
@@ -17,12 +17,26 @@ from .client_models import (
|
||||
)
|
||||
from .core import DeltaService, DeltaSpace, ObjectKey
|
||||
|
||||
# Import boto3-compatible type aliases (no boto3 import required!)
|
||||
from .types import (
|
||||
CopyObjectResponse,
|
||||
CreateBucketResponse,
|
||||
DeleteObjectResponse,
|
||||
DeleteObjectsResponse,
|
||||
GetObjectResponse,
|
||||
HeadObjectResponse,
|
||||
ListBucketsResponse,
|
||||
ListObjectsV2Response,
|
||||
PutObjectResponse,
|
||||
S3Object,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"__version__",
|
||||
# Client
|
||||
"DeltaGliderClient",
|
||||
"create_client",
|
||||
# Data classes
|
||||
# Data classes (legacy - will be deprecated in favor of TypedDict)
|
||||
"UploadSummary",
|
||||
"CompressionEstimate",
|
||||
"ObjectInfo",
|
||||
@@ -32,4 +46,15 @@ __all__ = [
|
||||
"DeltaService",
|
||||
"DeltaSpace",
|
||||
"ObjectKey",
|
||||
# boto3-compatible types (no boto3 import needed!)
|
||||
"ListObjectsV2Response",
|
||||
"PutObjectResponse",
|
||||
"GetObjectResponse",
|
||||
"DeleteObjectResponse",
|
||||
"DeleteObjectsResponse",
|
||||
"HeadObjectResponse",
|
||||
"ListBucketsResponse",
|
||||
"CreateBucketResponse",
|
||||
"CopyObjectResponse",
|
||||
"S3Object",
|
||||
]
|
||||
|
||||
@@ -1,6 +1,9 @@
|
||||
"""Adapters for DeltaGlider."""
|
||||
|
||||
from .cache_cas import ContentAddressedCache
|
||||
from .cache_encrypted import EncryptedCache
|
||||
from .cache_fs import FsCacheAdapter
|
||||
from .cache_memory import MemoryCache
|
||||
from .clock_utc import UtcClockAdapter
|
||||
from .diff_xdelta import XdeltaAdapter
|
||||
from .hash_sha import Sha256Adapter
|
||||
@@ -13,6 +16,9 @@ __all__ = [
|
||||
"XdeltaAdapter",
|
||||
"Sha256Adapter",
|
||||
"FsCacheAdapter",
|
||||
"ContentAddressedCache",
|
||||
"EncryptedCache",
|
||||
"MemoryCache",
|
||||
"UtcClockAdapter",
|
||||
"StdLoggerAdapter",
|
||||
"NoopMetricsAdapter",
|
||||
|
||||
246
src/deltaglider/adapters/cache_cas.py
Normal file
246
src/deltaglider/adapters/cache_cas.py
Normal file
@@ -0,0 +1,246 @@
|
||||
"""Content-Addressed Storage (CAS) cache adapter.
|
||||
|
||||
This adapter stores cached references using their SHA256 hash as the filename,
|
||||
eliminating collision risks and enabling automatic deduplication.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import shutil
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Unix-only imports for file locking
|
||||
if sys.platform != "win32":
|
||||
import fcntl
|
||||
|
||||
from ..core.errors import CacheCorruptionError, CacheMissError
|
||||
from ..ports.cache import CachePort
|
||||
from ..ports.hash import HashPort
|
||||
|
||||
|
||||
class ContentAddressedCache(CachePort):
|
||||
"""Content-addressed storage cache using SHA256 as filename.
|
||||
|
||||
Key Features:
|
||||
- Zero collision risk (SHA256 namespace is the filename)
|
||||
- Automatic deduplication (same content = same filename)
|
||||
- No metadata tracking needed (self-describing)
|
||||
- Secure by design (tampering changes SHA, breaks lookup)
|
||||
|
||||
Storage Layout:
|
||||
- base_dir/
|
||||
- ab/
|
||||
- cd/
|
||||
- abcdef123456... (full SHA256 as filename)
|
||||
|
||||
The two-level directory structure (first 2 chars, next 2 chars) prevents
|
||||
filesystem performance degradation from too many files in one directory.
|
||||
"""
|
||||
|
||||
def __init__(self, base_dir: Path, hasher: HashPort):
|
||||
"""Initialize content-addressed cache.
|
||||
|
||||
Args:
|
||||
base_dir: Root directory for cache storage
|
||||
hasher: Hash adapter for SHA256 computation
|
||||
"""
|
||||
self.base_dir = base_dir
|
||||
self.hasher = hasher
|
||||
# Mapping of (bucket, prefix) -> sha256 for compatibility
|
||||
# This is ephemeral and only used within a single process
|
||||
self._deltaspace_to_sha: dict[tuple[str, str], str] = {}
|
||||
|
||||
def _cas_path(self, sha256: str) -> Path:
|
||||
"""Get content-addressed path from SHA256 hash.
|
||||
|
||||
Uses two-level directory structure for filesystem optimization:
|
||||
- First 2 hex chars as L1 directory (256 buckets)
|
||||
- Next 2 hex chars as L2 directory (256 buckets per L1)
|
||||
- Full SHA as filename
|
||||
|
||||
Example: abcdef1234... -> ab/cd/abcdef1234...
|
||||
|
||||
Args:
|
||||
sha256: Full SHA256 hash (64 hex chars)
|
||||
|
||||
Returns:
|
||||
Path to file in content-addressed storage
|
||||
"""
|
||||
if len(sha256) < 4:
|
||||
raise ValueError(f"Invalid SHA256: {sha256}")
|
||||
|
||||
# Two-level directory structure
|
||||
l1_dir = sha256[:2] # First 2 chars
|
||||
l2_dir = sha256[2:4] # Next 2 chars
|
||||
|
||||
return self.base_dir / l1_dir / l2_dir / sha256
|
||||
|
||||
def ref_path(self, bucket: str, prefix: str) -> Path:
|
||||
"""Get path where reference should be cached.
|
||||
|
||||
For CAS, we need the SHA to compute the path. This method looks up
|
||||
the SHA from the ephemeral mapping. If not found, it returns a
|
||||
placeholder path (backward compatibility with has_ref checks).
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
|
||||
Returns:
|
||||
Path to cached reference (may not exist)
|
||||
"""
|
||||
key = (bucket, prefix)
|
||||
|
||||
# If we have the SHA mapping, use CAS path
|
||||
if key in self._deltaspace_to_sha:
|
||||
sha = self._deltaspace_to_sha[key]
|
||||
return self._cas_path(sha)
|
||||
|
||||
# Fallback: return a non-existent placeholder
|
||||
# This enables has_ref to return False for unmapped deltaspaces
|
||||
return self.base_dir / "_unmapped" / bucket / prefix / "reference.bin"
|
||||
|
||||
def has_ref(self, bucket: str, prefix: str, sha: str) -> bool:
|
||||
"""Check if reference exists with given SHA.
|
||||
|
||||
In CAS, existence check is simple: if file exists at SHA path,
|
||||
it MUST have that SHA (content-addressed guarantee).
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
sha: Expected SHA256 hash
|
||||
|
||||
Returns:
|
||||
True if reference exists with this SHA
|
||||
"""
|
||||
path = self._cas_path(sha)
|
||||
return path.exists()
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get cached reference with atomic SHA validation.
|
||||
|
||||
In CAS, the SHA IS the filename, so if the file exists, it's already
|
||||
validated by definition. We still perform an integrity check to detect
|
||||
filesystem corruption.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
expected_sha: Expected SHA256 hash
|
||||
|
||||
Returns:
|
||||
Path to validated cached file
|
||||
|
||||
Raises:
|
||||
CacheMissError: File not found in cache
|
||||
CacheCorruptionError: SHA mismatch (filesystem corruption)
|
||||
"""
|
||||
path = self._cas_path(expected_sha)
|
||||
|
||||
if not path.exists():
|
||||
raise CacheMissError(f"Cache miss for SHA {expected_sha[:8]}...")
|
||||
|
||||
# Lock file and validate content atomically
|
||||
try:
|
||||
with open(path, "rb") as f:
|
||||
# Acquire shared lock (Unix only)
|
||||
if sys.platform != "win32":
|
||||
fcntl.flock(f.fileno(), fcntl.LOCK_SH)
|
||||
|
||||
# Read and hash content
|
||||
content = f.read()
|
||||
actual_sha = hashlib.sha256(content).hexdigest()
|
||||
|
||||
# Release lock automatically when exiting context
|
||||
|
||||
# Validate SHA (should never fail in CAS unless filesystem corruption)
|
||||
if actual_sha != expected_sha:
|
||||
# Filesystem corruption detected
|
||||
try:
|
||||
path.unlink()
|
||||
except OSError:
|
||||
pass # Best effort cleanup
|
||||
|
||||
raise CacheCorruptionError(
|
||||
f"Filesystem corruption detected: file {path.name} has wrong content. "
|
||||
f"Expected SHA {expected_sha}, got {actual_sha}"
|
||||
)
|
||||
|
||||
# Update mapping for ref_path compatibility
|
||||
self._deltaspace_to_sha[(bucket, prefix)] = expected_sha
|
||||
|
||||
return path
|
||||
|
||||
except OSError as e:
|
||||
raise CacheMissError(f"Cache read error for SHA {expected_sha[:8]}...: {e}") from e
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path) -> Path:
|
||||
"""Cache reference file using content-addressed storage.
|
||||
|
||||
The file is stored at a path determined by its SHA256 hash.
|
||||
If a file with the same content already exists, it's reused
|
||||
(automatic deduplication).
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
src: Source file to cache
|
||||
|
||||
Returns:
|
||||
Path to cached file (content-addressed)
|
||||
"""
|
||||
# Compute SHA of source file
|
||||
sha = self.hasher.sha256(src)
|
||||
path = self._cas_path(sha)
|
||||
|
||||
# If file already exists, we're done (deduplication)
|
||||
if path.exists():
|
||||
# Update mapping
|
||||
self._deltaspace_to_sha[(bucket, prefix)] = sha
|
||||
return path
|
||||
|
||||
# Create directory structure with secure permissions
|
||||
path.parent.mkdir(parents=True, mode=0o700, exist_ok=True)
|
||||
|
||||
# Atomic write using temp file + rename
|
||||
temp_path = path.parent / f".tmp.{sha}"
|
||||
try:
|
||||
shutil.copy2(src, temp_path)
|
||||
# Atomic rename (POSIX guarantee)
|
||||
temp_path.rename(path)
|
||||
except Exception:
|
||||
# Cleanup on failure
|
||||
if temp_path.exists():
|
||||
temp_path.unlink()
|
||||
raise
|
||||
|
||||
# Update mapping
|
||||
self._deltaspace_to_sha[(bucket, prefix)] = sha
|
||||
|
||||
return path
|
||||
|
||||
def evict(self, bucket: str, prefix: str) -> None:
|
||||
"""Remove cached reference for given deltaspace.
|
||||
|
||||
In CAS, eviction is more complex because:
|
||||
1. Multiple deltaspaces may reference the same SHA (deduplication)
|
||||
2. We can't delete the file unless we know no other deltaspace uses it
|
||||
|
||||
For safety, we only remove the mapping, not the actual file.
|
||||
Orphaned files will be cleaned up by cache expiry (future feature).
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
"""
|
||||
key = (bucket, prefix)
|
||||
|
||||
# Remove mapping (safe operation)
|
||||
if key in self._deltaspace_to_sha:
|
||||
del self._deltaspace_to_sha[key]
|
||||
|
||||
# NOTE: We don't delete the actual CAS file because:
|
||||
# - Other deltaspaces may reference the same SHA
|
||||
# - The ephemeral cache will be cleaned on process exit anyway
|
||||
# - For persistent cache (future), we'd need reference counting
|
||||
283
src/deltaglider/adapters/cache_encrypted.py
Normal file
283
src/deltaglider/adapters/cache_encrypted.py
Normal file
@@ -0,0 +1,283 @@
|
||||
"""Encrypted cache wrapper using Fernet symmetric encryption.
|
||||
|
||||
This adapter wraps any CachePort implementation and adds transparent encryption/decryption.
|
||||
It uses Fernet (symmetric encryption based on AES-128-CBC with HMAC authentication).
|
||||
"""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from cryptography.fernet import Fernet
|
||||
|
||||
from ..core.errors import CacheCorruptionError, CacheMissError
|
||||
from ..ports.cache import CachePort
|
||||
|
||||
|
||||
class EncryptedCache(CachePort):
|
||||
"""Encrypted cache wrapper using Fernet symmetric encryption.
|
||||
|
||||
Wraps any CachePort implementation and transparently encrypts data at rest.
|
||||
Uses Fernet which provides:
|
||||
- AES-128-CBC encryption
|
||||
- HMAC authentication (prevents tampering)
|
||||
- Automatic key rotation support
|
||||
- Safe for ephemeral process-isolated caches
|
||||
|
||||
Key Management:
|
||||
- Ephemeral key generated per process (default, most secure)
|
||||
- Or use DG_CACHE_ENCRYPTION_KEY env var (base64-encoded Fernet key)
|
||||
- For production: use secrets management system (AWS KMS, HashiCorp Vault, etc.)
|
||||
|
||||
Security Properties:
|
||||
- Confidentiality: Data encrypted at rest
|
||||
- Integrity: HMAC prevents tampering
|
||||
- Authenticity: Only valid keys can decrypt
|
||||
- Forward Secrecy: Ephemeral keys destroyed on process exit
|
||||
"""
|
||||
|
||||
def __init__(self, backend: CachePort, encryption_key: bytes | None = None):
|
||||
"""Initialize encrypted cache wrapper.
|
||||
|
||||
Args:
|
||||
backend: Underlying cache implementation (CAS, filesystem, memory, etc.)
|
||||
encryption_key: Optional Fernet key (32 bytes base64-encoded).
|
||||
If None, generates ephemeral key for this process.
|
||||
"""
|
||||
self.backend = backend
|
||||
|
||||
# Key management: ephemeral (default) or provided
|
||||
if encryption_key is None:
|
||||
# Generate ephemeral key for this process (most secure)
|
||||
self._key = Fernet.generate_key()
|
||||
self._ephemeral = True
|
||||
else:
|
||||
# Use provided key (for persistent cache scenarios)
|
||||
self._key = encryption_key
|
||||
self._ephemeral = False
|
||||
|
||||
self._cipher = Fernet(self._key)
|
||||
|
||||
# Mapping: (bucket, prefix) -> plaintext_sha256
|
||||
# Needed because backend uses SHA for storage, but encrypted content has different SHA
|
||||
self._plaintext_sha_map: dict[tuple[str, str], str] = {}
|
||||
|
||||
@classmethod
|
||||
def from_env(cls, backend: CachePort) -> "EncryptedCache":
|
||||
"""Create encrypted cache with key from environment.
|
||||
|
||||
Looks for DG_CACHE_ENCRYPTION_KEY environment variable.
|
||||
If not found, generates ephemeral key.
|
||||
|
||||
Args:
|
||||
backend: Underlying cache implementation
|
||||
|
||||
Returns:
|
||||
EncryptedCache instance
|
||||
"""
|
||||
key_str = os.environ.get("DG_CACHE_ENCRYPTION_KEY")
|
||||
if key_str:
|
||||
# Decode base64-encoded key
|
||||
encryption_key = key_str.encode("utf-8")
|
||||
else:
|
||||
# Use ephemeral key
|
||||
encryption_key = None
|
||||
|
||||
return cls(backend, encryption_key)
|
||||
|
||||
def ref_path(self, bucket: str, prefix: str) -> Path:
|
||||
"""Get path where reference should be cached.
|
||||
|
||||
Delegates to backend. Path structure determined by backend
|
||||
(e.g., CAS uses SHA256-based paths).
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
|
||||
Returns:
|
||||
Path from backend
|
||||
"""
|
||||
return self.backend.ref_path(bucket, prefix)
|
||||
|
||||
def has_ref(self, bucket: str, prefix: str, sha: str) -> bool:
|
||||
"""Check if reference exists with given SHA.
|
||||
|
||||
Note: SHA is of the *unencrypted* content. The backend may store
|
||||
encrypted data, but we verify against original content hash.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
sha: SHA256 of unencrypted content
|
||||
|
||||
Returns:
|
||||
True if encrypted reference exists with this SHA
|
||||
"""
|
||||
# Delegate to backend
|
||||
# Backend may use SHA for content-addressed storage of encrypted data
|
||||
return self.backend.has_ref(bucket, prefix, sha)
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get cached reference with decryption and validation.
|
||||
|
||||
Retrieves encrypted data from backend, decrypts it, validates SHA,
|
||||
and returns path to decrypted temporary file.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
expected_sha: Expected SHA256 of *decrypted* content
|
||||
|
||||
Returns:
|
||||
Path to decrypted validated file (temporary)
|
||||
|
||||
Raises:
|
||||
CacheMissError: File not in cache
|
||||
CacheCorruptionError: Decryption failed or SHA mismatch
|
||||
"""
|
||||
# Check if we have this plaintext SHA mapped
|
||||
key = (bucket, prefix)
|
||||
if key not in self._plaintext_sha_map:
|
||||
raise CacheMissError(f"Cache miss for {bucket}/{prefix}")
|
||||
|
||||
# Verify the requested SHA matches our mapping
|
||||
if self._plaintext_sha_map[key] != expected_sha:
|
||||
raise CacheCorruptionError(
|
||||
f"SHA mismatch for {bucket}/{prefix}: "
|
||||
f"expected {expected_sha}, have {self._plaintext_sha_map[key]}"
|
||||
)
|
||||
|
||||
# Get encrypted file from backend using ref_path (not validated, we validate plaintext)
|
||||
encrypted_path = self.backend.ref_path(bucket, prefix)
|
||||
if not encrypted_path.exists():
|
||||
raise CacheMissError(f"Encrypted cache file not found for {bucket}/{prefix}")
|
||||
|
||||
# Read encrypted content
|
||||
try:
|
||||
with open(encrypted_path, "rb") as f:
|
||||
encrypted_data = f.read()
|
||||
except OSError as e:
|
||||
raise CacheMissError(f"Cannot read encrypted cache: {e}") from e
|
||||
|
||||
# Decrypt
|
||||
try:
|
||||
decrypted_data = self._cipher.decrypt(encrypted_data)
|
||||
except Exception as e:
|
||||
# Fernet raises InvalidToken for tampering/wrong key
|
||||
# SECURITY: Auto-delete corrupted cache files
|
||||
try:
|
||||
encrypted_path.unlink(missing_ok=True)
|
||||
# Clean up mapping
|
||||
if key in self._plaintext_sha_map:
|
||||
del self._plaintext_sha_map[key]
|
||||
except Exception:
|
||||
pass # Best effort cleanup
|
||||
raise CacheCorruptionError(
|
||||
f"Decryption failed for {bucket}/{prefix}: {e}. "
|
||||
f"Corrupted cache deleted automatically."
|
||||
) from e
|
||||
|
||||
# Validate SHA of decrypted content
|
||||
import hashlib
|
||||
|
||||
actual_sha = hashlib.sha256(decrypted_data).hexdigest()
|
||||
if actual_sha != expected_sha:
|
||||
# SECURITY: Auto-delete corrupted cache files
|
||||
try:
|
||||
encrypted_path.unlink(missing_ok=True)
|
||||
# Clean up mapping
|
||||
if key in self._plaintext_sha_map:
|
||||
del self._plaintext_sha_map[key]
|
||||
except Exception:
|
||||
pass # Best effort cleanup
|
||||
raise CacheCorruptionError(
|
||||
f"Decrypted content SHA mismatch for {bucket}/{prefix}: "
|
||||
f"expected {expected_sha}, got {actual_sha}. "
|
||||
f"Corrupted cache deleted automatically."
|
||||
)
|
||||
|
||||
# Write decrypted content to temporary file
|
||||
# Use same path as encrypted file but with .decrypted suffix
|
||||
decrypted_path = encrypted_path.with_suffix(".decrypted")
|
||||
try:
|
||||
with open(decrypted_path, "wb") as f:
|
||||
f.write(decrypted_data)
|
||||
except OSError as e:
|
||||
raise CacheCorruptionError(f"Cannot write decrypted cache: {e}") from e
|
||||
|
||||
return decrypted_path
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path) -> Path:
|
||||
"""Encrypt and cache reference file.
|
||||
|
||||
Reads source file, encrypts it, and stores encrypted version via backend.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
src: Source file to encrypt and cache
|
||||
|
||||
Returns:
|
||||
Path to encrypted cached file (from backend)
|
||||
"""
|
||||
# Read source file
|
||||
try:
|
||||
with open(src, "rb") as f:
|
||||
plaintext_data = f.read()
|
||||
except OSError as e:
|
||||
raise CacheCorruptionError(f"Cannot read source file {src}: {e}") from e
|
||||
|
||||
# Compute plaintext SHA for mapping
|
||||
import hashlib
|
||||
|
||||
plaintext_sha = hashlib.sha256(plaintext_data).hexdigest()
|
||||
|
||||
# Encrypt
|
||||
encrypted_data = self._cipher.encrypt(plaintext_data)
|
||||
|
||||
# Write encrypted data to temporary file
|
||||
temp_encrypted = src.with_suffix(".encrypted.tmp")
|
||||
try:
|
||||
with open(temp_encrypted, "wb") as f:
|
||||
f.write(encrypted_data)
|
||||
|
||||
# Store encrypted file via backend
|
||||
result_path = self.backend.write_ref(bucket, prefix, temp_encrypted)
|
||||
|
||||
# Store mapping of plaintext SHA
|
||||
key = (bucket, prefix)
|
||||
self._plaintext_sha_map[key] = plaintext_sha
|
||||
|
||||
return result_path
|
||||
|
||||
finally:
|
||||
# Cleanup temporary file
|
||||
if temp_encrypted.exists():
|
||||
temp_encrypted.unlink()
|
||||
|
||||
def evict(self, bucket: str, prefix: str) -> None:
|
||||
"""Remove cached reference (encrypted version).
|
||||
|
||||
Delegates to backend. Also cleans up any .decrypted temporary files and mappings.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
"""
|
||||
# Remove from plaintext SHA mapping
|
||||
key = (bucket, prefix)
|
||||
if key in self._plaintext_sha_map:
|
||||
del self._plaintext_sha_map[key]
|
||||
|
||||
# Get path to potentially clean up .decrypted files
|
||||
try:
|
||||
path = self.backend.ref_path(bucket, prefix)
|
||||
decrypted_path = path.with_suffix(".decrypted")
|
||||
if decrypted_path.exists():
|
||||
decrypted_path.unlink()
|
||||
except Exception:
|
||||
# Best effort cleanup
|
||||
pass
|
||||
|
||||
# Evict from backend
|
||||
self.backend.evict(bucket, prefix)
|
||||
@@ -1,8 +1,15 @@
|
||||
"""Filesystem cache adapter."""
|
||||
|
||||
import hashlib
|
||||
import shutil
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Unix-only imports for file locking
|
||||
if sys.platform != "win32":
|
||||
import fcntl
|
||||
|
||||
from ..core.errors import CacheCorruptionError, CacheMissError
|
||||
from ..ports.cache import CachePort
|
||||
from ..ports.hash import HashPort
|
||||
|
||||
@@ -29,6 +36,60 @@ class FsCacheAdapter(CachePort):
|
||||
actual_sha = self.hasher.sha256(path)
|
||||
return actual_sha == sha
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get cached reference with atomic SHA validation.
|
||||
|
||||
This method prevents TOCTOU attacks by validating the SHA at use-time,
|
||||
not just at check-time.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Prefix/deltaspace within bucket
|
||||
expected_sha: Expected SHA256 hash
|
||||
|
||||
Returns:
|
||||
Path to validated cached file
|
||||
|
||||
Raises:
|
||||
CacheMissError: File not found in cache
|
||||
CacheCorruptionError: SHA mismatch detected
|
||||
"""
|
||||
path = self.ref_path(bucket, prefix)
|
||||
|
||||
if not path.exists():
|
||||
raise CacheMissError(f"Cache miss for {bucket}/{prefix}")
|
||||
|
||||
# Lock file and validate content atomically
|
||||
try:
|
||||
with open(path, "rb") as f:
|
||||
# Acquire shared lock (Unix only)
|
||||
if sys.platform != "win32":
|
||||
fcntl.flock(f.fileno(), fcntl.LOCK_SH)
|
||||
|
||||
# Read and hash content
|
||||
content = f.read()
|
||||
actual_sha = hashlib.sha256(content).hexdigest()
|
||||
|
||||
# Release lock automatically when exiting context
|
||||
|
||||
# Validate SHA
|
||||
if actual_sha != expected_sha:
|
||||
# File corrupted or tampered - remove it
|
||||
try:
|
||||
path.unlink()
|
||||
except OSError:
|
||||
pass # Best effort cleanup
|
||||
|
||||
raise CacheCorruptionError(
|
||||
f"Cache corruption detected for {bucket}/{prefix}: "
|
||||
f"expected {expected_sha}, got {actual_sha}"
|
||||
)
|
||||
|
||||
return path
|
||||
|
||||
except OSError as e:
|
||||
raise CacheMissError(f"Cache read error for {bucket}/{prefix}: {e}") from e
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path) -> Path:
|
||||
"""Cache reference file."""
|
||||
path = self.ref_path(bucket, prefix)
|
||||
|
||||
279
src/deltaglider/adapters/cache_memory.py
Normal file
279
src/deltaglider/adapters/cache_memory.py
Normal file
@@ -0,0 +1,279 @@
|
||||
"""In-memory cache implementation with optional size limits.
|
||||
|
||||
This adapter stores cached references entirely in memory, avoiding filesystem I/O.
|
||||
Useful for:
|
||||
- High-performance scenarios where memory is abundant
|
||||
- Containerized environments with limited filesystem access
|
||||
- Testing and development
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Unix-only imports for compatibility
|
||||
if sys.platform != "win32":
|
||||
import fcntl # noqa: F401
|
||||
|
||||
from ..core.errors import CacheCorruptionError, CacheMissError
|
||||
from ..ports.cache import CachePort
|
||||
from ..ports.hash import HashPort
|
||||
|
||||
|
||||
class MemoryCache(CachePort):
|
||||
"""In-memory cache implementation with LRU eviction.
|
||||
|
||||
Stores cached references in memory as bytes. Useful for high-performance
|
||||
scenarios or when filesystem access is limited.
|
||||
|
||||
Features:
|
||||
- Zero filesystem I/O (everything in RAM)
|
||||
- Optional size limits with LRU eviction
|
||||
- Thread-safe operations
|
||||
- Temporary file creation for compatibility with file-based APIs
|
||||
|
||||
Limitations:
|
||||
- Data lost on process exit (ephemeral only)
|
||||
- Memory usage proportional to cache size
|
||||
- Not suitable for very large reference files
|
||||
|
||||
Storage Layout:
|
||||
- Key: (bucket, prefix) tuple
|
||||
- Value: (content_bytes, sha256) tuple
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
hasher: HashPort,
|
||||
max_size_mb: int = 100,
|
||||
temp_dir: Path | None = None,
|
||||
):
|
||||
"""Initialize in-memory cache.
|
||||
|
||||
Args:
|
||||
hasher: Hash adapter for SHA256 computation
|
||||
max_size_mb: Maximum cache size in megabytes (default 100MB)
|
||||
temp_dir: Directory for temporary files (default: system temp)
|
||||
"""
|
||||
self.hasher = hasher
|
||||
self.max_size_bytes = max_size_mb * 1024 * 1024
|
||||
|
||||
# Storage: (bucket, prefix) -> (content_bytes, sha256)
|
||||
self._cache: dict[tuple[str, str], tuple[bytes, str]] = {}
|
||||
|
||||
# Size tracking
|
||||
self._current_size = 0
|
||||
|
||||
# Access order for LRU eviction: (bucket, prefix) list
|
||||
self._access_order: list[tuple[str, str]] = []
|
||||
|
||||
# Temp directory for file-based API compatibility
|
||||
if temp_dir is None:
|
||||
import tempfile
|
||||
|
||||
self.temp_dir = Path(tempfile.gettempdir()) / "deltaglider-mem-cache"
|
||||
else:
|
||||
self.temp_dir = temp_dir
|
||||
|
||||
self.temp_dir.mkdir(parents=True, exist_ok=True, mode=0o700)
|
||||
|
||||
def _update_access(self, key: tuple[str, str]) -> None:
|
||||
"""Update LRU access order.
|
||||
|
||||
Args:
|
||||
key: Cache key (bucket, prefix)
|
||||
"""
|
||||
# Remove old position if exists
|
||||
if key in self._access_order:
|
||||
self._access_order.remove(key)
|
||||
|
||||
# Add to end (most recently used)
|
||||
self._access_order.append(key)
|
||||
|
||||
def _evict_lru(self, needed_bytes: int) -> None:
|
||||
"""Evict least recently used entries to free space.
|
||||
|
||||
Args:
|
||||
needed_bytes: Bytes needed for new entry
|
||||
"""
|
||||
while self._current_size + needed_bytes > self.max_size_bytes and self._access_order:
|
||||
# Evict least recently used
|
||||
lru_key = self._access_order[0]
|
||||
bucket, prefix = lru_key
|
||||
|
||||
# Remove from cache
|
||||
if lru_key in self._cache:
|
||||
content, _ = self._cache[lru_key]
|
||||
self._current_size -= len(content)
|
||||
del self._cache[lru_key]
|
||||
|
||||
# Remove from access order
|
||||
self._access_order.remove(lru_key)
|
||||
|
||||
def ref_path(self, bucket: str, prefix: str) -> Path:
|
||||
"""Get placeholder path for in-memory reference.
|
||||
|
||||
Returns a virtual path that doesn't actually exist on filesystem.
|
||||
Used for API compatibility.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
|
||||
Returns:
|
||||
Virtual path (may not exist on filesystem)
|
||||
"""
|
||||
# Return virtual path for compatibility
|
||||
# Actual data is in memory, but we need Path for API
|
||||
safe_bucket = bucket.replace("/", "_")
|
||||
safe_prefix = prefix.replace("/", "_")
|
||||
return self.temp_dir / safe_bucket / safe_prefix / "reference.bin"
|
||||
|
||||
def has_ref(self, bucket: str, prefix: str, sha: str) -> bool:
|
||||
"""Check if reference exists in memory with given SHA.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
sha: Expected SHA256 hash
|
||||
|
||||
Returns:
|
||||
True if reference exists with this SHA
|
||||
"""
|
||||
key = (bucket, prefix)
|
||||
if key not in self._cache:
|
||||
return False
|
||||
|
||||
_, cached_sha = self._cache[key]
|
||||
return cached_sha == sha
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get cached reference from memory with validation.
|
||||
|
||||
Retrieves content from memory, validates SHA, and writes to
|
||||
temporary file for compatibility with file-based APIs.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
expected_sha: Expected SHA256 hash
|
||||
|
||||
Returns:
|
||||
Path to temporary file containing content
|
||||
|
||||
Raises:
|
||||
CacheMissError: Content not in cache
|
||||
CacheCorruptionError: SHA mismatch
|
||||
"""
|
||||
key = (bucket, prefix)
|
||||
|
||||
# Check if in cache
|
||||
if key not in self._cache:
|
||||
raise CacheMissError(f"Cache miss for {bucket}/{prefix}")
|
||||
|
||||
# Get content and validate
|
||||
content, cached_sha = self._cache[key]
|
||||
|
||||
# Update LRU
|
||||
self._update_access(key)
|
||||
|
||||
# Validate SHA
|
||||
if cached_sha != expected_sha:
|
||||
# SHA mismatch - possible corruption
|
||||
raise CacheCorruptionError(
|
||||
f"Memory cache SHA mismatch for {bucket}/{prefix}: "
|
||||
f"expected {expected_sha}, got {cached_sha}"
|
||||
)
|
||||
|
||||
# Write to temporary file for API compatibility
|
||||
temp_path = self.ref_path(bucket, prefix)
|
||||
temp_path.parent.mkdir(parents=True, exist_ok=True, mode=0o700)
|
||||
|
||||
try:
|
||||
with open(temp_path, "wb") as f:
|
||||
f.write(content)
|
||||
except OSError as e:
|
||||
raise CacheMissError(f"Cannot write temp file: {e}") from e
|
||||
|
||||
return temp_path
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path) -> Path:
|
||||
"""Store reference file in memory.
|
||||
|
||||
Reads file content and stores in memory with SHA hash.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
src: Source file to cache
|
||||
|
||||
Returns:
|
||||
Virtual path (content is in memory)
|
||||
"""
|
||||
# Read source file
|
||||
try:
|
||||
with open(src, "rb") as f:
|
||||
content = f.read()
|
||||
except OSError as e:
|
||||
raise CacheCorruptionError(f"Cannot read source file {src}: {e}") from e
|
||||
|
||||
# Compute SHA
|
||||
sha = hashlib.sha256(content).hexdigest()
|
||||
|
||||
# Check if we need to evict
|
||||
content_size = len(content)
|
||||
if content_size > self.max_size_bytes:
|
||||
raise CacheCorruptionError(
|
||||
f"File too large for memory cache: {content_size} bytes "
|
||||
f"(limit: {self.max_size_bytes} bytes)"
|
||||
)
|
||||
|
||||
# Evict LRU entries if needed
|
||||
self._evict_lru(content_size)
|
||||
|
||||
# Store in memory
|
||||
key = (bucket, prefix)
|
||||
self._cache[key] = (content, sha)
|
||||
self._current_size += content_size
|
||||
|
||||
# Update LRU
|
||||
self._update_access(key)
|
||||
|
||||
# Return virtual path
|
||||
return self.ref_path(bucket, prefix)
|
||||
|
||||
def evict(self, bucket: str, prefix: str) -> None:
|
||||
"""Remove cached reference from memory.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Deltaspace prefix
|
||||
"""
|
||||
key = (bucket, prefix)
|
||||
|
||||
# Remove from cache
|
||||
if key in self._cache:
|
||||
content, _ = self._cache[key]
|
||||
self._current_size -= len(content)
|
||||
del self._cache[key]
|
||||
|
||||
# Remove from LRU tracking
|
||||
if key in self._access_order:
|
||||
self._access_order.remove(key)
|
||||
|
||||
# Clean up temp file if exists
|
||||
temp_path = self.ref_path(bucket, prefix)
|
||||
if temp_path.exists():
|
||||
try:
|
||||
temp_path.unlink()
|
||||
except OSError:
|
||||
pass # Best effort
|
||||
|
||||
def clear(self) -> None:
|
||||
"""Clear all cached content from memory.
|
||||
|
||||
Useful for testing and cleanup.
|
||||
"""
|
||||
self._cache.clear()
|
||||
self._access_order.clear()
|
||||
self._current_size = 0
|
||||
@@ -1,14 +1,16 @@
|
||||
"""CLI main entry point."""
|
||||
|
||||
import atexit
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import click
|
||||
|
||||
from ...adapters import (
|
||||
FsCacheAdapter,
|
||||
NoopMetricsAdapter,
|
||||
S3StorageAdapter,
|
||||
Sha256Adapter,
|
||||
@@ -18,6 +20,7 @@ from ...adapters import (
|
||||
)
|
||||
from ...core import DeltaService, ObjectKey
|
||||
from ...ports import MetricsPort
|
||||
from ...ports.cache import CachePort
|
||||
from .aws_compat import (
|
||||
copy_s3_to_s3,
|
||||
determine_operation,
|
||||
@@ -38,10 +41,14 @@ def create_service(
|
||||
) -> DeltaService:
|
||||
"""Create service with wired adapters."""
|
||||
# Get config from environment
|
||||
cache_dir = Path(os.environ.get("DG_CACHE_DIR", "/tmp/.deltaglider/reference_cache"))
|
||||
max_ratio = float(os.environ.get("DG_MAX_RATIO", "0.5"))
|
||||
metrics_type = os.environ.get("DG_METRICS", "logging") # Options: noop, logging, cloudwatch
|
||||
|
||||
# SECURITY: Always use ephemeral process-isolated cache
|
||||
cache_dir = Path(tempfile.mkdtemp(prefix="deltaglider-", dir="/tmp"))
|
||||
# Register cleanup handler to remove cache on exit
|
||||
atexit.register(lambda: shutil.rmtree(cache_dir, ignore_errors=True))
|
||||
|
||||
# Set AWS environment variables if provided
|
||||
if endpoint_url:
|
||||
os.environ["AWS_ENDPOINT_URL"] = endpoint_url
|
||||
@@ -54,7 +61,24 @@ def create_service(
|
||||
hasher = Sha256Adapter()
|
||||
storage = S3StorageAdapter(endpoint_url=endpoint_url)
|
||||
diff = XdeltaAdapter()
|
||||
cache = FsCacheAdapter(cache_dir, hasher)
|
||||
|
||||
# SECURITY: Configurable cache with encryption and backend selection
|
||||
from deltaglider.adapters import ContentAddressedCache, EncryptedCache, MemoryCache
|
||||
|
||||
# Select backend: memory or filesystem
|
||||
cache_backend = os.environ.get("DG_CACHE_BACKEND", "filesystem") # Options: filesystem, memory
|
||||
base_cache: CachePort
|
||||
if cache_backend == "memory":
|
||||
max_size_mb = int(os.environ.get("DG_CACHE_MEMORY_SIZE_MB", "100"))
|
||||
base_cache = MemoryCache(hasher, max_size_mb=max_size_mb, temp_dir=cache_dir)
|
||||
else:
|
||||
# Filesystem-backed with Content-Addressed Storage
|
||||
base_cache = ContentAddressedCache(cache_dir, hasher)
|
||||
|
||||
# Always apply encryption with ephemeral keys (security hardening)
|
||||
# Encryption key is optional via DG_CACHE_ENCRYPTION_KEY (ephemeral if not set)
|
||||
cache: CachePort = EncryptedCache.from_env(base_cache)
|
||||
|
||||
clock = UtcClockAdapter()
|
||||
logger = StdLoggerAdapter(level=log_level)
|
||||
|
||||
@@ -240,6 +264,13 @@ def ls(
|
||||
prefix_str: str
|
||||
bucket_name, prefix_str = parse_s3_url(s3_url)
|
||||
|
||||
# Ensure prefix ends with / if it's meant to be a directory
|
||||
# This helps with proper path handling
|
||||
if prefix_str and not prefix_str.endswith("/"):
|
||||
# Check if this is a file or directory by listing
|
||||
# For now, assume it's a directory prefix
|
||||
prefix_str = prefix_str + "/"
|
||||
|
||||
# Format bytes to human readable
|
||||
def format_bytes(size: int) -> str:
|
||||
if not human_readable:
|
||||
@@ -252,33 +283,38 @@ def ls(
|
||||
return f"{size_float:.1f}P"
|
||||
|
||||
# List objects using SDK (automatically filters .delta and reference.bin)
|
||||
from deltaglider.client import DeltaGliderClient, ListObjectsResponse
|
||||
from deltaglider.client import DeltaGliderClient
|
||||
|
||||
client = DeltaGliderClient(service)
|
||||
dg_response: ListObjectsResponse = client.list_objects(
|
||||
Bucket=bucket_name, Prefix=prefix_str, MaxKeys=10000
|
||||
dg_response = client.list_objects(
|
||||
Bucket=bucket_name,
|
||||
Prefix=prefix_str,
|
||||
MaxKeys=10000,
|
||||
Delimiter="/" if not recursive else "",
|
||||
)
|
||||
objects = dg_response.contents
|
||||
objects = dg_response["Contents"]
|
||||
|
||||
# Filter by recursive flag
|
||||
if not recursive:
|
||||
# Only show direct children
|
||||
seen_prefixes = set()
|
||||
# Show common prefixes (subdirectories) from S3 response
|
||||
for common_prefix in dg_response.get("CommonPrefixes", []):
|
||||
prefix_path = common_prefix.get("Prefix", "")
|
||||
# Show only the directory name, not the full path
|
||||
if prefix_str:
|
||||
# Strip the current prefix to show only the subdirectory
|
||||
display_name = prefix_path[len(prefix_str) :]
|
||||
else:
|
||||
display_name = prefix_path
|
||||
click.echo(f" PRE {display_name}")
|
||||
|
||||
# Only show files at current level (not in subdirectories)
|
||||
filtered_objects = []
|
||||
for obj in objects:
|
||||
rel_path = obj.key[len(prefix_str) :] if prefix_str else obj.key
|
||||
if "/" in rel_path:
|
||||
# It's in a subdirectory
|
||||
subdir = rel_path.split("/")[0] + "/"
|
||||
if subdir not in seen_prefixes:
|
||||
seen_prefixes.add(subdir)
|
||||
# Show as directory
|
||||
full_prefix = f"{prefix_str}{subdir}" if prefix_str else subdir
|
||||
click.echo(f" PRE {full_prefix}")
|
||||
else:
|
||||
# Direct file
|
||||
if rel_path: # Only add if there's actually a file at this level
|
||||
filtered_objects.append(obj)
|
||||
obj_key = obj["Key"]
|
||||
rel_path = obj_key[len(prefix_str) :] if prefix_str else obj_key
|
||||
# Only include if it's a direct child (no / in relative path)
|
||||
if "/" not in rel_path and rel_path:
|
||||
filtered_objects.append(obj)
|
||||
objects = filtered_objects
|
||||
|
||||
# Display objects (SDK already filters reference.bin and strips .delta)
|
||||
@@ -286,19 +322,26 @@ def ls(
|
||||
total_count = 0
|
||||
|
||||
for obj in objects:
|
||||
total_size += obj.size
|
||||
total_size += obj["Size"]
|
||||
total_count += 1
|
||||
|
||||
# Format the display
|
||||
size_str = format_bytes(obj.size)
|
||||
size_str = format_bytes(obj["Size"])
|
||||
# last_modified is a string from SDK, parse it if needed
|
||||
if isinstance(obj.last_modified, str):
|
||||
last_modified = obj.get("LastModified", "")
|
||||
if isinstance(last_modified, str):
|
||||
# Already a string, extract date portion
|
||||
date_str = obj.last_modified[:19].replace("T", " ")
|
||||
date_str = last_modified[:19].replace("T", " ")
|
||||
else:
|
||||
date_str = obj.last_modified.strftime("%Y-%m-%d %H:%M:%S")
|
||||
date_str = last_modified.strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
click.echo(f"{date_str} {size_str:>10} s3://{bucket_name}/{obj.key}")
|
||||
# Show only the filename relative to current prefix (like AWS CLI)
|
||||
if prefix_str:
|
||||
display_key = obj["Key"][len(prefix_str) :]
|
||||
else:
|
||||
display_key = obj["Key"]
|
||||
|
||||
click.echo(f"{date_str} {size_str:>10} {display_key}")
|
||||
|
||||
# Show summary if requested
|
||||
if summarize:
|
||||
|
||||
@@ -1,5 +1,9 @@
|
||||
"""DeltaGlider client with boto3-compatible APIs and advanced features."""
|
||||
|
||||
# ruff: noqa: I001
|
||||
import atexit
|
||||
import os
|
||||
import shutil
|
||||
import tempfile
|
||||
from collections.abc import Callable
|
||||
from pathlib import Path
|
||||
@@ -10,12 +14,36 @@ from .client_delete_helpers import delete_with_delta_suffix
|
||||
from .client_models import (
|
||||
BucketStats,
|
||||
CompressionEstimate,
|
||||
ListObjectsResponse,
|
||||
ObjectInfo,
|
||||
UploadSummary,
|
||||
)
|
||||
|
||||
# fmt: off - Keep all client_operations imports together
|
||||
from .client_operations import (
|
||||
create_bucket as _create_bucket,
|
||||
delete_bucket as _delete_bucket,
|
||||
download_batch as _download_batch,
|
||||
estimate_compression as _estimate_compression,
|
||||
find_similar_files as _find_similar_files,
|
||||
generate_presigned_post as _generate_presigned_post,
|
||||
generate_presigned_url as _generate_presigned_url,
|
||||
get_bucket_stats as _get_bucket_stats,
|
||||
get_object_info as _get_object_info,
|
||||
list_buckets as _list_buckets,
|
||||
upload_batch as _upload_batch,
|
||||
upload_chunked as _upload_chunked,
|
||||
)
|
||||
# fmt: on
|
||||
|
||||
from .core import DeltaService, DeltaSpace, ObjectKey
|
||||
from .core.errors import NotFoundError
|
||||
from .response_builders import (
|
||||
build_delete_response,
|
||||
build_get_response,
|
||||
build_list_objects_response,
|
||||
build_put_response,
|
||||
)
|
||||
from .types import CommonPrefix, S3Object
|
||||
|
||||
|
||||
class DeltaGliderClient:
|
||||
@@ -123,21 +151,33 @@ class DeltaGliderClient:
|
||||
# Calculate ETag from file content
|
||||
sha256_hash = self.service.hasher.sha256(tmp_path)
|
||||
|
||||
# Return boto3-compatible response with delta info
|
||||
return {
|
||||
"ETag": f'"{sha256_hash}"',
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
"DeltaGlider": {
|
||||
"original_size": summary.file_size,
|
||||
"stored_size": summary.delta_size or summary.file_size,
|
||||
"is_delta": summary.delta_size is not None,
|
||||
"compression_ratio": summary.delta_ratio or 1.0,
|
||||
"stored_as": summary.key,
|
||||
"operation": summary.operation,
|
||||
},
|
||||
# Build DeltaGlider compression info
|
||||
deltaglider_info: dict[str, Any] = {
|
||||
"OriginalSizeMB": summary.file_size / (1024 * 1024),
|
||||
"StoredSizeMB": (summary.delta_size or summary.file_size) / (1024 * 1024),
|
||||
"IsDelta": summary.delta_size is not None,
|
||||
"CompressionRatio": summary.delta_ratio or 1.0,
|
||||
"SavingsPercent": (
|
||||
(
|
||||
(summary.file_size - (summary.delta_size or summary.file_size))
|
||||
/ summary.file_size
|
||||
* 100
|
||||
)
|
||||
if summary.file_size > 0
|
||||
else 0.0
|
||||
),
|
||||
"StoredAs": summary.key,
|
||||
"Operation": summary.operation,
|
||||
}
|
||||
|
||||
# Return as dict[str, Any] for public API (TypedDict is a dict at runtime!)
|
||||
return cast(
|
||||
dict[str, Any],
|
||||
build_put_response(
|
||||
etag=f'"{sha256_hash}"',
|
||||
deltaglider_info=deltaglider_info,
|
||||
),
|
||||
)
|
||||
finally:
|
||||
# Clean up temp file
|
||||
if tmp_path.exists():
|
||||
@@ -173,19 +213,19 @@ class DeltaGliderClient:
|
||||
|
||||
# Get metadata
|
||||
obj_head = self.service.storage.head(f"{Bucket}/{Key}")
|
||||
file_size = tmp_path.stat().st_size
|
||||
etag = f'"{self.service.hasher.sha256(tmp_path)}"'
|
||||
|
||||
return {
|
||||
"Body": body, # File-like object
|
||||
"ContentLength": tmp_path.stat().st_size,
|
||||
"ContentType": obj_head.metadata.get("content_type", "binary/octet-stream")
|
||||
if obj_head
|
||||
else "binary/octet-stream",
|
||||
"ETag": f'"{self.service.hasher.sha256(tmp_path)}"',
|
||||
"Metadata": obj_head.metadata if obj_head else {},
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
}
|
||||
# Return as dict[str, Any] for public API (TypedDict is a dict at runtime!)
|
||||
return cast(
|
||||
dict[str, Any],
|
||||
build_get_response(
|
||||
body=body, # type: ignore[arg-type] # File object is compatible with bytes
|
||||
content_length=file_size,
|
||||
etag=etag,
|
||||
metadata=obj_head.metadata if obj_head else {},
|
||||
),
|
||||
)
|
||||
|
||||
def list_objects(
|
||||
self,
|
||||
@@ -197,7 +237,7 @@ class DeltaGliderClient:
|
||||
StartAfter: str | None = None,
|
||||
FetchMetadata: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> ListObjectsResponse:
|
||||
) -> dict[str, Any]:
|
||||
"""List objects in bucket with smart metadata fetching.
|
||||
|
||||
This method optimizes performance by:
|
||||
@@ -227,11 +267,11 @@ class DeltaGliderClient:
|
||||
# Fast listing for UI display (no metadata)
|
||||
response = client.list_objects(Bucket='releases', MaxKeys=100)
|
||||
|
||||
# Paginated listing
|
||||
# Paginated listing (boto3-compatible dict response)
|
||||
response = client.list_objects(
|
||||
Bucket='releases',
|
||||
MaxKeys=50,
|
||||
ContinuationToken=response.next_continuation_token
|
||||
ContinuationToken=response.get('NextContinuationToken')
|
||||
)
|
||||
|
||||
# Detailed listing with compression stats (slower, only for analytics)
|
||||
@@ -265,8 +305,8 @@ class DeltaGliderClient:
|
||||
"is_truncated": False,
|
||||
}
|
||||
|
||||
# Convert to ObjectInfo objects with smart metadata fetching
|
||||
contents = []
|
||||
# Convert to boto3-compatible S3Object TypedDicts (type-safe!)
|
||||
contents: list[S3Object] = []
|
||||
for obj in result.get("objects", []):
|
||||
# Skip reference.bin files (internal files, never exposed to users)
|
||||
if obj["key"].endswith("/reference.bin") or obj["key"] == "reference.bin":
|
||||
@@ -280,20 +320,12 @@ class DeltaGliderClient:
|
||||
if is_delta:
|
||||
display_key = display_key[:-6] # Remove .delta suffix
|
||||
|
||||
# Create object info with basic data (no HEAD request)
|
||||
info = ObjectInfo(
|
||||
key=display_key, # Use cleaned key without .delta
|
||||
size=obj["size"],
|
||||
last_modified=obj.get("last_modified", ""),
|
||||
etag=obj.get("etag"),
|
||||
storage_class=obj.get("storage_class", "STANDARD"),
|
||||
# DeltaGlider fields
|
||||
original_size=obj["size"], # For non-delta, original = stored
|
||||
compressed_size=obj["size"],
|
||||
is_delta=is_delta,
|
||||
compression_ratio=0.0 if not is_delta else None,
|
||||
reference_key=None,
|
||||
)
|
||||
# Build DeltaGlider metadata
|
||||
deltaglider_metadata: dict[str, str] = {
|
||||
"deltaglider-is-delta": str(is_delta).lower(),
|
||||
"deltaglider-original-size": str(obj["size"]),
|
||||
"deltaglider-compression-ratio": "0.0" if not is_delta else "unknown",
|
||||
}
|
||||
|
||||
# SMART METADATA FETCHING:
|
||||
# 1. NEVER fetch metadata for non-delta files (no point)
|
||||
@@ -304,30 +336,52 @@ class DeltaGliderClient:
|
||||
if obj_head and obj_head.metadata:
|
||||
metadata = obj_head.metadata
|
||||
# Update with actual compression stats
|
||||
info.original_size = int(metadata.get("file_size", obj["size"]))
|
||||
info.compression_ratio = float(metadata.get("compression_ratio", 0.0))
|
||||
info.reference_key = metadata.get("ref_key")
|
||||
original_size = int(metadata.get("file_size", obj["size"]))
|
||||
compression_ratio = float(metadata.get("compression_ratio", 0.0))
|
||||
reference_key = metadata.get("ref_key")
|
||||
|
||||
deltaglider_metadata["deltaglider-original-size"] = str(original_size)
|
||||
deltaglider_metadata["deltaglider-compression-ratio"] = str(
|
||||
compression_ratio
|
||||
)
|
||||
if reference_key:
|
||||
deltaglider_metadata["deltaglider-reference-key"] = reference_key
|
||||
except Exception as e:
|
||||
# Log but don't fail the listing
|
||||
self.service.logger.debug(f"Failed to fetch metadata for {obj['key']}: {e}")
|
||||
|
||||
contents.append(info)
|
||||
# Create boto3-compatible S3Object TypedDict - mypy validates structure!
|
||||
s3_obj: S3Object = {
|
||||
"Key": display_key, # Use cleaned key without .delta
|
||||
"Size": obj["size"],
|
||||
"LastModified": obj.get("last_modified", ""),
|
||||
"ETag": obj.get("etag"),
|
||||
"StorageClass": obj.get("storage_class", "STANDARD"),
|
||||
"Metadata": deltaglider_metadata,
|
||||
}
|
||||
contents.append(s3_obj)
|
||||
|
||||
# Build response with pagination support
|
||||
response = ListObjectsResponse(
|
||||
name=Bucket,
|
||||
prefix=Prefix,
|
||||
delimiter=Delimiter,
|
||||
max_keys=MaxKeys,
|
||||
contents=contents,
|
||||
common_prefixes=[{"Prefix": p} for p in result.get("common_prefixes", [])],
|
||||
is_truncated=result.get("is_truncated", False),
|
||||
next_continuation_token=result.get("next_continuation_token"),
|
||||
continuation_token=ContinuationToken,
|
||||
key_count=len(contents),
|
||||
# Build type-safe boto3-compatible CommonPrefix TypedDicts
|
||||
common_prefixes = result.get("common_prefixes", [])
|
||||
common_prefix_dicts: list[CommonPrefix] | None = (
|
||||
[CommonPrefix(Prefix=p) for p in common_prefixes] if common_prefixes else None
|
||||
)
|
||||
|
||||
return response
|
||||
# Return as dict[str, Any] for public API (TypedDict is a dict at runtime!)
|
||||
return cast(
|
||||
dict[str, Any],
|
||||
build_list_objects_response(
|
||||
bucket=Bucket,
|
||||
prefix=Prefix,
|
||||
delimiter=Delimiter,
|
||||
max_keys=MaxKeys,
|
||||
contents=contents,
|
||||
common_prefixes=common_prefix_dicts,
|
||||
is_truncated=result.get("is_truncated", False),
|
||||
next_continuation_token=result.get("next_continuation_token"),
|
||||
continuation_token=ContinuationToken,
|
||||
),
|
||||
)
|
||||
|
||||
def delete_object(
|
||||
self,
|
||||
@@ -347,32 +401,31 @@ class DeltaGliderClient:
|
||||
"""
|
||||
_, delete_result = delete_with_delta_suffix(self.service, Bucket, Key)
|
||||
|
||||
response = {
|
||||
"DeleteMarker": False,
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 204,
|
||||
},
|
||||
"DeltaGliderInfo": {
|
||||
"Type": delete_result.get("type"),
|
||||
"Deleted": delete_result.get("deleted", False),
|
||||
},
|
||||
# Build DeltaGlider-specific info
|
||||
deltaglider_info: dict[str, Any] = {
|
||||
"Type": delete_result.get("type"),
|
||||
"Deleted": delete_result.get("deleted", False),
|
||||
}
|
||||
|
||||
# Add warnings if any
|
||||
warnings = delete_result.get("warnings")
|
||||
if warnings:
|
||||
delta_info = response.get("DeltaGliderInfo")
|
||||
if delta_info and isinstance(delta_info, dict):
|
||||
delta_info["Warnings"] = warnings
|
||||
deltaglider_info["Warnings"] = warnings
|
||||
|
||||
# Add dependent delta count for references
|
||||
dependent_deltas = delete_result.get("dependent_deltas")
|
||||
if dependent_deltas:
|
||||
delta_info = response.get("DeltaGliderInfo")
|
||||
if delta_info and isinstance(delta_info, dict):
|
||||
delta_info["DependentDeltas"] = dependent_deltas
|
||||
deltaglider_info["DependentDeltas"] = dependent_deltas
|
||||
|
||||
return response
|
||||
# Return as dict[str, Any] for public API (TypedDict is a dict at runtime!)
|
||||
return cast(
|
||||
dict[str, Any],
|
||||
build_delete_response(
|
||||
delete_marker=False,
|
||||
status_code=204,
|
||||
deltaglider_info=deltaglider_info,
|
||||
),
|
||||
)
|
||||
|
||||
def delete_objects(
|
||||
self,
|
||||
@@ -760,40 +813,9 @@ class DeltaGliderClient:
|
||||
progress_callback=on_progress
|
||||
)
|
||||
"""
|
||||
file_path = Path(file_path)
|
||||
file_size = file_path.stat().st_size
|
||||
|
||||
# For small files, just use regular upload
|
||||
if file_size <= chunk_size:
|
||||
if progress_callback:
|
||||
progress_callback(1, 1, file_size, file_size)
|
||||
return self.upload(file_path, s3_url, max_ratio=max_ratio)
|
||||
|
||||
# Calculate chunks
|
||||
total_chunks = (file_size + chunk_size - 1) // chunk_size
|
||||
|
||||
# Create a temporary file for chunked processing
|
||||
# For now, we read the entire file but report progress in chunks
|
||||
# Future enhancement: implement true streaming upload in storage adapter
|
||||
bytes_read = 0
|
||||
|
||||
with open(file_path, "rb") as f:
|
||||
for chunk_num in range(1, total_chunks + 1):
|
||||
# Read chunk (simulated for progress reporting)
|
||||
chunk_data = f.read(chunk_size)
|
||||
bytes_read += len(chunk_data)
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(chunk_num, total_chunks, bytes_read, file_size)
|
||||
|
||||
# Perform the actual upload
|
||||
# TODO: When storage adapter supports streaming, pass chunks directly
|
||||
result = self.upload(file_path, s3_url, max_ratio=max_ratio)
|
||||
|
||||
# Final progress callback
|
||||
if progress_callback:
|
||||
progress_callback(total_chunks, total_chunks, file_size, file_size)
|
||||
|
||||
result: UploadSummary = _upload_chunked(
|
||||
self, file_path, s3_url, chunk_size, progress_callback, max_ratio
|
||||
)
|
||||
return result
|
||||
|
||||
def upload_batch(
|
||||
@@ -814,20 +836,7 @@ class DeltaGliderClient:
|
||||
Returns:
|
||||
List of UploadSummary objects
|
||||
"""
|
||||
results = []
|
||||
|
||||
for i, file_path in enumerate(files):
|
||||
file_path = Path(file_path)
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(file_path.name, i + 1, len(files))
|
||||
|
||||
# Upload each file
|
||||
s3_url = f"{s3_prefix.rstrip('/')}/{file_path.name}"
|
||||
summary = self.upload(file_path, s3_url, max_ratio=max_ratio)
|
||||
results.append(summary)
|
||||
|
||||
return results
|
||||
return _upload_batch(self, files, s3_prefix, max_ratio, progress_callback)
|
||||
|
||||
def download_batch(
|
||||
self,
|
||||
@@ -845,24 +854,7 @@ class DeltaGliderClient:
|
||||
Returns:
|
||||
List of downloaded file paths
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
results = []
|
||||
|
||||
for i, s3_url in enumerate(s3_urls):
|
||||
# Extract filename from URL
|
||||
filename = s3_url.split("/")[-1]
|
||||
if filename.endswith(".delta"):
|
||||
filename = filename[:-6] # Remove .delta suffix
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(filename, i + 1, len(s3_urls))
|
||||
|
||||
output_path = output_dir / filename
|
||||
self.download(s3_url, output_path)
|
||||
results.append(output_path)
|
||||
|
||||
return results
|
||||
return _download_batch(self, s3_urls, output_dir, progress_callback)
|
||||
|
||||
def estimate_compression(
|
||||
self,
|
||||
@@ -882,80 +874,10 @@ class DeltaGliderClient:
|
||||
Returns:
|
||||
CompressionEstimate with predicted compression
|
||||
"""
|
||||
file_path = Path(file_path)
|
||||
file_size = file_path.stat().st_size
|
||||
|
||||
# Check file extension
|
||||
ext = file_path.suffix.lower()
|
||||
delta_extensions = {
|
||||
".zip",
|
||||
".tar",
|
||||
".gz",
|
||||
".tar.gz",
|
||||
".tgz",
|
||||
".bz2",
|
||||
".tar.bz2",
|
||||
".xz",
|
||||
".tar.xz",
|
||||
".7z",
|
||||
".rar",
|
||||
".dmg",
|
||||
".iso",
|
||||
".pkg",
|
||||
".deb",
|
||||
".rpm",
|
||||
".apk",
|
||||
".jar",
|
||||
".war",
|
||||
".ear",
|
||||
}
|
||||
|
||||
# Already compressed formats that won't benefit from delta
|
||||
incompressible = {".jpg", ".jpeg", ".png", ".mp4", ".mp3", ".avi", ".mov"}
|
||||
|
||||
if ext in incompressible:
|
||||
return CompressionEstimate(
|
||||
original_size=file_size,
|
||||
estimated_compressed_size=file_size,
|
||||
estimated_ratio=0.0,
|
||||
confidence=0.95,
|
||||
should_use_delta=False,
|
||||
)
|
||||
|
||||
if ext not in delta_extensions:
|
||||
# Unknown type, conservative estimate
|
||||
return CompressionEstimate(
|
||||
original_size=file_size,
|
||||
estimated_compressed_size=file_size,
|
||||
estimated_ratio=0.0,
|
||||
confidence=0.5,
|
||||
should_use_delta=file_size > 1024 * 1024, # Only for files > 1MB
|
||||
)
|
||||
|
||||
# Look for similar files in the target location
|
||||
similar_files = self.find_similar_files(bucket, prefix, file_path.name)
|
||||
|
||||
if similar_files:
|
||||
# If we have similar files, estimate high compression
|
||||
estimated_ratio = 0.99 # 99% compression typical for similar versions
|
||||
confidence = 0.9
|
||||
recommended_ref = similar_files[0]["Key"] if similar_files else None
|
||||
else:
|
||||
# First file of its type
|
||||
estimated_ratio = 0.0
|
||||
confidence = 0.7
|
||||
recommended_ref = None
|
||||
|
||||
estimated_size = int(file_size * (1 - estimated_ratio))
|
||||
|
||||
return CompressionEstimate(
|
||||
original_size=file_size,
|
||||
estimated_compressed_size=estimated_size,
|
||||
estimated_ratio=estimated_ratio,
|
||||
confidence=confidence,
|
||||
recommended_reference=recommended_ref,
|
||||
should_use_delta=True,
|
||||
result: CompressionEstimate = _estimate_compression(
|
||||
self, file_path, bucket, prefix, sample_size
|
||||
)
|
||||
return result
|
||||
|
||||
def find_similar_files(
|
||||
self,
|
||||
@@ -975,56 +897,7 @@ class DeltaGliderClient:
|
||||
Returns:
|
||||
List of similar files with scores
|
||||
"""
|
||||
# List objects in the prefix (no metadata needed for similarity check)
|
||||
response = self.list_objects(
|
||||
Bucket=bucket,
|
||||
Prefix=prefix,
|
||||
MaxKeys=1000,
|
||||
FetchMetadata=False, # Don't need metadata for similarity
|
||||
)
|
||||
|
||||
similar: list[dict[str, Any]] = []
|
||||
base_name = Path(filename).stem
|
||||
ext = Path(filename).suffix
|
||||
|
||||
for obj in response.contents:
|
||||
obj_base = Path(obj.key).stem
|
||||
obj_ext = Path(obj.key).suffix
|
||||
|
||||
# Skip delta files and references
|
||||
if obj.key.endswith(".delta") or obj.key.endswith("reference.bin"):
|
||||
continue
|
||||
|
||||
score = 0.0
|
||||
|
||||
# Extension match
|
||||
if ext == obj_ext:
|
||||
score += 0.5
|
||||
|
||||
# Base name similarity
|
||||
if base_name in obj_base or obj_base in base_name:
|
||||
score += 0.3
|
||||
|
||||
# Version pattern match
|
||||
import re
|
||||
|
||||
if re.search(r"v?\d+[\.\d]*", base_name) and re.search(r"v?\d+[\.\d]*", obj_base):
|
||||
score += 0.2
|
||||
|
||||
if score > 0.5:
|
||||
similar.append(
|
||||
{
|
||||
"Key": obj.key,
|
||||
"Size": obj.size,
|
||||
"Similarity": score,
|
||||
"LastModified": obj.last_modified,
|
||||
}
|
||||
)
|
||||
|
||||
# Sort by similarity
|
||||
similar.sort(key=lambda x: x["Similarity"], reverse=True) # type: ignore
|
||||
|
||||
return similar[:limit]
|
||||
return _find_similar_files(self, bucket, prefix, filename, limit)
|
||||
|
||||
def get_object_info(self, s3_url: str) -> ObjectInfo:
|
||||
"""Get detailed object information including compression stats.
|
||||
@@ -1035,34 +908,8 @@ class DeltaGliderClient:
|
||||
Returns:
|
||||
ObjectInfo with detailed metadata
|
||||
"""
|
||||
# Parse URL
|
||||
if not s3_url.startswith("s3://"):
|
||||
raise ValueError(f"Invalid S3 URL: {s3_url}")
|
||||
|
||||
s3_path = s3_url[5:]
|
||||
parts = s3_path.split("/", 1)
|
||||
bucket = parts[0]
|
||||
key = parts[1] if len(parts) > 1 else ""
|
||||
|
||||
# Get object metadata
|
||||
obj_head = self.service.storage.head(f"{bucket}/{key}")
|
||||
if not obj_head:
|
||||
raise FileNotFoundError(f"Object not found: {s3_url}")
|
||||
|
||||
metadata = obj_head.metadata
|
||||
is_delta = key.endswith(".delta")
|
||||
|
||||
return ObjectInfo(
|
||||
key=key,
|
||||
size=obj_head.size,
|
||||
last_modified=metadata.get("last_modified", ""),
|
||||
etag=metadata.get("etag"),
|
||||
original_size=int(metadata.get("file_size", obj_head.size)),
|
||||
compressed_size=obj_head.size,
|
||||
compression_ratio=float(metadata.get("compression_ratio", 0.0)),
|
||||
is_delta=is_delta,
|
||||
reference_key=metadata.get("ref_key"),
|
||||
)
|
||||
result: ObjectInfo = _get_object_info(self, s3_url)
|
||||
return result
|
||||
|
||||
def get_bucket_stats(self, bucket: str, detailed_stats: bool = False) -> BucketStats:
|
||||
"""Get statistics for a bucket with optional detailed compression metrics.
|
||||
@@ -1091,76 +938,8 @@ class DeltaGliderClient:
|
||||
stats = client.get_bucket_stats('releases', detailed_stats=True)
|
||||
print(f"Compression ratio: {stats.average_compression_ratio:.1%}")
|
||||
"""
|
||||
# List all objects with smart metadata fetching
|
||||
all_objects = []
|
||||
continuation_token = None
|
||||
|
||||
while True:
|
||||
response = self.list_objects(
|
||||
Bucket=bucket,
|
||||
MaxKeys=1000,
|
||||
ContinuationToken=continuation_token,
|
||||
FetchMetadata=detailed_stats, # Only fetch metadata if detailed stats requested
|
||||
)
|
||||
|
||||
all_objects.extend(response.contents)
|
||||
|
||||
if not response.is_truncated:
|
||||
break
|
||||
|
||||
continuation_token = response.next_continuation_token
|
||||
|
||||
# Calculate statistics
|
||||
total_size = 0
|
||||
compressed_size = 0
|
||||
delta_count = 0
|
||||
direct_count = 0
|
||||
|
||||
for obj in all_objects:
|
||||
compressed_size += obj.size
|
||||
|
||||
if obj.is_delta:
|
||||
delta_count += 1
|
||||
# Use actual original size if we have it, otherwise estimate
|
||||
total_size += obj.original_size or obj.size
|
||||
else:
|
||||
direct_count += 1
|
||||
# For non-delta files, original equals compressed
|
||||
total_size += obj.size
|
||||
|
||||
space_saved = total_size - compressed_size
|
||||
avg_ratio = (space_saved / total_size) if total_size > 0 else 0.0
|
||||
|
||||
return BucketStats(
|
||||
bucket=bucket,
|
||||
object_count=len(all_objects),
|
||||
total_size=total_size,
|
||||
compressed_size=compressed_size,
|
||||
space_saved=space_saved,
|
||||
average_compression_ratio=avg_ratio,
|
||||
delta_objects=delta_count,
|
||||
direct_objects=direct_count,
|
||||
)
|
||||
|
||||
def _try_boto3_presigned_operation(self, operation: str, **kwargs: Any) -> Any | None:
|
||||
"""Try to generate presigned operation using boto3 client, return None if not available."""
|
||||
storage_adapter = self.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
if operation == "url":
|
||||
return str(storage_adapter.client.generate_presigned_url(**kwargs))
|
||||
elif operation == "post":
|
||||
return dict(storage_adapter.client.generate_presigned_post(**kwargs))
|
||||
except AttributeError:
|
||||
# storage_adapter does not have a 'client' attribute
|
||||
pass
|
||||
except Exception as e:
|
||||
# Fall back to manual construction if needed
|
||||
self.service.logger.warning(f"Failed to generate presigned {operation}: {e}")
|
||||
|
||||
return None
|
||||
result: BucketStats = _get_bucket_stats(self, bucket, detailed_stats)
|
||||
return result
|
||||
|
||||
def generate_presigned_url(
|
||||
self,
|
||||
@@ -1178,28 +957,7 @@ class DeltaGliderClient:
|
||||
Returns:
|
||||
Presigned URL string
|
||||
"""
|
||||
# Try boto3 first, fallback to manual construction
|
||||
url = self._try_boto3_presigned_operation(
|
||||
"url",
|
||||
ClientMethod=ClientMethod,
|
||||
Params=Params,
|
||||
ExpiresIn=ExpiresIn,
|
||||
)
|
||||
if url is not None:
|
||||
return str(url)
|
||||
|
||||
# Fallback: construct URL manually (less secure, for dev/testing only)
|
||||
bucket = Params.get("Bucket", "")
|
||||
key = Params.get("Key", "")
|
||||
|
||||
if self.endpoint_url:
|
||||
base_url = self.endpoint_url
|
||||
else:
|
||||
base_url = f"https://{bucket}.s3.amazonaws.com"
|
||||
|
||||
# Warning: This is not a real presigned URL, just a placeholder
|
||||
self.service.logger.warning("Using placeholder presigned URL - not suitable for production")
|
||||
return f"{base_url}/{key}?expires={ExpiresIn}"
|
||||
return _generate_presigned_url(self, ClientMethod, Params, ExpiresIn)
|
||||
|
||||
def generate_presigned_post(
|
||||
self,
|
||||
@@ -1221,31 +979,7 @@ class DeltaGliderClient:
|
||||
Returns:
|
||||
Dict with 'url' and 'fields' for form submission
|
||||
"""
|
||||
# Try boto3 first, fallback to manual construction
|
||||
response = self._try_boto3_presigned_operation(
|
||||
"post",
|
||||
Bucket=Bucket,
|
||||
Key=Key,
|
||||
Fields=Fields,
|
||||
Conditions=Conditions,
|
||||
ExpiresIn=ExpiresIn,
|
||||
)
|
||||
if response is not None:
|
||||
return dict(response)
|
||||
|
||||
# Fallback: return minimal structure for compatibility
|
||||
if self.endpoint_url:
|
||||
url = f"{self.endpoint_url}/{Bucket}"
|
||||
else:
|
||||
url = f"https://{Bucket}.s3.amazonaws.com"
|
||||
|
||||
return {
|
||||
"url": url,
|
||||
"fields": {
|
||||
"key": Key,
|
||||
**(Fields or {}),
|
||||
},
|
||||
}
|
||||
return _generate_presigned_post(self, Bucket, Key, Fields, Conditions, ExpiresIn)
|
||||
|
||||
# ============================================================================
|
||||
# Bucket Management APIs (boto3-compatible)
|
||||
@@ -1276,36 +1010,7 @@ class DeltaGliderClient:
|
||||
... CreateBucketConfiguration={'LocationConstraint': 'us-west-2'}
|
||||
... )
|
||||
"""
|
||||
storage_adapter = self.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
params: dict[str, Any] = {"Bucket": Bucket}
|
||||
if CreateBucketConfiguration:
|
||||
params["CreateBucketConfiguration"] = CreateBucketConfiguration
|
||||
|
||||
response = storage_adapter.client.create_bucket(**params)
|
||||
return {
|
||||
"Location": response.get("Location", f"/{Bucket}"),
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
}
|
||||
except Exception as e:
|
||||
error_msg = str(e)
|
||||
if "BucketAlreadyExists" in error_msg or "BucketAlreadyOwnedByYou" in error_msg:
|
||||
# Bucket already exists - return success
|
||||
self.service.logger.debug(f"Bucket {Bucket} already exists")
|
||||
return {
|
||||
"Location": f"/{Bucket}",
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
}
|
||||
raise RuntimeError(f"Failed to create bucket: {e}") from e
|
||||
else:
|
||||
raise NotImplementedError("Storage adapter does not support bucket creation")
|
||||
return _create_bucket(self, Bucket, CreateBucketConfiguration, **kwargs)
|
||||
|
||||
def delete_bucket(
|
||||
self,
|
||||
@@ -1327,30 +1032,7 @@ class DeltaGliderClient:
|
||||
>>> client = create_client()
|
||||
>>> client.delete_bucket(Bucket='my-bucket')
|
||||
"""
|
||||
storage_adapter = self.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
storage_adapter.client.delete_bucket(Bucket=Bucket)
|
||||
return {
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 204,
|
||||
},
|
||||
}
|
||||
except Exception as e:
|
||||
error_msg = str(e)
|
||||
if "NoSuchBucket" in error_msg:
|
||||
# Bucket doesn't exist - return success
|
||||
self.service.logger.debug(f"Bucket {Bucket} does not exist")
|
||||
return {
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 204,
|
||||
},
|
||||
}
|
||||
raise RuntimeError(f"Failed to delete bucket: {e}") from e
|
||||
else:
|
||||
raise NotImplementedError("Storage adapter does not support bucket deletion")
|
||||
return _delete_bucket(self, Bucket, **kwargs)
|
||||
|
||||
def list_buckets(self, **kwargs: Any) -> dict[str, Any]:
|
||||
"""List all S3 buckets (boto3-compatible).
|
||||
@@ -1367,23 +1049,7 @@ class DeltaGliderClient:
|
||||
>>> for bucket in response['Buckets']:
|
||||
... print(bucket['Name'])
|
||||
"""
|
||||
storage_adapter = self.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
response = storage_adapter.client.list_buckets()
|
||||
return {
|
||||
"Buckets": response.get("Buckets", []),
|
||||
"Owner": response.get("Owner", {}),
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
}
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Failed to list buckets: {e}") from e
|
||||
else:
|
||||
raise NotImplementedError("Storage adapter does not support bucket listing")
|
||||
return _list_buckets(self, **kwargs)
|
||||
|
||||
def _parse_tagging(self, tagging: str) -> dict[str, str]:
|
||||
"""Parse URL-encoded tagging string to dict."""
|
||||
@@ -1399,7 +1065,6 @@ class DeltaGliderClient:
|
||||
def create_client(
|
||||
endpoint_url: str | None = None,
|
||||
log_level: str = "INFO",
|
||||
cache_dir: str = "/tmp/.deltaglider/cache",
|
||||
aws_access_key_id: str | None = None,
|
||||
aws_secret_access_key: str | None = None,
|
||||
aws_session_token: str | None = None,
|
||||
@@ -1414,11 +1079,11 @@ def create_client(
|
||||
- Compression estimation
|
||||
- Progress callbacks for large uploads
|
||||
- Detailed object and bucket statistics
|
||||
- Secure ephemeral cache (process-isolated, auto-cleanup)
|
||||
|
||||
Args:
|
||||
endpoint_url: Optional S3 endpoint URL (for MinIO, R2, etc.)
|
||||
log_level: Logging level
|
||||
cache_dir: Directory for reference cache
|
||||
aws_access_key_id: AWS access key ID (None to use environment/IAM)
|
||||
aws_secret_access_key: AWS secret access key (None to use environment/IAM)
|
||||
aws_session_token: AWS session token for temporary credentials (None if not using)
|
||||
@@ -1450,7 +1115,9 @@ def create_client(
|
||||
"""
|
||||
# Import here to avoid circular dependency
|
||||
from .adapters import (
|
||||
FsCacheAdapter,
|
||||
ContentAddressedCache,
|
||||
EncryptedCache,
|
||||
MemoryCache,
|
||||
NoopMetricsAdapter,
|
||||
S3StorageAdapter,
|
||||
Sha256Adapter,
|
||||
@@ -1459,6 +1126,11 @@ def create_client(
|
||||
XdeltaAdapter,
|
||||
)
|
||||
|
||||
# SECURITY: Always use ephemeral process-isolated cache
|
||||
cache_dir = Path(tempfile.mkdtemp(prefix="deltaglider-", dir="/tmp"))
|
||||
# Register cleanup handler to remove cache on exit
|
||||
atexit.register(lambda: shutil.rmtree(cache_dir, ignore_errors=True))
|
||||
|
||||
# Build boto3 client kwargs
|
||||
boto3_kwargs = {}
|
||||
if aws_access_key_id is not None:
|
||||
@@ -1474,13 +1146,29 @@ def create_client(
|
||||
hasher = Sha256Adapter()
|
||||
storage = S3StorageAdapter(endpoint_url=endpoint_url, boto3_kwargs=boto3_kwargs)
|
||||
diff = XdeltaAdapter()
|
||||
cache = FsCacheAdapter(Path(cache_dir), hasher)
|
||||
|
||||
# SECURITY: Configurable cache with encryption and backend selection
|
||||
from .ports.cache import CachePort
|
||||
|
||||
cache_backend = os.environ.get("DG_CACHE_BACKEND", "filesystem") # Options: filesystem, memory
|
||||
base_cache: CachePort
|
||||
if cache_backend == "memory":
|
||||
max_size_mb = int(os.environ.get("DG_CACHE_MEMORY_SIZE_MB", "100"))
|
||||
base_cache = MemoryCache(hasher, max_size_mb=max_size_mb, temp_dir=cache_dir)
|
||||
else:
|
||||
# Filesystem-backed with Content-Addressed Storage
|
||||
base_cache = ContentAddressedCache(cache_dir, hasher)
|
||||
|
||||
# Always apply encryption with ephemeral keys (security hardening)
|
||||
# Encryption key is optional via DG_CACHE_ENCRYPTION_KEY (ephemeral if not set)
|
||||
cache: CachePort = EncryptedCache.from_env(base_cache)
|
||||
|
||||
clock = UtcClockAdapter()
|
||||
logger = StdLoggerAdapter(level=log_level)
|
||||
metrics = NoopMetricsAdapter()
|
||||
|
||||
# Get default values
|
||||
tool_version = kwargs.pop("tool_version", "deltaglider/0.2.0")
|
||||
tool_version = kwargs.pop("tool_version", "deltaglider/5.0.0")
|
||||
max_ratio = kwargs.pop("max_ratio", 0.5)
|
||||
|
||||
# Create service
|
||||
|
||||
37
src/deltaglider/client_operations/__init__.py
Normal file
37
src/deltaglider/client_operations/__init__.py
Normal file
@@ -0,0 +1,37 @@
|
||||
"""Client operation modules for DeltaGliderClient.
|
||||
|
||||
This package contains modular operation implementations:
|
||||
- bucket: S3 bucket management (create, delete, list)
|
||||
- presigned: Presigned URL generation for temporary access
|
||||
- batch: Batch upload/download operations
|
||||
- stats: Statistics and analytics operations
|
||||
"""
|
||||
|
||||
from .batch import download_batch, upload_batch, upload_chunked
|
||||
from .bucket import create_bucket, delete_bucket, list_buckets
|
||||
from .presigned import generate_presigned_post, generate_presigned_url
|
||||
from .stats import (
|
||||
estimate_compression,
|
||||
find_similar_files,
|
||||
get_bucket_stats,
|
||||
get_object_info,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Bucket operations
|
||||
"create_bucket",
|
||||
"delete_bucket",
|
||||
"list_buckets",
|
||||
# Presigned operations
|
||||
"generate_presigned_url",
|
||||
"generate_presigned_post",
|
||||
# Batch operations
|
||||
"upload_chunked",
|
||||
"upload_batch",
|
||||
"download_batch",
|
||||
# Stats operations
|
||||
"get_bucket_stats",
|
||||
"get_object_info",
|
||||
"estimate_compression",
|
||||
"find_similar_files",
|
||||
]
|
||||
159
src/deltaglider/client_operations/batch.py
Normal file
159
src/deltaglider/client_operations/batch.py
Normal file
@@ -0,0 +1,159 @@
|
||||
"""Batch upload/download operations for DeltaGlider client.
|
||||
|
||||
This module contains DeltaGlider-specific batch operations:
|
||||
- upload_batch
|
||||
- download_batch
|
||||
- upload_chunked
|
||||
"""
|
||||
|
||||
from collections.abc import Callable
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from ..client_models import UploadSummary
|
||||
|
||||
|
||||
def upload_chunked(
|
||||
client: Any, # DeltaGliderClient
|
||||
file_path: str | Path,
|
||||
s3_url: str,
|
||||
chunk_size: int = 5 * 1024 * 1024,
|
||||
progress_callback: Callable[[int, int, int, int], None] | None = None,
|
||||
max_ratio: float = 0.5,
|
||||
) -> UploadSummary:
|
||||
"""Upload a file in chunks with progress callback.
|
||||
|
||||
This method reads the file in chunks to avoid loading large files entirely into memory,
|
||||
making it suitable for uploading very large files. Progress is reported after each chunk.
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
file_path: Local file to upload
|
||||
s3_url: S3 destination URL (s3://bucket/path/filename)
|
||||
chunk_size: Size of each chunk in bytes (default 5MB)
|
||||
progress_callback: Callback(chunk_number, total_chunks, bytes_sent, total_bytes)
|
||||
max_ratio: Maximum acceptable delta/file ratio for compression
|
||||
|
||||
Returns:
|
||||
UploadSummary with compression statistics
|
||||
|
||||
Example:
|
||||
def on_progress(chunk_num, total_chunks, bytes_sent, total_bytes):
|
||||
percent = (bytes_sent / total_bytes) * 100
|
||||
print(f"Upload progress: {percent:.1f}%")
|
||||
|
||||
client.upload_chunked(
|
||||
"large_file.zip",
|
||||
"s3://bucket/releases/large_file.zip",
|
||||
chunk_size=10 * 1024 * 1024, # 10MB chunks
|
||||
progress_callback=on_progress
|
||||
)
|
||||
"""
|
||||
file_path = Path(file_path)
|
||||
file_size = file_path.stat().st_size
|
||||
|
||||
# For small files, just use regular upload
|
||||
if file_size <= chunk_size:
|
||||
if progress_callback:
|
||||
progress_callback(1, 1, file_size, file_size)
|
||||
result: UploadSummary = client.upload(file_path, s3_url, max_ratio=max_ratio)
|
||||
return result
|
||||
|
||||
# Calculate chunks
|
||||
total_chunks = (file_size + chunk_size - 1) // chunk_size
|
||||
|
||||
# Create a temporary file for chunked processing
|
||||
# For now, we read the entire file but report progress in chunks
|
||||
# Future enhancement: implement true streaming upload in storage adapter
|
||||
bytes_read = 0
|
||||
|
||||
with open(file_path, "rb") as f:
|
||||
for chunk_num in range(1, total_chunks + 1):
|
||||
# Read chunk (simulated for progress reporting)
|
||||
chunk_data = f.read(chunk_size)
|
||||
bytes_read += len(chunk_data)
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(chunk_num, total_chunks, bytes_read, file_size)
|
||||
|
||||
# Perform the actual upload
|
||||
# TODO: When storage adapter supports streaming, pass chunks directly
|
||||
upload_result: UploadSummary = client.upload(file_path, s3_url, max_ratio=max_ratio)
|
||||
|
||||
# Final progress callback
|
||||
if progress_callback:
|
||||
progress_callback(total_chunks, total_chunks, file_size, file_size)
|
||||
|
||||
return upload_result
|
||||
|
||||
|
||||
def upload_batch(
|
||||
client: Any, # DeltaGliderClient
|
||||
files: list[str | Path],
|
||||
s3_prefix: str,
|
||||
max_ratio: float = 0.5,
|
||||
progress_callback: Callable[[str, int, int], None] | None = None,
|
||||
) -> list[UploadSummary]:
|
||||
"""Upload multiple files in batch.
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
files: List of local file paths
|
||||
s3_prefix: S3 destination prefix (s3://bucket/prefix/)
|
||||
max_ratio: Maximum acceptable delta/file ratio
|
||||
progress_callback: Callback(filename, current_file_index, total_files)
|
||||
|
||||
Returns:
|
||||
List of UploadSummary objects
|
||||
"""
|
||||
results = []
|
||||
|
||||
for i, file_path in enumerate(files):
|
||||
file_path = Path(file_path)
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(file_path.name, i + 1, len(files))
|
||||
|
||||
# Upload each file
|
||||
s3_url = f"{s3_prefix.rstrip('/')}/{file_path.name}"
|
||||
summary = client.upload(file_path, s3_url, max_ratio=max_ratio)
|
||||
results.append(summary)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def download_batch(
|
||||
client: Any, # DeltaGliderClient
|
||||
s3_urls: list[str],
|
||||
output_dir: str | Path,
|
||||
progress_callback: Callable[[str, int, int], None] | None = None,
|
||||
) -> list[Path]:
|
||||
"""Download multiple files in batch.
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
s3_urls: List of S3 URLs to download
|
||||
output_dir: Local directory to save files
|
||||
progress_callback: Callback(filename, current_file_index, total_files)
|
||||
|
||||
Returns:
|
||||
List of downloaded file paths
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
results = []
|
||||
|
||||
for i, s3_url in enumerate(s3_urls):
|
||||
# Extract filename from URL
|
||||
filename = s3_url.split("/")[-1]
|
||||
if filename.endswith(".delta"):
|
||||
filename = filename[:-6] # Remove .delta suffix
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(filename, i + 1, len(s3_urls))
|
||||
|
||||
output_path = output_dir / filename
|
||||
client.download(s3_url, output_path)
|
||||
results.append(output_path)
|
||||
|
||||
return results
|
||||
152
src/deltaglider/client_operations/bucket.py
Normal file
152
src/deltaglider/client_operations/bucket.py
Normal file
@@ -0,0 +1,152 @@
|
||||
"""Bucket management operations for DeltaGlider client.
|
||||
|
||||
This module contains boto3-compatible bucket operations:
|
||||
- create_bucket
|
||||
- delete_bucket
|
||||
- list_buckets
|
||||
"""
|
||||
|
||||
from typing import Any
|
||||
|
||||
|
||||
def create_bucket(
|
||||
client: Any, # DeltaGliderClient (avoiding circular import)
|
||||
Bucket: str,
|
||||
CreateBucketConfiguration: dict[str, str] | None = None,
|
||||
**kwargs: Any,
|
||||
) -> dict[str, Any]:
|
||||
"""Create an S3 bucket (boto3-compatible).
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
Bucket: Bucket name to create
|
||||
CreateBucketConfiguration: Optional bucket configuration (e.g., LocationConstraint)
|
||||
**kwargs: Additional S3 parameters (for compatibility)
|
||||
|
||||
Returns:
|
||||
Response dict with bucket location
|
||||
|
||||
Example:
|
||||
>>> client = create_client()
|
||||
>>> client.create_bucket(Bucket='my-bucket')
|
||||
>>> # With region
|
||||
>>> client.create_bucket(
|
||||
... Bucket='my-bucket',
|
||||
... CreateBucketConfiguration={'LocationConstraint': 'us-west-2'}
|
||||
... )
|
||||
"""
|
||||
storage_adapter = client.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
params: dict[str, Any] = {"Bucket": Bucket}
|
||||
if CreateBucketConfiguration:
|
||||
params["CreateBucketConfiguration"] = CreateBucketConfiguration
|
||||
|
||||
response = storage_adapter.client.create_bucket(**params)
|
||||
return {
|
||||
"Location": response.get("Location", f"/{Bucket}"),
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
}
|
||||
except Exception as e:
|
||||
error_msg = str(e)
|
||||
if "BucketAlreadyExists" in error_msg or "BucketAlreadyOwnedByYou" in error_msg:
|
||||
# Bucket already exists - return success
|
||||
client.service.logger.debug(f"Bucket {Bucket} already exists")
|
||||
return {
|
||||
"Location": f"/{Bucket}",
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
}
|
||||
raise RuntimeError(f"Failed to create bucket: {e}") from e
|
||||
else:
|
||||
raise NotImplementedError("Storage adapter does not support bucket creation")
|
||||
|
||||
|
||||
def delete_bucket(
|
||||
client: Any, # DeltaGliderClient
|
||||
Bucket: str,
|
||||
**kwargs: Any,
|
||||
) -> dict[str, Any]:
|
||||
"""Delete an S3 bucket (boto3-compatible).
|
||||
|
||||
Note: Bucket must be empty before deletion.
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
Bucket: Bucket name to delete
|
||||
**kwargs: Additional S3 parameters (for compatibility)
|
||||
|
||||
Returns:
|
||||
Response dict with deletion status
|
||||
|
||||
Example:
|
||||
>>> client = create_client()
|
||||
>>> client.delete_bucket(Bucket='my-bucket')
|
||||
"""
|
||||
storage_adapter = client.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
storage_adapter.client.delete_bucket(Bucket=Bucket)
|
||||
return {
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 204,
|
||||
},
|
||||
}
|
||||
except Exception as e:
|
||||
error_msg = str(e)
|
||||
if "NoSuchBucket" in error_msg:
|
||||
# Bucket doesn't exist - return success
|
||||
client.service.logger.debug(f"Bucket {Bucket} does not exist")
|
||||
return {
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 204,
|
||||
},
|
||||
}
|
||||
raise RuntimeError(f"Failed to delete bucket: {e}") from e
|
||||
else:
|
||||
raise NotImplementedError("Storage adapter does not support bucket deletion")
|
||||
|
||||
|
||||
def list_buckets(
|
||||
client: Any, # DeltaGliderClient
|
||||
**kwargs: Any,
|
||||
) -> dict[str, Any]:
|
||||
"""List all S3 buckets (boto3-compatible).
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
**kwargs: Additional S3 parameters (for compatibility)
|
||||
|
||||
Returns:
|
||||
Response dict with bucket list
|
||||
|
||||
Example:
|
||||
>>> client = create_client()
|
||||
>>> response = client.list_buckets()
|
||||
>>> for bucket in response['Buckets']:
|
||||
... print(bucket['Name'])
|
||||
"""
|
||||
storage_adapter = client.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
response = storage_adapter.client.list_buckets()
|
||||
return {
|
||||
"Buckets": response.get("Buckets", []),
|
||||
"Owner": response.get("Owner", {}),
|
||||
"ResponseMetadata": {
|
||||
"HTTPStatusCode": 200,
|
||||
},
|
||||
}
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Failed to list buckets: {e}") from e
|
||||
else:
|
||||
raise NotImplementedError("Storage adapter does not support bucket listing")
|
||||
124
src/deltaglider/client_operations/presigned.py
Normal file
124
src/deltaglider/client_operations/presigned.py
Normal file
@@ -0,0 +1,124 @@
|
||||
"""Presigned URL operations for DeltaGlider client.
|
||||
|
||||
This module contains boto3-compatible presigned URL operations:
|
||||
- generate_presigned_url
|
||||
- generate_presigned_post
|
||||
"""
|
||||
|
||||
from typing import Any
|
||||
|
||||
|
||||
def try_boto3_presigned_operation(
|
||||
client: Any, # DeltaGliderClient
|
||||
operation: str,
|
||||
**kwargs: Any,
|
||||
) -> Any | None:
|
||||
"""Try to generate presigned operation using boto3 client, return None if not available."""
|
||||
storage_adapter = client.service.storage
|
||||
|
||||
# Check if storage adapter has boto3 client
|
||||
if hasattr(storage_adapter, "client"):
|
||||
try:
|
||||
if operation == "url":
|
||||
return str(storage_adapter.client.generate_presigned_url(**kwargs))
|
||||
elif operation == "post":
|
||||
return dict(storage_adapter.client.generate_presigned_post(**kwargs))
|
||||
except AttributeError:
|
||||
# storage_adapter does not have a 'client' attribute
|
||||
pass
|
||||
except Exception as e:
|
||||
# Fall back to manual construction if needed
|
||||
client.service.logger.warning(f"Failed to generate presigned {operation}: {e}")
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def generate_presigned_url(
|
||||
client: Any, # DeltaGliderClient
|
||||
ClientMethod: str,
|
||||
Params: dict[str, Any],
|
||||
ExpiresIn: int = 3600,
|
||||
) -> str:
|
||||
"""Generate presigned URL (boto3-compatible).
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
ClientMethod: Method name ('get_object' or 'put_object')
|
||||
Params: Parameters dict with Bucket and Key
|
||||
ExpiresIn: URL expiration in seconds
|
||||
|
||||
Returns:
|
||||
Presigned URL string
|
||||
"""
|
||||
# Try boto3 first, fallback to manual construction
|
||||
url = try_boto3_presigned_operation(
|
||||
client,
|
||||
"url",
|
||||
ClientMethod=ClientMethod,
|
||||
Params=Params,
|
||||
ExpiresIn=ExpiresIn,
|
||||
)
|
||||
if url is not None:
|
||||
return str(url)
|
||||
|
||||
# Fallback: construct URL manually (less secure, for dev/testing only)
|
||||
bucket = Params.get("Bucket", "")
|
||||
key = Params.get("Key", "")
|
||||
|
||||
if client.endpoint_url:
|
||||
base_url = client.endpoint_url
|
||||
else:
|
||||
base_url = f"https://{bucket}.s3.amazonaws.com"
|
||||
|
||||
# Warning: This is not a real presigned URL, just a placeholder
|
||||
client.service.logger.warning("Using placeholder presigned URL - not suitable for production")
|
||||
return f"{base_url}/{key}?expires={ExpiresIn}"
|
||||
|
||||
|
||||
def generate_presigned_post(
|
||||
client: Any, # DeltaGliderClient
|
||||
Bucket: str,
|
||||
Key: str,
|
||||
Fields: dict[str, str] | None = None,
|
||||
Conditions: list[Any] | None = None,
|
||||
ExpiresIn: int = 3600,
|
||||
) -> dict[str, Any]:
|
||||
"""Generate presigned POST data for HTML forms (boto3-compatible).
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
Bucket: S3 bucket name
|
||||
Key: Object key
|
||||
Fields: Additional fields to include
|
||||
Conditions: Upload conditions
|
||||
ExpiresIn: URL expiration in seconds
|
||||
|
||||
Returns:
|
||||
Dict with 'url' and 'fields' for form submission
|
||||
"""
|
||||
# Try boto3 first, fallback to manual construction
|
||||
response = try_boto3_presigned_operation(
|
||||
client,
|
||||
"post",
|
||||
Bucket=Bucket,
|
||||
Key=Key,
|
||||
Fields=Fields,
|
||||
Conditions=Conditions,
|
||||
ExpiresIn=ExpiresIn,
|
||||
)
|
||||
if response is not None:
|
||||
return dict(response)
|
||||
|
||||
# Fallback: return minimal structure for compatibility
|
||||
if client.endpoint_url:
|
||||
url = f"{client.endpoint_url}/{Bucket}"
|
||||
else:
|
||||
url = f"https://{Bucket}.s3.amazonaws.com"
|
||||
|
||||
return {
|
||||
"url": url,
|
||||
"fields": {
|
||||
"key": Key,
|
||||
**(Fields or {}),
|
||||
},
|
||||
}
|
||||
337
src/deltaglider/client_operations/stats.py
Normal file
337
src/deltaglider/client_operations/stats.py
Normal file
@@ -0,0 +1,337 @@
|
||||
"""Statistics and analysis operations for DeltaGlider client.
|
||||
|
||||
This module contains DeltaGlider-specific statistics operations:
|
||||
- get_bucket_stats
|
||||
- get_object_info
|
||||
- estimate_compression
|
||||
- find_similar_files
|
||||
"""
|
||||
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from ..client_models import BucketStats, CompressionEstimate, ObjectInfo
|
||||
|
||||
|
||||
def get_object_info(
|
||||
client: Any, # DeltaGliderClient
|
||||
s3_url: str,
|
||||
) -> ObjectInfo:
|
||||
"""Get detailed object information including compression stats.
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
s3_url: S3 URL of the object
|
||||
|
||||
Returns:
|
||||
ObjectInfo with detailed metadata
|
||||
"""
|
||||
# Parse URL
|
||||
if not s3_url.startswith("s3://"):
|
||||
raise ValueError(f"Invalid S3 URL: {s3_url}")
|
||||
|
||||
s3_path = s3_url[5:]
|
||||
parts = s3_path.split("/", 1)
|
||||
bucket = parts[0]
|
||||
key = parts[1] if len(parts) > 1 else ""
|
||||
|
||||
# Get object metadata
|
||||
obj_head = client.service.storage.head(f"{bucket}/{key}")
|
||||
if not obj_head:
|
||||
raise FileNotFoundError(f"Object not found: {s3_url}")
|
||||
|
||||
metadata = obj_head.metadata
|
||||
is_delta = key.endswith(".delta")
|
||||
|
||||
return ObjectInfo(
|
||||
key=key,
|
||||
size=obj_head.size,
|
||||
last_modified=metadata.get("last_modified", ""),
|
||||
etag=metadata.get("etag"),
|
||||
original_size=int(metadata.get("file_size", obj_head.size)),
|
||||
compressed_size=obj_head.size,
|
||||
compression_ratio=float(metadata.get("compression_ratio", 0.0)),
|
||||
is_delta=is_delta,
|
||||
reference_key=metadata.get("ref_key"),
|
||||
)
|
||||
|
||||
|
||||
def get_bucket_stats(
|
||||
client: Any, # DeltaGliderClient
|
||||
bucket: str,
|
||||
detailed_stats: bool = False,
|
||||
) -> BucketStats:
|
||||
"""Get statistics for a bucket with optional detailed compression metrics.
|
||||
|
||||
This method provides two modes:
|
||||
- Quick stats (default): Fast overview using LIST only (~50ms)
|
||||
- Detailed stats: Accurate compression metrics with HEAD requests (slower)
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
bucket: S3 bucket name
|
||||
detailed_stats: If True, fetch accurate compression ratios for delta files (default: False)
|
||||
|
||||
Returns:
|
||||
BucketStats with compression and space savings info
|
||||
|
||||
Performance:
|
||||
- With detailed_stats=False: ~50ms for any bucket size (1 LIST call per 1000 objects)
|
||||
- With detailed_stats=True: ~2-3s per 1000 objects (adds HEAD calls for delta files only)
|
||||
|
||||
Example:
|
||||
# Quick stats for dashboard display
|
||||
stats = client.get_bucket_stats('releases')
|
||||
print(f"Objects: {stats.object_count}, Size: {stats.total_size}")
|
||||
|
||||
# Detailed stats for analytics (slower but accurate)
|
||||
stats = client.get_bucket_stats('releases', detailed_stats=True)
|
||||
print(f"Compression ratio: {stats.average_compression_ratio:.1%}")
|
||||
"""
|
||||
# List all objects with smart metadata fetching
|
||||
all_objects = []
|
||||
continuation_token = None
|
||||
|
||||
while True:
|
||||
response = client.list_objects(
|
||||
Bucket=bucket,
|
||||
MaxKeys=1000,
|
||||
ContinuationToken=continuation_token,
|
||||
FetchMetadata=detailed_stats, # Only fetch metadata if detailed stats requested
|
||||
)
|
||||
|
||||
# Extract S3Objects from response (with Metadata containing DeltaGlider info)
|
||||
for obj_dict in response["Contents"]:
|
||||
# Convert dict back to ObjectInfo for backward compatibility with stats calculation
|
||||
metadata = obj_dict.get("Metadata", {})
|
||||
# Parse compression ratio safely (handle "unknown" value)
|
||||
compression_ratio_str = metadata.get("deltaglider-compression-ratio", "0.0")
|
||||
try:
|
||||
compression_ratio = (
|
||||
float(compression_ratio_str) if compression_ratio_str != "unknown" else 0.0
|
||||
)
|
||||
except ValueError:
|
||||
compression_ratio = 0.0
|
||||
|
||||
all_objects.append(
|
||||
ObjectInfo(
|
||||
key=obj_dict["Key"],
|
||||
size=obj_dict["Size"],
|
||||
last_modified=obj_dict.get("LastModified", ""),
|
||||
etag=obj_dict.get("ETag"),
|
||||
storage_class=obj_dict.get("StorageClass", "STANDARD"),
|
||||
original_size=int(metadata.get("deltaglider-original-size", obj_dict["Size"])),
|
||||
compressed_size=obj_dict["Size"],
|
||||
is_delta=metadata.get("deltaglider-is-delta", "false") == "true",
|
||||
compression_ratio=compression_ratio,
|
||||
reference_key=metadata.get("deltaglider-reference-key"),
|
||||
)
|
||||
)
|
||||
|
||||
if not response.get("IsTruncated"):
|
||||
break
|
||||
|
||||
continuation_token = response.get("NextContinuationToken")
|
||||
|
||||
# Calculate statistics
|
||||
total_size = 0
|
||||
compressed_size = 0
|
||||
delta_count = 0
|
||||
direct_count = 0
|
||||
|
||||
for obj in all_objects:
|
||||
# Skip reference.bin files - they are internal implementation details
|
||||
# and their size is already accounted for in delta metadata
|
||||
if obj.key.endswith("/reference.bin") or obj.key == "reference.bin":
|
||||
continue
|
||||
|
||||
compressed_size += obj.size
|
||||
|
||||
if obj.is_delta:
|
||||
delta_count += 1
|
||||
# Use actual original size if we have it, otherwise estimate
|
||||
total_size += obj.original_size or obj.size
|
||||
else:
|
||||
direct_count += 1
|
||||
# For non-delta files, original equals compressed
|
||||
total_size += obj.size
|
||||
|
||||
space_saved = total_size - compressed_size
|
||||
avg_ratio = (space_saved / total_size) if total_size > 0 else 0.0
|
||||
|
||||
return BucketStats(
|
||||
bucket=bucket,
|
||||
object_count=len(all_objects),
|
||||
total_size=total_size,
|
||||
compressed_size=compressed_size,
|
||||
space_saved=space_saved,
|
||||
average_compression_ratio=avg_ratio,
|
||||
delta_objects=delta_count,
|
||||
direct_objects=direct_count,
|
||||
)
|
||||
|
||||
|
||||
def estimate_compression(
|
||||
client: Any, # DeltaGliderClient
|
||||
file_path: str | Path,
|
||||
bucket: str,
|
||||
prefix: str = "",
|
||||
sample_size: int = 1024 * 1024,
|
||||
) -> CompressionEstimate:
|
||||
"""Estimate compression ratio before upload.
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
file_path: Local file to estimate
|
||||
bucket: Target bucket
|
||||
prefix: Target prefix (for finding similar files)
|
||||
sample_size: Bytes to sample for estimation (default 1MB)
|
||||
|
||||
Returns:
|
||||
CompressionEstimate with predicted compression
|
||||
"""
|
||||
file_path = Path(file_path)
|
||||
file_size = file_path.stat().st_size
|
||||
|
||||
# Check file extension
|
||||
ext = file_path.suffix.lower()
|
||||
delta_extensions = {
|
||||
".zip",
|
||||
".tar",
|
||||
".gz",
|
||||
".tar.gz",
|
||||
".tgz",
|
||||
".bz2",
|
||||
".tar.bz2",
|
||||
".xz",
|
||||
".tar.xz",
|
||||
".7z",
|
||||
".rar",
|
||||
".dmg",
|
||||
".iso",
|
||||
".pkg",
|
||||
".deb",
|
||||
".rpm",
|
||||
".apk",
|
||||
".jar",
|
||||
".war",
|
||||
".ear",
|
||||
}
|
||||
|
||||
# Already compressed formats that won't benefit from delta
|
||||
incompressible = {".jpg", ".jpeg", ".png", ".mp4", ".mp3", ".avi", ".mov"}
|
||||
|
||||
if ext in incompressible:
|
||||
return CompressionEstimate(
|
||||
original_size=file_size,
|
||||
estimated_compressed_size=file_size,
|
||||
estimated_ratio=0.0,
|
||||
confidence=0.95,
|
||||
should_use_delta=False,
|
||||
)
|
||||
|
||||
if ext not in delta_extensions:
|
||||
# Unknown type, conservative estimate
|
||||
return CompressionEstimate(
|
||||
original_size=file_size,
|
||||
estimated_compressed_size=file_size,
|
||||
estimated_ratio=0.0,
|
||||
confidence=0.5,
|
||||
should_use_delta=file_size > 1024 * 1024, # Only for files > 1MB
|
||||
)
|
||||
|
||||
# Look for similar files in the target location
|
||||
similar_files = find_similar_files(client, bucket, prefix, file_path.name)
|
||||
|
||||
if similar_files:
|
||||
# If we have similar files, estimate high compression
|
||||
estimated_ratio = 0.99 # 99% compression typical for similar versions
|
||||
confidence = 0.9
|
||||
recommended_ref = similar_files[0]["Key"] if similar_files else None
|
||||
else:
|
||||
# First file of its type
|
||||
estimated_ratio = 0.0
|
||||
confidence = 0.7
|
||||
recommended_ref = None
|
||||
|
||||
estimated_size = int(file_size * (1 - estimated_ratio))
|
||||
|
||||
return CompressionEstimate(
|
||||
original_size=file_size,
|
||||
estimated_compressed_size=estimated_size,
|
||||
estimated_ratio=estimated_ratio,
|
||||
confidence=confidence,
|
||||
recommended_reference=recommended_ref,
|
||||
should_use_delta=True,
|
||||
)
|
||||
|
||||
|
||||
def find_similar_files(
|
||||
client: Any, # DeltaGliderClient
|
||||
bucket: str,
|
||||
prefix: str,
|
||||
filename: str,
|
||||
limit: int = 5,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Find similar files that could serve as references.
|
||||
|
||||
Args:
|
||||
client: DeltaGliderClient instance
|
||||
bucket: S3 bucket
|
||||
prefix: Prefix to search in
|
||||
filename: Filename to match against
|
||||
limit: Maximum number of results
|
||||
|
||||
Returns:
|
||||
List of similar files with scores
|
||||
"""
|
||||
# List objects in the prefix (no metadata needed for similarity check)
|
||||
response = client.list_objects(
|
||||
Bucket=bucket,
|
||||
Prefix=prefix,
|
||||
MaxKeys=1000,
|
||||
FetchMetadata=False, # Don't need metadata for similarity
|
||||
)
|
||||
|
||||
similar: list[dict[str, Any]] = []
|
||||
base_name = Path(filename).stem
|
||||
ext = Path(filename).suffix
|
||||
|
||||
for obj in response["Contents"]:
|
||||
obj_key = obj["Key"]
|
||||
obj_base = Path(obj_key).stem
|
||||
obj_ext = Path(obj_key).suffix
|
||||
|
||||
# Skip delta files and references
|
||||
if obj_key.endswith(".delta") or obj_key.endswith("reference.bin"):
|
||||
continue
|
||||
|
||||
score = 0.0
|
||||
|
||||
# Extension match
|
||||
if ext == obj_ext:
|
||||
score += 0.5
|
||||
|
||||
# Base name similarity
|
||||
if base_name in obj_base or obj_base in base_name:
|
||||
score += 0.3
|
||||
|
||||
# Version pattern match
|
||||
if re.search(r"v?\d+[\.\d]*", base_name) and re.search(r"v?\d+[\.\d]*", obj_base):
|
||||
score += 0.2
|
||||
|
||||
if score > 0.5:
|
||||
similar.append(
|
||||
{
|
||||
"Key": obj_key,
|
||||
"Size": obj["Size"],
|
||||
"Similarity": score,
|
||||
"LastModified": obj["LastModified"],
|
||||
}
|
||||
)
|
||||
|
||||
# Sort by similarity
|
||||
similar.sort(key=lambda x: x["Similarity"], reverse=True) # type: ignore
|
||||
|
||||
return similar[:limit]
|
||||
@@ -47,3 +47,15 @@ class PolicyViolationWarning(Warning):
|
||||
"""Policy violation warning."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class CacheMissError(DeltaGliderError):
|
||||
"""Cache miss - file not found in cache."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class CacheCorruptionError(DeltaGliderError):
|
||||
"""Cache corruption - SHA mismatch or tampering detected."""
|
||||
|
||||
pass
|
||||
|
||||
@@ -230,7 +230,10 @@ class DeltaService:
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
tmp_path = Path(tmpdir)
|
||||
delta_path = tmp_path / "delta"
|
||||
ref_path = self.cache.ref_path(delta_space.bucket, delta_space.prefix)
|
||||
# SECURITY: Use validated ref to prevent TOCTOU attacks
|
||||
ref_path = self.cache.get_validated_ref(
|
||||
delta_space.bucket, delta_space.prefix, delta_meta.ref_sha256
|
||||
)
|
||||
out_path = tmp_path / "output"
|
||||
|
||||
# Download delta
|
||||
@@ -408,7 +411,8 @@ class DeltaService:
|
||||
if not cache_hit:
|
||||
self._cache_reference(delta_space, ref_sha256)
|
||||
|
||||
ref_path = self.cache.ref_path(delta_space.bucket, delta_space.prefix)
|
||||
# SECURITY: Use validated ref to prevent TOCTOU attacks
|
||||
ref_path = self.cache.get_validated_ref(delta_space.bucket, delta_space.prefix, ref_sha256)
|
||||
|
||||
# Create delta
|
||||
with tempfile.NamedTemporaryFile(suffix=".delta") as delta_file:
|
||||
|
||||
@@ -15,6 +15,26 @@ class CachePort(Protocol):
|
||||
"""Check if reference exists and matches SHA."""
|
||||
...
|
||||
|
||||
def get_validated_ref(self, bucket: str, prefix: str, expected_sha: str) -> Path:
|
||||
"""Get cached reference with atomic SHA validation.
|
||||
|
||||
This method MUST be used instead of ref_path() to prevent TOCTOU attacks.
|
||||
It validates the SHA256 hash at the time of use, not just at cache check time.
|
||||
|
||||
Args:
|
||||
bucket: S3 bucket name
|
||||
prefix: Prefix/deltaspace within bucket
|
||||
expected_sha: Expected SHA256 hash of the file
|
||||
|
||||
Returns:
|
||||
Path to the validated cached file
|
||||
|
||||
Raises:
|
||||
CacheMissError: If cached file doesn't exist
|
||||
CacheCorruptionError: If SHA doesn't match (file corrupted or tampered)
|
||||
"""
|
||||
...
|
||||
|
||||
def write_ref(self, bucket: str, prefix: str, src: Path) -> Path:
|
||||
"""Cache reference file."""
|
||||
...
|
||||
|
||||
152
src/deltaglider/response_builders.py
Normal file
152
src/deltaglider/response_builders.py
Normal file
@@ -0,0 +1,152 @@
|
||||
"""Type-safe response builders using TypedDicts for internal type safety.
|
||||
|
||||
This module provides builder functions that construct boto3-compatible responses
|
||||
with full compile-time type validation using TypedDicts. At runtime, TypedDicts
|
||||
are plain dicts, so there's no conversion overhead.
|
||||
|
||||
Benefits:
|
||||
- Field name typos caught by mypy (e.g., "HTTPStatusCode" → "HttpStatusCode")
|
||||
- Wrong types caught by mypy (e.g., string instead of int)
|
||||
- Missing required fields caught by mypy
|
||||
- Extra unknown fields caught by mypy
|
||||
"""
|
||||
|
||||
from typing import Any
|
||||
|
||||
from .types import (
|
||||
CommonPrefix,
|
||||
DeleteObjectResponse,
|
||||
GetObjectResponse,
|
||||
ListObjectsV2Response,
|
||||
PutObjectResponse,
|
||||
ResponseMetadata,
|
||||
S3Object,
|
||||
)
|
||||
|
||||
|
||||
def build_response_metadata(status_code: int = 200) -> ResponseMetadata:
|
||||
"""Build ResponseMetadata with full type safety via TypedDict.
|
||||
|
||||
TypedDict is a dict at runtime - no conversion needed!
|
||||
mypy validates all fields match ResponseMetadata TypedDict.
|
||||
Uses our types.py TypedDict which has proper NotRequired fields.
|
||||
"""
|
||||
# Build as TypedDict - mypy validates field names and types!
|
||||
metadata: ResponseMetadata = {
|
||||
"HTTPStatusCode": status_code,
|
||||
# All other fields are NotRequired - can be omitted!
|
||||
}
|
||||
return metadata # Returns dict at runtime, ResponseMetadata type at compile-time
|
||||
|
||||
|
||||
def build_put_response(
|
||||
etag: str,
|
||||
*,
|
||||
version_id: str | None = None,
|
||||
deltaglider_info: dict[str, Any] | None = None,
|
||||
) -> PutObjectResponse:
|
||||
"""Build PutObjectResponse with full type safety via TypedDict.
|
||||
|
||||
Uses our types.py TypedDict which has proper NotRequired fields.
|
||||
mypy validates all field names, types, and structure.
|
||||
"""
|
||||
# Build as TypedDict - mypy catches typos and type errors!
|
||||
response: PutObjectResponse = {
|
||||
"ETag": etag,
|
||||
"ResponseMetadata": build_response_metadata(),
|
||||
}
|
||||
|
||||
if version_id:
|
||||
response["VersionId"] = version_id
|
||||
|
||||
# DeltaGlider extension - add as Any field
|
||||
if deltaglider_info:
|
||||
response["DeltaGliderInfo"] = deltaglider_info # type: ignore[typeddict-item]
|
||||
|
||||
return response # Returns dict at runtime, PutObjectResponse type at compile-time
|
||||
|
||||
|
||||
def build_get_response(
|
||||
body: Any,
|
||||
content_length: int,
|
||||
etag: str,
|
||||
metadata: dict[str, Any],
|
||||
) -> GetObjectResponse:
|
||||
"""Build GetObjectResponse with full type safety via TypedDict.
|
||||
|
||||
Uses our types.py TypedDict which has proper NotRequired fields.
|
||||
mypy validates all field names, types, and structure.
|
||||
"""
|
||||
# Build as TypedDict - mypy catches typos and type errors!
|
||||
response: GetObjectResponse = {
|
||||
"Body": body,
|
||||
"ContentLength": content_length,
|
||||
"ETag": etag,
|
||||
"Metadata": metadata,
|
||||
"ResponseMetadata": build_response_metadata(),
|
||||
}
|
||||
return response # Returns dict at runtime, GetObjectResponse type at compile-time
|
||||
|
||||
|
||||
def build_list_objects_response(
|
||||
bucket: str,
|
||||
prefix: str,
|
||||
delimiter: str,
|
||||
max_keys: int,
|
||||
contents: list[S3Object],
|
||||
common_prefixes: list[CommonPrefix] | None,
|
||||
is_truncated: bool,
|
||||
next_continuation_token: str | None,
|
||||
continuation_token: str | None,
|
||||
) -> ListObjectsV2Response:
|
||||
"""Build ListObjectsV2Response with full type safety via TypedDict.
|
||||
|
||||
Uses our types.py TypedDict which has proper NotRequired fields.
|
||||
mypy validates all field names, types, and structure.
|
||||
"""
|
||||
# Build as TypedDict - mypy catches typos and type errors!
|
||||
response: ListObjectsV2Response = {
|
||||
"IsTruncated": is_truncated,
|
||||
"Contents": contents,
|
||||
"Name": bucket,
|
||||
"Prefix": prefix,
|
||||
"Delimiter": delimiter,
|
||||
"MaxKeys": max_keys,
|
||||
"KeyCount": len(contents),
|
||||
"ResponseMetadata": build_response_metadata(),
|
||||
}
|
||||
|
||||
# Add optional fields
|
||||
if common_prefixes:
|
||||
response["CommonPrefixes"] = common_prefixes
|
||||
|
||||
if next_continuation_token:
|
||||
response["NextContinuationToken"] = next_continuation_token
|
||||
|
||||
if continuation_token:
|
||||
response["ContinuationToken"] = continuation_token
|
||||
|
||||
return response # Returns dict at runtime, ListObjectsV2Response type at compile-time
|
||||
|
||||
|
||||
def build_delete_response(
|
||||
delete_marker: bool = False,
|
||||
status_code: int = 204,
|
||||
deltaglider_info: dict[str, Any] | None = None,
|
||||
) -> DeleteObjectResponse:
|
||||
"""Build DeleteObjectResponse with full type safety via TypedDict.
|
||||
|
||||
Uses our types.py TypedDict which has proper NotRequired fields.
|
||||
mypy validates all field names, types, and structure.
|
||||
"""
|
||||
# Build as TypedDict - mypy catches typos and type errors!
|
||||
response: DeleteObjectResponse = {
|
||||
"DeleteMarker": delete_marker,
|
||||
"ResponseMetadata": build_response_metadata(status_code),
|
||||
}
|
||||
|
||||
# DeltaGlider extension
|
||||
if deltaglider_info:
|
||||
response["DeltaGliderInfo"] = deltaglider_info # type: ignore[typeddict-item]
|
||||
|
||||
return response # Returns dict at runtime, DeleteObjectResponse type at compile-time
|
||||
355
src/deltaglider/types.py
Normal file
355
src/deltaglider/types.py
Normal file
@@ -0,0 +1,355 @@
|
||||
"""Type definitions for boto3-compatible responses.
|
||||
|
||||
These TypedDict definitions provide type hints for DeltaGlider's boto3-compatible
|
||||
responses. All methods return plain `dict[str, Any]` at runtime for maximum
|
||||
flexibility and boto3 compatibility.
|
||||
|
||||
## Basic Usage (Recommended)
|
||||
|
||||
Use DeltaGlider with simple dict access - no type imports needed:
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
|
||||
client = create_client()
|
||||
|
||||
# Returns plain dict - 100% boto3 compatible
|
||||
response = client.put_object(Bucket='my-bucket', Key='file.zip', Body=data)
|
||||
print(response['ETag'])
|
||||
|
||||
# List objects with dict access
|
||||
listing = client.list_objects(Bucket='my-bucket')
|
||||
for obj in listing['Contents']:
|
||||
print(f"{obj['Key']}: {obj['Size']} bytes")
|
||||
```
|
||||
|
||||
## Optional Type Hints
|
||||
|
||||
For IDE autocomplete and type checking, you can use our convenience TypedDicts:
|
||||
|
||||
```python
|
||||
from deltaglider import create_client
|
||||
from deltaglider.types import PutObjectResponse, ListObjectsV2Response
|
||||
|
||||
client = create_client()
|
||||
response: PutObjectResponse = client.put_object(...) # IDE autocomplete
|
||||
listing: ListObjectsV2Response = client.list_objects(...)
|
||||
```
|
||||
|
||||
## Advanced: boto3-stubs Integration
|
||||
|
||||
For strictest type checking (requires boto3-stubs installation):
|
||||
|
||||
```bash
|
||||
pip install boto3-stubs[s3]
|
||||
```
|
||||
|
||||
```python
|
||||
from mypy_boto3_s3.type_defs import PutObjectOutputTypeDef
|
||||
response: PutObjectOutputTypeDef = client.put_object(...)
|
||||
```
|
||||
|
||||
**Note**: boto3-stubs TypedDefs are very strict and require ALL optional fields.
|
||||
DeltaGlider returns partial dicts for better boto3 compatibility, so boto3-stubs
|
||||
types may show false positive errors. Use `dict[str, Any]` or our TypedDicts instead.
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
DeltaGlider returns `dict[str, Any]` from all boto3-compatible methods because:
|
||||
1. **Flexibility**: boto3 responses vary by service and operation
|
||||
2. **Compatibility**: Exact match with boto3 runtime behavior
|
||||
3. **Simplicity**: No complex type dependencies for users
|
||||
4. **Optional Typing**: Users choose their preferred level of type safety
|
||||
"""
|
||||
|
||||
from datetime import datetime
|
||||
from typing import Any, Literal, NotRequired, TypedDict
|
||||
|
||||
# ============================================================================
|
||||
# S3 Object Types
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class S3Object(TypedDict):
|
||||
"""An S3 object returned in list operations.
|
||||
|
||||
Compatible with boto3's S3.Client.list_objects_v2() response Contents.
|
||||
"""
|
||||
|
||||
Key: str
|
||||
Size: int
|
||||
LastModified: datetime
|
||||
ETag: NotRequired[str]
|
||||
StorageClass: NotRequired[str]
|
||||
Owner: NotRequired[dict[str, str]]
|
||||
Metadata: NotRequired[dict[str, str]]
|
||||
|
||||
|
||||
class CommonPrefix(TypedDict):
|
||||
"""A common prefix (directory) in S3 listing.
|
||||
|
||||
Compatible with boto3's S3.Client.list_objects_v2() response CommonPrefixes.
|
||||
"""
|
||||
|
||||
Prefix: str
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Response Metadata (used in all responses)
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class ResponseMetadata(TypedDict):
|
||||
"""Metadata about the API response.
|
||||
|
||||
Compatible with all boto3 responses.
|
||||
"""
|
||||
|
||||
RequestId: NotRequired[str]
|
||||
HostId: NotRequired[str]
|
||||
HTTPStatusCode: int
|
||||
HTTPHeaders: NotRequired[dict[str, str]]
|
||||
RetryAttempts: NotRequired[int]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# List Operations Response Types
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class ListObjectsV2Response(TypedDict):
|
||||
"""Response from list_objects_v2 operation.
|
||||
|
||||
100% compatible with boto3's S3.Client.list_objects_v2() response.
|
||||
|
||||
Example:
|
||||
```python
|
||||
client = create_client()
|
||||
response: ListObjectsV2Response = client.list_objects(
|
||||
Bucket='my-bucket',
|
||||
Prefix='path/',
|
||||
Delimiter='/'
|
||||
)
|
||||
|
||||
for obj in response['Contents']:
|
||||
print(f"{obj['Key']}: {obj['Size']} bytes")
|
||||
|
||||
for prefix in response.get('CommonPrefixes', []):
|
||||
print(f"Directory: {prefix['Prefix']}")
|
||||
```
|
||||
"""
|
||||
|
||||
Contents: list[S3Object]
|
||||
Name: NotRequired[str] # Bucket name
|
||||
Prefix: NotRequired[str]
|
||||
Delimiter: NotRequired[str]
|
||||
MaxKeys: NotRequired[int]
|
||||
CommonPrefixes: NotRequired[list[CommonPrefix]]
|
||||
EncodingType: NotRequired[str]
|
||||
KeyCount: NotRequired[int]
|
||||
ContinuationToken: NotRequired[str]
|
||||
NextContinuationToken: NotRequired[str]
|
||||
StartAfter: NotRequired[str]
|
||||
IsTruncated: NotRequired[bool]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Put/Get/Delete Response Types
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class PutObjectResponse(TypedDict):
|
||||
"""Response from put_object operation.
|
||||
|
||||
Compatible with boto3's S3.Client.put_object() response.
|
||||
"""
|
||||
|
||||
ETag: str
|
||||
VersionId: NotRequired[str]
|
||||
ServerSideEncryption: NotRequired[str]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
class GetObjectResponse(TypedDict):
|
||||
"""Response from get_object operation.
|
||||
|
||||
Compatible with boto3's S3.Client.get_object() response.
|
||||
"""
|
||||
|
||||
Body: Any # StreamingBody in boto3, bytes in DeltaGlider
|
||||
ContentLength: int
|
||||
ContentType: NotRequired[str]
|
||||
ETag: NotRequired[str]
|
||||
LastModified: NotRequired[datetime]
|
||||
Metadata: NotRequired[dict[str, str]]
|
||||
VersionId: NotRequired[str]
|
||||
StorageClass: NotRequired[str]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
class DeleteObjectResponse(TypedDict):
|
||||
"""Response from delete_object operation.
|
||||
|
||||
Compatible with boto3's S3.Client.delete_object() response.
|
||||
"""
|
||||
|
||||
DeleteMarker: NotRequired[bool]
|
||||
VersionId: NotRequired[str]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
class DeletedObject(TypedDict):
|
||||
"""A successfully deleted object.
|
||||
|
||||
Compatible with boto3's S3.Client.delete_objects() response Deleted.
|
||||
"""
|
||||
|
||||
Key: str
|
||||
VersionId: NotRequired[str]
|
||||
DeleteMarker: NotRequired[bool]
|
||||
DeleteMarkerVersionId: NotRequired[str]
|
||||
|
||||
|
||||
class DeleteError(TypedDict):
|
||||
"""An error that occurred during deletion.
|
||||
|
||||
Compatible with boto3's S3.Client.delete_objects() response Errors.
|
||||
"""
|
||||
|
||||
Key: str
|
||||
Code: str
|
||||
Message: str
|
||||
VersionId: NotRequired[str]
|
||||
|
||||
|
||||
class DeleteObjectsResponse(TypedDict):
|
||||
"""Response from delete_objects operation.
|
||||
|
||||
Compatible with boto3's S3.Client.delete_objects() response.
|
||||
"""
|
||||
|
||||
Deleted: NotRequired[list[DeletedObject]]
|
||||
Errors: NotRequired[list[DeleteError]]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Head Object Response
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class HeadObjectResponse(TypedDict):
|
||||
"""Response from head_object operation.
|
||||
|
||||
Compatible with boto3's S3.Client.head_object() response.
|
||||
"""
|
||||
|
||||
ContentLength: int
|
||||
ContentType: NotRequired[str]
|
||||
ETag: NotRequired[str]
|
||||
LastModified: NotRequired[datetime]
|
||||
Metadata: NotRequired[dict[str, str]]
|
||||
VersionId: NotRequired[str]
|
||||
StorageClass: NotRequired[str]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Bucket Operations
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class Bucket(TypedDict):
|
||||
"""An S3 bucket.
|
||||
|
||||
Compatible with boto3's S3.Client.list_buckets() response Buckets.
|
||||
"""
|
||||
|
||||
Name: str
|
||||
CreationDate: datetime
|
||||
|
||||
|
||||
class ListBucketsResponse(TypedDict):
|
||||
"""Response from list_buckets operation.
|
||||
|
||||
Compatible with boto3's S3.Client.list_buckets() response.
|
||||
"""
|
||||
|
||||
Buckets: list[Bucket]
|
||||
Owner: NotRequired[dict[str, str]]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
class CreateBucketResponse(TypedDict):
|
||||
"""Response from create_bucket operation.
|
||||
|
||||
Compatible with boto3's S3.Client.create_bucket() response.
|
||||
"""
|
||||
|
||||
Location: NotRequired[str]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Multipart Upload Types
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class CompletedPart(TypedDict):
|
||||
"""A completed part in a multipart upload."""
|
||||
|
||||
PartNumber: int
|
||||
ETag: str
|
||||
|
||||
|
||||
class CompleteMultipartUploadResponse(TypedDict):
|
||||
"""Response from complete_multipart_upload operation."""
|
||||
|
||||
Location: NotRequired[str]
|
||||
Bucket: NotRequired[str]
|
||||
Key: NotRequired[str]
|
||||
ETag: NotRequired[str]
|
||||
VersionId: NotRequired[str]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Copy Operations
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class CopyObjectResponse(TypedDict):
|
||||
"""Response from copy_object operation.
|
||||
|
||||
Compatible with boto3's S3.Client.copy_object() response.
|
||||
"""
|
||||
|
||||
CopyObjectResult: NotRequired[dict[str, Any]]
|
||||
ETag: NotRequired[str]
|
||||
LastModified: NotRequired[datetime]
|
||||
VersionId: NotRequired[str]
|
||||
ResponseMetadata: NotRequired[ResponseMetadata]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Type Aliases for Convenience
|
||||
# ============================================================================
|
||||
|
||||
# Common parameter types
|
||||
BucketName = str
|
||||
ObjectKey = str
|
||||
Prefix = str
|
||||
Delimiter = str
|
||||
|
||||
# Storage class options
|
||||
StorageClass = Literal[
|
||||
"STANDARD",
|
||||
"REDUCED_REDUNDANCY",
|
||||
"STANDARD_IA",
|
||||
"ONEZONE_IA",
|
||||
"INTELLIGENT_TIERING",
|
||||
"GLACIER",
|
||||
"DEEP_ARCHIVE",
|
||||
"GLACIER_IR",
|
||||
]
|
||||
@@ -8,7 +8,7 @@ from unittest.mock import Mock
|
||||
import pytest
|
||||
|
||||
from deltaglider.adapters import (
|
||||
FsCacheAdapter,
|
||||
ContentAddressedCache,
|
||||
NoopMetricsAdapter,
|
||||
Sha256Adapter,
|
||||
StdLoggerAdapter,
|
||||
@@ -59,9 +59,9 @@ def real_hasher():
|
||||
|
||||
@pytest.fixture
|
||||
def cache_adapter(temp_dir, real_hasher):
|
||||
"""Create filesystem cache adapter."""
|
||||
"""Create content-addressed storage cache adapter."""
|
||||
cache_dir = temp_dir / "cache"
|
||||
return FsCacheAdapter(cache_dir, real_hasher)
|
||||
return ContentAddressedCache(cache_dir, real_hasher)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@@ -10,7 +10,6 @@ from deltaglider import create_client
|
||||
from deltaglider.client import (
|
||||
BucketStats,
|
||||
CompressionEstimate,
|
||||
ListObjectsResponse,
|
||||
ObjectInfo,
|
||||
)
|
||||
|
||||
@@ -125,7 +124,7 @@ class MockStorage:
|
||||
@pytest.fixture
|
||||
def client(tmp_path):
|
||||
"""Create a client with mocked storage."""
|
||||
client = create_client(cache_dir=str(tmp_path / "cache"))
|
||||
client = create_client()
|
||||
|
||||
# Replace storage with mock
|
||||
mock_storage = MockStorage()
|
||||
@@ -157,7 +156,6 @@ class TestCredentialHandling:
|
||||
aws_access_key_id="AKIAIOSFODNN7EXAMPLE",
|
||||
aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
|
||||
region_name="us-west-2",
|
||||
cache_dir=str(tmp_path / "cache"),
|
||||
)
|
||||
|
||||
# Verify the client was created
|
||||
@@ -180,7 +178,6 @@ class TestCredentialHandling:
|
||||
aws_access_key_id="ASIAIOSFODNN7EXAMPLE",
|
||||
aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
|
||||
aws_session_token="FwoGZXIvYXdzEBEaDH...",
|
||||
cache_dir=str(tmp_path / "cache"),
|
||||
)
|
||||
|
||||
assert client is not None
|
||||
@@ -189,7 +186,7 @@ class TestCredentialHandling:
|
||||
def test_create_client_without_credentials_uses_environment(self, tmp_path):
|
||||
"""Test that omitting credentials falls back to environment/IAM."""
|
||||
# This should use boto3's default credential chain
|
||||
client = create_client(cache_dir=str(tmp_path / "cache"))
|
||||
client = create_client()
|
||||
|
||||
assert client is not None
|
||||
assert client.service.storage.client is not None
|
||||
@@ -200,7 +197,6 @@ class TestCredentialHandling:
|
||||
endpoint_url="http://localhost:9000",
|
||||
aws_access_key_id="minioadmin",
|
||||
aws_secret_access_key="minioadmin",
|
||||
cache_dir=str(tmp_path / "cache"),
|
||||
)
|
||||
|
||||
assert client is not None
|
||||
@@ -279,27 +275,35 @@ class TestBoto3Compatibility:
|
||||
assert response["ContentLength"] == len(content)
|
||||
|
||||
def test_list_objects(self, client):
|
||||
"""Test list_objects with various options."""
|
||||
"""Test list_objects with various options (boto3-compatible dict response)."""
|
||||
# List all objects (default: FetchMetadata=False)
|
||||
response = client.list_objects(Bucket="test-bucket")
|
||||
|
||||
assert isinstance(response, ListObjectsResponse)
|
||||
assert response.key_count > 0
|
||||
assert len(response.contents) > 0
|
||||
# Response is now a boto3-compatible dict (not ListObjectsResponse)
|
||||
assert isinstance(response, dict)
|
||||
assert response["KeyCount"] > 0
|
||||
assert len(response["Contents"]) > 0
|
||||
|
||||
# Verify S3Object structure
|
||||
for obj in response["Contents"]:
|
||||
assert "Key" in obj
|
||||
assert "Size" in obj
|
||||
assert "LastModified" in obj
|
||||
assert "Metadata" in obj # DeltaGlider metadata
|
||||
|
||||
# Test with FetchMetadata=True (should only affect delta files)
|
||||
response_with_metadata = client.list_objects(Bucket="test-bucket", FetchMetadata=True)
|
||||
assert isinstance(response_with_metadata, ListObjectsResponse)
|
||||
assert response_with_metadata.key_count > 0
|
||||
assert isinstance(response_with_metadata, dict)
|
||||
assert response_with_metadata["KeyCount"] > 0
|
||||
|
||||
def test_list_objects_with_delimiter(self, client):
|
||||
"""Test list_objects with delimiter for folder simulation."""
|
||||
"""Test list_objects with delimiter for folder simulation (boto3-compatible dict response)."""
|
||||
response = client.list_objects(Bucket="test-bucket", Prefix="", Delimiter="/")
|
||||
|
||||
# Should have common prefixes for folders
|
||||
assert len(response.common_prefixes) > 0
|
||||
assert {"Prefix": "folder1/"} in response.common_prefixes
|
||||
assert {"Prefix": "folder2/"} in response.common_prefixes
|
||||
assert len(response.get("CommonPrefixes", [])) > 0
|
||||
assert {"Prefix": "folder1/"} in response["CommonPrefixes"]
|
||||
assert {"Prefix": "folder2/"} in response["CommonPrefixes"]
|
||||
|
||||
def test_delete_object(self, client):
|
||||
"""Test delete_object."""
|
||||
|
||||
@@ -71,7 +71,7 @@ def mock_storage():
|
||||
def client(tmp_path):
|
||||
"""Create DeltaGliderClient with mock storage."""
|
||||
# Use create_client to get a properly configured client
|
||||
client = create_client(cache_dir=str(tmp_path / "cache"))
|
||||
client = create_client()
|
||||
|
||||
# Replace storage with mock
|
||||
mock_storage = MockStorage()
|
||||
|
||||
@@ -53,8 +53,11 @@ class TestSDKFiltering:
|
||||
client = DeltaGliderClient(service)
|
||||
response = client.list_objects(Bucket="test-bucket", Prefix="releases/")
|
||||
|
||||
# Response is now a boto3-compatible dict
|
||||
contents = response["Contents"]
|
||||
|
||||
# Verify .delta suffix is stripped
|
||||
keys = [obj.key for obj in response.contents]
|
||||
keys = [obj["Key"] for obj in contents]
|
||||
assert "releases/app-v1.zip" in keys
|
||||
assert "releases/app-v2.zip" in keys
|
||||
assert "releases/README.md" in keys
|
||||
@@ -63,8 +66,10 @@ class TestSDKFiltering:
|
||||
for key in keys:
|
||||
assert not key.endswith(".delta"), f"Found .delta suffix in: {key}"
|
||||
|
||||
# Verify is_delta flag is set correctly
|
||||
delta_objects = [obj for obj in response.contents if obj.is_delta]
|
||||
# Verify is_delta flag is set correctly in Metadata
|
||||
delta_objects = [
|
||||
obj for obj in contents if obj.get("Metadata", {}).get("deltaglider-is-delta") == "true"
|
||||
]
|
||||
assert len(delta_objects) == 2
|
||||
|
||||
def test_list_objects_filters_reference_bin(self):
|
||||
@@ -106,15 +111,18 @@ class TestSDKFiltering:
|
||||
client = DeltaGliderClient(service)
|
||||
response = client.list_objects(Bucket="test-bucket", Prefix="releases/")
|
||||
|
||||
# Response is now a boto3-compatible dict
|
||||
contents = response["Contents"]
|
||||
|
||||
# Verify NO reference.bin files in output
|
||||
keys = [obj.key for obj in response.contents]
|
||||
keys = [obj["Key"] for obj in contents]
|
||||
for key in keys:
|
||||
assert not key.endswith("reference.bin"), f"Found reference.bin in: {key}"
|
||||
|
||||
# Should only have the app.zip (with .delta stripped)
|
||||
assert len(response.contents) == 1
|
||||
assert response.contents[0].key == "releases/app.zip"
|
||||
assert response.contents[0].is_delta is True
|
||||
assert len(contents) == 1
|
||||
assert contents[0]["Key"] == "releases/app.zip"
|
||||
assert contents[0].get("Metadata", {}).get("deltaglider-is-delta") == "true"
|
||||
|
||||
def test_list_objects_combined_filtering(self):
|
||||
"""Test filtering of both .delta and reference.bin together."""
|
||||
@@ -170,12 +178,15 @@ class TestSDKFiltering:
|
||||
client = DeltaGliderClient(service)
|
||||
response = client.list_objects(Bucket="test-bucket", Prefix="data/")
|
||||
|
||||
# Response is now a boto3-compatible dict
|
||||
contents = response["Contents"]
|
||||
|
||||
# Should filter out 2 reference.bin files
|
||||
# Should strip .delta from 3 files
|
||||
# Should keep 1 regular file as-is
|
||||
assert len(response.contents) == 4 # 3 deltas + 1 regular file
|
||||
assert len(contents) == 4 # 3 deltas + 1 regular file
|
||||
|
||||
keys = [obj.key for obj in response.contents]
|
||||
keys = [obj["Key"] for obj in contents]
|
||||
expected_keys = ["data/file1.zip", "data/file2.zip", "data/file3.txt", "data/sub/app.jar"]
|
||||
assert sorted(keys) == sorted(expected_keys)
|
||||
|
||||
|
||||
189
tests/unit/test_cache_encrypted.py
Normal file
189
tests/unit/test_cache_encrypted.py
Normal file
@@ -0,0 +1,189 @@
|
||||
"""Tests for encrypted cache adapter."""
|
||||
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
from cryptography.fernet import Fernet
|
||||
|
||||
from deltaglider.adapters import ContentAddressedCache, EncryptedCache, Sha256Adapter
|
||||
from deltaglider.core.errors import CacheCorruptionError, CacheMissError
|
||||
|
||||
|
||||
class TestEncryptedCache:
|
||||
"""Test encrypted cache wrapper functionality."""
|
||||
|
||||
@pytest.fixture
|
||||
def temp_dir(self):
|
||||
"""Create temporary directory for tests."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
yield Path(tmpdir)
|
||||
|
||||
@pytest.fixture
|
||||
def hasher(self):
|
||||
"""Create SHA256 hasher."""
|
||||
return Sha256Adapter()
|
||||
|
||||
@pytest.fixture
|
||||
def backend(self, temp_dir, hasher):
|
||||
"""Create CAS backend."""
|
||||
return ContentAddressedCache(temp_dir, hasher)
|
||||
|
||||
@pytest.fixture
|
||||
def encrypted_cache(self, backend):
|
||||
"""Create encrypted cache with ephemeral key."""
|
||||
return EncryptedCache(backend)
|
||||
|
||||
def test_ephemeral_key_generation(self, backend):
|
||||
"""Test that ephemeral key is generated automatically."""
|
||||
cache = EncryptedCache(backend)
|
||||
|
||||
assert cache._ephemeral is True
|
||||
assert cache._key is not None
|
||||
assert len(cache._key) == 44 # Base64-encoded 32-byte key
|
||||
|
||||
def test_provided_key_usage(self, backend):
|
||||
"""Test using provided encryption key."""
|
||||
key = Fernet.generate_key()
|
||||
cache = EncryptedCache(backend, encryption_key=key)
|
||||
|
||||
assert cache._ephemeral is False
|
||||
assert cache._key == key
|
||||
|
||||
def test_write_and_read_encrypted(self, encrypted_cache, temp_dir):
|
||||
"""Test writing and reading encrypted content."""
|
||||
# Create test file
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Secret data that should be encrypted"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
# Compute expected SHA
|
||||
import hashlib
|
||||
|
||||
expected_sha = hashlib.sha256(test_content).hexdigest()
|
||||
|
||||
# Write to encrypted cache
|
||||
encrypted_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Read back and validate
|
||||
decrypted_path = encrypted_cache.get_validated_ref(
|
||||
"test-bucket", "test-prefix", expected_sha
|
||||
)
|
||||
|
||||
# Verify decrypted content matches original
|
||||
decrypted_content = decrypted_path.read_bytes()
|
||||
assert decrypted_content == test_content
|
||||
|
||||
def test_encrypted_storage_not_readable(self, encrypted_cache, backend, temp_dir):
|
||||
"""Test that stored data is actually encrypted."""
|
||||
# Create test file
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Plaintext secret"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
# Write to encrypted cache
|
||||
encrypted_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Get the encrypted file path from backend
|
||||
backend_path = backend.ref_path("test-bucket", "test-prefix")
|
||||
|
||||
# Read encrypted content directly
|
||||
encrypted_content = backend_path.read_bytes()
|
||||
|
||||
# Verify content is NOT the same as plaintext
|
||||
assert encrypted_content != test_content
|
||||
# Verify content doesn't contain plaintext substring
|
||||
assert b"secret" not in encrypted_content.lower()
|
||||
|
||||
def test_cache_miss(self, encrypted_cache):
|
||||
"""Test cache miss error."""
|
||||
with pytest.raises(CacheMissError):
|
||||
encrypted_cache.get_validated_ref("no-bucket", "no-prefix", "fakehash")
|
||||
|
||||
def test_decryption_with_wrong_sha(self, encrypted_cache, temp_dir):
|
||||
"""Test that wrong SHA is detected after decryption."""
|
||||
# Create test file
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Test content"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
# Write to cache
|
||||
encrypted_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Try to read with wrong SHA
|
||||
with pytest.raises(CacheCorruptionError, match="SHA mismatch"):
|
||||
encrypted_cache.get_validated_ref("test-bucket", "test-prefix", "wrong_sha_hash_here")
|
||||
|
||||
def test_decryption_with_wrong_key(self, temp_dir):
|
||||
"""Test that decryption fails with wrong key."""
|
||||
# Create shared backend
|
||||
from deltaglider.adapters import ContentAddressedCache, Sha256Adapter
|
||||
|
||||
hasher = Sha256Adapter()
|
||||
backend = ContentAddressedCache(temp_dir / "shared", hasher)
|
||||
|
||||
# Create two caches with different keys sharing same backend
|
||||
cache1 = EncryptedCache(backend)
|
||||
|
||||
# Write with cache1
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Encrypted data"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
import hashlib
|
||||
|
||||
expected_sha = hashlib.sha256(test_content).hexdigest()
|
||||
|
||||
cache1.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Create cache2 with different key (fresh instance, different ephemeral key)
|
||||
# and manually add to its mapping (simulating persistent storage scenario)
|
||||
cache2 = EncryptedCache(backend)
|
||||
cache2._plaintext_sha_map[("test-bucket", "test-prefix")] = expected_sha
|
||||
|
||||
# Try to read with cache2 (different key) - should fail decryption
|
||||
with pytest.raises(CacheCorruptionError, match="Decryption failed"):
|
||||
cache2.get_validated_ref("test-bucket", "test-prefix", expected_sha)
|
||||
|
||||
def test_evict_cleans_decrypted_files(self, encrypted_cache, temp_dir):
|
||||
"""Test that evict cleans up .decrypted temporary files."""
|
||||
# Create and store file
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Test"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
import hashlib
|
||||
|
||||
expected_sha = hashlib.sha256(test_content).hexdigest()
|
||||
|
||||
encrypted_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Read to create .decrypted file
|
||||
decrypted_path = encrypted_cache.get_validated_ref(
|
||||
"test-bucket", "test-prefix", expected_sha
|
||||
)
|
||||
assert decrypted_path.exists()
|
||||
|
||||
# Evict
|
||||
encrypted_cache.evict("test-bucket", "test-prefix")
|
||||
|
||||
# Verify .decrypted file is removed
|
||||
assert not decrypted_path.exists()
|
||||
|
||||
def test_from_env_with_no_key(self, backend, monkeypatch):
|
||||
"""Test from_env creates ephemeral key when env var not set."""
|
||||
monkeypatch.delenv("DG_CACHE_ENCRYPTION_KEY", raising=False)
|
||||
|
||||
cache = EncryptedCache.from_env(backend)
|
||||
|
||||
assert cache._ephemeral is True
|
||||
|
||||
def test_from_env_with_key(self, backend, monkeypatch):
|
||||
"""Test from_env uses key from environment."""
|
||||
key = Fernet.generate_key()
|
||||
monkeypatch.setenv("DG_CACHE_ENCRYPTION_KEY", key.decode("utf-8"))
|
||||
|
||||
cache = EncryptedCache.from_env(backend)
|
||||
|
||||
assert cache._ephemeral is False
|
||||
assert cache._key == key
|
||||
200
tests/unit/test_cache_memory.py
Normal file
200
tests/unit/test_cache_memory.py
Normal file
@@ -0,0 +1,200 @@
|
||||
"""Tests for in-memory cache adapter."""
|
||||
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from deltaglider.adapters import MemoryCache, Sha256Adapter
|
||||
from deltaglider.core.errors import CacheCorruptionError, CacheMissError
|
||||
|
||||
|
||||
class TestMemoryCache:
|
||||
"""Test in-memory cache functionality."""
|
||||
|
||||
@pytest.fixture
|
||||
def temp_dir(self):
|
||||
"""Create temporary directory for tests."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
yield Path(tmpdir)
|
||||
|
||||
@pytest.fixture
|
||||
def hasher(self):
|
||||
"""Create SHA256 hasher."""
|
||||
return Sha256Adapter()
|
||||
|
||||
@pytest.fixture
|
||||
def memory_cache(self, hasher, temp_dir):
|
||||
"""Create memory cache with 1MB limit."""
|
||||
return MemoryCache(hasher, max_size_mb=1, temp_dir=temp_dir)
|
||||
|
||||
def test_write_and_read(self, memory_cache, temp_dir):
|
||||
"""Test basic write and read functionality."""
|
||||
# Create test file
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Hello, memory cache!"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
# Compute expected SHA
|
||||
import hashlib
|
||||
|
||||
expected_sha = hashlib.sha256(test_content).hexdigest()
|
||||
|
||||
# Write to memory cache
|
||||
memory_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Read back
|
||||
retrieved_path = memory_cache.get_validated_ref("test-bucket", "test-prefix", expected_sha)
|
||||
|
||||
# Verify content
|
||||
assert retrieved_path.read_bytes() == test_content
|
||||
|
||||
def test_has_ref_true(self, memory_cache, temp_dir):
|
||||
"""Test has_ref returns True for existing content."""
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Test"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
import hashlib
|
||||
|
||||
sha = hashlib.sha256(test_content).hexdigest()
|
||||
|
||||
memory_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
assert memory_cache.has_ref("test-bucket", "test-prefix", sha) is True
|
||||
|
||||
def test_has_ref_false(self, memory_cache):
|
||||
"""Test has_ref returns False for non-existent content."""
|
||||
assert memory_cache.has_ref("no-bucket", "no-prefix", "fakehash") is False
|
||||
|
||||
def test_cache_miss(self, memory_cache):
|
||||
"""Test cache miss error."""
|
||||
with pytest.raises(CacheMissError):
|
||||
memory_cache.get_validated_ref("no-bucket", "no-prefix", "fakehash")
|
||||
|
||||
def test_sha_mismatch_detection(self, memory_cache, temp_dir):
|
||||
"""Test that SHA mismatch is detected."""
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_file.write_bytes(b"Content")
|
||||
|
||||
memory_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Try to read with wrong SHA
|
||||
with pytest.raises(CacheCorruptionError, match="SHA mismatch"):
|
||||
memory_cache.get_validated_ref("test-bucket", "test-prefix", "wrong_sha")
|
||||
|
||||
def test_lru_eviction(self, hasher, temp_dir):
|
||||
"""Test LRU eviction when cache is full."""
|
||||
# Create small cache (only 10KB)
|
||||
small_cache = MemoryCache(hasher, max_size_mb=0.01, temp_dir=temp_dir)
|
||||
|
||||
# Create files that will exceed cache limit
|
||||
file1 = temp_dir / "file1.txt"
|
||||
file2 = temp_dir / "file2.txt"
|
||||
file3 = temp_dir / "file3.txt"
|
||||
|
||||
# Each file is 5KB
|
||||
file1.write_bytes(b"A" * 5000)
|
||||
file2.write_bytes(b"B" * 5000)
|
||||
file3.write_bytes(b"C" * 5000)
|
||||
|
||||
# Write file1 and file2 (total 10KB, at limit)
|
||||
small_cache.write_ref("bucket", "prefix1", file1)
|
||||
small_cache.write_ref("bucket", "prefix2", file2)
|
||||
|
||||
# Verify both are in cache
|
||||
import hashlib
|
||||
|
||||
sha1 = hashlib.sha256(b"A" * 5000).hexdigest()
|
||||
sha2 = hashlib.sha256(b"B" * 5000).hexdigest()
|
||||
|
||||
assert small_cache.has_ref("bucket", "prefix1", sha1) is True
|
||||
assert small_cache.has_ref("bucket", "prefix2", sha2) is True
|
||||
|
||||
# Write file3 (5KB) - should evict file1 (LRU)
|
||||
small_cache.write_ref("bucket", "prefix3", file3)
|
||||
|
||||
# file1 should be evicted
|
||||
assert small_cache.has_ref("bucket", "prefix1", sha1) is False
|
||||
|
||||
# file2 and file3 should still be in cache
|
||||
sha3 = hashlib.sha256(b"C" * 5000).hexdigest()
|
||||
assert small_cache.has_ref("bucket", "prefix2", sha2) is True
|
||||
assert small_cache.has_ref("bucket", "prefix3", sha3) is True
|
||||
|
||||
def test_file_too_large_for_cache(self, hasher, temp_dir):
|
||||
"""Test error when file exceeds cache size limit."""
|
||||
small_cache = MemoryCache(hasher, max_size_mb=0.001, temp_dir=temp_dir) # 1KB limit
|
||||
|
||||
large_file = temp_dir / "large.txt"
|
||||
large_file.write_bytes(b"X" * 2000) # 2KB file
|
||||
|
||||
with pytest.raises(CacheCorruptionError, match="too large"):
|
||||
small_cache.write_ref("bucket", "prefix", large_file)
|
||||
|
||||
def test_evict_removes_from_memory(self, memory_cache, temp_dir):
|
||||
"""Test that evict removes content from memory."""
|
||||
test_file = temp_dir / "test.txt"
|
||||
test_content = b"Test"
|
||||
test_file.write_bytes(test_content)
|
||||
|
||||
import hashlib
|
||||
|
||||
sha = hashlib.sha256(test_content).hexdigest()
|
||||
|
||||
memory_cache.write_ref("test-bucket", "test-prefix", test_file)
|
||||
|
||||
# Verify it's in cache
|
||||
assert memory_cache.has_ref("test-bucket", "test-prefix", sha) is True
|
||||
|
||||
# Evict
|
||||
memory_cache.evict("test-bucket", "test-prefix")
|
||||
|
||||
# Verify it's gone
|
||||
assert memory_cache.has_ref("test-bucket", "test-prefix", sha) is False
|
||||
|
||||
def test_clear_removes_all(self, memory_cache, temp_dir):
|
||||
"""Test that clear removes all cached content."""
|
||||
# Add multiple files
|
||||
for i in range(3):
|
||||
test_file = temp_dir / f"test{i}.txt"
|
||||
test_file.write_bytes(f"Content {i}".encode())
|
||||
memory_cache.write_ref("bucket", f"prefix{i}", test_file)
|
||||
|
||||
# Verify cache is not empty
|
||||
assert memory_cache._current_size > 0
|
||||
assert len(memory_cache._cache) == 3
|
||||
|
||||
# Clear
|
||||
memory_cache.clear()
|
||||
|
||||
# Verify cache is empty
|
||||
assert memory_cache._current_size == 0
|
||||
assert len(memory_cache._cache) == 0
|
||||
assert len(memory_cache._access_order) == 0
|
||||
|
||||
def test_access_order_updated_on_read(self, memory_cache, temp_dir):
|
||||
"""Test that LRU access order is updated on reads."""
|
||||
# Create two files
|
||||
file1 = temp_dir / "file1.txt"
|
||||
file2 = temp_dir / "file2.txt"
|
||||
file1.write_bytes(b"File 1")
|
||||
file2.write_bytes(b"File 2")
|
||||
|
||||
# Write both
|
||||
memory_cache.write_ref("bucket", "prefix1", file1)
|
||||
memory_cache.write_ref("bucket", "prefix2", file2)
|
||||
|
||||
# Access order should be: [prefix1, prefix2]
|
||||
assert memory_cache._access_order[0] == ("bucket", "prefix1")
|
||||
assert memory_cache._access_order[1] == ("bucket", "prefix2")
|
||||
|
||||
# Read prefix1 again
|
||||
import hashlib
|
||||
|
||||
sha1 = hashlib.sha256(b"File 1").hexdigest()
|
||||
memory_cache.get_validated_ref("bucket", "prefix1", sha1)
|
||||
|
||||
# Access order should now be: [prefix2, prefix1]
|
||||
assert memory_cache._access_order[0] == ("bucket", "prefix2")
|
||||
assert memory_cache._access_order[1] == ("bucket", "prefix1")
|
||||
Reference in New Issue
Block a user