freshen up

This commit is contained in:
Simone Scarduzio
2025-11-11 11:18:06 +01:00
parent 0d46283ff0
commit 7a4d30a007
2 changed files with 82 additions and 347 deletions

View File

@@ -6,9 +6,8 @@
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![xdelta3](https://img.shields.io/badge/powered%20by-xdelta3-green.svg)](https://github.com/jmacd/xdelta)
<div align="center">
<img src="https://github.com/beshu-tech/deltaglider/raw/main/docs/deltaglider.png" alt="DeltaGlider Logo"/>
</div>
> 🌟 Star if you like this! 🙏
> Leave a message in [Issues](https://github.com/beshu-tech/deltaglider/issues) - we are listening!
**Store 4TB of similar files in 5GB. No, that's not a typo.**
@@ -37,6 +36,10 @@ We don't expect significant benefit for multimedia content like videos, but we n
The quickest way to start is using the GUI
* https://github.com/beshu-tech/deltaglider_commander/
<div align="center">
<img src="https://github.com/beshu-tech/deltaglider/raw/main/docs/deltaglider.png" alt="DeltaGlider Logo"/>
</div>
### CLI Installation
```bash
@@ -487,18 +490,18 @@ This is why DeltaGlider achieves 99%+ compression on versioned archives - xdelta
### System Architecture
DeltaGlider uses a clean hexagonal architecture:
DeltaGlider intelligently stores files within **DeltaSpaces** - S3 prefixes where related files share a common reference file for delta compression:
```
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Your App │────▶│ DeltaGlider │────▶│ S3/MinIO
│ (CLI/SDK) │ │ Core │ │ Storage
└─────────────┘ └──────────────┘ ─────────────
┌──────▼───────┐
│ Local Cache │
│ (References) │
└──────────────┘
┌─────────────┐ ┌──────────────┐ ┌─────────────────
│ Your App │────▶│ DeltaGlider │────▶│ DeltaSpace
│ (CLI/SDK) │ │ Core │ │ (S3 prefix)
└─────────────┘ └──────────────┘ ├─────────────────
│ reference.bin │
┌──────▼───────┐ │ file1.delta │
│ Local Cache │ │ file2.delta │
│ (References) │ │ file3.delta │
└──────────────┘ └─────────────────┘
```
**Key Components:**
@@ -507,6 +510,9 @@ DeltaGlider uses a clean hexagonal architecture:
- **Integrity verification**: SHA256 on every operation
- **Local caching**: Fast repeated operations
- **Zero dependencies**: No database, no manifest files
- **Modular storage**: The storage layer is pluggable - you could easily replace S3 with a filesystem driver (using extended attributes for metadata) or any other backend
The codebase follows a ports-and-adapters pattern where core business logic is decoupled from infrastructure, with storage operations abstracted through well-defined interfaces in the `ports/` directory and concrete implementations in `adapters/`.
### When to Use DeltaGlider
@@ -651,14 +657,8 @@ MIT - Use it freely in your projects.
## Success Stories
> "We reduced our artifact storage from 4TB to 5GB. This isn't hyperbole—it's math."
> — [ReadOnlyREST Case Study](docs/case-study-readonlyrest.md)
> "Our CI/CD pipeline now uploads 100x faster. Deploys that took minutes now take seconds."
> — Platform Engineer at [redacted]
> "We were about to buy expensive deduplication storage. DeltaGlider saved us $50K/year."
> — CTO at [stealth startup]
> "We reduced our artifact storage from 4TB to 5GB. CI is also much faster, due to smaller uploads."
> — [ReadonlyREST Case Study](docs/case-study-readonlyrest.md)
---
@@ -670,4 +670,10 @@ deltaglider analyze s3://your-bucket/
# Output: "Potential savings: 95.2% (4.8TB → 237GB)"
```
Built with ❤️ by engineers who were tired of paying to store the same bytes over and over.
## Who built this?
Built with ❤️ by [ReadonlyREST](https://readonlyrest.com) engineers who were tired of paying to store the same bytes over and over.
We also built [Anaphora](https://anaphora.it) for aggregated reports and alerting
And [Deltaglider Commander](https://github.com/beshu-tech/deltaglider_commander)

View File

@@ -1,347 +1,76 @@
# Case Study: How ReadOnlyREST Reduced Storage Costs by 99.9% with DeltaGlider
## How ReadonlyREST Cut 4TB of S3 Storage Down to 5GB (and Saved 99.9%)
## Executive Summary
### TL;DR
**The Challenge**: ReadOnlyREST, a security plugin for Elasticsearch, was facing exponential storage costs managing 145 release versions across multiple product lines, consuming nearly 4TB of S3 storage.
We were paying to store 4TB of mostly identical plugin builds.
DeltaGlider deduplicated everything down to 4.9GB — 99.9% smaller, $1.1k/year cheaper, and no workflow changes.
**The Solution**: DeltaGlider, an intelligent delta compression system that reduced storage from 4,060GB to just 4.9GB.
#### The Problem
**The Impact**:
- 💰 **$1,119 annual savings** on storage costs
- 📉 **99.9% reduction** in storage usage
-**Zero changes** to existing workflows
-**Full data integrity** maintained
ReadonlyREST supports ~150 Elasticsearch/Kibana versions × multiple product lines × all our own releases.
After years of publishing builds, our S3 archive hit `4TB` (201,840 files, $93/month).
Glacier helped, but restoring files took 48 hours — useless for CI/CD.
---
Every plugin ZIP was ~82MB, but `99.7% identical` to the next one. We were paying to store duplicates.
## The Storage Crisis
#### The Fix: DeltaGlider
### The Numbers That Kept Us Up at Night
DeltaGlider stores binary deltas between similar files instead of full copies.
ReadOnlyREST maintains a comprehensive release archive:
- **145 version folders** (v1.50.0 through v1.66.1)
- **201,840 total files** to manage
- **3.96 TB** of S3 storage consumed
- **$1,120/year** in storage costs alone
Each version folder contained:
- 513 plugin ZIP files (one for each Elasticsearch version)
- 879 checksum files (SHA1 and SHA512)
- 3 product lines (Enterprise, Pro, Free)
### The Hidden Problem
What made this particularly painful wasn't just the size—it was the **redundancy**. Each 82.5MB plugin ZIP was 99.7% identical to others in the same version, differing only in minor Elasticsearch compatibility adjustments. We were essentially storing the same data hundreds of times.
> "We were paying to store 4TB of data that was fundamentally just variations of the same ~250MB of unique content. It felt like photocopying War and Peace 500 times because each copy had a different page number."
>
> — *DevOps Lead*
---
## Enter DeltaGlider
### The Lightbulb Moment
The breakthrough came when we realized we didn't need to store complete files—just the *differences* between them. DeltaGlider applies this principle automatically:
1. **First file becomes the reference** (stored in full)
2. **Similar files store only deltas** (typically 0.3% of original size)
3. **Different files uploaded directly** (no delta overhead)
### Implementation: Surprisingly Simple
```bash
# Before DeltaGlider (standard S3 upload)
aws s3 cp readonlyrest-1.66.1_es8.0.0.zip s3://releases/
# Size on S3: 82.5MB
# With DeltaGlider
deltaglider cp readonlyrest-1.66.1_es8.0.0.zip s3://releases/
# Size on S3: 65KB (99.92% smaller!)
# Before
```
aws s3 cp readonlyrest-1.66.1_es8.0.0.zip s3://releases/ # 82MB
```
The beauty? **Zero changes to our build pipeline**. DeltaGlider works as a drop-in replacement for S3 uploads.
---
## The Results: Beyond Our Expectations
### Storage Transformation
# After
```
BEFORE DELTAGLIDER AFTER DELTAGLIDER
━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━
4,060 GB (3.96 TB) → 4.9 GB
$93.38/month → $0.11/month
201,840 files → 201,840 files (same!)
deltaglider cp readonlyrest-1.66.1_es8.0.0.zip s3://releases/ # 65KB
```
### Real Performance Metrics
Drop-in replacement for `aws s3 cp`. No pipeline changes.
Data integrity checked with SHA256, stored as metadata in S3.
From our actual production deployment:
| Metric | Value | Impact |
|--------|-------|--------|
| **Compression Ratio** | 99.9% | Near-perfect deduplication |
| **Delta Size** | ~65KB per 82.5MB file | 1/1,269th of original |
| **Upload Speed** | 3-4 files/second | Faster than raw S3 uploads |
| **Download Speed** | Transparent reconstruction | No user impact |
| **Storage Savings** | 4,055 GB | Enough for 850,000 more files |
### The Result
### Version-to-Version Comparison
| Metric | Before | After | Δ |
|-------------- |----------|----------|--------------|
| Storage | 4.06TB | 4.9GB | -99.9% |
| Cost | $93/mo | $0.11/mo | -$1,119/yr |
| Files | 201,840 | 201,840 | identical |
| Upload speed | 1x | 34x | faster |
Testing between similar versions showed incredible efficiency:
Each “different” ZIP? Just a 65KB delta.
Reconstruction time: <100ms.
Zero user impact.
## Under the Hood
Uses xdelta3 diffs.
• Keeps one reference per group
• Stores deltas for near-identical files
• Skips small or text-based ones (.sha, .json, etc.)
Its smart enough to decide whats worth diffing automatically.
## Payoff
• 4TB → 5GB overnight
• Uploads 1,200× faster
• CI bandwidth cut 99%
• 100% checksum verified integrity
• Zero vendor lock-in (open source)
## Takeaways
If You Ship Versioned Artifacts
This will probably save you four figures and hours of upload time per year.
```
readonlyrest-1.66.1_es7.17.0.zip (82.5MB) → reference.bin (82.5MB)
readonlyrest-1.66.1_es7.17.1.zip (82.5MB) → 64KB delta (0.08% size)
readonlyrest-1.66.1_es7.17.2.zip (82.5MB) → 65KB delta (0.08% size)
...
readonlyrest-1.66.1_es8.15.0.zip (82.5MB) → 71KB delta (0.09% size)
```
---
## Technical Deep Dive
### How DeltaGlider Achieves 99.9% Compression
DeltaGlider uses binary diff algorithms (xdelta3) to identify and store only the bytes that change between files:
```python
# Simplified concept
reference = "readonlyrest-1.66.1_es7.17.0.zip" # 82.5MB
new_file = "readonlyrest-1.66.1_es7.17.1.zip" # 82.5MB
delta = binary_diff(reference, new_file) # 65KB
# Delta contains only:
# - Elasticsearch version string changes
# - Compatibility metadata updates
# - Build timestamp differences
```
### Intelligent File Type Detection
Not every file benefits from delta compression. DeltaGlider automatically:
- **Applies delta compression to**: `.zip`, `.tar`, `.gz`, `.dmg`, `.jar`, `.war`
- **Uploads directly**: `.txt`, `.sha1`, `.sha512`, `.json`, `.md`
This intelligence meant our 127,455 checksum files were uploaded directly, avoiding unnecessary processing overhead.
### Architecture That Scales
```
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Client │────▶│ DeltaGlider │────▶│ S3/MinIO │
│ (CI/CD) │ │ │ │ │
└─────────────┘ └──────────────┘ └─────────────┘
┌──────▼───────┐
│ Local Cache │
│ (References) │
└──────────────┘
```
---
## Business Impact
### Immediate ROI
- **Day 1**: 99.9% storage reduction
- **Month 1**: $93 saved
- **Year 1**: $1,119 saved
- **5 Years**: $5,595 saved (not counting growth)
### Hidden Benefits We Didn't Expect
1. **Faster Deployments**: Uploading 65KB deltas is 1,200x faster than 82.5MB files
2. **Reduced Bandwidth**: CI/CD pipeline bandwidth usage dropped 99%
3. **Improved Reliability**: Fewer timeout errors on large file uploads
4. **Better Compliance**: Automatic SHA256 integrity verification on every operation
### Environmental Impact
> "Reducing storage by 4TB means fewer drives spinning in data centers. It's a small contribution to our sustainability goals, but every bit counts."
>
> — *CTO*
---
## Implementation Journey
### Week 1: Proof of Concept
- Tested with 10 files
- Achieved 99.6% compression
- Decision to proceed
### Week 2: Production Rollout
- Uploaded all 201,840 files
- Zero errors or failures
- Immediate cost reduction
### Week 3: Integration
```bash
# Simple integration into our CI/CD
- aws s3 cp $FILE s3://releases/
+ deltaglider cp $FILE s3://releases/
```
### Week 4: Full Migration
- All build pipelines updated
- Developer documentation completed
- Monitoring dashboards configured
---
## Lessons Learned
### What Worked Well
1. **Drop-in replacement**: No architectural changes needed
2. **Automatic intelligence**: File type detection "just worked"
3. **Preservation of structure**: Directory hierarchy maintained perfectly
### Challenges Overcome
1. **Initial skepticism**: "99.9% compression sounds too good to be true"
- *Solution*: Live demonstration with real data
2. **Download concerns**: "Will it be slow to reconstruct files?"
- *Solution*: Benchmarking showed <100ms reconstruction time
3. **Reliability questions**: "What if the reference file is corrupted?"
- *Solution*: SHA256 verification on every operation
---
## For Decision Makers
### Why This Matters
Storage costs scale linearly with data growth. Without DeltaGlider:
- Next 145 versions: Additional $1,120/year
- 5-year projection: $11,200 in storage alone
- Opportunity cost: Resources that could fund innovation
### Risk Assessment
| Risk | Mitigation | Status |
|------|------------|--------|
| Vendor lock-in | Open-source, standards-based | ✅ Mitigated |
| Data corruption | SHA256 verification built-in | ✅ Mitigated |
| Performance impact | Faster than original | ✅ No risk |
| Complexity | Drop-in replacement | ✅ No risk |
### Strategic Advantages
1. **Cost Predictability**: Storage costs become negligible
2. **Scalability**: Can handle 100x more versions in same space
3. **Competitive Edge**: More resources for product development
4. **Green IT**: Reduced carbon footprint from storage
---
## For Engineers
### Getting Started
```bash
# Install DeltaGlider
pip install deltaglider
# Upload a file (automatic compression)
deltaglider cp my-release-v1.0.0.zip s3://releases/
# Download (automatic reconstruction)
deltaglider cp s3://releases/my-release-v1.0.0.zip .
# It's that simple.
deltaglider cp my-release.zip s3://releases/
```
### Performance Characteristics
```python
# Compression ratios by similarity
identical_files: 99.9% # Same file, different name
minor_changes: 99.7% # Version bumps, timestamps
moderate_changes: 95.0% # Feature additions
major_changes: 70.0% # Significant refactoring
completely_different: 0% # No compression (uploaded as-is)
```
### Integration Examples
**GitHub Actions**:
```yaml
- name: Upload Release
run: deltaglider cp dist/*.zip s3://releases/${{ github.ref_name }}/
```
**Jenkins Pipeline**:
```groovy
sh "deltaglider cp ${WORKSPACE}/target/*.jar s3://artifacts/"
```
**Python Script**:
```python
from deltaglider import DeltaService
service = DeltaService(bucket="releases")
service.put("my-app-v2.0.0.zip", "v2.0.0/")
```
---
## The Bottom Line
DeltaGlider transformed our storage crisis into a solved problem:
-**4TB → 5GB** storage reduction
-**$1,119/year** saved
-**Zero** workflow disruption
-**100%** data integrity maintained
For ReadOnlyREST, DeltaGlider wasn't just a cost-saving tool—it was a glimpse into the future of intelligent storage. When 99.9% of your data is redundant, why pay to store it 500 times?
---
## Next Steps
### For Your Organization
1. **Identify similar use cases**: Version releases, backups, build artifacts
2. **Run the calculator**: `[Your files] × [Versions] × [Similarity] = Savings`
3. **Start small**: Test with one project's releases
4. **Scale confidently**: Deploy across all similar data
### Get Started Today
```bash
# See your potential savings
git clone https://github.com/beshu-tech/deltaglider
cd deltaglider
python calculate_savings.py --path /your/releases
# Try it yourself
docker run -p 9000:9000 minio/minio # Local S3
pip install deltaglider
deltaglider cp your-file.zip s3://test/
```
---
## About ReadOnlyREST
ReadOnlyREST is the enterprise security plugin for Elasticsearch and OpenSearch, protecting clusters in production since 2015. Learn more at [readonlyrest.com](https://readonlyrest.com)
## About DeltaGlider
DeltaGlider is an open-source delta compression system for S3-compatible storage, turning redundant data into remarkable savings. Built with modern Python, containerized for portability, and designed for scale.
---
*"In a world where storage is cheap but not free, and data grows exponentially but changes incrementally, DeltaGlider represents a fundamental shift in how we think about storing versioned artifacts."*
**— ReadOnlyREST Engineering Team**
Thats it.