docs: evaluations instructions and readme refactoring

2026-07-16 17:51:14 +02:00 · 2025-12-22 18:32:59 +01:00
parent 30b8a65377
commit 9a623cbc3f
7 changed files with 570 additions and 232 deletions
@@ -0,0 +1,74 @@
+# Architecture
+
+## Tech Stack
+
+| Layer | Technology |
+|-------|------------|
+| Backend | Rust with Axum (SSR) |
+| Frontend | HTML + HTMX + minimal JS |
+| Database | SurrealDB (graph, document, vector) |
+| AI | OpenAI-compatible API |
+| Web Processing | Headless Chromium |
+
+## Crate Structure
+
+```
+minne/
+├── main/                 # Combined server + worker binary
+├── api-router/           # REST API routes
+├── html-router/          # SSR web interface
+├── ingestion-pipeline/   # Content processing pipeline
+├── retrieval-pipeline/   # Search and retrieval logic
+├── common/               # Shared types, storage, utilities
+├── evaluations/          # Benchmarking framework
+└── json-stream-parser/   # Streaming JSON utilities
+```
+
+## Process Modes
+
+| Binary | Purpose |
+|--------|---------|
+| `main` | All-in-one: serves UI and processes content |
+| `server` | UI and API only (no background processing) |
+| `worker` | Background processing only (no UI) |
+
+Split deployment is useful for scaling or resource isolation.
+
+## Data Flow
+
+```
+Content In → Ingestion Pipeline → SurrealDB
+                    ↓
+            Entity Extraction
+                    ↓
+            Embedding Generation
+                    ↓
+            Graph Relationships
+
+Query → Retrieval Pipeline → Results
+              ↓
+       Vector Search + FTS + Graph
+              ↓
+       RRF Fusion → (Optional Rerank) → Response
+```
+
+## Database Schema
+
+SurrealDB stores:
+
+- **TextContent** — Raw ingested content
+- **TextChunk** — Chunked content with embeddings
+- **KnowledgeEntity** — Extracted entities (people, concepts, etc.)
+- **KnowledgeRelationship** — Connections between entities
+- **User** — Authentication and preferences
+- **SystemSettings** — Model configuration
+
+Embeddings are stored in dedicated tables with HNSW indexes for fast vector search.
+
+## Retrieval Strategy
+
+1. **Collect candidates** — Vector similarity + full-text search
+2. **Merge ranks** — Reciprocal Rank Fusion (RRF)
+3. **Attach context** — Link chunks to parent entities
+4. **Rerank** (optional) — Cross-encoder rescoring
+5. **Return** — Top-k results with metadata
@@ -0,0 +1,77 @@
+# Configuration
+
+Minne can be configured via environment variables or a `config.yaml` file. Environment variables take precedence.
+
+## Required Settings
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `OPENAI_API_KEY` | API key for OpenAI-compatible endpoint | `sk-...` |
+| `SURREALDB_ADDRESS` | WebSocket address of SurrealDB | `ws://127.0.0.1:8000` |
+| `SURREALDB_USERNAME` | SurrealDB username | `root_user` |
+| `SURREALDB_PASSWORD` | SurrealDB password | `root_password` |
+| `SURREALDB_DATABASE` | Database name | `minne_db` |
+| `SURREALDB_NAMESPACE` | Namespace | `minne_ns` |
+
+## Optional Settings
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `HTTP_PORT` | Server port | `3000` |
+| `DATA_DIR` | Local data directory | `./data` |
+| `OPENAI_BASE_URL` | Custom AI provider URL | OpenAI default |
+| `RUST_LOG` | Logging level | `info` |
+
+### Reranking (Optional)
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `RERANKING_ENABLED` | Enable FastEmbed reranking | `false` |
+| `RERANKING_POOL_SIZE` | Concurrent reranker workers | `2` |
+| `FASTEMBED_CACHE_DIR` | Model cache directory | `<data_dir>/fastembed/reranker` |
+
+> [!NOTE]
+> Enabling reranking downloads ~1.1 GB of model data on first startup.
+
+## Example config.yaml
+
+```yaml
+surrealdb_address: "ws://127.0.0.1:8000"
+surrealdb_username: "root_user"
+surrealdb_password: "root_password"
+surrealdb_database: "minne_db"
+surrealdb_namespace: "minne_ns"
+openai_api_key: "sk-your-key-here"
+data_dir: "./minne_data"
+http_port: 3000
+
+# Optional reranking
+reranking_enabled: true
+reranking_pool_size: 2
+```
+
+## AI Provider Setup
+
+Minne works with any OpenAI-compatible API that supports structured outputs.
+
+### OpenAI (Default)
+
+Set `OPENAI_API_KEY` only. The default base URL points to OpenAI.
+
+### Ollama
+
+```bash
+OPENAI_API_KEY="ollama"
+OPENAI_BASE_URL="http://localhost:11434/v1"
+```
+
+### Other Providers
+
+Any provider exposing an OpenAI-compatible endpoint works. Set `OPENAI_BASE_URL` accordingly.
+
+## Model Selection
+
+1. Access `/admin` in your Minne instance
+2. Select models for content processing and chat
+3. **Content Processing**: Must support structured outputs
+4. **Embedding Dimensions**: Update when changing embedding models (e.g., 1536 for `text-embedding-3-small`)
@@ -0,0 +1,64 @@
+# Features
+
+## Search vs Chat
+
+**Search** — Use when you know what you're looking for. Full-text search matches query terms across your content.
+
+**Chat** — Use when exploring concepts or reasoning about your knowledge. The AI analyzes your query and retrieves relevant context from your entire knowledge base.
+
+## Content Processing
+
+Minne automatically processes saved content:
+
+1. **Web scraping** extracts readable text from URLs (via headless Chrome)
+2. **Text analysis** identifies key concepts and relationships
+3. **Graph creation** builds connections between related content
+4. **Embedding generation** enables semantic search
+
+## Knowledge Graph
+
+Explore your knowledge as an interactive network:
+
+- **Manual curation** — Create entities and relationships yourself
+- **AI automation** — Let AI extract entities and discover relationships
+- **Hybrid approach** — AI suggests connections for your approval
+
+The D3-based graph visualization shows entities as nodes and relationships as edges.
+
+## Hybrid Retrieval
+
+Minne combines multiple retrieval strategies:
+
+- **Vector similarity** — Semantic matching via embeddings
+- **Full-text search** — Keyword matching with BM25
+- **Graph traversal** — Following relationships between entities
+
+Results are merged using Reciprocal Rank Fusion (RRF) for optimal relevance.
+
+## Reranking (Optional)
+
+When enabled, retrieval results are rescored with a cross-encoder model for improved relevance. Powered by [fastembed-rs](https://github.com/Anush008/fastembed-rs).
+
+**Trade-offs:**
+- Downloads ~1.1 GB of model data
+- Adds latency per query
+- Potentially improves answer quality, see [blog post](https://blog.stark.pub/posts/eval-retrieval-refactor/)
+
+Enable via `RERANKING_ENABLED=true`. See [Configuration](./configuration.md).
+
+## Multi-Format Ingestion
+
+Supported content types:
+- Plain text and notes
+- URLs (web pages)
+- PDF documents
+- Audio files
+- Images
+
+## Scratchpad
+
+Quickly capture content without committing to permanent storage. Convert to full content when ready.
+
+## iOS Shortcut
+
+Use the [Minne iOS Shortcut](https://www.icloud.com/shortcuts/e433fbd7602f4e2eaa70dca162323477) for quick content capture from your phone.
@@ -0,0 +1,67 @@
+# Installation
+
+Minne can be installed through several methods. Choose the one that best fits your setup.
+
+## Docker Compose (Recommended)
+
+The fastest way to get Minne running with all dependencies:
+
+```bash
+git clone https://github.com/perstarkse/minne.git
+cd minne
+docker compose up -d
+```
+
+The included `docker-compose.yml` handles SurrealDB and Chromium automatically.
+
+**Required:** Set your `OPENAI_API_KEY` in `docker-compose.yml` before starting.
+
+## Nix
+
+Run Minne directly with Nix (includes Chromium):
+
+```bash
+nix run 'github:perstarkse/minne#main'
+```
+
+Configure via environment variables or a `config.yaml` file. See [Configuration](./configuration.md).
+
+## Pre-built Binaries
+
+Download binaries for Windows, macOS, and Linux from [GitHub Releases](https://github.com/perstarkse/minne/releases/latest).
+
+**Requirements:**
+- SurrealDB instance (local or remote)
+- Chromium (for web scraping)
+
+## Build from Source
+
+```bash
+git clone https://github.com/perstarkse/minne.git
+cd minne
+cargo build --release --bin main
+```
+
+The binary will be at `target/release/main`.
+
+**Requirements:**
+- Rust toolchain
+- SurrealDB accessible at configured address
+- Chromium in PATH
+
+## Process Modes
+
+Minne offers flexible deployment:
+
+| Binary | Description |
+|--------|-------------|
+| `main` | Combined server + worker (recommended) |
+| `server` | Web interface and API only |
+| `worker` | Background processing only |
+
+For most users, `main` is the right choice. Split deployments are useful for resource optimization or scaling.
+
+## Next Steps
+
+- [Configuration](./configuration.md) — Environment variables and config.yaml
+- [Features](./features.md) — What Minne can do
@@ -0,0 +1,48 @@
+# Vision
+
+## The "Why" Behind Minne
+
+Personal knowledge management has always fascinated me. I wanted something that made it incredibly easy to capture content—snippets of text, URLs, media—while automatically discovering connections between ideas. But I also wanted control over my knowledge structure.
+
+Traditional tools like Logseq and Obsidian are excellent, but manual linking often becomes a hindrance. Fully automated systems sometimes miss important context or create relationships I wouldn't have chosen.
+
+Minne offers the best of both worlds: effortless capture with AI-assisted relationship discovery, but with flexibility to manually curate, edit, or override connections. Let AI handle the heavy lifting, take full control yourself, or use a hybrid approach where AI suggests and you approve.
+
+## Design Principles
+
+- **Capture should be instant** — No friction between thought and storage
+- **Connections should emerge** — AI finds relationships you might miss
+- **Control should be optional** — Automate by default, curate when it matters
+- **Privacy should be default** — Self-hosted, your data stays yours
+
+## Roadmap
+
+### Near-term
+
+- [ ] TUI frontend with system editor integration
+- [ ] Enhanced retrieval recall via improved reranking
+- [ ] Additional content type support (e-books, research papers)
+
+### Medium-term
+
+- [ ] Embedded SurrealDB option (zero-config `nix run` with just `OPENAI_API_KEY`)
+- [ ] Browser extension for seamless capture
+- [ ] Mobile-native apps
+
+### Long-term
+
+- [ ] Federated knowledge sharing (opt-in)
+- [ ] Local LLM integration (fully offline operation)
+- [ ] Plugin system for custom entity extractors
+
+## Related Projects
+
+If Minne isn't quite right for you, check out:
+
+- [Karakeep](https://github.com/karakeep-app/karakeep) (formerly Hoarder) — Excellent bookmark/read-later with AI tagging
+- [Logseq](https://logseq.com/) — Outliner-based PKM with manual linking
+- [Obsidian](https://obsidian.md/) — Markdown-based PKM with plugin ecosystem
+
+## Contributing
+
+Feature requests and contributions are welcome. Minne was built for personal use first, but the self-hosted community benefits when we share.