docs: evaluations instructions and readme refactoring

This commit is contained in:
Per Stark
2025-12-22 18:32:59 +01:00
parent 30b8a65377
commit 9a623cbc3f
7 changed files with 570 additions and 232 deletions

74
docs/architecture.md Normal file
View File

@@ -0,0 +1,74 @@
# Architecture
## Tech Stack
| Layer | Technology |
|-------|------------|
| Backend | Rust with Axum (SSR) |
| Frontend | HTML + HTMX + minimal JS |
| Database | SurrealDB (graph, document, vector) |
| AI | OpenAI-compatible API |
| Web Processing | Headless Chromium |
## Crate Structure
```
minne/
├── main/ # Combined server + worker binary
├── api-router/ # REST API routes
├── html-router/ # SSR web interface
├── ingestion-pipeline/ # Content processing pipeline
├── retrieval-pipeline/ # Search and retrieval logic
├── common/ # Shared types, storage, utilities
├── evaluations/ # Benchmarking framework
└── json-stream-parser/ # Streaming JSON utilities
```
## Process Modes
| Binary | Purpose |
|--------|---------|
| `main` | All-in-one: serves UI and processes content |
| `server` | UI and API only (no background processing) |
| `worker` | Background processing only (no UI) |
Split deployment is useful for scaling or resource isolation.
## Data Flow
```
Content In → Ingestion Pipeline → SurrealDB
Entity Extraction
Embedding Generation
Graph Relationships
Query → Retrieval Pipeline → Results
Vector Search + FTS + Graph
RRF Fusion → (Optional Rerank) → Response
```
## Database Schema
SurrealDB stores:
- **TextContent** — Raw ingested content
- **TextChunk** — Chunked content with embeddings
- **KnowledgeEntity** — Extracted entities (people, concepts, etc.)
- **KnowledgeRelationship** — Connections between entities
- **User** — Authentication and preferences
- **SystemSettings** — Model configuration
Embeddings are stored in dedicated tables with HNSW indexes for fast vector search.
## Retrieval Strategy
1. **Collect candidates** — Vector similarity + full-text search
2. **Merge ranks** — Reciprocal Rank Fusion (RRF)
3. **Attach context** — Link chunks to parent entities
4. **Rerank** (optional) — Cross-encoder rescoring
5. **Return** — Top-k results with metadata

77
docs/configuration.md Normal file
View File

@@ -0,0 +1,77 @@
# Configuration
Minne can be configured via environment variables or a `config.yaml` file. Environment variables take precedence.
## Required Settings
| Variable | Description | Example |
|----------|-------------|---------|
| `OPENAI_API_KEY` | API key for OpenAI-compatible endpoint | `sk-...` |
| `SURREALDB_ADDRESS` | WebSocket address of SurrealDB | `ws://127.0.0.1:8000` |
| `SURREALDB_USERNAME` | SurrealDB username | `root_user` |
| `SURREALDB_PASSWORD` | SurrealDB password | `root_password` |
| `SURREALDB_DATABASE` | Database name | `minne_db` |
| `SURREALDB_NAMESPACE` | Namespace | `minne_ns` |
## Optional Settings
| Variable | Description | Default |
|----------|-------------|---------|
| `HTTP_PORT` | Server port | `3000` |
| `DATA_DIR` | Local data directory | `./data` |
| `OPENAI_BASE_URL` | Custom AI provider URL | OpenAI default |
| `RUST_LOG` | Logging level | `info` |
### Reranking (Optional)
| Variable | Description | Default |
|----------|-------------|---------|
| `RERANKING_ENABLED` | Enable FastEmbed reranking | `false` |
| `RERANKING_POOL_SIZE` | Concurrent reranker workers | `2` |
| `FASTEMBED_CACHE_DIR` | Model cache directory | `<data_dir>/fastembed/reranker` |
> [!NOTE]
> Enabling reranking downloads ~1.1 GB of model data on first startup.
## Example config.yaml
```yaml
surrealdb_address: "ws://127.0.0.1:8000"
surrealdb_username: "root_user"
surrealdb_password: "root_password"
surrealdb_database: "minne_db"
surrealdb_namespace: "minne_ns"
openai_api_key: "sk-your-key-here"
data_dir: "./minne_data"
http_port: 3000
# Optional reranking
reranking_enabled: true
reranking_pool_size: 2
```
## AI Provider Setup
Minne works with any OpenAI-compatible API that supports structured outputs.
### OpenAI (Default)
Set `OPENAI_API_KEY` only. The default base URL points to OpenAI.
### Ollama
```bash
OPENAI_API_KEY="ollama"
OPENAI_BASE_URL="http://localhost:11434/v1"
```
### Other Providers
Any provider exposing an OpenAI-compatible endpoint works. Set `OPENAI_BASE_URL` accordingly.
## Model Selection
1. Access `/admin` in your Minne instance
2. Select models for content processing and chat
3. **Content Processing**: Must support structured outputs
4. **Embedding Dimensions**: Update when changing embedding models (e.g., 1536 for `text-embedding-3-small`)

64
docs/features.md Normal file
View File

@@ -0,0 +1,64 @@
# Features
## Search vs Chat
**Search** — Use when you know what you're looking for. Full-text search matches query terms across your content.
**Chat** — Use when exploring concepts or reasoning about your knowledge. The AI analyzes your query and retrieves relevant context from your entire knowledge base.
## Content Processing
Minne automatically processes saved content:
1. **Web scraping** extracts readable text from URLs (via headless Chrome)
2. **Text analysis** identifies key concepts and relationships
3. **Graph creation** builds connections between related content
4. **Embedding generation** enables semantic search
## Knowledge Graph
Explore your knowledge as an interactive network:
- **Manual curation** — Create entities and relationships yourself
- **AI automation** — Let AI extract entities and discover relationships
- **Hybrid approach** — AI suggests connections for your approval
The D3-based graph visualization shows entities as nodes and relationships as edges.
## Hybrid Retrieval
Minne combines multiple retrieval strategies:
- **Vector similarity** — Semantic matching via embeddings
- **Full-text search** — Keyword matching with BM25
- **Graph traversal** — Following relationships between entities
Results are merged using Reciprocal Rank Fusion (RRF) for optimal relevance.
## Reranking (Optional)
When enabled, retrieval results are rescored with a cross-encoder model for improved relevance. Powered by [fastembed-rs](https://github.com/Anush008/fastembed-rs).
**Trade-offs:**
- Downloads ~1.1 GB of model data
- Adds latency per query
- Potentially improves answer quality, see [blog post](https://blog.stark.pub/posts/eval-retrieval-refactor/)
Enable via `RERANKING_ENABLED=true`. See [Configuration](./configuration.md).
## Multi-Format Ingestion
Supported content types:
- Plain text and notes
- URLs (web pages)
- PDF documents
- Audio files
- Images
## Scratchpad
Quickly capture content without committing to permanent storage. Convert to full content when ready.
## iOS Shortcut
Use the [Minne iOS Shortcut](https://www.icloud.com/shortcuts/e433fbd7602f4e2eaa70dca162323477) for quick content capture from your phone.

67
docs/installation.md Normal file
View File

@@ -0,0 +1,67 @@
# Installation
Minne can be installed through several methods. Choose the one that best fits your setup.
## Docker Compose (Recommended)
The fastest way to get Minne running with all dependencies:
```bash
git clone https://github.com/perstarkse/minne.git
cd minne
docker compose up -d
```
The included `docker-compose.yml` handles SurrealDB and Chromium automatically.
**Required:** Set your `OPENAI_API_KEY` in `docker-compose.yml` before starting.
## Nix
Run Minne directly with Nix (includes Chromium):
```bash
nix run 'github:perstarkse/minne#main'
```
Configure via environment variables or a `config.yaml` file. See [Configuration](./configuration.md).
## Pre-built Binaries
Download binaries for Windows, macOS, and Linux from [GitHub Releases](https://github.com/perstarkse/minne/releases/latest).
**Requirements:**
- SurrealDB instance (local or remote)
- Chromium (for web scraping)
## Build from Source
```bash
git clone https://github.com/perstarkse/minne.git
cd minne
cargo build --release --bin main
```
The binary will be at `target/release/main`.
**Requirements:**
- Rust toolchain
- SurrealDB accessible at configured address
- Chromium in PATH
## Process Modes
Minne offers flexible deployment:
| Binary | Description |
|--------|-------------|
| `main` | Combined server + worker (recommended) |
| `server` | Web interface and API only |
| `worker` | Background processing only |
For most users, `main` is the right choice. Split deployments are useful for resource optimization or scaling.
## Next Steps
- [Configuration](./configuration.md) — Environment variables and config.yaml
- [Features](./features.md) — What Minne can do

48
docs/vision.md Normal file
View File

@@ -0,0 +1,48 @@
# Vision
## The "Why" Behind Minne
Personal knowledge management has always fascinated me. I wanted something that made it incredibly easy to capture content—snippets of text, URLs, media—while automatically discovering connections between ideas. But I also wanted control over my knowledge structure.
Traditional tools like Logseq and Obsidian are excellent, but manual linking often becomes a hindrance. Fully automated systems sometimes miss important context or create relationships I wouldn't have chosen.
Minne offers the best of both worlds: effortless capture with AI-assisted relationship discovery, but with flexibility to manually curate, edit, or override connections. Let AI handle the heavy lifting, take full control yourself, or use a hybrid approach where AI suggests and you approve.
## Design Principles
- **Capture should be instant** — No friction between thought and storage
- **Connections should emerge** — AI finds relationships you might miss
- **Control should be optional** — Automate by default, curate when it matters
- **Privacy should be default** — Self-hosted, your data stays yours
## Roadmap
### Near-term
- [ ] TUI frontend with system editor integration
- [ ] Enhanced retrieval recall via improved reranking
- [ ] Additional content type support (e-books, research papers)
### Medium-term
- [ ] Embedded SurrealDB option (zero-config `nix run` with just `OPENAI_API_KEY`)
- [ ] Browser extension for seamless capture
- [ ] Mobile-native apps
### Long-term
- [ ] Federated knowledge sharing (opt-in)
- [ ] Local LLM integration (fully offline operation)
- [ ] Plugin system for custom entity extractors
## Related Projects
If Minne isn't quite right for you, check out:
- [Karakeep](https://github.com/karakeep-app/karakeep) (formerly Hoarder) — Excellent bookmark/read-later with AI tagging
- [Logseq](https://logseq.com/) — Outliner-based PKM with manual linking
- [Obsidian](https://obsidian.md/) — Markdown-based PKM with plugin ecosystem
## Contributing
Feature requests and contributions are welcome. Minne was built for personal use first, but the self-hosted community benefits when we share.