docs: evaluations instructions and readme refactoring

2026-06-27 12:26:21 +02:00 · 2025-12-22 18:32:59 +01:00
parent 30b8a65377
commit 9a623cbc3f
7 changed files with 570 additions and 232 deletions
@@ -0,0 +1,212 @@
+# Evaluations
+
+The `evaluations` crate provides a retrieval evaluation framework for benchmarking Minne's information retrieval pipeline against standard datasets.
+
+## Quick Start
+
+```bash
+# Run SQuAD v2.0 evaluation (vector-only, recommended)
+cargo run --package evaluations -- --ingest-chunks-only
+
+# Run a specific dataset
+cargo run --package evaluations -- --dataset fiqa --ingest-chunks-only
+
+# Convert dataset only (no evaluation)
+cargo run --package evaluations -- --convert-only
+```
+
+## Prerequisites
+
+### 1. SurrealDB
+
+Start a SurrealDB instance before running evaluations:
+
+```bash
+docker-compose up -d surrealdb
+```
+
+Or using the default endpoint configuration:
+
+```bash
+surreal start --user root_user --pass root_password
+```
+
+### 2. Download Raw Datasets
+
+Raw datasets must be downloaded manually and placed in `evaluations/data/raw/`. See [Dataset Sources](#dataset-sources) below for links and formats.
+
+## Directory Structure
+
+```
+evaluations/
+├── data/
+│   ├── raw/          # Downloaded raw datasets (manual)
+│   │   ├── squad/    # SQuAD v2.0
+│   │   ├── nq-dev/   # Natural Questions
+│   │   ├── fiqa/     # BEIR: FiQA-2018
+│   │   ├── fever/    # BEIR: FEVER
+│   │   ├── hotpotqa/ # BEIR: HotpotQA
+│   │   └── ...       # Other BEIR subsets
+│   └── converted/    # Auto-generated (Minne JSON format)
+├── cache/            # Ingestion and embedding caches
+├── reports/          # Evaluation output (JSON + Markdown)
+├── manifest.yaml     # Dataset and slice definitions
+└── src/              # Evaluation source code
+```
+
+## Dataset Sources
+
+### SQuAD v2.0
+
+Download and place at `data/raw/squad/dev-v2.0.json`:
+
+```bash
+mkdir -p evaluations/data/raw/squad
+curl -L https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json \
+  -o evaluations/data/raw/squad/dev-v2.0.json
+```
+
+### Natural Questions (NQ)
+
+Download and place at `data/raw/nq-dev/dev-all.jsonl`:
+
+```bash
+mkdir -p evaluations/data/raw/nq-dev
+# Download from Google's Natural Questions page or HuggingFace
+# File: dev-all.jsonl (simplified JSONL format)
+```
+
+Source: [Google Natural Questions](https://ai.google.com/research/NaturalQuestions)
+
+### BEIR Datasets
+
+All BEIR datasets follow the same format structure:
+
+```
+data/raw/<dataset>/
+├── corpus.jsonl      # Document corpus
+├── queries.jsonl     # Query set
+└── qrels/
+    └── test.tsv      # Relevance judgments (or dev.tsv)
+```
+
+Download datasets from the [BEIR Benchmark repository](https://github.com/beir-cellar/beir). Each dataset zip extracts to the required directory structure.
+
+| Dataset    | Directory     |
+|------------|---------------|
+| FEVER      | `fever/`      |
+| FiQA-2018  | `fiqa/`       |
+| HotpotQA   | `hotpotqa/`   |
+| NFCorpus   | `nfcorpus/`   |
+| Quora      | `quora/`      |
+| TREC-COVID | `trec-covid/` |
+| SciFact    | `scifact/`    |
+| NQ (BEIR)  | `nq/`         |
+
+Example download:
+
+```bash
+cd evaluations/data/raw
+curl -L https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip -o fiqa.zip
+unzip fiqa.zip && rm fiqa.zip
+```
+
+## Dataset Conversion
+
+Raw datasets are automatically converted to Minne's internal JSON format on first run. To force reconversion:
+
+```bash
+cargo run --package evaluations -- --force-convert
+```
+
+Converted files are saved to `data/converted/` and cached for subsequent runs.
+
+## CLI Reference
+
+### Common Options
+
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--dataset <NAME>` | Dataset to evaluate | `squad-v2` |
+| `--limit <N>` | Max questions to evaluate (0 = all) | `200` |
+| `--k <N>` | Precision@k cutoff | `5` |
+| `--slice <ID>` | Use a predefined slice from manifest | — |
+| `--rerank` | Enable FastEmbed reranking stage | disabled |
+| `--embedding-backend <BE>` | `fastembed` or `hashed` | `fastembed` |
+| `--ingest-chunks-only` | Skip entity extraction, ingest only text chunks | disabled |
+
+> [!TIP]
+> Use `--ingest-chunks-only` when evaluating vector-only retrieval strategies. This skips the LLM-based entity extraction and graph generation, significantly speeding up ingestion while focusing on pure chunk-based vector search.
+
+### Available Datasets
+
+```
+squad-v2, natural-questions, beir, fever, fiqa, hotpotqa, 
+nfcorpus, quora, trec-covid, scifact, nq-beir
+```
+
+### Database Configuration
+
+| Flag | Environment | Default |
+|------|-------------|---------|
+| `--db-endpoint` | `EVAL_DB_ENDPOINT` | `ws://127.0.0.1:8000` |
+| `--db-username` | `EVAL_DB_USERNAME` | `root_user` |
+| `--db-password` | `EVAL_DB_PASSWORD` | `root_password` |
+| `--db-namespace` | `EVAL_DB_NAMESPACE` | auto-generated |
+| `--db-database` | `EVAL_DB_DATABASE` | auto-generated |
+
+### Example Runs
+
+```bash
+# Vector-only evaluation (recommended for benchmarking)
+cargo run --package evaluations -- \
+  --dataset fiqa \
+  --ingest-chunks-only \
+  --limit 200
+
+# Full FiQA evaluation with reranking
+cargo run --package evaluations -- \
+  --dataset fiqa \
+  --ingest-chunks-only \
+  --limit 500 \
+  --rerank \
+  --k 10
+
+# Use a predefined slice for reproducibility
+cargo run --package evaluations -- --slice fiqa-test-200 --ingest-chunks-only
+
+# Run the mixed BEIR benchmark
+cargo run --package evaluations -- --dataset beir --slice beir-mix-600 --ingest-chunks-only
+```
+
+## Slices
+
+Slices are predefined, reproducible subsets defined in `manifest.yaml`. Each slice specifies:
+
+- **limit**: Number of questions
+- **corpus_limit**: Maximum corpus size
+- **seed**: Fixed RNG seed for reproducibility
+
+View available slices in [manifest.yaml](./manifest.yaml).
+
+## Reports
+
+Evaluations generate reports in `reports/`:
+
+- **JSON**: Full structured results (`*-report.json`)
+- **Markdown**: Human-readable summary with sample mismatches (`*-report.md`)
+- **History**: Timestamped run history (`history/`)
+
+## Performance Tuning
+
+```bash
+# Log per-stage performance timings
+cargo run --package evaluations -- --perf-log-console
+
+# Save telemetry to file
+cargo run --package evaluations -- --perf-log-json ./perf.json
+```
+
+## License
+
+See [../LICENSE](../LICENSE).