Evaluations

The evaluations crate benchmarks Minne's retrieval pipeline against standard datasets.

Quick Start

# One-time prep (convert, slice ledger, corpus cache, DB seed)
cargo eval --warm --dataset beir --slice beir-mix-600

# Check readiness
cargo eval --status --dataset beir --slice beir-mix-600

# Run benchmark (steady state after warm)
cargo eval --dataset beir --slice beir-mix-600 --require-ready

Default dataset is beir. When --slice is omitted, the first catalog slice for the dataset is applied automatically (e.g. beir-mix-600).

Chunk-only ingestion is the default. Pass --include-entities to opt into entity extraction during ingestion (requires OPENAI_API_KEY).

Custom slice sizes

--slice is a ledger id, not only a catalog name. You can use any id; --limit controls how many questions the ledger contains:

# 200-case BEIR mix (default --limit is 200)
cargo eval --warm --dataset beir --slice beir-mix-200
cargo eval --dataset beir --slice beir-mix-200 --require-ready

The catalog slice beir-mix-600 in manifest.yaml is a preset with limit: 600 and negative_multiplier: 9.0.

BEIR mix layout

beir is a virtual mix across eight subset datasets (FEVER, FiQA, HotpotQA, NFCorpus, Quora, TREC-COVID, SciFact, NQ-BEIR). There is no monolithic beir-minne/ store.

Build an in-memory qrels-world mix from raw subset data
Resolve the slice ledger (cache/slices/beir/<slice-id>.json)
Materialize only ledger paragraph ids into per-subset stores (fever-minne/, fiqa-minne/, …)
Ingest the slice corpus and seed SurrealDB

Conversion is qrels-closed: only documents that appear in qrels are exported, not the full BEIR corpus.

Chunk-only mode may evaluate fewer cases than the slice ledger size when some questions are impossible or lack verifiable answer chunks.

Reports include a Retrieved Context Volume section: total characters and estimated tokens across all chunks returned per query (~chars/4, comparable across --chunk-result-cap sweeps). Use this to compare the cost of raising --chunk-result-cap.

Prerequisites

SurrealDB

docker-compose up -d surrealdb

Raw datasets

Place raw datasets under evaluations/data/raw/. See manifest.yaml for paths.

BEIR subsets live in sibling directories (data/raw/fever, data/raw/fiqa, …). The data/raw/beir entry is a virtual catalog placeholder; warm uses the subset paths.

Directory structure

evaluations/
├── data/
│   ├── raw/           # Downloaded datasets (manual)
│   │   ├── fever/     # BEIR subset raw dirs (corpus.jsonl, queries.jsonl, qrels/)
│   │   ├── fiqa/
│   │   └── …
│   └── converted/     # Sharded stores (auto-generated)
│       ├── fever-minne/  # per-BEIR-subset stores
│       ├── fiqa-minne/
│       └── …             # BEIR mix loads from subset stores (no monolithic beir-minne/)
├── cache/
│   ├── slices/        # Slice ledgers
│   └── ingested/      # Corpus ingestion caches (manifest includes namespace seed)
├── reports/           # JSON + Markdown output from benchmark runs
├── manifest.yaml
└── src/

After upgrading: delete old monolithic *-minne.json files, any legacy beir-minne/ merged store, cache/snapshots/ directories, and stale reports/history/ artifacts, then re-run --warm.

Common flags

Flag	Description	Default
`--dataset`	Dataset to evaluate	`beir`
`--slice`	Slice ledger id (catalog or custom)	first catalog slice
`--limit`	Max questions in the slice ledger	`200`
`--warm`	Prepare without running queries	—
`--status`	Print readiness	—
`--require-ready`	Fail if not warmed	—
`--include-entities`	Entity extraction during ingestion	off
`--force-convert`	Rebuild converted store	—
`--chunk-result-cap`	Max chunks returned per query (raise with `--k`)	`5`
`--perf-log-console`	Print per-stage timings after a run	off
`--label`	Label stored in JSON/Markdown reports	—

See REFACTOR.md for architecture notes.

4.2 KiB Raw Permalink Blame History