mirror of https://github.com/perstarkse/minne.git synced 2026-01-14 14:13:25 +01:00

Files

Per Stark f9f48d1046 docs: evaluations instructions and readme refactoring

2025-12-22 18:55:47 +01:00

6.0 KiB

Raw Permalink Blame History

Evaluations

The evaluations crate provides a retrieval evaluation framework for benchmarking Minne's information retrieval pipeline against standard datasets.

Quick Start

# Run SQuAD v2.0 evaluation (vector-only, recommended)
cargo run --package evaluations -- --ingest-chunks-only

# Run a specific dataset
cargo run --package evaluations -- --dataset fiqa --ingest-chunks-only

# Convert dataset only (no evaluation)
cargo run --package evaluations -- --convert-only

Prerequisites

1. SurrealDB

Start a SurrealDB instance before running evaluations:

docker-compose up -d surrealdb

Or using the default endpoint configuration:

surreal start --user root_user --pass root_password

2. Download Raw Datasets

Raw datasets must be downloaded manually and placed in evaluations/data/raw/. See Dataset Sources below for links and formats.

Directory Structure

evaluations/
├── data/
│   ├── raw/          # Downloaded raw datasets (manual)
│   │   ├── squad/    # SQuAD v2.0
│   │   ├── nq-dev/   # Natural Questions
│   │   ├── fiqa/     # BEIR: FiQA-2018
│   │   ├── fever/    # BEIR: FEVER
│   │   ├── hotpotqa/ # BEIR: HotpotQA
│   │   └── ...       # Other BEIR subsets
│   └── converted/    # Auto-generated (Minne JSON format)
├── cache/            # Ingestion and embedding caches
├── reports/          # Evaluation output (JSON + Markdown)
├── manifest.yaml     # Dataset and slice definitions
└── src/              # Evaluation source code

Dataset Sources

SQuAD v2.0

Download and place at data/raw/squad/dev-v2.0.json:

mkdir -p evaluations/data/raw/squad
curl -L https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json \
  -o evaluations/data/raw/squad/dev-v2.0.json

Natural Questions (NQ)

Download and place at data/raw/nq-dev/dev-all.jsonl:

mkdir -p evaluations/data/raw/nq-dev
# Download from Google's Natural Questions page or HuggingFace
# File: dev-all.jsonl (simplified JSONL format)

Source: Google Natural Questions

BEIR Datasets

All BEIR datasets follow the same format structure:

data/raw/<dataset>/
├── corpus.jsonl      # Document corpus
├── queries.jsonl     # Query set
└── qrels/
    └── test.tsv      # Relevance judgments (or dev.tsv)

Download datasets from the BEIR Benchmark repository. Each dataset zip extracts to the required directory structure.

Dataset	Directory
FEVER	`fever/`
FiQA-2018	`fiqa/`
HotpotQA	`hotpotqa/`
NFCorpus	`nfcorpus/`
Quora	`quora/`
TREC-COVID	`trec-covid/`
SciFact	`scifact/`
NQ (BEIR)	`nq/`

Example download:

cd evaluations/data/raw
curl -L https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip -o fiqa.zip
unzip fiqa.zip && rm fiqa.zip

Dataset Conversion

Raw datasets are automatically converted to Minne's internal JSON format on first run. To force reconversion:

cargo run --package evaluations -- --force-convert

Converted files are saved to data/converted/ and cached for subsequent runs.

CLI Reference

Common Options

Flag	Description	Default
`--dataset <NAME>`	Dataset to evaluate	`squad-v2`
`--limit <N>`	Max questions to evaluate (0 = all)	`200`
`--k <N>`	Precision@k cutoff	`5`
`--slice <ID>`	Use a predefined slice from manifest	—
`--rerank`	Enable FastEmbed reranking stage	disabled
`--embedding-backend <BE>`	`fastembed` or `hashed`	`fastembed`
`--ingest-chunks-only`	Skip entity extraction, ingest only text chunks	disabled

Tip

Use --ingest-chunks-only when evaluating vector-only retrieval strategies. This skips the LLM-based entity extraction and graph generation, significantly speeding up ingestion while focusing on pure chunk-based vector search.

Available Datasets

squad-v2, natural-questions, beir, fever, fiqa, hotpotqa, 
nfcorpus, quora, trec-covid, scifact, nq-beir

Database Configuration

Flag	Environment	Default
`--db-endpoint`	`EVAL_DB_ENDPOINT`	`ws://127.0.0.1:8000`
`--db-username`	`EVAL_DB_USERNAME`	`root_user`
`--db-password`	`EVAL_DB_PASSWORD`	`root_password`
`--db-namespace`	`EVAL_DB_NAMESPACE`	auto-generated
`--db-database`	`EVAL_DB_DATABASE`	auto-generated

Example Runs

# Vector-only evaluation (recommended for benchmarking)
cargo run --package evaluations -- \
  --dataset fiqa \
  --ingest-chunks-only \
  --limit 200

# Full FiQA evaluation with reranking
cargo run --package evaluations -- \
  --dataset fiqa \
  --ingest-chunks-only \
  --limit 500 \
  --rerank \
  --k 10

# Use a predefined slice for reproducibility
cargo run --package evaluations -- --slice fiqa-test-200 --ingest-chunks-only

# Run the mixed BEIR benchmark
cargo run --package evaluations -- --dataset beir --slice beir-mix-600 --ingest-chunks-only

Slices

Slices are predefined, reproducible subsets defined in manifest.yaml. Each slice specifies:

limit: Number of questions
corpus_limit: Maximum corpus size
seed: Fixed RNG seed for reproducibility

View available slices in manifest.yaml.

Reports

Evaluations generate reports in reports/:

JSON: Full structured results (*-report.json)
Markdown: Human-readable summary with sample mismatches (*-report.md)
History: Timestamped run history (history/)

Performance Tuning

# Log per-stage performance timings
cargo run --package evaluations -- --perf-log-console

# Save telemetry to file
cargo run --package evaluations -- --perf-log-json ./perf.json

License

See ../LICENSE.

6.0 KiB Raw Permalink Blame History