# Evaluations The `evaluations` crate provides a retrieval evaluation framework for benchmarking Minne's information retrieval pipeline against standard datasets. ## Quick Start ```bash # Run SQuAD v2.0 evaluation (vector-only, recommended) cargo run --package evaluations -- --ingest-chunks-only # Run a specific dataset cargo run --package evaluations -- --dataset fiqa --ingest-chunks-only # Convert dataset only (no evaluation) cargo run --package evaluations -- --convert-only ``` ## Prerequisites ### 1. SurrealDB Start a SurrealDB instance before running evaluations: ```bash docker-compose up -d surrealdb ``` Or using the default endpoint configuration: ```bash surreal start --user root_user --pass root_password ``` ### 2. Download Raw Datasets Raw datasets must be downloaded manually and placed in `evaluations/data/raw/`. See [Dataset Sources](#dataset-sources) below for links and formats. ## Directory Structure ``` evaluations/ ├── data/ │ ├── raw/ # Downloaded raw datasets (manual) │ │ ├── squad/ # SQuAD v2.0 │ │ ├── nq-dev/ # Natural Questions │ │ ├── fiqa/ # BEIR: FiQA-2018 │ │ ├── fever/ # BEIR: FEVER │ │ ├── hotpotqa/ # BEIR: HotpotQA │ │ └── ... # Other BEIR subsets │ └── converted/ # Auto-generated (Minne JSON format) ├── cache/ # Ingestion and embedding caches ├── reports/ # Evaluation output (JSON + Markdown) ├── manifest.yaml # Dataset and slice definitions └── src/ # Evaluation source code ``` ## Dataset Sources ### SQuAD v2.0 Download and place at `data/raw/squad/dev-v2.0.json`: ```bash mkdir -p evaluations/data/raw/squad curl -L https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json \ -o evaluations/data/raw/squad/dev-v2.0.json ``` ### Natural Questions (NQ) Download and place at `data/raw/nq-dev/dev-all.jsonl`: ```bash mkdir -p evaluations/data/raw/nq-dev # Download from Google's Natural Questions page or HuggingFace # File: dev-all.jsonl (simplified JSONL format) ``` Source: [Google Natural Questions](https://ai.google.com/research/NaturalQuestions) ### BEIR Datasets All BEIR datasets follow the same format structure: ``` data/raw// ├── corpus.jsonl # Document corpus ├── queries.jsonl # Query set └── qrels/ └── test.tsv # Relevance judgments (or dev.tsv) ``` Download datasets from the [BEIR Benchmark repository](https://github.com/beir-cellar/beir). Each dataset zip extracts to the required directory structure. | Dataset | Directory | |------------|---------------| | FEVER | `fever/` | | FiQA-2018 | `fiqa/` | | HotpotQA | `hotpotqa/` | | NFCorpus | `nfcorpus/` | | Quora | `quora/` | | TREC-COVID | `trec-covid/` | | SciFact | `scifact/` | | NQ (BEIR) | `nq/` | Example download: ```bash cd evaluations/data/raw curl -L https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip -o fiqa.zip unzip fiqa.zip && rm fiqa.zip ``` ## Dataset Conversion Raw datasets are automatically converted to Minne's internal JSON format on first run. To force reconversion: ```bash cargo run --package evaluations -- --force-convert ``` Converted files are saved to `data/converted/` and cached for subsequent runs. ## CLI Reference ### Common Options | Flag | Description | Default | |------|-------------|---------| | `--dataset ` | Dataset to evaluate | `squad-v2` | | `--limit ` | Max questions to evaluate (0 = all) | `200` | | `--k ` | Precision@k cutoff | `5` | | `--slice ` | Use a predefined slice from manifest | — | | `--rerank` | Enable FastEmbed reranking stage | disabled | | `--embedding-backend ` | `fastembed` or `hashed` | `fastembed` | | `--ingest-chunks-only` | Skip entity extraction, ingest only text chunks | disabled | > [!TIP] > Use `--ingest-chunks-only` when evaluating vector-only retrieval strategies. This skips the LLM-based entity extraction and graph generation, significantly speeding up ingestion while focusing on pure chunk-based vector search. ### Available Datasets ``` squad-v2, natural-questions, beir, fever, fiqa, hotpotqa, nfcorpus, quora, trec-covid, scifact, nq-beir ``` ### Database Configuration | Flag | Environment | Default | |------|-------------|---------| | `--db-endpoint` | `EVAL_DB_ENDPOINT` | `ws://127.0.0.1:8000` | | `--db-username` | `EVAL_DB_USERNAME` | `root_user` | | `--db-password` | `EVAL_DB_PASSWORD` | `root_password` | | `--db-namespace` | `EVAL_DB_NAMESPACE` | auto-generated | | `--db-database` | `EVAL_DB_DATABASE` | auto-generated | ### Example Runs ```bash # Vector-only evaluation (recommended for benchmarking) cargo run --package evaluations -- \ --dataset fiqa \ --ingest-chunks-only \ --limit 200 # Full FiQA evaluation with reranking cargo run --package evaluations -- \ --dataset fiqa \ --ingest-chunks-only \ --limit 500 \ --rerank \ --k 10 # Use a predefined slice for reproducibility cargo run --package evaluations -- --slice fiqa-test-200 --ingest-chunks-only # Run the mixed BEIR benchmark cargo run --package evaluations -- --dataset beir --slice beir-mix-600 --ingest-chunks-only ``` ## Slices Slices are predefined, reproducible subsets defined in `manifest.yaml`. Each slice specifies: - **limit**: Number of questions - **corpus_limit**: Maximum corpus size - **seed**: Fixed RNG seed for reproducibility View available slices in [manifest.yaml](./manifest.yaml). ## Reports Evaluations generate reports in `reports/`: - **JSON**: Full structured results (`*-report.json`) - **Markdown**: Human-readable summary with sample mismatches (`*-report.md`) - **History**: Timestamped run history (`history/`) ## Performance Tuning ```bash # Log per-stage performance timings cargo run --package evaluations -- --perf-log-console # Save telemetry to file cargo run --package evaluations -- --perf-log-json ./perf.json ``` ## License See [../LICENSE](../LICENSE).