evals: eval crate overhaul, simplification and performance improvements

2026-06-24 10:56:29 +02:00 · 2026-06-17 19:23:11 +02:00
parent adc04d8c6d
commit fb51a8b55f
53 changed files with 2852 additions and 1831 deletions
@@ -0,0 +1,98 @@
+# Evaluations crate refactor plan
+
+This document records the architecture review and the simplification work applied to the
+`evaluations` crate. **No backwards compatibility** is maintained for converted JSON layouts,
+legacy report history, or old cache artifact formats.
+
+## Goals
+
+- Smaller, linear pipeline (no state machine ceremony)
+- Sharded converted store for **all** datasets (memory-efficient partial loading)
+- Slice-first loading when a catalog slice is selected
+- In-memory SurrealDB for ingestion (no ephemeral server namespaces)
+- Single DB lifecycle module (`db/`)
+- CLI helpers under `cli/`
+
+## Primary workflow
+
+```bash
+# One-time prep (converts raw data if needed, builds slice ledger, corpus cache, DB seed)
+cargo eval --warm --dataset beir --slice beir-mix-600
+
+# Check readiness
+cargo eval --status --dataset beir --slice beir-mix-600
+
+# Steady-state benchmark
+cargo eval --dataset beir --slice beir-mix-600 --require-ready
+```
+
+Default dataset is `beir`. Chunk-only ingestion is the default; pass `--include-entities` to
+opt into entity extraction (requires `OPENAI_API_KEY`). Slice tuning such as
+`negative_multiplier` lives in `manifest.yaml` (e.g. `beir-mix-600` uses `9.0`).
+
+## Cache layers (after refactor)
+
+| Layer | Location | Purpose |
+|-------|----------|---------|
+| Converted store | `data/converted/<name>/` | Sharded paragraphs + question catalog |
+| Slice ledger | `cache/slices/<dataset>/<slice-id>.json` | Deterministic questions + paragraph set |
+| Corpus cache | `cache/ingested/<dataset>/<slice-id>/` | Ingestion paragraph shards, manifest, and namespace reuse seed |
+
+Namespace reuse state lives in the corpus manifest (`metadata.namespace_seed`), not a separate
+`snapshots/` tree. After upgrading, delete old `*-minne.json` monolithic files, any
+`cache/snapshots/` directories, and re-run `--warm`.
+
+## Phases applied
+
+### Phase 0 — dead code
+
+- Removed unused `criterion` dependency
+- Removed unused `EmbeddingCache`
+- Updated README for current CLI
+
+### Phase 1 — structure
+
+- Flattened pipeline to linear `async fn` stages
+- Removed `eval.rs` hub; imports go to owning modules
+- Merged `namespace.rs`, `db_helpers.rs` → `db/`; dropped standalone `snapshot.rs`
+- Moved `status.rs` → `cli/status.rs`
+- Fixed catalog slice bootstrap (build ledger when explicit slice manifest is missing)
+
+### Phase 2 — no legacy paths
+
+- All datasets use sharded converted store only
+- Removed legacy JSON layout and migration
+- Removed legacy report history format
+- Auto-apply first catalog slice when `--slice` omitted
+- Namespace seed folded into corpus manifest (removed `cache/snapshots/`)
+
+### Phase 3 — performance
+
+- Ingestion always uses in-memory SurrealDB
+- Slice-first partial load when ledger is complete
+- Default catalog slice for dataset when `--slice` not passed
+- Split `slice/` into `mod.rs`, `build.rs`, and `beir.rs`
+
+### Phase 4 — BEIR mix slice-first
+
+- `beir` is a virtual mix: slice ledger references prefixed ids (`fever-…`, `fiqa-…`, …)
+- Conversion is **qrels-closed** per subset (only documents appearing in qrels, not full corpus)
+- Slice ledger is resolved for the requested `--slice` (catalog preset or custom id + `--limit`)
+- Only ledger paragraph ids are materialized into per-subset stores (`fever-minne/`, `fiqa-minne/`, …)
+- No monolithic `beir-minne/` merged store
+- Raw BEIR data lives in per-subset dirs under `data/raw/`; `data/raw/beir` is a catalog placeholder
+
+## Do not re-introduce
+
+- Monolithic `*-minne.json` converted files
+- Monolithic `beir-minne/` merged converted store (use per-subset stores + virtual mix loader)
+- `state-machines` pipeline for this linear flow
+- `eval.rs` re-export hub
+- Legacy history migration in reports
+- Ephemeral `ingest_eval_*` namespaces on the shared SurrealDB server
+- Separate `cache/snapshots/` namespace state files
+
+## Open follow-ups
+
+- Generate `DatasetKind` from `manifest.yaml` at build time
+- Split `report.rs` when touching reporting again