evals: eval crate overhaul, simplification and performance improvements

This commit is contained in:
Per Stark
2026-06-17 19:23:11 +02:00
parent adc04d8c6d
commit fb51a8b55f
53 changed files with 2852 additions and 1831 deletions
+98
View File
@@ -0,0 +1,98 @@
# Evaluations crate refactor plan
This document records the architecture review and the simplification work applied to the
`evaluations` crate. **No backwards compatibility** is maintained for converted JSON layouts,
legacy report history, or old cache artifact formats.
## Goals
- Smaller, linear pipeline (no state machine ceremony)
- Sharded converted store for **all** datasets (memory-efficient partial loading)
- Slice-first loading when a catalog slice is selected
- In-memory SurrealDB for ingestion (no ephemeral server namespaces)
- Single DB lifecycle module (`db/`)
- CLI helpers under `cli/`
## Primary workflow
```bash
# One-time prep (converts raw data if needed, builds slice ledger, corpus cache, DB seed)
cargo eval --warm --dataset beir --slice beir-mix-600
# Check readiness
cargo eval --status --dataset beir --slice beir-mix-600
# Steady-state benchmark
cargo eval --dataset beir --slice beir-mix-600 --require-ready
```
Default dataset is `beir`. Chunk-only ingestion is the default; pass `--include-entities` to
opt into entity extraction (requires `OPENAI_API_KEY`). Slice tuning such as
`negative_multiplier` lives in `manifest.yaml` (e.g. `beir-mix-600` uses `9.0`).
## Cache layers (after refactor)
| Layer | Location | Purpose |
|-------|----------|---------|
| Converted store | `data/converted/<name>/` | Sharded paragraphs + question catalog |
| Slice ledger | `cache/slices/<dataset>/<slice-id>.json` | Deterministic questions + paragraph set |
| Corpus cache | `cache/ingested/<dataset>/<slice-id>/` | Ingestion paragraph shards, manifest, and namespace reuse seed |
Namespace reuse state lives in the corpus manifest (`metadata.namespace_seed`), not a separate
`snapshots/` tree. After upgrading, delete old `*-minne.json` monolithic files, any
`cache/snapshots/` directories, and re-run `--warm`.
## Phases applied
### Phase 0 — dead code
- Removed unused `criterion` dependency
- Removed unused `EmbeddingCache`
- Updated README for current CLI
### Phase 1 — structure
- Flattened pipeline to linear `async fn` stages
- Removed `eval.rs` hub; imports go to owning modules
- Merged `namespace.rs`, `db_helpers.rs``db/`; dropped standalone `snapshot.rs`
- Moved `status.rs``cli/status.rs`
- Fixed catalog slice bootstrap (build ledger when explicit slice manifest is missing)
### Phase 2 — no legacy paths
- All datasets use sharded converted store only
- Removed legacy JSON layout and migration
- Removed legacy report history format
- Auto-apply first catalog slice when `--slice` omitted
- Namespace seed folded into corpus manifest (removed `cache/snapshots/`)
### Phase 3 — performance
- Ingestion always uses in-memory SurrealDB
- Slice-first partial load when ledger is complete
- Default catalog slice for dataset when `--slice` not passed
- Split `slice/` into `mod.rs`, `build.rs`, and `beir.rs`
### Phase 4 — BEIR mix slice-first
- `beir` is a virtual mix: slice ledger references prefixed ids (`fever-…`, `fiqa-…`, …)
- Conversion is **qrels-closed** per subset (only documents appearing in qrels, not full corpus)
- Slice ledger is resolved for the requested `--slice` (catalog preset or custom id + `--limit`)
- Only ledger paragraph ids are materialized into per-subset stores (`fever-minne/`, `fiqa-minne/`, …)
- No monolithic `beir-minne/` merged store
- Raw BEIR data lives in per-subset dirs under `data/raw/`; `data/raw/beir` is a catalog placeholder
## Do not re-introduce
- Monolithic `*-minne.json` converted files
- Monolithic `beir-minne/` merged converted store (use per-subset stores + virtual mix loader)
- `state-machines` pipeline for this linear flow
- `eval.rs` re-export hub
- Legacy history migration in reports
- Ephemeral `ingest_eval_*` namespaces on the shared SurrealDB server
- Separate `cache/snapshots/` namespace state files
## Open follow-ups
- Generate `DatasetKind` from `manifest.yaml` at build time
- Split `report.rs` when touching reporting again