The right metric is retrieval quality
A model can look strong on public benchmarks and still miss your internal documents. Build a small set of real queries, expected sources, and bad answers. Then compare whether the correct evidence appears in top-k results.
- Measure recall at top-k and inspect evidence quality.
- Separate multilingual, code, table, and long-document queries.
- Track cost and latency at realistic batch sizes.