Evaluate the failure you fear
Start with the failure that would hurt the product: wrong answer, hallucinated citation, bad JSON, unsafe tool call, latency spike, or expensive retry loop. The eval should make that failure visible before users find it.
- Keep golden cases small enough to run on every prompt or model change.
- Separate retrieval quality, answer quality, policy behavior, and formatting reliability.
- Add negative cases where the correct behavior is to refuse or ask a clarifying question.