Separate benchmark types
A model can rank highly in chat preference and still be the wrong model for a codebase, legal policy search, local deployment, or low-latency customer support workflow. Always ask what the benchmark is measuring.
- Preference scores are not the same as factual accuracy or tool reliability.
- Academic benchmarks are not the same as production latency and cost.
- Open-weight model choice adds hardware, quantization, and serving constraints.