Arena multi-domain preference
Combines Text, WebDev, Vision, Document, and related preference signals from real users.
AI Model Benchmark 2026
This hub explains how major 2026 AI model evaluation sources differ. Arena reflects human preference, Artificial Analysis helps compare capability, speed, and price, Vals AI focuses on industry tasks, and HELM emphasizes transparency and reproducibility.
Guozhen AI Composite Ranking v0.1
This is Guozhen AI's original synthesis layer. It normalizes Arena multi-domain preference, Vals real-task evidence, Artificial Analysis production signals, and HELM-style transparency signals into a 0-100 weighted score.
The ranking combines public benchmark signals from LMArena Text, WebDev, Vision, and Document, then uses Vals, Artificial Analysis, and HELM-style methodology for editorial calibration. If an external source is temporarily unavailable, the page keeps a stable composite ranking without exposing fetch diagnostics to readers.
Combines Text, WebDev, Vision, Document, and related preference signals from real users.
Uses coding, terminal, industry, and agentic task evidence to avoid chat-only evaluation.
Adds production signals such as intelligence, speed, latency, and price.
Rewards reproducibility, robustness, multi-metric reporting, and research transparency.
| Rank | Model | Composite | Arena | Tasks | Efficiency | Transparency | Best for |
|---|---|---|---|---|---|---|---|
| 1 | claude-fable-5 Anthropic | 93.4 | 100 | 100 | 84 | 76 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 1. |
| 2 | claude-opus-4-7-thinking Anthropic | 92.9 | 99 | 99 | 84 | 76 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 2. |
| 3 | claude-opus-4-7 Anthropic | 91.3 | 96 | 97 | 84 | 76 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 4. |
| 4 | claude-opus-4-6-thinking Anthropic | 90.4 | 96 | 95 | 84 | 76 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 6. |
| 5 | claude-opus-4-8-thinking Anthropic | 89.9 | 94 | 95 | 84 | 76 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 3. |
| 6 | claude-opus-4-6 Anthropic | 89.4 | 94 | 93 | 84 | 76 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 7. |
| 7 | qwen3.7-max-20260517 Alibaba | 89.1 | 91 | 91 | 88 | 82 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 8. |
| 8 | claude-opus-4-8 Anthropic | 88.1 | 91 | 93 | 84 | 76 | Coding, WebDev, engineering agents, frontend tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 5. |
| 9 | gpt-5.5-high OpenAI | 88.1 | 89 | 89 | 90 | 76 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered. |
| 10 | glm-5.1 Zai | 87.3 | 89 | 89 | 82 | 88 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 9. |
| 11 | gpt-5.5 OpenAI | 86.9 | 87 | 88 | 90 | 76 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered. |
| 12 | gpt-5.4-high OpenAI | 86.2 | 88 | 84 | 90 | 76 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered. |
| 13 | claude-sonnet-4-6 Anthropic | 85.5 | 87 | 89 | 84 | 76 | Long documents, knowledge organization, report analysis Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 10. |
| 14 | muse-spark Meta | 84.8 | 87 | 84 | 82 | 82 | Multimodal, vision understanding, image-text tasks Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 13. |
| 15 | gemini-3.5-flash Google | 83.9 | 82 | 82 | 91 | 76 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 14. |
| 16 | gpt-5.5-instant OpenAI | 83.2 | 81 | 82 | 90 | 76 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered. |
| 17 | gpt-5.5-xhigh (codex-harness) OpenAI | 82.8 | 81 | 81 | 90 | 76 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 15. |
| 18 | gpt-5.2-chat-latest-20260210 OpenAI | 82.7 | 82 | 78 | 90 | 76 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered. |
| 19 | kimi-k2.6 Moonshot AI | 82.4 | 80 | 82 | 85 | 88 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 12. |
| 20 | qwen3.6-max-preview Alibaba | 81.2 | 78 | 78 | 88 | 82 | General Q&A, writing, knowledge organization Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 17. |
Trusted Sources
Human preference
Uses anonymous pairwise voting from real users. It is useful for general chat, writing, and preference-driven quality, but a single score should not be treated as the best choice for every workflow.
Limitation: Preference data can be affected by sampling, traffic allocation, prompt mix, and model exposure.
Capability, speed, and cost
Tracks intelligence, throughput, latency, and pricing, making it useful for API selection, cost control, and production trade-off analysis.
Limitation: Composite scores cannot represent every private workflow; teams still need task-specific evaluation.
Industry task evaluation
Focuses on high-value industry tasks such as finance, law, healthcare, coding, and education, with attention to documents, long context, and agentic workflows.
Limitation: Some datasets and judging details are private, so it is best used as an industry signal rather than a fully reproducible experiment.
Transparent reproducible evaluation
Emphasizes transparent scenarios, metrics, and reproducible evaluation, which helps research-minded readers inspect model capability and robustness.
Limitation: Updates may be slower than commercial leaderboards, so the newest models can lag behind.
Guozhen AI Scorecard
Compare reasoning, science, math, knowledge, and instruction following instead of trusting one top-ranked model.
Prefer evidence from documents, codebases, tool use, multi-turn workflows, and long context over exam-only scores.
Check hallucination risk, format consistency, and whether the model stays coherent across long tasks.
For similar quality, compare input and output price, latency, throughput, and context window.
Separate closed APIs, open weights, local deployment, compliance, and auditability.
Model Selection
Start with Arena-style preference data, then check Artificial Analysis for speed and cost.
Use LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.
Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.
Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.
Compare open weights, licenses, deployment cost, context windows, and data retention policy.
Start with Arena preference data, then add Artificial Analysis speed and cost signals.
Weights: Arena Text/Document preference 50%, general intelligence 20%, speed and cost 20%, knowledge organization 10%.
Best overall for high-quality writing, long-answer structure, complex Q&A, and document summaries.
Very stable in Text and Document signals, especially for long documents and deep writing.
Strong multimodal, long-context, and information organization ability.
Good structured output, production API fit, and general Q&A performance.
Not the highest quality, but useful for fast summarization, rewriting, and lightweight Q&A.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Prioritize LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.
Weights: Vals/SWE real tasks 40%, WebDev/Arena engineering preference 25%, agent reliability 20%, speed and cost 15%.
Strong coding, long-context, and repository-level understanding signals.
Strong SWE-style repair, tool use, and production API behavior.
Excellent WebDev and reasoning signal for frontend refactors and architecture analysis.
Notable WebDev signal and worth testing for Chinese engineering workflows.
Balanced for code explanation, local fixes, and lighter agent workflows.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.
Weights: Vals industry tasks 45%, long-document reasoning 25%, compliance control 15%, cost and speed 15%.
Strong long-document reasoning and safer professional-answer style.
Strong long context and multimodal handling for reports and industry documents.
Good tool ecosystem for knowledge bases, customer support, and internal workflow automation.
Stable document reasoning for professional material review.
Worth testing for Chinese long-document and cost-sensitive industry workflows.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.
Weights: transparent academic evaluation 35%, reasoning and knowledge 30%, reproducibility 20%, tools and retrieval 15%.
Strong for complex reasoning, paper summaries, and long-form research analysis.
Strong general knowledge, tool ecosystem, and structured analysis.
Strong long-context and multimodal analysis for papers, charts, and data materials.
Stable reasoning and document comprehension for serious reading.
Useful for visual and multimodal interpretation of figures and experiment materials.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Compare open weights, licenses, deployment cost, context windows, and data retention policy separately.
Weights: openness and deployability 35%, data control 25%, Chinese usability 15%, cost efficiency 15%, capability 10%.
Strong Chinese ecosystem, open community, and practical private-deployment route.
Good Chinese capability and enterprise deployment fit.
Interesting for Chinese long documents and internal knowledge Q&A tests.
Strong open ecosystem, though Chinese and industry coverage need more validation.
Not a local-first model, but useful for low-cost high-throughput workloads after data sanitization.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
This page does not copy external leaderboards or claim that one model is always best. Guozhen AI combines public benchmark sources, methodology differences, and practical scenarios so readers can make better 2026 model decisions.