Arena multi-domain preference
Combines Text, WebDev, Vision, Document, and related preference signals from real users.
AI Model Benchmark Hub
This hub explains how major AI model evaluation sites differ. Arena reflects human preference, Artificial Analysis helps compare capability, speed, and price, Vals AI focuses on industry tasks, and HELM emphasizes transparency and reproducibility.
Guozhen AI Composite Ranking v0.1
This is Guozhen AI's original synthesis layer. It normalizes Arena multi-domain preference, Vals real-task evidence, Artificial Analysis production signals, and HELM-style transparency signals into a 0-100 weighted score.
The automated ranking currently uses the official LMArena Hugging Face Dataset latest snapshots for Text, WebDev, Vision, and Document. Vals, Artificial Analysis, and HELM remain part of the methodology, editorial calibration, and model-selection guidance.
Combines Text, WebDev, Vision, Document, and related preference signals from real users.
Uses coding, terminal, industry, and agentic task evidence to avoid chat-only evaluation.
Adds production signals such as intelligence, speed, latency, and price.
Rewards reproducibility, robustness, multi-metric reporting, and research transparency.
| Rank | Model | Composite | Arena | Tasks | Efficiency | Transparency | Best for |
|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-7-thinking Anthropic | 94.8 | 99 | 94 | 90 | 87 | Complex reasoning, long documents, engineering agents, WebDev Strong across Arena-style preference and real engineering tasks, making it the strongest composite choice in this snapshot. |
| 2 | claude-opus-4-6-thinking Anthropic | 93.6 | 98 | 92 | 89 | 87 | Document understanding, deep writing, reasoning-heavy work Very stable across Text, Vision, and Document preference signals, slightly behind the newer thinking model. |
| 3 | gemini-3.1-pro-preview Google | 91.7 | 91 | 96 | 91 | 84 | Coding, long context, multimodal, search-augmented tasks Strong Vals coding and long-context signals lift its composite score beyond pure chat preference ranking. |
| 4 | gpt-5.5-high OpenAI | 90.9 | 88 | 95 | 96 | 83 | General intelligence, code repair, production API selection Strong on SWE-style tasks and general intelligence signals, with favorable production trade-offs. |
| 5 | claude-opus-4-7 Anthropic | 89.4 | 96 | 88 | 86 | 86 | Writing, chat, documents, lighter agent workflows Still very strong in Arena and WebDev, but slightly less reliable than the thinking variant for complex tasks. |
| 6 | claude-opus-4-6 Anthropic | 88.8 | 95 | 87 | 86 | 86 | Text creation, visual understanding, document analysis A stable all-round model for high-quality content and complex material analysis. |
| 7 | gemini-3-pro Google | 88.3 | 90 | 89 | 91 | 84 | Vision, multimodal, long context Strong vision and multimodal performance keep it high among Google models. |
| 8 | gpt-5.4-high OpenAI | 84.1 | 87 | 88 | 85 | 82 | Competitive coding, stable API use, general assistant tasks Still strong in selected academic and coding tasks, but trails GPT-5.5 and newer Claude models overall. |
| 9 | qwen3.7-max-20260517 Alibaba | 83.7 | 86 | 83 | 86 | 79 | Chinese tasks, WebDev, cost-sensitive API use Notable WebDev performance, with extra value for Chinese-language and cost-aware use cases. |
| 10 | gemini-3.5-flash Google | 82.6 | 84 | 81 | 93 | 80 | Low latency, multimodal, high-throughput workloads Not the strongest intelligence model, but speed and cost make it useful at production scale. |
| 11 | claude-sonnet-4-6 Anthropic | 80.8 | 82 | 81 | 82 | 84 | Daily writing, code explanation, cost-controlled tasks Below the Opus tier, but balanced for quality and cost. |
| 12 | glm-5.1 Zhipu AI | 79.2 | 82 | 78 | 82 | 75 | Chinese Q&A, domestic ecosystem, enterprise private deployment review Good WebDev signal, worth further testing for Chinese and domestic ecosystem scenarios. |
| 13 | kimi-k2.6 Moonshot AI | 78.4 | 81 | 77 | 82 | 74 | Chinese long documents, knowledge organization, cost-aware workflows Interesting for Chinese long-document work, though cross-source coverage is less complete than top labs. |
| 14 | muse-spark Meta | 77.1 | 85 | 73 | 78 | 76 | General chat and open-ecosystem tracking Strong Text preference signal, but weaker cross-source task and production coverage lowers the composite rank. |
| 15 | deepseek-r1-202605 DeepSeek | 76.4 | 78 | 79 | 83 | 72 | Chinese reasoning, math, cost-sensitive API use Good reasoning and cost signals, worth testing for Chinese technical Q&A and budget-sensitive work. |
| 16 | deepseek-v3.1 DeepSeek | 75.8 | 77 | 76 | 86 | 72 | General Chinese tasks, batch processing, tool use Efficient and cost-aware, useful as a candidate for batch workflows. |
| 17 | llama-4-maverick Meta | 74.9 | 75 | 74 | 78 | 88 | Open ecosystem, local deployment, research reproducibility Strong openness and transparency, though top task capability trails frontier closed models. |
| 18 | qwen3.7-plus Alibaba | 74.2 | 76 | 73 | 84 | 76 | Chinese apps, low-cost production, domestic ecosystem Good Chinese ecosystem and cost profile, useful as an enterprise fallback. |
| 19 | grok-4 xAI | 73.6 | 76 | 72 | 77 | 70 | Fresh information, creative Q&A, social context Interesting for freshness and creative Q&A, with less cross-source coverage than top labs. |
| 20 | mistral-large-2 Mistral | 72.8 | 73 | 72 | 80 | 79 | EU compliance, open ecosystem, multilingual tasks Useful for multilingual and compliance-sensitive work, though not a top composite performer. |
Trusted Sources
Human preference
Uses anonymous pairwise voting from real users. It is useful for general chat, writing, and preference-driven quality, but a single score should not be treated as the best choice for every workflow.
Limitation: Preference data can be affected by sampling, traffic allocation, prompt mix, and model exposure.
Capability, speed, and cost
Tracks intelligence, throughput, latency, and pricing, making it useful for API selection, cost control, and production trade-off analysis.
Limitation: Composite scores cannot represent every private workflow; teams still need task-specific evaluation.
Industry task evaluation
Focuses on high-value industry tasks such as finance, law, healthcare, coding, and education, with attention to documents, long context, and agentic workflows.
Limitation: Some datasets and judging details are private, so it is best used as an industry signal rather than a fully reproducible experiment.
Transparent reproducible evaluation
Emphasizes transparent scenarios, metrics, and reproducible evaluation, which helps research-minded readers inspect model capability and robustness.
Limitation: Updates may be slower than commercial leaderboards, so the newest models can lag behind.
Guozhen AI Scorecard
Compare reasoning, science, math, knowledge, and instruction following instead of trusting one top-ranked model.
Prefer evidence from documents, codebases, tool use, multi-turn workflows, and long context over exam-only scores.
Check hallucination risk, format consistency, and whether the model stays coherent across long tasks.
For similar quality, compare input and output price, latency, throughput, and context window.
Separate closed APIs, open weights, local deployment, compliance, and auditability.
Model Selection
Start with Arena-style preference data, then check Artificial Analysis for speed and cost.
Use LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.
Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.
Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.
Compare open weights, licenses, deployment cost, context windows, and data retention policy.
Start with Arena preference data, then add Artificial Analysis speed and cost signals.
Weights: Arena Text/Document preference 50%, general intelligence 20%, speed and cost 20%, knowledge organization 10%.
Best overall for high-quality writing, long-answer structure, complex Q&A, and document summaries.
Very stable in Text and Document signals, especially for long documents and deep writing.
Strong multimodal, long-context, and information organization ability.
Good structured output, production API fit, and general Q&A performance.
Not the highest quality, but useful for fast summarization, rewriting, and lightweight Q&A.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.
Prioritize LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.
Weights: Vals/SWE real tasks 40%, WebDev/Arena engineering preference 25%, agent reliability 20%, speed and cost 15%.
Strong coding, long-context, and repository-level understanding signals.
Strong SWE-style repair, tool use, and production API behavior.
Excellent WebDev and reasoning signal for frontend refactors and architecture analysis.
Notable WebDev signal and worth testing for Chinese engineering workflows.
Balanced for code explanation, local fixes, and lighter agent workflows.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Extended candidate for coding and agent workflows; validate on your own repository and test suite.
Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.
Weights: Vals industry tasks 45%, long-document reasoning 25%, compliance control 15%, cost and speed 15%.
Strong long-document reasoning and safer professional-answer style.
Strong long context and multimodal handling for reports and industry documents.
Good tool ecosystem for knowledge bases, customer support, and internal workflow automation.
Stable document reasoning for professional material review.
Worth testing for Chinese long-document and cost-sensitive industry workflows.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Extended candidate for industry workflows; combine public signals with private internal evaluation.
Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.
Weights: transparent academic evaluation 35%, reasoning and knowledge 30%, reproducibility 20%, tools and retrieval 15%.
Strong for complex reasoning, paper summaries, and long-form research analysis.
Strong general knowledge, tool ecosystem, and structured analysis.
Strong long-context and multimodal analysis for papers, charts, and data materials.
Stable reasoning and document comprehension for serious reading.
Useful for visual and multimodal interpretation of figures and experiment materials.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Extended candidate for research analysis; check transparent benchmark methodology and source reliability.
Compare open weights, licenses, deployment cost, context windows, and data retention policy separately.
Weights: openness and deployability 35%, data control 25%, Chinese usability 15%, cost efficiency 15%, capability 10%.
Strong Chinese ecosystem, open community, and practical private-deployment route.
Good Chinese capability and enterprise deployment fit.
Interesting for Chinese long documents and internal knowledge Q&A tests.
Strong open ecosystem, though Chinese and industry coverage need more validation.
Not a local-first model, but useful for low-cost high-throughput workloads after data sanitization.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.
This page does not copy external leaderboards or claim that one model is always best. Guozhen AI combines public benchmark sources, methodology differences, and practical scenarios so readers can make better model decisions.