Guozhen AIGlobal AI field notes and model intelligence

AI Model Benchmark 2026

Read 2026 AI model leaderboards without choosing the wrong model

This hub explains how major 2026 AI model evaluation sources differ. Arena reflects human preference, Artificial Analysis helps compare capability, speed, and price, Vals AI focuses on industry tasks, and HELM emphasizes transparency and reproducibility.

Guozhen AI Composite Ranking v0.1

Weighted composite ranking

This is Guozhen AI's original synthesis layer. It normalizes Arena multi-domain preference, Vals real-task evidence, Artificial Analysis production signals, and HELM-style transparency signals into a 0-100 weighted score.

Auto snapshot: 2026-06-10
Refreshes every 3 days; next refresh around 2026-06-24

The ranking combines public benchmark signals from LMArena Text, WebDev, Vision, and Document, then uses Vals, Artificial Analysis, and HELM-style methodology for editorial calibration. If an external source is temporarily unavailable, the page keeps a stable composite ranking without exposing fetch diagnostics to readers.

40%

Arena multi-domain preference

Combines Text, WebDev, Vision, Document, and related preference signals from real users.

25%

Vals and real tasks

Uses coding, terminal, industry, and agentic task evidence to avoid chat-only evaluation.

25%

Artificial Analysis

Adds production signals such as intelligence, speed, latency, and price.

10%

HELM and transparent evals

Rewards reproducibility, robustness, multi-metric reporting, and research transparency.

RankModelCompositeArenaTasksEfficiencyTransparencyBest for
1
claude-fable-5
Anthropic
93.4
1001008476
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 1.

2
claude-opus-4-7-thinking
Anthropic
92.9
99998476
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 2.

3
claude-opus-4-7
Anthropic
91.3
96978476
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 4.

4
claude-opus-4-6-thinking
Anthropic
90.4
96958476
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 6.

5
claude-opus-4-8-thinking
Anthropic
89.9
94958476
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 3.

6
claude-opus-4-6
Anthropic
89.4
94938476
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 7.

7
qwen3.7-max-20260517
Alibaba
89.1
91918882
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 8.

8
claude-opus-4-8
Anthropic
88.1
91938476
Coding, WebDev, engineering agents, frontend tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 5.

9
gpt-5.5-high
OpenAI
88.1
89899076
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered.

10
glm-5.1
Zai
87.3
89898288
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 9.

11
gpt-5.5
OpenAI
86.9
87889076
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered.

12
gpt-5.4-high
OpenAI
86.2
88849076
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered.

13
claude-sonnet-4-6
Anthropic
85.5
87898476
Long documents, knowledge organization, report analysis

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 10.

14
muse-spark
Meta
84.8
87848282
Multimodal, vision understanding, image-text tasks

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 13.

15
gemini-3.5-flash
Google
83.9
82829176
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 14.

16
gpt-5.5-instant
OpenAI
83.2
81829076
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered.

17
gpt-5.5-xhigh (codex-harness)
OpenAI
82.8
81819076
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 15.

18
gpt-5.2-chat-latest-20260210
OpenAI
82.7
82789076
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank not covered.

19
kimi-k2.6
Moonshot AI
82.4
80828588
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 12.

20
qwen3.6-max-preview
Alibaba
81.2
78788882
General Q&A, writing, knowledge organization

Auto snapshot combines LMArena Text/WebDev/Vision/Document signals; Text rank not covered, WebDev rank 17.

Trusted Sources

How to read major model benchmark sites

Arena / LMArena

Human preference

Source

Uses anonymous pairwise voting from real users. It is useful for general chat, writing, and preference-driven quality, but a single score should not be treated as the best choice for every workflow.

General chatWriting qualityMultimodal preferenceFrontier tracking

Limitation: Preference data can be affected by sampling, traffic allocation, prompt mix, and model exposure.

Artificial Analysis

Capability, speed, and cost

Source

Tracks intelligence, throughput, latency, and pricing, making it useful for API selection, cost control, and production trade-off analysis.

API selectionCost comparisonLatency and speedGeneral capability

Limitation: Composite scores cannot represent every private workflow; teams still need task-specific evaluation.

Vals AI

Industry task evaluation

Source

Focuses on high-value industry tasks such as finance, law, healthcare, coding, and education, with attention to documents, long context, and agentic workflows.

Finance and lawIndustry documentsLong contextAgent workflows

Limitation: Some datasets and judging details are private, so it is best used as an industry signal rather than a fully reproducible experiment.

Stanford HELM

Transparent reproducible evaluation

Source

Emphasizes transparent scenarios, metrics, and reproducible evaluation, which helps research-minded readers inspect model capability and robustness.

Research reproducibilityCapability breakdownsEvaluation methodsMulti-metric analysis

Limitation: Updates may be slower than commercial leaderboards, so the newest models can lag behind.

Guozhen AI Scorecard

A practical synthesis framework

30%

General intelligence

Compare reasoning, science, math, knowledge, and instruction following instead of trusting one top-ranked model.

25%

Real tasks

Prefer evidence from documents, codebases, tool use, multi-turn workflows, and long context over exam-only scores.

20%

Reliability

Check hallucination risk, format consistency, and whether the model stays coherent across long tasks.

15%

Cost and speed

For similar quality, compare input and output price, latency, throughput, and context window.

10%

Openness and control

Separate closed APIs, open weights, local deployment, compliance, and auditability.

Model Selection

Choose models by real scenario

Writing, Q&A, and knowledge organization

Start with Arena-style preference data, then check Artificial Analysis for speed and cost.

Coding, debugging, and engineering agents

Use LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.

Finance, law, healthcare, and education

Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.

Research and model capability analysis

Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.

Local deployment, private data, and compliance

Compare open weights, licenses, deployment cost, context windows, and data retention policy.

Writing, Q&A, and knowledge organization

Start with Arena preference data, then add Artificial Analysis speed and cost signals.

Weights: Arena Text/Document preference 50%, general intelligence 20%, speed and cost 20%, knowledge organization 10%.

#1
claude-opus-4-7-thinking
Anthropic
96.2

Best overall for high-quality writing, long-answer structure, complex Q&A, and document summaries.

#2
claude-opus-4-6-thinking
Anthropic
95.1

Very stable in Text and Document signals, especially for long documents and deep writing.

#3
gemini-3.1-pro-preview
Google
91.8

Strong multimodal, long-context, and information organization ability.

#4
gpt-5.5-high
OpenAI
90.7

Good structured output, production API fit, and general Q&A performance.

#5
gemini-3.5-flash
Google
84.4

Not the highest quality, but useful for fast summarization, rewriting, and lightweight Q&A.

#6
claude-opus-4-7
Anthropic
88.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#7
gemini-3-pro
Google
87.6

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#8
gpt-5.4-high
OpenAI
86.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#9
claude-sonnet-4-6
Anthropic
85.7

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#10
qwen3.7-max-20260517
Alibaba
84.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#11
deepseek-r1-202605
DeepSeek
83.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#12
kimi-k2.6
Moonshot AI
82.7

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#13
glm-5.1
Zhipu AI
81.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#14
deepseek-v3.1
DeepSeek
81.3

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#15
qwen3.7-plus
Alibaba
80.6

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#16
muse-spark
Meta
79.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#17
llama-4-maverick
Meta
78.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#18
grok-4
xAI
78.1

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#19
mistral-large-2
Mistral
76.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#20
command-r-plus-next
Cohere
75.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

Coding, debugging, and engineering agents

Prioritize LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.

Weights: Vals/SWE real tasks 40%, WebDev/Arena engineering preference 25%, agent reliability 20%, speed and cost 15%.

#1
gemini-3.1-pro-preview
Google
96.0

Strong coding, long-context, and repository-level understanding signals.

#2
gpt-5.5-high
OpenAI
95.2

Strong SWE-style repair, tool use, and production API behavior.

#3
claude-opus-4-7-thinking
Anthropic
94.5

Excellent WebDev and reasoning signal for frontend refactors and architecture analysis.

#4
qwen3.7-max-20260517
Alibaba
87.6

Notable WebDev signal and worth testing for Chinese engineering workflows.

#5
claude-sonnet-4-6
Anthropic
84.9

Balanced for code explanation, local fixes, and lighter agent workflows.

#6
claude-opus-4-6-thinking
Anthropic
84.0

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#7
gpt-5.4-high
OpenAI
83.4

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#8
claude-opus-4-7
Anthropic
82.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#9
gemini-3-pro
Google
82.1

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#10
deepseek-r1-202605
DeepSeek
81.6

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#11
deepseek-v3.1
DeepSeek
80.7

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#12
qwen3.7-plus
Alibaba
79.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#13
glm-5.1
Zhipu AI
78.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#14
kimi-k2.6
Moonshot AI
77.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#15
llama-4-maverick
Meta
76.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#16
mistral-large-2
Mistral
75.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#17
muse-spark
Meta
74.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#18
grok-4
xAI
74.1

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#19
command-r-plus-next
Cohere
73.4

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#20
yi-large-next
01.AI
72.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

Finance, law, healthcare, and education

Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.

Weights: Vals industry tasks 45%, long-document reasoning 25%, compliance control 15%, cost and speed 15%.

#1
claude-opus-4-7-thinking
Anthropic
95.0

Strong long-document reasoning and safer professional-answer style.

#2
gemini-3.1-pro-preview
Google
93.8

Strong long context and multimodal handling for reports and industry documents.

#3
gpt-5.5-high
OpenAI
92.9

Good tool ecosystem for knowledge bases, customer support, and internal workflow automation.

#4
claude-opus-4-6-thinking
Anthropic
91.5

Stable document reasoning for professional material review.

#5
kimi-k2.6
Moonshot AI
82.3

Worth testing for Chinese long-document and cost-sensitive industry workflows.

#6
claude-opus-4-7
Anthropic
89.9

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#7
gemini-3-pro
Google
88.4

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#8
gpt-5.4-high
OpenAI
87.8

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#9
qwen3.7-max-20260517
Alibaba
86.2

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#10
claude-sonnet-4-6
Anthropic
85.4

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#11
deepseek-r1-202605
DeepSeek
84.1

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#12
glm-5.1
Zhipu AI
83.3

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#13
deepseek-v3.1
DeepSeek
82.4

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#14
qwen3.7-plus
Alibaba
81.6

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#15
llama-4-maverick
Meta
80.5

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#16
mistral-large-2
Mistral
79.7

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#17
muse-spark
Meta
78.8

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#18
command-r-plus-next
Cohere
78.0

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#19
grok-4
xAI
77.1

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#20
yi-large-next
01.AI
76.2

Extended candidate for industry workflows; combine public signals with private internal evaluation.

Research, papers, and model capability analysis

Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.

Weights: transparent academic evaluation 35%, reasoning and knowledge 30%, reproducibility 20%, tools and retrieval 15%.

#1
claude-opus-4-7-thinking
Anthropic
94.2

Strong for complex reasoning, paper summaries, and long-form research analysis.

#2
gpt-5.5-high
OpenAI
93.4

Strong general knowledge, tool ecosystem, and structured analysis.

#3
gemini-3.1-pro-preview
Google
92.8

Strong long-context and multimodal analysis for papers, charts, and data materials.

#4
claude-opus-4-6-thinking
Anthropic
91.0

Stable reasoning and document comprehension for serious reading.

#5
gemini-3-pro
Google
87.1

Useful for visual and multimodal interpretation of figures and experiment materials.

#6
claude-opus-4-7
Anthropic
88.9

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#7
gpt-5.4-high
OpenAI
88.2

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#8
deepseek-r1-202605
DeepSeek
86.7

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#9
qwen3.7-max-20260517
Alibaba
85.6

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#10
claude-sonnet-4-6
Anthropic
84.9

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#11
llama-4-maverick
Meta
84.1

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#12
deepseek-v3.1
DeepSeek
83.3

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#13
glm-5.1
Zhipu AI
82.2

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#14
kimi-k2.6
Moonshot AI
81.5

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#15
qwen3.7-plus
Alibaba
80.6

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#16
mistral-large-2
Mistral
79.8

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#17
muse-spark
Meta
78.7

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#18
grok-4
xAI
77.9

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#19
command-r-plus-next
Cohere
77.0

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#20
yi-large-next
01.AI
76.2

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

Local deployment, private data, and compliance

Compare open weights, licenses, deployment cost, context windows, and data retention policy separately.

Weights: openness and deployability 35%, data control 25%, Chinese usability 15%, cost efficiency 15%, capability 10%.

#1
qwen3.7-max / Qwen open ecosystem
Alibaba
89.0

Strong Chinese ecosystem, open community, and practical private-deployment route.

#2
glm-5.1 / GLM open ecosystem
Zhipu AI
86.4

Good Chinese capability and enterprise deployment fit.

#3
kimi-k2.6 / Moonshot ecosystem
Moonshot AI
83.2

Interesting for Chinese long documents and internal knowledge Q&A tests.

#4
muse-spark / Meta open ecosystem
Meta
81.5

Strong open ecosystem, though Chinese and industry coverage need more validation.

#5
gemini-3.5-flash
Google
78.8

Not a local-first model, but useful for low-cost high-throughput workloads after data sanitization.

#6
deepseek-r1 / DeepSeek open ecosystem
DeepSeek
77.9

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#7
deepseek-v3.1 / DeepSeek ecosystem
DeepSeek
77.2

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#8
mistral-large-2 / Mistral ecosystem
Mistral
76.5

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#9
qwen3.7-plus / Qwen open ecosystem
Alibaba
75.8

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#10
command-r-plus-next
Cohere
74.9

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#11
yi-large-next
01.AI
74.0

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#12
baichuan-4-next
Baichuan
73.2

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#13
internlm3-latest
Shanghai AI Lab
72.6

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#14
minimax-text-01
MiniMax
71.8

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#15
ernie-4.5
Baidu
71.1

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#16
gemini-3.5-flash
Google
70.5

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#17
gpt-5.5-high
OpenAI
69.4

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#18
claude-sonnet-4-6
Anthropic
68.8

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#19
claude-opus-4-7-thinking
Anthropic
68.1

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#20
gemini-3.1-pro-preview
Google
67.6

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

Editorial note

This page does not copy external leaderboards or claim that one model is always best. Guozhen AI combines public benchmark sources, methodology differences, and practical scenarios so readers can make better 2026 model decisions.