Guozhen AIGlobal AI field notes and model intelligence

English translation

I Tested Gemini-3.5 Against DeepSeek-V4 and GPT-5.5. The Result Was Unexpected

2026-05-21Field note #103Screenshots preserved from the original article

Hi, I am Guozhen.

After nearly half a year, Gemini upgraded from version 3 to version 3.5.

This time, only Gemini 3.5 Flash was released, and it claims to have surpassed Gemini's own 3.1 Pro.

So I ran a hands-on comparison test.

1. Gemini 3.5 Flash

First, look at the model card scores.

For coding, its Terminal-bench 2.1 score reached 76.2%, close to GPT-5.5's 78.2%, and clearly above Gemini 3 Flash and Gemini 3.1 Pro.

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

More importantly, its agent capability looks strong.

MCP Atlas reached 83.6%, higher than GPT-5.5 and Claude Opus 4.7.

Toolathlon reached 56.5%, suggesting strong performance in MCP, multi-tool calling, and real task flows.

UI operation is also not weak. OSWorld-Verified reached 78.4%, almost touching GPT-5.5's 78.7%.

Based on these scores, Gemini 3.5 Flash has become a very competitive mainstream model for agents, MCP, and real tool-use scenarios.

2. Comparison test

The test idea was to define a fixed environment, choose the models to compare, then send each result to Gemini-3.1-Pro as a judge.

The test task was a small agent task:

I will upload an Excel file. Please read and analyze the data.
Identify fields, data types, row count, column count, and check empty values, outliers, and duplicate values.
Automatically choose fields suitable for bar charts, line charts, and pie charts.
Only output one directly runnable HTML file containing HTML/CSS/JS.
Use ECharts to draw a bar chart, line chart, and pie chart.
The page should include a data overview, three charts, and a Chinese conclusion for each chart.
Do not invent fields or values that do not exist. All conclusions must come from the Excel file.
If a chart type is not suitable, explain the reason on the page and provide an alternative chart.

The models used were Gemini-3.5-Flash, DeepSeek-V4-Flash, DeepSeek-V4-Pro, and GPT-5.5.

First, I selected Gemini-3.5-Flash:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Then I sent the small agent task to it:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

I saved the generated HTML file:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5-Flash result:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

After uploading an Excel file, the data was displayed:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Visualization results:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Then I sent the same small agent task to DeepSeek-V4 Flash:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

After uploading the same Excel file to DeepSeek-V4 Flash, the result displayed like this:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

DeepSeek-V4 Flash data visualizations:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Then I sent the same task to DeepSeek-V4-Pro:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

DeepSeek-V4-Pro data-analysis visualization:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

DeepSeek-V4-Pro data display:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

DeepSeek-V4-Pro visualizations:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Finally, I sent the same task to GPT-5.5:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

GPT-5.5 data display:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Visualization results:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

3. Judge scores

You can roughly see which result looks better just from intuition. But to make the evaluation more objective, I asked Gemini-3.1-Pro to judge the outputs:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

The judge gave the scores:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

The judge also gave detailed explanations:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

Then I asked the judge to summarize the result in three sentences:

Gemini-3.5 vs DeepSeek-V4 and GPT-5.5

DeepSeek-V4-Pro won because of its rigorous logic and professional validation. It is the first choice for production-grade accurate reports.

Gemini-3.5-Flash ranked second because of strong fault tolerance and stability, while DeepSeek-V4-Flash became the best prototyping tool because of its excellent visual design.

GPT-5.5 fell behind because of a simple UI and weaker intelligent insights. The overall conclusion became: choose Pro for precision, choose Flash for visual polish.

I was surprised that GPT-5.5 ranked last, so I tried several more times. The result stayed the same.

Final thoughts

Because of article length, this test only used one small agent task to make an initial evaluation of Gemini 3.5 Flash's real performance. Overall, its completion quality was stable.

If you need a rigorous report, DeepSeek-V4-Pro is stronger. Gemini 3.5 Flash's advantage is balance, stability, and suitability for real office-automation scenarios.

GPT-5.5 performing poorly on this task surprised me. Of course, this is only a small-sample test. I will continue testing with more complex tasks later.

This English edition preserves the screenshots and workflow order from the original Chinese article.

Reader Messages

Reader messages

Questions, corrections, extra sources, or hands-on results can be left here. No login is required.

Max 800 characters

To reduce spam, each message is checked for length, link count, and posting frequency.

0/800

Messages

0 messages
Loading messages...