What should I do next in practice?

Kimi K2.6 has promising coding, agentic and browsing scores, but the cited Kimi evidence is less complete as a direct four way comparison [11][13].

What should I compare this against?

Cross-check this answer against "GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmark Comparison".

What should I do next in practice?

Kimi K2.6 has promising coding, agentic and browsing scores, but the cited Kimi evidence is less complete as a direct four way comparison [11][13].

What should I compare this against?

Cross-check this answer against "GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmark Comparison".

Trending Discover

ReportsPublishedApr 28, 2026Last edited May 3, 20269 sources

GPT-5.5, Claude Opus 4.7, DeepSeek V4 and Kimi K2.6: benchmark winners by category

No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr... DeepSeek V4 Pro Max is competitive but generally behind the top closed models in the main shared...

Search & fact-check with Studio Global AI Browse more from Discover

14K0

GPT-5.5 Review: Benchmarks, Pricing & Vs Claude (2026)The image displays a comparison table of benchmark performances for various AI models, including GPT-5.5, GPT-5.5 Pro, Claude 4.7, GPT-5.4, and Gemini 3.1 Pro, with the models' scores listed across different evaluation categories such as Terminal-Bench 2.0, OSWorld-Verified, and GPQA Diamond.

Benchmark tables make these four models look easy to rank, but the evidence points to different winners for different jobs. The strongest shared table has Claude Opus 4.7 ahead on GPQA Diamond and Humanity’s Last Exam without tools, GPT-5.5 ahead on Terminal-Bench 2.0, and GPT-5.5 Pro ahead on Humanity’s Last Exam with tools and BrowseComp ^[4]. Separate reporting adds GPT-5.5 leads over Claude on OSWorld-Verified and FrontierMath, while Claude Opus 4.7 is reported #1 in Vision & Document Arena ^[5]^[1].

The practical takeaway: use the benchmark category that matches your workload, not a single overall leaderboard.

Quick verdict

Workload	Best-supported leader	Evidence
Science reasoning	Claude Opus 4.7	94.2% on GPQA Diamond, ahead of GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4]
No-tools exam reasoning	Claude Opus 4.7	46.9% on Humanity’s Last Exam without tools ^[4]
Tool-augmented exam reasoning	GPT-5.5 Pro	57.2% on Humanity’s Last Exam with tools ^[4]
Terminal and agentic computing	GPT-5.5	82.7% on Terminal-Bench 2.0 ^[4]^[5]
OS operation	GPT-5.5	78.7% on OSWorld-Verified versus Claude Opus 4.7 at 78.0% ^[5]
Frontier math	GPT-5.5	51.7% on FrontierMath Tiers 1–3 versus Claude Opus 4.7 at 43.8% ^[5]
Software engineering in the shared table	Claude Opus 4.7	64.3% on SWE-Bench Pro / SWE Pro ^[4]
Browsing	GPT-5.5 Pro	90.1% on BrowseComp ^[4]
Public tool workflow / MCP	Claude Opus 4.7	79.1% on MCP Atlas / MCPAtlas Public ^[4]
Vision and document work	Claude Opus 4.7	Reported #1 in Vision & Document Arena ^[1]
Cost-performance claim	DeepSeek V4	Reported near state-of-the-art at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not expose full pricing methodology ^[4]
Most under-comparable model	Kimi K2.6	Useful reported scores, but fewer clean four-way comparisons against GPT-5.5, Claude Opus 4.7 and DeepSeek V4 ^[11]^[13]

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Best-supported read
GPQA Diamond	93.6% ^[4]	Not provided	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Not provided	Claude Opus 4.7 leads ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Not provided	Claude Opus 4.7 leads ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	54.0% in a separate Kimi comparison ^[13]	GPT-5.5 Pro leads the shared table ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Not provided	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	66.7% in a separate Kimi comparison ^[13]	GPT-5.5 leads ^[4]^[5]
SWE-Bench Pro / SWE Pro	58.6% ^[4]	Not provided	64.3% ^[4]	55.4% for DeepSeek-V4-Pro-Max ^[4]	58.6% in a separate Kimi comparison ^[13]	Claude Opus 4.7 leads the shared table ^[4]
BrowseComp	84.4% ^[4]	90.1% ^[4]	79.3% ^[4]	83.4% for DeepSeek-V4-Pro-Max ^[4]	83.2% in a Kimi vs DeepSeek comparison ^[11]	GPT-5.5 Pro leads the shared table ^[4]
MCP Atlas / MCPAtlas Public	75.3% ^[4]	Not provided	79.1% ^[4]	73.6% for DeepSeek-V4-Pro-Max ^[4]	Not provided	Claude Opus 4.7 leads ^[4]
OSWorld-Verified	78.7% ^[5]	Not provided	78.0% ^[5]	Not provided	Not provided	GPT-5.5 leads Claude in the cited comparison ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Not provided	43.8% ^[5]	Not provided	Not provided	GPT-5.5 leads Claude in the cited comparison ^[5]
Vision & Document Arena	Not provided	Not provided	Reported #1 overall ^[1]	Not provided	Not provided	Claude Opus 4.7 has the only cited result ^[1]
AIME 2026	Not provided	Not provided	Not provided	Not available in the cited Kimi vs DeepSeek table ^[11]	96.4% in Thinking mode ^[11]	Useful Kimi data, not a four-way ranking ^[11]
APEX Agents	Not provided	Not provided	Not provided	Not available in the cited Kimi vs DeepSeek table ^[11]	27.9% in Thinking mode ^[11]	Useful Kimi data, not a four-way ranking ^[11]
Context window	Not provided	Not provided	1,000k tokens in one Artificial Analysis comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in the same comparison ^[3]	Not provided	Claude Opus 4.7 and DeepSeek V4 Pro match in that comparison ^[3]

GPT-5.5 and GPT-5.5 Pro: strongest on terminal, OS, math and tool use

GPT-5.5’s clearest win in the provided evidence is Terminal-Bench 2.0, where it scores 82.7% versus Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5]. It also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% to 78.0%, and leads more clearly on FrontierMath Tiers 1–3, 51.7% to 43.8% ^[5].

GPT-5.5 does not lead every reasoning benchmark. In the shared VentureBeat table, Claude Opus 4.7 narrowly leads GPQA Diamond at 94.2%, while GPT-5.5 scores 93.6% and DeepSeek-V4-Pro-Max scores 90.1% ^[4].

GPT-5.5 Pro changes the picture when tools are allowed. It leads Humanity’s Last Exam with tools at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4]. It also leads BrowseComp at 90.1%, ahead of GPT-5.5 at 84.4%, DeepSeek-V4-Pro-Max at 83.4%, and Claude Opus 4.7 at 79.3% ^[4].

OpenAI’s own GPT-5.5 announcement reports GPT-5.5 at 95.0% on ARC-AGI-1 Verified and 85.0% on ARC-AGI-2 Verified, but it also notes that GPT evaluations were run with reasoning effort set to xhigh in a research environment that may differ from production ChatGPT ^[8]. Those ARC results are useful GPT-5.5 context, but the supplied excerpt does not include DeepSeek V4 or Kimi K2.6, so it cannot settle this four-way comparison ^[8].

Some GPT-5.5-only domain results are also promising: 91.7% on Harvey BigLaw Bench, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. Treat those as domain context rather than direct comparison evidence because the supplied excerpt does not report the same benchmarks for Claude Opus 4.7, DeepSeek V4 and Kimi K2.6 ^[7].

Claude Opus 4.7: strongest cited science, no-tools reasoning and documents

Claude Opus 4.7 has the best academic-reasoning profile in the main shared table. It leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9% ^[4]. It also leads SWE-Bench Pro / SWE Pro at 64.3% and MCP Atlas / MCPAtlas Public at 79.1% in that table ^[4].

Claude’s weaker spot in the cited data is terminal and OS-style operation. GPT-5.5 leads Claude Opus 4.7 by more than 13 points on Terminal-Bench 2.0, 82.7% to 69.4%, and also leads Claude on OSWorld-Verified and FrontierMath Tiers 1–3 in the cited GPT-5.5 comparison ^[4]^[5].

Claude has the strongest multimodal and document signal in the supplied evidence. One source reports Claude Opus 4.7 taking #1 in Vision & Document Arena, improving by 4 points over Opus 4.6 in Document Arena, and winning diagram, homework and OCR subcategories ^[1]. The excerpt does not provide numeric Vision & Document Arena scores for GPT-5.5, DeepSeek V4 or Kimi K2.6, so this supports Claude’s document strength but not a complete four-way multimodal ranking ^[1].

DeepSeek V4: competitive, cheaper on the cited claim, but not the benchmark leader here

The supplied evidence uses more than one DeepSeek label: DeepSeek-V4-Pro-Max appears in the VentureBeat benchmark table, while DeepSeek V4 Pro appears in an Artificial Analysis context-window comparison ^[4]^[3]. That matters because benchmark conclusions should not be assumed to transfer perfectly across variants.

In the main shared table, DeepSeek-V4-Pro-Max is competitive but does not lead any row. It scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, 67.9% on Terminal-Bench 2.0, 55.4% on SWE-Bench Pro / SWE Pro, 83.4% on BrowseComp, and 73.6% on MCP Atlas / MCPAtlas Public ^[4].

DeepSeek’s most important product claim is cost-performance. VentureBeat describes DeepSeek V4 as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5 ^[4]. The excerpt does not include enough detail to verify the pricing basis, workload assumptions, latency tradeoffs or token normalization behind that comparison, so teams should validate cost and quality on their own tasks before treating the ratio as universal ^[4].

For long-context screening, Artificial Analysis lists both DeepSeek V4 Pro and Claude Opus 4.7 with 1,000k-token context windows in one comparison ^[3]. That supports parity for those two listed configurations, not a broader claim about every DeepSeek or Claude mode ^[3].

Kimi K2.6: promising scores, but the four-way evidence is thinner

Kimi K2.6 is the hardest model to rank cleanly against the other three because the supplied evidence is less complete as a four-way comparison. A Kimi-focused comparison reports Kimi K2.6 at 58.6% on SWE-Bench Pro, 80.2% on SWE-Bench Verified, 66.7% on Terminal-Bench 2.0, 54.0% on Humanity’s Last Exam with tools, and 89.6% on LiveCodeBench v6 ^[13]. The same source says those K2.6 numbers come from a Moonshot AI official model card, but its comparison set is mainly Claude Opus 4.6 and GPT-5.4 rather than GPT-5.5, Claude Opus 4.7 and DeepSeek V4 together ^[13].

A separate Kimi vs DeepSeek comparison reports Kimi K2.6 at 96.4% on AIME 2026 in Thinking mode, 27.9% on APEX Agents in Thinking mode, and 83.2% on BrowseComp with Thinking mode and context management ^[11]. In that same excerpt, DeepSeek-V4 Pro is listed at 83.4% on BrowseComp, while DeepSeek values are not available for AIME 2026 and APEX Agents ^[11].

There is also a Reddit snippet claiming DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and Kimi K2.6 is #2 ^[18]. Because the supplied evidence is user-generated and lacks a score table or methodology, it should not drive the main ranking ^[18].

Which model should you test first?

Test Claude Opus 4.7 first for science reasoning, no-tools expert Q&A, SWE-Bench-style software engineering, MCP-style workflows, and document-heavy multimodal work; those are Claude’s strongest cited areas ^[4]^[1].
Test GPT-5.5 first for terminal-heavy agents, OS-operation tasks, and frontier math work; it leads the cited Terminal-Bench 2.0, OSWorld-Verified and FrontierMath results ^[4]^[5].
Test GPT-5.5 Pro first when tool-augmented reasoning or browsing is central; it leads Humanity’s Last Exam with tools and BrowseComp in the shared table ^[4].
Test DeepSeek V4 first when cost-performance is the primary constraint and you can run your own quality checks; the main cited advantage is the reported near-frontier performance at about one-sixth the cost of Opus 4.7 and GPT-5.5 ^[4].
Test Kimi K2.6 first if you specifically want to evaluate its reported coding, agentic and browsing scores, but run the same prompts and harnesses yourself because the available Kimi data is less complete as a direct four-way comparison ^[11]^[13].

Evidence caveats

The benchmark picture is useful, but it is not a universal leaderboard. The evidence mixes GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, Claude Opus 4.7 and Kimi K2.6 results from different sources and settings ^[3]^[4]^[5]^[11]^[13]. At least one GPT-5.5 comparison notes that benchmark values are vendor-reported, and OpenAI’s own ARC reporting says its GPT evaluations were run in a research environment with xhigh reasoning effort ^[5]^[8].

That does not make the numbers useless. It means close results should be treated as directional, while large gaps are more actionable. In the supplied evidence, the large actionable gaps are GPT-5.5’s Terminal-Bench lead, GPT-5.5’s FrontierMath lead over Claude, Claude’s Vision & Document Arena signal, and GPT-5.5 Pro’s tool-augmented HLE lead ^[4]^[5]^[1].

For production decisions, the best benchmark is still your own task set: same prompts, same tool access, same context size, same latency constraints, and the same scoring rules across all candidate models.

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Search & fact-check with Studio Global AI

Key takeaways

No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr...
DeepSeek V4 Pro Max is competitive but generally behind the top closed models in the main shared table, while VentureBeat reports a roughly one sixth cost positioning that still needs workload level validation [4].
Kimi K2.6 has promising coding, agentic and browsing scores, but the cited Kimi evidence is less complete as a direct four way comparison [11][13].

Supporting visuals

Kimi K2.6 vs DeepSeek V2 vs GPT-5.5 vs Claude Opus 4.7: WhichThe image compares benchmarking results of Kimi K2.6, DeepSeek V2, GPT-5.5, and Claude Opus 4.7 for coding reasoning in 2026, highlighting their evaluation criteria and testing strategies.

The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026. We rank the best AI models, compare benchmarks and break down real costs.

Continue your research

Illustration of an AI coding agent connected to cloud database and backup systems

PocketOS Database Deletion: What the Claude/Cursor Incident Really Shows

PocketOS Database Deletion: What We Know About the Reported Claude/Cursor Incident

Abstract benchmark dashboard comparing GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4

GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmark Comparison

GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4 benchmarks compared

Abstract editorial illustration of OpenAI model names organized into ChatGPT, API, Codex, and retired categories

OpenAI Models in 2026: Current ChatGPT, API, Codex, and Retired Models

Editorial illustration of a gated AI frontier model analyzing cybersecurity code and vulnerabilities

Claude Mythos Preview Explained: What Anthropic Has Confirmed—and What It Hasn’t

Claude Mythos Preview Explained: Anthropic’s Gated Frontier Model, Without the Hype

Research conversation

YouApr 27, 2026

Research question

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AIApr 27, 202621 sources

Cited answer

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena ^[4]^[1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools ^[4]^[5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison ^[4]^[18]^[1].

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Sources

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[18] "DeepSeek v4 is now the #1 open-weight model on our Vibe Code Benchmark, and it’s not close. It leaves the #2 (Kimi K2.6) in the dust, and even beats out frontier closed source models like Gemini 3.1 Pro." : r/acceleratereddit.com
Ethical considerations in AI development Latest breakthroughs in machine learning Impact of AI on job markets Public Anyone can view, post, and comment to this community 0 0 Reddit RulesPrivacy PolicyUser AgreementYour Privacy ChoicesAccessibilityReddit, In...

Trending Discover

ReportsPublishedApr 28, 2026Last edited May 3, 20269 sources