No single model wins the sourced benchmark set: Claude Opus 4.7 leads GPQA Diamond at 94.2% and no tools Humanity’s Last Exam at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pro leads tool assisted HL... DeepSeek V4 Pro Max is competitive in the shared table but does not lead any listed row; its big...

Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by Category. Article summary: No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr.... Topic tags: ai, llm benchmarks, openai, anthropic, deepseek. Reference image context from search candidates: Reference image 1: visual subject "Kimi K2.6 ties GPT-5.5 on SWE-bench Pro at 5–6x lower cost — with agent swarms, 13-hour autonomous runs, and open weights. In practice it is the first open-source model that can su" source context "Kimi K2.6: The Complete Developer Guide (2026) - Codersera" Reference image 2: visual subject "# Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which S
Benchmark tables make this four-way matchup look easier than it is. The strongest shared table covers GPT-5.5, GPT-5.5 Pro where available, Claude Opus 4.7, and DeepSeek-V4-Pro-Max; Kimi K2.6 appears mainly in separate comparisons, so it is less cleanly comparable across every category [4][
11][
13]. The right conclusion is category-specific: pick the benchmark that resembles your workload, then test the finalists on your own prompts.
| Workload | Best-supported pick | Why |
|---|---|---|
| Science reasoning | Claude Opus 4.7 | 94.2% on GPQA Diamond, ahead of GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% [ |
| No-tools expert reasoning | Claude Opus 4.7 | 46.9% on Humanity’s Last Exam without tools, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% [ |
Studio Global AI
Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.
No single model wins the sourced benchmark set: Claude Opus 4.7 leads GPQA Diamond at 94.2% and no tools Humanity’s Last Exam at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pro leads tool assisted HL...
No single model wins the sourced benchmark set: Claude Opus 4.7 leads GPQA Diamond at 94.2% and no tools Humanity’s Last Exam at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pro leads tool assisted HL... DeepSeek V4 Pro Max is competitive in the shared table but does not lead any listed row; its biggest cited advantage is VentureBeat’s cost performance framing at about one sixth the cost of Opus 4.7 and GPT 5.5 [4].
Treat close wins as directional because the sources mix base and Pro modes, DeepSeek variants, separate Kimi comparisons, and vendor reported or research environment settings [3][5][8][11][13].
Continue with "Hong Kong Policing Revision Guide: ICAC, Police Powers and Accountability" for another angle and extra citations.
Open related pageCross-check this answer against "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: 2026 benchmark verdict".
Open related pageArena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
| Tool-assisted exam reasoning | GPT-5.5 Pro | 57.2% on Humanity’s Last Exam with tools, ahead of Claude Opus 4.7 at 54.7% [ |
| Terminal and agentic computing | GPT-5.5 | 82.7% on Terminal-Bench 2.0, ahead of Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% [ |
| OS operation | GPT-5.5 | 78.7% on OSWorld-Verified versus Claude Opus 4.7 at 78.0% [ |
| Frontier math | GPT-5.5 | 51.7% on FrontierMath Tiers 1–3 versus Claude Opus 4.7 at 43.8% [ |
| Software engineering in the shared table | Claude Opus 4.7 | 64.3% on SWE-Bench Pro / SWE Pro, ahead of GPT-5.5 at 58.6% and DeepSeek-V4-Pro-Max at 55.4% [ |
| Browsing | GPT-5.5 Pro | 90.1% on BrowseComp, ahead of GPT-5.5 at 84.4%, DeepSeek-V4-Pro-Max at 83.4%, and Claude Opus 4.7 at 79.3% [ |
| MCP-style public tool workflow | Claude Opus 4.7 | 79.1% on MCP Atlas / MCPAtlas Public, ahead of GPT-5.5 at 75.3% and DeepSeek-V4-Pro-Max at 73.6% [ |
| Vision and document analysis | Claude Opus 4.7 | Reported #1 in Vision & Document Arena, with wins in diagram, homework, and OCR subcategories [ |
| Cost-sensitive evaluation | DeepSeek V4 | VentureBeat reports near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but that cost claim should be validated on your own workload [ |
| Least clean four-way comparison | Kimi K2.6 | Kimi has useful reported scores, but the cited Kimi evidence is mostly separate from the GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max table [ |
| Benchmark / capability | GPT-5.5 | GPT-5.5 Pro | Claude Opus 4.7 | DeepSeek V4 / V4 Pro Max | Kimi K2.6 | Best-supported read |
|---|---|---|---|---|---|---|
| GPQA Diamond | 93.6% [ | Not reported | 94.2% [ | 90.1% for DeepSeek-V4-Pro-Max [ | Not reported | Claude leads the shared table [ |
| Humanity’s Last Exam, no tools | 41.4% [ | 43.1% [ | 46.9% [ | 37.7% for DeepSeek-V4-Pro-Max [ | Not reported | Claude leads the shared table [ |
| Humanity’s Last Exam, with tools | 52.2% [ | 57.2% [ | 54.7% [ | 48.2% for DeepSeek-V4-Pro-Max [ | 54.0% in a separate Kimi comparison [ | GPT-5.5 Pro leads the shared table [ |
| Terminal-Bench 2.0 | 82.7% [ | Not reported | 69.4% [ | 67.9% for DeepSeek-V4-Pro-Max [ | 66.7% in a separate Kimi comparison [ | GPT-5.5 leads [ |
| SWE-Bench Pro / SWE Pro | 58.6% [ | Not reported | 64.3% [ | 55.4% for DeepSeek-V4-Pro-Max [ | 58.6% in a separate Kimi comparison [ | Claude leads the shared table [ |
| BrowseComp | 84.4% [ | 90.1% [ | 79.3% [ | 83.4% for DeepSeek-V4-Pro-Max [ | 83.2% in a Kimi vs DeepSeek comparison [ | GPT-5.5 Pro leads the shared table |
| MCP Atlas / MCPAtlas Public | 75.3% [ | Not reported | 79.1% [ | 73.6% for DeepSeek-V4-Pro-Max [ | Not reported | Claude leads [ |
| OSWorld-Verified | 78.7% [ | Not reported | 78.0% [ | Not reported | Not reported | GPT-5.5 leads Claude by a small margin [ |
| FrontierMath Tiers 1–3 | 51.7% [ | Not reported | 43.8% [ | Not reported | Not reported | GPT-5.5 leads Claude [ |
| Vision & Document Arena | Not reported | Not reported | Reported #1 overall [ | Not reported | Not reported | Claude has the only cited result [ |
| AIME 2026 | Not reported | Not reported | Not reported | Not available in the cited Kimi vs DeepSeek table [ | 96.4% in Thinking mode [ | Useful Kimi signal, not a four-way ranking [ |
| APEX Agents | Not reported | Not reported | Not reported | Not available in the cited Kimi vs DeepSeek table [ | 27.9% in Thinking mode [ | Useful Kimi signal, not a four-way ranking [ |
| Context window | Not reported | Not reported | 1,000k tokens in one Artificial Analysis comparison [ | 1,000k tokens for DeepSeek V4 Pro in the same comparison [ | Not reported | Claude and DeepSeek V4 Pro match in that comparison [ |
Rows that mix sources should be read carefully. A Kimi score reported in a separate Kimi-focused comparison is useful, but it is not as strong as a result produced in the same shared table and harness as GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max [4][
11][
13].
GPT-5.5’s clearest win is Terminal-Bench 2.0: 82.7% versus Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% in the shared table [4][
5]. That is one of the largest gaps in the sourced benchmark set.
It also leads Claude Opus 4.7 on OSWorld-Verified, but by a narrow 78.7% to 78.0% margin [5]. On FrontierMath Tiers 1–3, the GPT-5.5 lead is larger: 51.7% versus Claude’s 43.8% [
5].
GPT-5.5 Pro changes the picture when tools or browsing are central. It leads Humanity’s Last Exam with tools at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% [4]. It also leads BrowseComp at 90.1%, ahead of GPT-5.5 at 84.4%, DeepSeek-V4-Pro-Max at 83.4%, and Claude Opus 4.7 at 79.3% [
4].
GPT-5.5 does not lead every reasoning test. Claude Opus 4.7 narrowly beats it on GPQA Diamond, 94.2% to 93.6%, in the shared table [4]. A separate GPT-5.5 guide reports GPT-5.5-only domain results including 91.7% on Harvey BigLaw Bench, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench, but those should not be treated as four-way wins because the cited excerpt does not report the same scores for Claude Opus 4.7, DeepSeek V4, and Kimi K2.6 [
7].
Claude Opus 4.7 has the best no-tools reasoning profile in the main shared table. It leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9% [4]. It also leads SWE-Bench Pro / SWE Pro at 64.3% and MCP Atlas / MCPAtlas Public at 79.1% in that same table [
4].
Claude’s weaker area in the cited data is terminal-style operation. GPT-5.5 leads Claude by more than 13 points on Terminal-Bench 2.0, 82.7% to 69.4%, and also leads Claude on OSWorld-Verified and FrontierMath Tiers 1–3 [4][
5].
Claude has the strongest cited multimodal and document signal. One source reports Claude Opus 4.7 taking #1 in Vision & Document Arena, improving by 4 points over Opus 4.6 in Document Arena, and winning diagram, homework, and OCR subcategories [1]. The same source does not provide comparable numeric Vision & Document Arena scores for GPT-5.5, DeepSeek V4, or Kimi K2.6, so this supports Claude’s document strength but not a complete four-way multimodal ranking [
1].
The sources use more than one DeepSeek label. The shared benchmark table reports DeepSeek-V4-Pro-Max, while the Artificial Analysis comparison reports DeepSeek V4 Pro with a 1,000k-token context window [4][
3]. Those labels should not be treated as automatically interchangeable.
In the main shared table, DeepSeek-V4-Pro-Max is competitive but does not lead any row. It scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, 67.9% on Terminal-Bench 2.0, 55.4% on SWE-Bench Pro / SWE Pro, 83.4% on BrowseComp, and 73.6% on MCP Atlas / MCPAtlas Public [4].
DeepSeek’s strongest cited product claim is cost-performance rather than a category win. VentureBeat describes DeepSeek V4 as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5 [4]. That is a reason to test it for cost-sensitive workloads, not a reason to skip workload-level validation.
For long-context screening, one Artificial Analysis comparison lists both DeepSeek V4 Pro and Claude Opus 4.7 at 1,000k-token context windows [3]. That supports parity for those listed configurations, not a broader claim about every DeepSeek or Claude mode [
3].
Kimi K2.6 is the hardest model to rank cleanly in this set because it is not included in the main shared table against GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max [4]. A Kimi-focused comparison reports K2.6 at 58.6% on SWE-Bench Pro, 80.2% on SWE-Bench Verified, 66.7% on Terminal-Bench 2.0, 54.0% on Humanity’s Last Exam with tools, and 89.6% on LiveCodeBench v6 [
13]. That source says the K2.6 numbers come from a Moonshot AI official model card, but the comparison set is mainly Claude Opus 4.6 and GPT-5.4 rather than the exact four-way lineup here [
13].
A separate Kimi vs DeepSeek comparison reports Kimi K2.6 at 96.4% on AIME 2026 in Thinking mode, 27.9% on APEX Agents in Thinking mode, and 83.2% on BrowseComp with Thinking mode and context management [11]. In that same source, DeepSeek-V4 Pro is listed at 83.4% on BrowseComp, while DeepSeek values are not available for AIME 2026 and APEX Agents [
11].
That makes Kimi worth testing, especially for coding, agentic, and browsing workloads, but the sourced material does not support a clean overall ranking against GPT-5.5 and Claude Opus 4.7 across the same benchmark suite [11][
13].
This is not a universal leaderboard. The sources mix base and Pro variants, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, Claude Opus 4.7, and Kimi K2.6 [3][
4][
11][
13]. Some results are also vendor-reported, and OpenAI notes that its GPT evaluations for ARC were run with reasoning effort set to xhigh in a research environment that may differ from production ChatGPT [
5][
8].
Close margins should be treated as directional. Claude’s GPQA Diamond lead over GPT-5.5 is 0.6 points, and GPT-5.5’s OSWorld-Verified lead over Claude is 0.7 points [4][
5]. Larger gaps are more actionable: GPT-5.5’s Terminal-Bench 2.0 lead over Claude is more than 13 points, and its FrontierMath lead over Claude is 7.9 points [
5].
The practical bottom line: there is no single winner across GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6. Use the benchmark category that maps to your real workload, then rerun the same evaluation across the models you can actually deploy.
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...