| Tool-assisted exam reasoning | GPT-5.5 Pro | 57.2% on Humanity’s Last Exam with tools, ahead of Claude Opus 4.7 at 54.7% |
| Terminal and agentic computing | GPT-5.5 | 82.7% on Terminal-Bench 2.0, ahead of Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% |
| OS operation | GPT-5.5 | 78.7% on OSWorld-Verified versus Claude Opus 4.7 at 78.0% |
| Frontier math | GPT-5.5 | 51.7% on FrontierMath Tiers 1–3 versus Claude Opus 4.7 at 43.8% |
| Software engineering in the shared table | Claude Opus 4.7 | 64.3% on SWE-Bench Pro / SWE Pro, ahead of GPT-5.5 at 58.6% and DeepSeek-V4-Pro-Max at 55.4% |
| Browsing | GPT-5.5 Pro | 90.1% on BrowseComp, ahead of GPT-5.5 at 84.4%, DeepSeek-V4-Pro-Max at 83.4%, and Claude Opus 4.7 at 79.3% |
| MCP-style public tool workflow | Claude Opus 4.7 | 79.1% on MCP Atlas / MCPAtlas Public, ahead of GPT-5.5 at 75.3% and DeepSeek-V4-Pro-Max at 73.6% |
| Vision and document analysis | Claude Opus 4.7 | Reported #1 in Vision & Document Arena, with wins in diagram, homework, and OCR subcategories |
| Least clean four-way comparison | Kimi K2.6 | Kimi has useful reported scores, but the cited Kimi evidence is mostly separate from the GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max table |
Rows that mix sources should be read carefully. A Kimi score reported in a separate Kimi-focused comparison is useful, but it is not as strong as a result produced in the same shared table and harness as GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max .
GPT-5.5’s clearest win is Terminal-Bench 2.0: 82.7% versus Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% in the shared table . That is one of the largest gaps in the sourced benchmark set.
It also leads Claude Opus 4.7 on OSWorld-Verified, but by a narrow 78.7% to 78.0% margin . On FrontierMath Tiers 1–3, the GPT-5.5 lead is larger: 51.7% versus Claude’s 43.8%
.
GPT-5.5 Pro changes the picture when tools or browsing are central. It leads Humanity’s Last Exam with tools at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% . It also leads BrowseComp at 90.1%, ahead of GPT-5.5 at 84.4%, DeepSeek-V4-Pro-Max at 83.4%, and Claude Opus 4.7 at 79.3%
.
GPT-5.5 does not lead every reasoning test. Claude Opus 4.7 narrowly beats it on GPQA Diamond, 94.2% to 93.6%, in the shared table . A separate GPT-5.5 guide reports GPT-5.5-only domain results including 91.7% on Harvey BigLaw Bench, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench, but those should not be treated as four-way wins because the cited excerpt does not report the same scores for Claude Opus 4.7, DeepSeek V4, and Kimi K2.6
.
Claude Opus 4.7 has the best no-tools reasoning profile in the main shared table. It leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9% . It also leads SWE-Bench Pro / SWE Pro at 64.3% and MCP Atlas / MCPAtlas Public at 79.1% in that same table
.
Claude’s weaker area in the cited data is terminal-style operation. GPT-5.5 leads Claude by more than 13 points on Terminal-Bench 2.0, 82.7% to 69.4%, and also leads Claude on OSWorld-Verified and FrontierMath Tiers 1–3 .
Claude has the strongest cited multimodal and document signal. One source reports Claude Opus 4.7 taking #1 in Vision & Document Arena, improving by 4 points over Opus 4.6 in Document Arena, and winning diagram, homework, and OCR subcategories . The same source does not provide comparable numeric Vision & Document Arena scores for GPT-5.5, DeepSeek V4, or Kimi K2.6, so this supports Claude’s document strength but not a complete four-way multimodal ranking
.
The sources use more than one DeepSeek label. The shared benchmark table reports DeepSeek-V4-Pro-Max, while the Artificial Analysis comparison reports DeepSeek V4 Pro with a 1,000k-token context window . Those labels should not be treated as automatically interchangeable.
In the main shared table, DeepSeek-V4-Pro-Max is competitive but does not lead any row. It scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, 67.9% on Terminal-Bench 2.0, 55.4% on SWE-Bench Pro / SWE Pro, 83.4% on BrowseComp, and 73.6% on MCP Atlas / MCPAtlas Public .
DeepSeek’s strongest cited product claim is cost-performance rather than a category win. VentureBeat describes DeepSeek V4 as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5 . That is a reason to test it for cost-sensitive workloads, not a reason to skip workload-level validation.
For long-context screening, one Artificial Analysis comparison lists both DeepSeek V4 Pro and Claude Opus 4.7 at 1,000k-token context windows . That supports parity for those listed configurations, not a broader claim about every DeepSeek or Claude mode
.
Kimi K2.6 is the hardest model to rank cleanly in this set because it is not included in the main shared table against GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max . A Kimi-focused comparison reports K2.6 at 58.6% on SWE-Bench Pro, 80.2% on SWE-Bench Verified, 66.7% on Terminal-Bench 2.0, 54.0% on Humanity’s Last Exam with tools, and 89.6% on LiveCodeBench v6
. That source says the K2.6 numbers come from a Moonshot AI official model card, but the comparison set is mainly Claude Opus 4.6 and GPT-5.4 rather than the exact four-way lineup here
.
A separate Kimi vs DeepSeek comparison reports Kimi K2.6 at 96.4% on AIME 2026 in Thinking mode, 27.9% on APEX Agents in Thinking mode, and 83.2% on BrowseComp with Thinking mode and context management . In that same source, DeepSeek-V4 Pro is listed at 83.4% on BrowseComp, while DeepSeek values are not available for AIME 2026 and APEX Agents
.
That makes Kimi worth testing, especially for coding, agentic, and browsing workloads, but the sourced material does not support a clean overall ranking against GPT-5.5 and Claude Opus 4.7 across the same benchmark suite .
This is not a universal leaderboard. The sources mix base and Pro variants, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, Claude Opus 4.7, and Kimi K2.6 . Some results are also vendor-reported, and OpenAI notes that its GPT evaluations for ARC were run with reasoning effort set to xhigh in a research environment that may differ from production ChatGPT
.
Close margins should be treated as directional. Claude’s GPQA Diamond lead over GPT-5.5 is 0.6 points, and GPT-5.5’s OSWorld-Verified lead over Claude is 0.7 points . Larger gaps are more actionable: GPT-5.5’s Terminal-Bench 2.0 lead over Claude is more than 13 points, and its FrontierMath lead over Claude is 7.9 points
.
The practical bottom line: there is no single winner across GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6. Use the benchmark category that maps to your real workload, then rerun the same evaluation across the models you can actually deploy.
Comments
0 comments