A dash means the score was not found in the cited sources for that model, not that the model scored zero. The GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7 and DeepSeek-V4-Pro-Max rows mostly come from one shared comparison; Kimi K2.6 figures come from separate Kimi sources .
OpenAI describes GPT-5.5 as built for complex tasks such as coding, research and data analysis . In the shared VentureBeat comparison, GPT-5.5 posts 82.7% on Terminal-Bench 2.0, ahead of Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9%
. It also scores 93.6% on GPQA Diamond, 58.6% on SWE-Bench Pro and 84.4% on BrowseComp in that table
.
The main caveat is that GPT-5.5 Pro is a separate comparison point. In the same shared table, GPT-5.5 Pro reaches 90.1% on BrowseComp and 57.2% on Humanity’s Last Exam with tools, but those numbers should not be merged with base GPT-5.5 when comparing cost, latency or model settings .
For procurement context, BenchLM lists GPT-5.5 with a 1M-token context window, while one pricing report lists GPT-5.5 at $5 per million input tokens and $30 per million output tokens . Treat that pricing as a signal to verify against current provider pricing before budgeting.
Claude Opus 4.7 has the strongest cited software-repair signals in this group. LLM Stats lists it at 87.6% on SWE-Bench Verified, and the shared comparison reports 64.3% on SWE-Bench Pro . It also leads the shared GPQA Diamond row at 94.2%, Humanity’s Last Exam without tools at 46.9% and MCP Atlas at 79.1%
.
LLM Stats reports a 1M-token context window and $5/$25 per million-token pricing for Claude Opus 4.7 . The comparability caveat is important: Anthropic notes that some benchmark results used internal implementations or updated harness parameters, and that some scores are not directly comparable to public leaderboard scores
.
Kimi K2.6 is the strongest open-weight candidate in the cited material. Release coverage describes it as an open-weight 1T-parameter MoE model with 32B active parameters, 384 experts, native multimodality, INT4 quantization and 256K context . Its Hugging Face model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0 and 89.6 on LiveCodeBench v6
.
The same release coverage reports 54.0 on Humanity’s Last Exam with tools and 83.2 on BrowseComp for Kimi K2.6 . LLM Stats lists Kimi K2.6 with 262K context, $0.95/$4.00 in its price columns and an Open Source label
. The limitation is that Kimi’s figures do not come from the same shared table as GPT-5.5, Claude Opus 4.7 and DeepSeek-V4-Pro-Max, so close score differences should be treated as prompts for testing rather than definitive wins
.
DeepSeek-V4-Pro-Max looks like the value candidate rather than the clear all-around benchmark leader. LLM Stats lists it with 1.6T size, 1M context, 80.6% on SWE-Bench Verified and $1.74/$3.48 in its cost columns . In the shared comparison, it scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, 67.9% on Terminal-Bench 2.0, 55.4% on SWE-Bench Pro, 83.4% on BrowseComp and 73.6% on MCP Atlas
.
Those numbers make DeepSeek-V4-Pro-Max worth testing for cost-sensitive workloads. But the same shared table shows GPT-5.5, GPT-5.5 Pro or Claude Opus 4.7 leading most of the reported benchmark rows, so DeepSeek should be validated on your own tasks before replacing a premium model in production .
Pricing and context windows are not always reported by the same source or provider. Use these as procurement signals, not final quotes.
The cited rows measure different skills. GPQA Diamond and Humanity’s Last Exam emphasize hard reasoning, Terminal-Bench 2.0 and SWE-Bench variants emphasize coding and agentic software work, and BrowseComp measures browsing-style retrieval performance in the shared comparison . A model can lead one row and trail another because the task, tool access and evaluation harness differ.
Even the same named benchmark can vary by implementation. LLM Stats lists Claude Opus 4.7 at 87.6% on SWE-Bench Verified, while LMCouncil lists Claude Opus 4.7 at 83.5% ± 1.7 under its setup . Anthropic also states that some of its results used internal implementations or updated harness parameters, limiting direct comparability with public leaderboard scores
.
That is why one- or two-point gaps should not decide a production rollout by themselves. Public benchmarks are best used to narrow the shortlist; your own evaluation should make the final call.
Before committing to one model, test the top two or three candidates on tasks that resemble your actual workload.
If you want the highest-end shortlist, test GPT-5.5 and Claude Opus 4.7 side by side: GPT-5.5 has the strongest cited Terminal-Bench 2.0 result, while Claude Opus 4.7 has the strongest cited SWE-Bench Pro and SWE-Bench Verified results . If you need open weights, start with Kimi K2.6
. If cost is the constraint, include DeepSeek-V4-Pro-Max, but validate it on your own workload before treating it as a drop-in replacement for the premium options
.
Comments
0 comments