That separation matters because some older headline tests are running out of room at the top. Nanonets explains that MMLU is calculated with 5-shot prompting and that, by 2026, top models are clustered above roughly 88%, which makes the benchmark less useful for separating frontier models . In other words, a one-point edge on a saturated test may matter less than a clear lead on the benchmark closest to your workflow.
Among the three models with BenchLM pages in this source set, Claude Opus 4.7 posts the strongest overall score. BenchLM lists it at 97/100, #2 of 110 on the provisional leaderboard and #2 of 14 on the verified leaderboard .
GPT-5.5 is also in the top tier on BenchLM, but its displayed overall score is lower: 89/100, #5 of 112 on the provisional leaderboard and #2 of 16 on the verified leaderboard . Kimi 2.6 appears lower still on the same site, at 85/100, #12 of 115, with 27 published benchmark scores shown
.
That does not make BenchLM a universal ranking of all four models. The comparison pools are not identical, and the DeepSeek numbers available in this source set come primarily from separate DeepSeek V4-Pro-Max reporting rather than an equivalent BenchLM entry .
For software engineering, Claude Opus 4.7 has the most direct public evidence here. MindStudio reports Claude Opus 4.7 at 82.4% on SWE-bench Verified, roughly 11 points higher than Opus 4.6, and presents SWE-bench Verified as the most meaningful coding benchmark in that breakdown . The same source reports FinanceBench at 82.7% and a 9.5-point improvement on MathVista for visual math reasoning
.
GPT-5.5 may still be relevant to engineering teams, but the OpenAI figures included here emphasize professional work and computer-use agents rather than a headline SWE-bench score: GDPval 84.9%, OSWorld-Verified 78.7% and Tau2-bench Telecom 98.0% .
For Kimi K2.6, GMI Cloud says the model tops SWE-Bench Pro and can run 300 parallel sub-agents on 4x H100S, but the provided snippet does not give a precise score or a same-conditions comparison against the other three models . For DeepSeek V4, the most concrete numbers in this source set are stronger on reasoning and math than on coding
.
If your use case is knowledge work, desktop or browser control, or customer-service automation, GPT-5.5’s official benchmark set is especially relevant. OpenAI says GPT-5.5 scores 84.9% on GDPval, a benchmark for producing well-specified knowledge-work outputs across 44 occupations . It also reports 78.7% on OSWorld-Verified, which tests whether a model can operate real computer environments, and 98.0% on Tau2-bench Telecom, which measures complex customer-service workflows without prompt tuning
.
Claude Opus 4.7 has agentic evidence too, but it is in a different format. Anthropic says its internal research-agent benchmark gives Claude Opus 4.7 a tied top overall score of 0.715 across six modules, with a General Finance score of 0.813 versus 0.767 for Opus 4.6 .
Those are useful signals, not directly comparable scores. GPT-5.5’s 84.9% on GDPval and Claude Opus 4.7’s 0.715 on Anthropic’s internal research-agent benchmark are not the same scale, not the same task set and not evidence for a simple head-to-head ranking .
The clearest DeepSeek V4 numbers here are for the V4-Pro-Max setting. DataCamp, citing DeepSeek internal results, reports 87.5% on MMLU-Pro, 90.1% on GPQA Diamond and 92.6% on GSM8K for math . These are strong reference points, but the internal-results label matters when comparing them with third-party leaderboards
.
A Hugging Face table for DeepSeek-V4-Pro puts DeepSeek V4-Pro-Max and Kimi K2.6 Thinking side by side on several knowledge and reasoning rows .
| Benchmark | DeepSeek V4-Pro-Max | Kimi K2.6 Thinking | Higher value in this table |
|---|---|---|---|
| MMLU-Pro | 87.5 | 87.1 | DeepSeek V4-Pro-Max |
| SimpleQA-Verified | 57.9 | 36.9 | DeepSeek V4-Pro-Max |
| Chinese-SimpleQA | 84.4 | 75.9 | DeepSeek V4-Pro-Max |
| GPQA Diamond | 90.1 | 90.5 | Kimi K2.6 Thinking |
| HLE | 37.7 | 36.4 | DeepSeek V4-Pro-Max |
On this table alone, DeepSeek V4-Pro-Max leads Kimi K2.6 Thinking on MMLU-Pro, SimpleQA-Verified, Chinese-SimpleQA and HLE, while Kimi is slightly higher on GPQA Diamond . The table does not settle the four-way race, however, because its other comparison columns are Opus-4.6 Max and GPT-5.4 xHigh rather than Claude Opus 4.7 and GPT-5.5
.
For teams that care about open weights and operating economics, Kimi K2.6 stands out for a different reason. Artificial Analysis describes Moonshot’s Kimi K2.6 as the leading open weights model and places it #4 overall on the Artificial Analysis Intelligence Index with a score of 54 .
Vals AI also gives Kimi K2.6 useful operating data: Accuracy 63.94% ± 1.97, Latency 373.57s and Cost/Test $0.21 . GPT-5.5’s Vals entry shows higher Accuracy at 67.76% ± 1.79 and Latency of 409.09s, with a 1M context window
. On those Vals entries alone, GPT-5.5 has the higher displayed accuracy, while Kimi K2.6 has the lower displayed latency and an explicit cost-per-test figure
.
Do not average these into one master number. Artificial Analysis, Vals AI and BenchLM use different scoring systems, so Kimi’s Intelligence Index 54, Vals Accuracy 63.94% and BenchLM 85/100 should not be merged as if they were three versions of the same metric .
The public evidence points to different leaders for different jobs. Claude Opus 4.7 has the cleanest coding and BenchLM case; GPT-5.5 has the most concrete official metrics for knowledge-work agents and computer use; DeepSeek V4-Pro-Max has strong reasoning and math figures; and Kimi K2.6 is notable for open weights, cost and latency signals .
What the evidence does not support is a confident 1st-through-4th ranking across all four models. Use the benchmark table as a shortlist, then run your own evaluation on the actual work: codebases, finance documents, browser tasks, support workflows, long-running agents and budget limits. That approach aligns better with how 2026 benchmarks are organized and with the known limits of high-level scores .
Comments
0 comments