GPT-5.5 is the best-supported all-rounder. In the available Artificial Analysis excerpt, GPT-5.5 appears first in the xhigh configuration with an Intelligence Index score of 60, followed by GPT-5.5 high at 59 and Claude Opus 4.7 at 57 . On BrowseComp, GPT-5.5 scores 84.4%, just ahead of DeepSeek V4 at 83.4%, while Claude Opus 4.7 trails at 79.3%
.
Claude Opus 4.7 is especially strong for software and knowledge-heavy work. It leads GPT-5.5 on SWE-Bench Pro, 64.3% versus 58.6%, and narrowly on GPQA Diamond, 94.2% versus 93.6% . But GPT-5.5 leads by a wide margin on Terminal-Bench 2.0, scoring 82.7% versus Claude Opus 4.7’s 69.4%
.
DeepSeek V4 is the price-performance challenger. VentureBeat reports DeepSeek V4 at 83.4% on BrowseComp, only one percentage point below GPT-5.5 and above Claude Opus 4.7 . Mashable also cites API pricing of $1.74 per 1 million input tokens and $3.48 per 1 million output tokens for DeepSeek V4, compared with $5/$30 for GPT-5.5 and $5/$25 for Claude Opus 4.7
.
Kimi K2.6 should not be forced into the same ranking. DocsBot describes Kimi K2.6 as an open-source, natively multimodal, agentic model with a 1T-parameter mixture-of-experts architecture, 32B activated parameters and a 256K-token context window . But the provided sources do not include enough direct benchmark values against GPT-5.5, Claude Opus 4.7 and DeepSeek V4 to rank it fairly
.
The main limitation is the evidence base. DataCamp notes in a related frontier-model comparison that benchmark scores can be vendor-reported and may use different harness configurations . That is a warning against treating the numbers as if they came from one neutral tournament.
The model variants also differ. Artificial Analysis lists GPT-5.5 xhigh, GPT-5.5 high and Claude Opus 4.7 with Adaptive Reasoning and Max Effort . VentureBeat’s DeepSeek discussion refers to DeepSeek-V4-Pro-Max
. Those settings can matter a lot in reasoning, coding and agentic tasks, where tool use, reasoning effort and test setup can affect the final score.
So the better question is not simply, “Which model is number one?” It is: which model is best supported for the workload you actually care about?
The cleanest overall signal in the provided sources is the Artificial Analysis Intelligence Index excerpt. It places GPT-5.5 xhigh at 60 points, GPT-5.5 high at 59 points and Claude Opus 4.7 with Adaptive Reasoning and Max Effort at 57 points .
That supports a modest but visible GPT-5.5 lead over Claude Opus 4.7 in this index . The same available excerpt, however, does not provide complete directly citable Intelligence Index values for DeepSeek V4 and Kimi K2.6, so it cannot support a clean four-model ranking
.
BrowseComp is the strongest directly cited three-way comparison among GPT-5.5, Claude Opus 4.7 and DeepSeek V4. VentureBeat reports 90.1% for GPT-5.5 Pro, 84.4% for GPT-5.5, 83.4% for DeepSeek V4 and 79.3% for Claude Opus 4.7 .
VentureBeat’s broader interpretation is important: despite DeepSeek V4’s strong result, the report says DeepSeek-V4-Pro-Max does not appear to dethrone GPT-5.5 or Claude Opus 4.7 across the directly comparable benchmarks overall . The fair reading is narrower: DeepSeek V4 is very close to GPT-5.5 on BrowseComp, but that single benchmark does not prove an overall win
.
For coding, there is no single obvious champion. Claude Opus 4.7 leads on SWE-Bench Pro with 64.3%, compared with GPT-5.5 at 58.6% . Vellum also cites 87.6% for Claude Opus 4.7 on SWE-Bench Verified
. But on Terminal-Bench 2.0, the picture flips: GPT-5.5 scores 82.7%, while Claude Opus 4.7 scores 69.4%
.
The provided sources do not give enough equivalent coding numbers for DeepSeek V4 and Kimi K2.6 to place them in the same table. VentureBeat says DeepSeek V4 comes close to top models on several directly comparable benchmarks, but the clearest specific numbers in the available excerpt are for BrowseComp . For Kimi K2.6, DocsBot mainly provides model and architecture details rather than a full benchmark matrix against the other three models
.
On knowledge and reasoning tests, GPT-5.5 and Claude Opus 4.7 are close, and the leader depends on the benchmark and tool setup. Vellum lists GPQA Diamond at 93.6% for GPT-5.5 and 94.2% for Claude Opus 4.7 . Mashable cites the same GPQA Diamond values and adds Humanity’s Last Exam results: without tools, GPT-5.5 scores 40.6% versus Claude Opus 4.7 at 31.2%; with tools, Claude Opus 4.7 scores 54.7% versus GPT-5.5 at 52.2%
.
Professional and agentic benchmarks are also mixed. Vellum reports GPT-5.5 at 84.9% on GDPval versus Claude Opus 4.7 at 80.3%, 78.7% on OSWorld-Verified versus 78.0%, and 75.3% on MCP Atlas versus Claude Opus 4.7 at 79.1% . OpenAI reports FinanceAgent v1.1 at 60.0% for GPT-5.5 and 64.4% for Claude Opus 4.7
.
Anthropic also points to an internal research-agent benchmark in which Claude Opus 4.7, according to Anthropic, tied for the top overall score across six modules at 0.715 and scored 0.813 on General Finance versus 0.767 for Opus 4.6 . Because that benchmark is internal and does not cover all four models equally in the provided material, it is best read as evidence of Claude’s agentic strength rather than as an independent four-way ranking
.
For real production use, one extra benchmark point may matter less than cost, context length and throughput. Mashable cites DeepSeek V4 at $1.74 per 1 million input tokens and $3.48 per 1 million output tokens, with a 1 million-token context window . The same source gives GPT-5.5 at $5 per 1 million input tokens and $30 per 1 million output tokens, and Claude Opus 4.7 at $5 input and $25 output per 1 million tokens, each also listed with a 1 million-token context window
.
Kimi K2.6 is a special case here. DocsBot describes it as having a 256K-token context window, a 1T-parameter MoE architecture, 32B activated parameters and agentic orchestration scaling to 300 sub-agents and 4,000 coordinated steps . Those are relevant technical details, but they are not a substitute for direct benchmark and pricing comparisons against GPT-5.5, Claude Opus 4.7 and DeepSeek V4
.
The strongest conclusion is not that one model wins everything. GPT-5.5 is the best-supported all-rounder in the available sources because it leads the Artificial Analysis excerpt and performs strongly on BrowseComp and several professional benchmarks . Claude Opus 4.7 remains a top-tier model, especially on SWE-Bench Pro, SWE-Bench Verified, GPQA Diamond and selected agentic finance tasks
. DeepSeek V4 is the clearest value story, coming within one percentage point of GPT-5.5 on BrowseComp while carrying much lower cited API prices
. Kimi K2.6 should be treated as promising but under-evidenced in this comparison: the available sources do not provide the direct benchmark and price data needed for a fair ranking
.
Comments
0 comments