ReportsPublishedApr 29, 2026Last edited May 3, 202612 sources

GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4 benchmarks compared

Search & fact-check with Studio Global AI Browse more from Discover

13K0

Kimi-K2 is the next open-weight AI milestone from China after DeepseekA comparison of AI benchmark scores for Kimi-K2-0905, Kimi-K2-0711, and Claude Sonnet 4 across multiple evaluation metrics and tools, showing performance differences in areas like multilingual, verification, and coding tasks.

Benchmark charts make this look like a simple race, but this comparison is not a clean single leaderboard. The closest shared table among the cited sources compares GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7, and DeepSeek-V4-Pro-Max; Kimi K2.6 is reported in separate Kimi-focused sources, including release coverage, Hugging Face, and LLM Stats ^[1]^[6]^[11]^[24]. Use the numbers as a shortlist for testing, not as a final universal ranking.

For DeepSeek V4, this article uses DeepSeek-V4-Pro-Max because that is the DeepSeek V4 variant with benchmark and cost rows in the cited sources ^[18]^[24]. GPT-5.5 and GPT-5.5 Pro are also kept separate where the sources report different results ^[24].

Quick verdict

Best terminal-style coding-agent signal: GPT-5.5, with 82.7% on Terminal-Bench 2.0 in the shared comparison ^[24].
Best cited software-repair signal: Claude Opus 4.7, with 64.3% on SWE-Bench Pro and 87.6% on SWE-Bench Verified in the cited data ^[18]^[24].
Best open-weight candidate: Kimi K2.6, described as an open-weight 1T-parameter MoE model with 32B active parameters and strong coding benchmark rows on its Hugging Face model card ^[1]^[6].
Best value candidate to validate: DeepSeek-V4-Pro-Max, which LLM Stats lists with 1.6T size, 1M context, 80.6% on SWE-Bench Verified, and $1.74/$3.48 in its cost columns ^[18].
Best caveat: GPT-5.5 Pro is not the same comparison point as base GPT-5.5; in the shared table it leads BrowseComp and HLE with tools where reported ^[24].

Benchmark comparison table

A dash means the score was not found in the cited sources for that model, not that the model scored zero. The GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7, and DeepSeek-V4-Pro-Max rows mostly come from one shared comparison, while Kimi K2.6 figures come from separate Kimi sources ^[1]^[6]^[24].

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam, no tools	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam, with tools	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

Best model by workload

Priority	Start with	Why
Terminal-heavy coding agents	GPT-5.5	It has the highest Terminal-Bench 2.0 score in the shared comparison at 82.7% ^[24].
Software-engineering repair	Claude Opus 4.7	It leads the cited SWE-Bench Pro row and the cited SWE-Bench Verified row among these models ^[18]^[24].
Hard reasoning without tools	Claude Opus 4.7	It leads GPQA Diamond and HLE without tools in the shared comparison ^[24].
Tool-assisted hard reasoning or browsing	GPT-5.5 Pro	It leads HLE with tools at 57.2% and BrowseComp at 90.1% where GPT-5.5 Pro is reported ^[24].
Open-weight deployment	Kimi K2.6	It is the clearest open-weight model in the cited sources and has strong coding rows on Hugging Face ^[1]^[6].
Cost-sensitive hosted inference	DeepSeek-V4-Pro-Max, then Kimi K2.6	DeepSeek is listed at $1.74/$3.48 in LLM Stats cost columns, while Kimi K2.6 is listed at $0.95/$4.00 in LLM Stats price columns ^[11]^[18].
1M-token context needs	GPT-5.5, Claude Opus 4.7, or DeepSeek-V4-Pro-Max	The cited sources list 1M context for GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max; Kimi K2.6 is reported around 256K context ^[1]^[16]^[18]^[27].

Model-by-model notes

GPT-5.5

GPT-5.5 is the clearest terminal-agent pick in the cited comparison. OpenAI describes GPT-5.5 as built for complex tasks such as coding, research, and data analysis ^[38]. In the shared VentureBeat table, GPT-5.5 posts 82.7% on Terminal-Bench 2.0, above Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[24]. It also scores 93.6% on GPQA Diamond, 58.6% on SWE-Bench Pro, and 84.4% on BrowseComp in that table ^[24].

The important caveat is that GPT-5.5 Pro is reported separately. In the same shared table, GPT-5.5 Pro reaches 90.1% on BrowseComp and 57.2% on HLE with tools, but those rows should not be merged with base GPT-5.5 when comparing cost, latency, or effort settings ^[24].

For procurement context, BenchLM lists GPT-5.5 with a 1M-token context window, while one pricing report lists GPT-5.5 at $5 per million input tokens and $30 per million output tokens ^[27]^[36]. Treat that as a pricing signal to verify against current OpenAI API pricing before making a buying decision.

Claude Opus 4.7

Claude Opus 4.7 has the strongest cited software-repair results in this group. LLM Stats lists it at 87.6% on SWE-Bench Verified, and the shared VentureBeat table reports 64.3% on SWE-Bench Pro ^[18]^[24]. It also leads the shared GPQA Diamond row at 94.2%, HLE without tools at 46.9%, and MCP Atlas at 79.1% ^[24].

LLM Stats reports a 1M-token context window and $5/$25 per million-token pricing for Claude Opus 4.7 ^[16]. The main caveat is comparability: Anthropic notes that some of its benchmark results used internal implementations or updated harness parameters, and that some scores are not directly comparable to public leaderboard scores ^[17].

Kimi K2.6

Kimi K2.6 is the strongest open-weight candidate in the cited material. Release coverage describes it as an open-weight 1T-parameter MoE model with 32B active parameters, 384 experts, native multimodality, INT4 quantization, and 256K context ^[1]. Its Hugging Face model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, and 89.6 on LiveCodeBench v6 ^[6].

The same release coverage reports 54.0 on HLE with tools and 83.2 on BrowseComp for Kimi K2.6 ^[1]. LLM Stats lists Kimi K2.6 with 262K context, $0.95/$4.00 in its price columns, and an Open Source label ^[11]. The limitation is that Kimi’s figures do not come from the same shared table as GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max, so close score differences should be treated as reasons to test rather than as definitive wins ^[1]^[6]^[24].

DeepSeek-V4-Pro-Max

DeepSeek-V4-Pro-Max is the value candidate rather than the clear all-around benchmark leader. LLM Stats lists it with 1.6T size, 1M context, 80.6% on SWE-Bench Verified, and $1.74/$3.48 in its cost columns ^[18]. In the shared VentureBeat comparison, it scores 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, 67.9% on Terminal-Bench 2.0, 55.4% on SWE-Bench Pro, 83.4% on BrowseComp, and 73.6% on MCP Atlas ^[24].

Those numbers make DeepSeek-V4-Pro-Max worth testing for cost-sensitive workloads, especially because the cited cost columns are below the cited GPT-5.5 and Claude Opus 4.7 price signals ^[16]^[18]^[36]. But the same shared table shows GPT-5.5 or Claude Opus 4.7 leading most of the reported benchmark rows, so DeepSeek should be validated on your own tasks before replacing a premium closed model ^[24].

Context and pricing signals

Pricing and context windows are not always reported by the same source or provider, so treat this as a procurement checklist rather than a final quote.

Model	Cited context and pricing signal	Practical read
GPT-5.5	BenchLM lists 1M context; one pricing report lists $5 input and $30 output per million tokens ^[27]^[36].	Premium closed-model option; verify live OpenAI pricing.
Claude Opus 4.7	LLM Stats reports 1M context and $5/$25 per million-token pricing ^[16].	Premium coding, reasoning, and long-context option.
Kimi K2.6	Release coverage reports 256K context; LLM Stats lists 262K context and $0.95/$4.00 in price columns ^[1]^[11].	Strong open-weight candidate; hosted pricing may vary by provider.
DeepSeek-V4-Pro-Max	LLM Stats lists 1M context, 1.6T size, 80.6% on SWE-Bench Verified, and $1.74/$3.48 in cost columns ^[18].	Strong value candidate if quality holds on your workload.

Why benchmark rankings can disagree

The cited rows do not test one skill. The comparison separates GPQA Diamond, HLE with and without tools, Terminal-Bench 2.0, SWE-Bench Pro, BrowseComp, MCP Atlas, and SWE-Bench Verified ^[18]^[24]. A model can lead one row and trail another because the tasks, tool access, and scoring harnesses differ.

Even the same named benchmark can vary across implementations. LLM Stats lists Claude Opus 4.7 at 87.6% on SWE-Bench Verified, while LMCouncil lists Claude Opus 4.7 at 83.5% ± 1.7 under its setup ^[18]^[30]. Anthropic also says some benchmark results used internal implementations or updated harness parameters, limiting direct comparability with public leaderboard scores ^[17].

That is why one- or two-point gaps should not drive a production decision by themselves. Public benchmarks are best used to narrow the shortlist, then your own evaluation should decide.

How to evaluate the finalists

Before committing to one model, test the top two or three candidates on a representative internal eval set:

Use your real prompts and files. Benchmark-style prompts are useful, but they rarely capture your repository, documents, policies, or user behavior.
Match your tool environment. Coding-agent results can change when the model has or lacks terminal access, browsing, repository context, retrieval, or internal APIs.
Measure cost and latency with the same settings. Higher-effort or Pro modes can change quality, spend, and response time.
Review failures manually. For coding tasks, inspect tests, diffs, maintainability, security regressions, and hallucinated dependencies.
Include at least one lower-cost challenger. Kimi K2.6 and DeepSeek-V4-Pro-Max are worth including if open weights or inference cost matter ^[1]^[18].

Bottom line

If you want the safest high-end shortlist, test GPT-5.5 and Claude Opus 4.7 side by side: GPT-5.5 has the strongest cited Terminal-Bench 2.0 result, while Claude Opus 4.7 has the strongest cited SWE-Bench Pro and SWE-Bench Verified results ^[18]^[24]. If you need open weights, start with Kimi K2.6 ^[1]^[6]. If cost is the constraint, put DeepSeek-V4-Pro-Max into the evaluation, but validate it against your own workload before treating it as a replacement for the premium models ^[18]^[24].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Search & fact-check with Studio Global AI

Key takeaways

There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,...
GPT 5.5 Pro should be treated separately from base GPT 5.5: it leads BrowseComp at 90.1% and HLE with tools at 57.2% in the shared comparison table [24].
Kimi K2.6 is described as a 1T parameter MoE with 32B active parameters and roughly 256K context, while DeepSeek V4 Pro Max is listed with 1M context and $1.74/$3.48 in LLM Stats cost columns [1][18].

Supporting visuals

The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026. We rank the best AI models, compare benchmarks and break down real costs.

Claude Opus 4.7 Benchmark Full Analysis: Empirical Data LeadingThe image displays benchmark results for the Claude Opus 4.7 model released in April 2026, highlighting its leading performance in seven major rankings, including SW-E-bench, SW-E-bench Pro, GPQA Diamond reasoning, and Vision multimodal, with notable accuracy percentages and performance improvements.

Continue your research

Illustration of an AI coding agent deleting a database, representing the reported PocketOS incident

PocketOS Database Deletion: What the Reported Claude/Cursor Incident Shows About AI-Agent Risk

PocketOS Database Deletion: What We Know About the Reported Claude/Cursor Incident

The chart compares win rates under time constraints and major error rates across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in win rat

OpenAI Models in 2026: Current ChatGPT, API, and Retired Lineup

OpenAI Models in 2026: Current ChatGPT Models, API Models, and Retirements

Abstract digital wave graphic announcing Claude Mythos Preview on Vertex AI.

Claude Mythos Preview, Explained: Anthropic’s Gated Frontier Model and Cybersecurity Claims

Claude Mythos Preview, explained

A line chart compares the performance of Claude Opus 4.7 and 4.6 in internal autonomous agentic coding evaluation, showing scores increasing with effort level.

Claude Opus 4.7 Benchmarks: How Powerful Is Anthropic’s Opus Model?

How Powerful Is Claude Opus 4.7?

Research conversation

YouApr 25, 2026

Research question

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AIApr 25, 202640 sources

Cited answer

I found public benchmark data for all four, but not one fully independent, apples-to-apples leaderboard covering GPT-5.5, Claude Opus 4.7, Kimi K2.6, and DeepSeek V4 under the same harness. The practical read is: GPT-5.5 looks strongest for broad agentic coding/computer-use tasks, Claude Opus 4.7 is extremely strong on coding and hard reasoning, Kimi K2.6 is the strongest open-weight option in this group, and DeepSeek V4 looks like the best price/performance play if its reported results hold up across independent testing ^[4]^[5]^[6]^[13].

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

If you want the safest high-end general choice, start with GPT-5.5. If your priority is top-tier coding reliability and long-context enterprise work, test Claude Opus 4.7. If you need open weights and strong frontier-adjacent performance, Kimi K2.6 is the most compelling open option. If cost is the constraint and you can tolerate some validation risk, DeepSeek V4 is the one to benchmark first.

Sources

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

Trending Discover

ReportsPublishedApr 29, 2026Last edited May 3, 202612 sources

GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4 benchmarks compared

Search & fact-check with Studio Global AI Browse more from Discover

13K0

Quick verdict

Best terminal-style coding-agent signal: GPT-5.5, with 82.7% on Terminal-Bench 2.0 in the shared comparison ^[24].
Best cited software-repair signal: Claude Opus 4.7, with 64.3% on SWE-Bench Pro and 87.6% on SWE-Bench Verified in the cited data ^[18]^[24].
Best open-weight candidate: Kimi K2.6, described as an open-weight 1T-parameter MoE model with 32B active parameters and strong coding benchmark rows on its Hugging Face model card ^[1]^[6].
Best value candidate to validate: DeepSeek-V4-Pro-Max, which LLM Stats lists with 1.6T size, 1M context, 80.6% on SWE-Bench Verified, and $1.74/$3.48 in its cost columns ^[18].
Best caveat: GPT-5.5 Pro is not the same comparison point as base GPT-5.5; in the shared table it leads BrowseComp and HLE with tools where reported ^[24].

Benchmark comparison table

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam, no tools	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam, with tools	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

Best model by workload

Priority	Start with	Why
Terminal-heavy coding agents	GPT-5.5	It has the highest Terminal-Bench 2.0 score in the shared comparison at 82.7% ^[24].
Software-engineering repair	Claude Opus 4.7	It leads the cited SWE-Bench Pro row and the cited SWE-Bench Verified row among these models ^[18]^[24].
Hard reasoning without tools	Claude Opus 4.7	It leads GPQA Diamond and HLE without tools in the shared comparison ^[24].
Tool-assisted hard reasoning or browsing	GPT-5.5 Pro	It leads HLE with tools at 57.2% and BrowseComp at 90.1% where GPT-5.5 Pro is reported ^[24].
Open-weight deployment	Kimi K2.6	It is the clearest open-weight model in the cited sources and has strong coding rows on Hugging Face ^[1]^[6].
Cost-sensitive hosted inference	DeepSeek-V4-Pro-Max, then Kimi K2.6	DeepSeek is listed at $1.74/$3.48 in LLM Stats cost columns, while Kimi K2.6 is listed at $0.95/$4.00 in LLM Stats price columns ^[11]^[18].
1M-token context needs	GPT-5.5, Claude Opus 4.7, or DeepSeek-V4-Pro-Max	The cited sources list 1M context for GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max; Kimi K2.6 is reported around 256K context ^[1]^[16]^[18]^[27].

Model-by-model notes

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Context and pricing signals

Pricing and context windows are not always reported by the same source or provider, so treat this as a procurement checklist rather than a final quote.

Model	Cited context and pricing signal	Practical read
GPT-5.5	BenchLM lists 1M context; one pricing report lists $5 input and $30 output per million tokens ^[27]^[36].	Premium closed-model option; verify live OpenAI pricing.
Claude Opus 4.7	LLM Stats reports 1M context and $5/$25 per million-token pricing ^[16].	Premium coding, reasoning, and long-context option.
Kimi K2.6	Release coverage reports 256K context; LLM Stats lists 262K context and $0.95/$4.00 in price columns ^[1]^[11].	Strong open-weight candidate; hosted pricing may vary by provider.
DeepSeek-V4-Pro-Max	LLM Stats lists 1M context, 1.6T size, 80.6% on SWE-Bench Verified, and $1.74/$3.48 in cost columns ^[18].	Strong value candidate if quality holds on your workload.

Why benchmark rankings can disagree

That is why one- or two-point gaps should not drive a production decision by themselves. Public benchmarks are best used to narrow the shortlist, then your own evaluation should decide.

How to evaluate the finalists

Before committing to one model, test the top two or three candidates on a representative internal eval set:

Use your real prompts and files. Benchmark-style prompts are useful, but they rarely capture your repository, documents, policies, or user behavior.
Match your tool environment. Coding-agent results can change when the model has or lacks terminal access, browsing, repository context, retrieval, or internal APIs.
Measure cost and latency with the same settings. Higher-effort or Pro modes can change quality, spend, and response time.
Review failures manually. For coding tasks, inspect tests, diffs, maintainability, security regressions, and hallucinated dependencies.
Include at least one lower-cost challenger. Kimi K2.6 and DeepSeek-V4-Pro-Max are worth including if open weights or inference cost matter ^[1]^[18].

Bottom line

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Search & fact-check with Studio Global AI

Key takeaways

There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,...
GPT 5.5 Pro should be treated separately from base GPT 5.5: it leads BrowseComp at 90.1% and HLE with tools at 57.2% in the shared comparison table [24].
Kimi K2.6 is described as a 1T parameter MoE with 32B active parameters and roughly 256K context, while DeepSeek V4 Pro Max is listed with 1M context and $1.74/$3.48 in LLM Stats cost columns [1][18].

Supporting visuals

Continue your research

PocketOS Database Deletion: What the Reported Claude/Cursor Incident Shows About AI-Agent Risk

PocketOS Database Deletion: What We Know About the Reported Claude/Cursor Incident

OpenAI Models in 2026: Current ChatGPT, API, and Retired Lineup

OpenAI Models in 2026: Current ChatGPT Models, API Models, and Retirements

Claude Mythos Preview, Explained: Anthropic’s Gated Frontier Model and Cybersecurity Claims

Claude Mythos Preview, explained

Claude Opus 4.7 Benchmarks: How Powerful Is Anthropic’s Opus Model?

How Powerful Is Claude Opus 4.7?

Research conversation

YouApr 25, 2026

Research question

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AIApr 25, 202640 sources

Cited answer

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

Sources

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

Trending Discover

ReportsPublishedApr 29, 2026Last edited May 3, 202612 sources

GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4 benchmarks compared

Search & fact-check with Studio Global AI Browse more from Discover

13K0

Quick verdict

Best terminal-style coding-agent signal: GPT-5.5, with 82.7% on Terminal-Bench 2.0 in the shared comparison ^[24].
Best cited software-repair signal: Claude Opus 4.7, with 64.3% on SWE-Bench Pro and 87.6% on SWE-Bench Verified in the cited data ^[18]^[24].
Best open-weight candidate: Kimi K2.6, described as an open-weight 1T-parameter MoE model with 32B active parameters and strong coding benchmark rows on its Hugging Face model card ^[1]^[6].
Best value candidate to validate: DeepSeek-V4-Pro-Max, which LLM Stats lists with 1.6T size, 1M context, 80.6% on SWE-Bench Verified, and $1.74/$3.48 in its cost columns ^[18].
Best caveat: GPT-5.5 Pro is not the same comparison point as base GPT-5.5; in the shared table it leads BrowseComp and HLE with tools where reported ^[24].

Benchmark comparison table

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam, no tools	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam, with tools	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

Best model by workload

Priority	Start with	Why
Terminal-heavy coding agents	GPT-5.5	It has the highest Terminal-Bench 2.0 score in the shared comparison at 82.7% ^[24].
Software-engineering repair	Claude Opus 4.7	It leads the cited SWE-Bench Pro row and the cited SWE-Bench Verified row among these models ^[18]^[24].
Hard reasoning without tools	Claude Opus 4.7	It leads GPQA Diamond and HLE without tools in the shared comparison ^[24].
Tool-assisted hard reasoning or browsing	GPT-5.5 Pro	It leads HLE with tools at 57.2% and BrowseComp at 90.1% where GPT-5.5 Pro is reported ^[24].
Open-weight deployment	Kimi K2.6	It is the clearest open-weight model in the cited sources and has strong coding rows on Hugging Face ^[1]^[6].
Cost-sensitive hosted inference	DeepSeek-V4-Pro-Max, then Kimi K2.6	DeepSeek is listed at $1.74/$3.48 in LLM Stats cost columns, while Kimi K2.6 is listed at $0.95/$4.00 in LLM Stats price columns ^[11]^[18].
1M-token context needs	GPT-5.5, Claude Opus 4.7, or DeepSeek-V4-Pro-Max	The cited sources list 1M context for GPT-5.5, Claude Opus 4.7, and DeepSeek-V4-Pro-Max; Kimi K2.6 is reported around 256K context ^[1]^[16]^[18]^[27].

Model-by-model notes

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Context and pricing signals

Pricing and context windows are not always reported by the same source or provider, so treat this as a procurement checklist rather than a final quote.

Model	Cited context and pricing signal	Practical read
GPT-5.5	BenchLM lists 1M context; one pricing report lists $5 input and $30 output per million tokens ^[27]^[36].	Premium closed-model option; verify live OpenAI pricing.
Claude Opus 4.7	LLM Stats reports 1M context and $5/$25 per million-token pricing ^[16].	Premium coding, reasoning, and long-context option.
Kimi K2.6	Release coverage reports 256K context; LLM Stats lists 262K context and $0.95/$4.00 in price columns ^[1]^[11].	Strong open-weight candidate; hosted pricing may vary by provider.
DeepSeek-V4-Pro-Max	LLM Stats lists 1M context, 1.6T size, 80.6% on SWE-Bench Verified, and $1.74/$3.48 in cost columns ^[18].	Strong value candidate if quality holds on your workload.

Why benchmark rankings can disagree

That is why one- or two-point gaps should not drive a production decision by themselves. Public benchmarks are best used to narrow the shortlist, then your own evaluation should decide.

How to evaluate the finalists

Before committing to one model, test the top two or three candidates on a representative internal eval set:

Use your real prompts and files. Benchmark-style prompts are useful, but they rarely capture your repository, documents, policies, or user behavior.
Match your tool environment. Coding-agent results can change when the model has or lacks terminal access, browsing, repository context, retrieval, or internal APIs.
Measure cost and latency with the same settings. Higher-effort or Pro modes can change quality, spend, and response time.
Review failures manually. For coding tasks, inspect tests, diffs, maintainability, security regressions, and hallucinated dependencies.
Include at least one lower-cost challenger. Kimi K2.6 and DeepSeek-V4-Pro-Max are worth including if open weights or inference cost matter ^[1]^[18].

Bottom line

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Search & fact-check with Studio Global AI

Key takeaways

There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,...
GPT 5.5 Pro should be treated separately from base GPT 5.5: it leads BrowseComp at 90.1% and HLE with tools at 57.2% in the shared comparison table [24].
Kimi K2.6 is described as a 1T parameter MoE with 32B active parameters and roughly 256K context, while DeepSeek V4 Pro Max is listed with 1M context and $1.74/$3.48 in LLM Stats cost columns [1][18].

Supporting visuals

Continue your research

Research conversation

YouApr 25, 2026

Research question

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AIApr 25, 202640 sources

Cited answer

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

Sources

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

Quick verdict

Benchmark comparison table

Best model by workload

Model-by-model notes

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Context and pricing signals

Why benchmark rankings can disagree

How to evaluate the finalists

Bottom line

Search, cite, and publish your own answer

Key takeaways

Supporting visuals

People also ask

What is the short answer to "GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4 benchmarks compared"?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

PocketOS Database Deletion: What the Reported Claude/Cursor Incident Shows About AI-Agent Risk

OpenAI Models in 2026: Current ChatGPT, API, and Retired Lineup

Claude Mythos Preview, Explained: Anthropic’s Gated Frontier Model and Cybersecurity Claims

Claude Opus 4.7 Benchmarks: How Powerful Is Anthropic’s Opus Model?

Research conversation

Research question

Cited answer

Key findings

Comparison

How to interpret the numbers

Evidence notes

Limitations / uncertainty

Summary

Sources

Quick verdict

Benchmark comparison table

Best model by workload

Model-by-model notes

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Context and pricing signals

Why benchmark rankings can disagree

How to evaluate the finalists

Bottom line

Search, cite, and publish your own answer

Key takeaways

Supporting visuals

People also ask

What is the short answer to "GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4 benchmarks compared"?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

PocketOS Database Deletion: What the Reported Claude/Cursor Incident Shows About AI-Agent Risk

OpenAI Models in 2026: Current ChatGPT, API, and Retired Lineup

Claude Mythos Preview, Explained: Anthropic’s Gated Frontier Model and Cybersecurity Claims

Claude Opus 4.7 Benchmarks: How Powerful Is Anthropic’s Opus Model?

Research conversation

Research question

Cited answer

Key findings

Comparison

How to interpret the numbers

Evidence notes

Limitations / uncertainty

Summary

Sources

Quick verdict

Benchmark comparison table

Best model by workload

Model-by-model notes

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max