ReportsPublic URLApr 27, 202621 sources

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no tools results and is reported 1 in Vision & Document Arena [4][1]. GPT 5.5 leads the cited Terminal Bench 2.0, OSWorld Verified, and FrontierMath results, while GPT 5.5 Pro leads Huma

Key takeaways

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1%.
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7%.
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2%.
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9%.
The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena [4][1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierM
## Key findings

Research answer

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena ^[4]^[1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools ^[4]^[5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison ^[4]^[18]^[1].

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Supporting visuals

The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026. We rank the best AI models, compare benchmarks and break down real costs.

GPT-5.5 Review: Benchmarks, Pricing & Vs Claude (2026)The image displays a comparison table of benchmark performances for various AI models, including GPT-5.5, GPT-5.5 Pro, Claude 4.7, GPT-5.4, and Gemini 3.1 Pro, with the models' scores listed across different evaluation categories such as Terminal-Bench 2.0, OSWorld-Verified, and GPQA Diamond.

Viral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and moreViral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and more.

WordPress AI Multi Site Manager with MCP - Future of WordPress Management

5 AI Business Automations You Can Implement Today

SEO Pricing UK: A Clear Guide to Our Packages

Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: WhichThe image compares the performance of Kimi K2.6 and DeepSeek V4, alongside GPT-5.5 and Claude Opus 4.7, highlighting their strengths, best use cases, access routes, and key features in a detailed infographic.

Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4: Agentic Coding Benchmarks (2026)

Long-Horizon Execution and Agent Orchestration

Kimi K2.6 vs DeepSeek V2 vs GPT-5.5 vs Claude Opus 4.7: WhichThe image compares benchmarking results of Kimi K2.6, DeepSeek V2, GPT-5.5, and Claude Opus 4.7 for coding reasoning in 2026, highlighting their evaluation criteria and testing strategies.

Research conversation

YouApr 27, 2026

Research question

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AIApr 27, 202621 sources

Cited answer

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Continue your research

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Research & compare Claude Code vs OpenAI Codex as comprehensively as possible

Deep research & compare GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4

Research & compare Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 as comprehensively as possible

Sources

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[2] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
A game being played on a smartphone. NYT Connections hints today: Clues, answers for April 25, 2026 Everything you need to solve 'Connections' 1049. By Mashable Team Connections game on a smartphone Connections game on a smartphone OpenAI’s GPT-5.5 vs Claud...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[6] GPT-5.5 (high) Review | Pricing, Benchmarks & Capabilities (2026)designforonline.com
5 AI Business Automations You Can Implement Today SEO Pricing UK: A Clear Guide to Our Packages Viral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and more. The Best AI Models So Far in 2026 Design for Online Gemini 3.1 Pro, Claude Sonn...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[9] Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which Should You Test First? | LaoZhang AI Blogblog.laozhang.ai
As of Apr 24, 2026, this comparison should be built around DeepSeek V4, not an older DeepSeek label. Test Kimi K2.6 first when the job is low-cost coding-agent exploration, test DeepSeek V4 Flash or V4 Pro when you need a cheap callable API route today, use...
[10] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
benchlm.ai @glevd GPT-5.5 OpenAICurrentReleased Apr 23, 2026 Overall Score 93Prov. 2 of 115Verified 3 of 23 Arena Elo N/A Categories Ranked 2of 8 Price (1M tokens) $5 in / $30 out Speed N/A Context 1M ProprietaryReasoning Confidence base According to BenchL...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[12] I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 ...news.ycombinator.com
ozgune 1 day ago parent context favorite on: DeepSeek v4 I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 across the same AI benchmarks. All results are for Max editions, except for Kimi. Summary: Opus 4.6 forms the baseline all three are tr...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[14] Kimi K2.6 vs DeepSeek V2 vs GPT-5.5 vs Claude Opus 4.7: Which Should You Test First? | LaoZhang AI Blogblog.laozhang.ai
Contract item Kimi K2.6 Current DeepSeek API GPT-5.5 Claude Opus 4.7 --- --- Owner route Moonshot / Kimi platform DeepSeek API OpenAI ChatGPT and Codex first Anthropic API and Claude/cloud routes Current deploy label kimi-k2.6 deepseek-v4-flash or deepseek-...
[15] The Coding Assistant Breakdown: More Tokens Please - SemiAnalysisnewsletter.semianalysis.com
On benchmarks, DeepSeek did not feel that standard benchmarks were good at capturing real-world task capability, so they introduced their own set of agentic benchmarks to measure how V4 compared against other SOTA models: Chinese writing, retrieval augmente...
[16] Claude Opus 4.7 Isn't a Drop-in Replacement for 4.6blog.dailydoseofds.com
It ended up running around 20% faster than LM Studio on the same hardware. Separately, it refactored an 8-year-old financial matching engine across 13 hours, delivering a 133% peak throughput gain. This is the capability gap that usually separates frontier...
[17] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
13 Image 14: OpenAI GPT-5 High OpenAI 1,301 1,067 87.3% — — — — Proprietary 14 Image 15: OpenAI GPT-5.3 Codex OpenAI 1,244 821 — — 400K $1.75 $14.00 Proprietary 15 Image 16: OpenAI GPT-5.1 OpenAI 1,231 1,013 88.1% 76.3% 400K $1.25 $10.00 Proprietary 16 Imag...
[18] "DeepSeek v4 is now the #1 open-weight model on our Vibe Code Benchmark, and it’s not close. It leaves the #2 (Kimi K2.6) in the dust, and even beats out frontier closed source models like Gemini 3.1 Pro." : r/acceleratereddit.com
Ethical considerations in AI development Latest breakthroughs in machine learning Impact of AI on job markets Public Anyone can view, post, and comment to this community 0 0 Reddit RulesPrivacy PolicyUser AgreementYour Privacy ChoicesAccessibilityReddit, In...
[19] Deep|DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task ... - FundaAIfundaai.substack.com
Coding (8 tasks): Standard DeepSeek Pro and Flash score about 8.1 avg. DeepSeek Pro (Thinking) scores 8.48 on 4 completed coding tasks and Flash (Thinking) scores 8.44 on 6 completed coding tasks, but hard coding timeouts make the thinking-mode results less...
[20] I just ran the math on GPT-5.5, Claude Opus 4.7, Kimi K2 ... - Facebookfacebook.com
Vibe Coding is Life I just ran the math on GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4, and Llama 4 Facebook Log In Log In Forgot Account? The exact break-even point where self-hosting beats API pricing Why your biggest self-hosting cost isn't GPUs — i...
[21] GPT 5.5 is out. Here are the benchmarks. - Facebookfacebook.com
1d 4 Image 3 View all 3 replies []( Robert Eaton Opus 4.7 has been terrible. Getting everything wrong, making same mistake over and over. I went back to 4.6 and everything was fine 1d 4 Image 4Image 5 View all 2 replies []( Patrick Healy Codex $100a month p...

ReportsPublic URLApr 27, 202621 sources

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Key takeaways

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1%.
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7%.
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2%.
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9%.
The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena [4][1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierM
## Key findings

Research answer

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Supporting visuals

Research conversation

YouApr 27, 2026

Research question

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AIApr 27, 202621 sources

Cited answer

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Sources

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[2] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
A game being played on a smartphone. NYT Connections hints today: Clues, answers for April 25, 2026 Everything you need to solve 'Connections' 1049. By Mashable Team Connections game on a smartphone Connections game on a smartphone OpenAI’s GPT-5.5 vs Claud...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[6] GPT-5.5 (high) Review | Pricing, Benchmarks & Capabilities (2026)designforonline.com
5 AI Business Automations You Can Implement Today SEO Pricing UK: A Clear Guide to Our Packages Viral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and more. The Best AI Models So Far in 2026 Design for Online Gemini 3.1 Pro, Claude Sonn...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[9] Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which Should You Test First? | LaoZhang AI Blogblog.laozhang.ai
As of Apr 24, 2026, this comparison should be built around DeepSeek V4, not an older DeepSeek label. Test Kimi K2.6 first when the job is low-cost coding-agent exploration, test DeepSeek V4 Flash or V4 Pro when you need a cheap callable API route today, use...
[10] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
benchlm.ai @glevd GPT-5.5 OpenAICurrentReleased Apr 23, 2026 Overall Score 93Prov. 2 of 115Verified 3 of 23 Arena Elo N/A Categories Ranked 2of 8 Price (1M tokens) $5 in / $30 out Speed N/A Context 1M ProprietaryReasoning Confidence base According to BenchL...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[12] I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 ...news.ycombinator.com
ozgune 1 day ago parent context favorite on: DeepSeek v4 I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 across the same AI benchmarks. All results are for Max editions, except for Kimi. Summary: Opus 4.6 forms the baseline all three are tr...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[14] Kimi K2.6 vs DeepSeek V2 vs GPT-5.5 vs Claude Opus 4.7: Which Should You Test First? | LaoZhang AI Blogblog.laozhang.ai
Contract item Kimi K2.6 Current DeepSeek API GPT-5.5 Claude Opus 4.7 --- --- Owner route Moonshot / Kimi platform DeepSeek API OpenAI ChatGPT and Codex first Anthropic API and Claude/cloud routes Current deploy label kimi-k2.6 deepseek-v4-flash or deepseek-...
[15] The Coding Assistant Breakdown: More Tokens Please - SemiAnalysisnewsletter.semianalysis.com
On benchmarks, DeepSeek did not feel that standard benchmarks were good at capturing real-world task capability, so they introduced their own set of agentic benchmarks to measure how V4 compared against other SOTA models: Chinese writing, retrieval augmente...
[16] Claude Opus 4.7 Isn't a Drop-in Replacement for 4.6blog.dailydoseofds.com
It ended up running around 20% faster than LM Studio on the same hardware. Separately, it refactored an 8-year-old financial matching engine across 13 hours, delivering a 133% peak throughput gain. This is the capability gap that usually separates frontier...
[17] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
13 Image 14: OpenAI GPT-5 High OpenAI 1,301 1,067 87.3% — — — — Proprietary 14 Image 15: OpenAI GPT-5.3 Codex OpenAI 1,244 821 — — 400K $1.75 $14.00 Proprietary 15 Image 16: OpenAI GPT-5.1 OpenAI 1,231 1,013 88.1% 76.3% 400K $1.25 $10.00 Proprietary 16 Imag...
[18] "DeepSeek v4 is now the #1 open-weight model on our Vibe Code Benchmark, and it’s not close. It leaves the #2 (Kimi K2.6) in the dust, and even beats out frontier closed source models like Gemini 3.1 Pro." : r/acceleratereddit.com
Ethical considerations in AI development Latest breakthroughs in machine learning Impact of AI on job markets Public Anyone can view, post, and comment to this community 0 0 Reddit RulesPrivacy PolicyUser AgreementYour Privacy ChoicesAccessibilityReddit, In...
[19] Deep|DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task ... - FundaAIfundaai.substack.com
Coding (8 tasks): Standard DeepSeek Pro and Flash score about 8.1 avg. DeepSeek Pro (Thinking) scores 8.48 on 4 completed coding tasks and Flash (Thinking) scores 8.44 on 6 completed coding tasks, but hard coding timeouts make the thinking-mode results less...
[20] I just ran the math on GPT-5.5, Claude Opus 4.7, Kimi K2 ... - Facebookfacebook.com
Vibe Coding is Life I just ran the math on GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4, and Llama 4 Facebook Log In Log In Forgot Account? The exact break-even point where self-hosting beats API pricing Why your biggest self-hosting cost isn't GPUs — i...
[21] GPT 5.5 is out. Here are the benchmarks. - Facebookfacebook.com
1d 4 Image 3 View all 3 replies []( Robert Eaton Opus 4.7 has been terrible. Getting everything wrong, making same mistake over and over. I went back to 4.6 and everything was fine 1d 4 Image 4Image 5 View all 2 replies []( Patrick Healy Codex $100a month p...

ReportsPublic URLApr 27, 202621 sources

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Key takeaways

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1%.
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7%.
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2%.
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9%.
The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena [4][1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierM
## Key findings

Research answer

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Supporting visuals

Research conversation

YouApr 27, 2026

Research question

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AIApr 27, 202621 sources

Cited answer

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Sources

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[2] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
A game being played on a smartphone. NYT Connections hints today: Clues, answers for April 25, 2026 Everything you need to solve 'Connections' 1049. By Mashable Team Connections game on a smartphone Connections game on a smartphone OpenAI’s GPT-5.5 vs Claud...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[6] GPT-5.5 (high) Review | Pricing, Benchmarks & Capabilities (2026)designforonline.com
5 AI Business Automations You Can Implement Today SEO Pricing UK: A Clear Guide to Our Packages Viral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and more. The Best AI Models So Far in 2026 Design for Online Gemini 3.1 Pro, Claude Sonn...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[9] Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which Should You Test First? | LaoZhang AI Blogblog.laozhang.ai
As of Apr 24, 2026, this comparison should be built around DeepSeek V4, not an older DeepSeek label. Test Kimi K2.6 first when the job is low-cost coding-agent exploration, test DeepSeek V4 Flash or V4 Pro when you need a cheap callable API route today, use...
[10] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
benchlm.ai @glevd GPT-5.5 OpenAICurrentReleased Apr 23, 2026 Overall Score 93Prov. 2 of 115Verified 3 of 23 Arena Elo N/A Categories Ranked 2of 8 Price (1M tokens) $5 in / $30 out Speed N/A Context 1M ProprietaryReasoning Confidence base According to BenchL...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[12] I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 ...news.ycombinator.com
ozgune 1 day ago parent context favorite on: DeepSeek v4 I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 across the same AI benchmarks. All results are for Max editions, except for Kimi. Summary: Opus 4.6 forms the baseline all three are tr...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[14] Kimi K2.6 vs DeepSeek V2 vs GPT-5.5 vs Claude Opus 4.7: Which Should You Test First? | LaoZhang AI Blogblog.laozhang.ai
Contract item Kimi K2.6 Current DeepSeek API GPT-5.5 Claude Opus 4.7 --- --- Owner route Moonshot / Kimi platform DeepSeek API OpenAI ChatGPT and Codex first Anthropic API and Claude/cloud routes Current deploy label kimi-k2.6 deepseek-v4-flash or deepseek-...
[15] The Coding Assistant Breakdown: More Tokens Please - SemiAnalysisnewsletter.semianalysis.com
On benchmarks, DeepSeek did not feel that standard benchmarks were good at capturing real-world task capability, so they introduced their own set of agentic benchmarks to measure how V4 compared against other SOTA models: Chinese writing, retrieval augmente...
[16] Claude Opus 4.7 Isn't a Drop-in Replacement for 4.6blog.dailydoseofds.com
It ended up running around 20% faster than LM Studio on the same hardware. Separately, it refactored an 8-year-old financial matching engine across 13 hours, delivering a 133% peak throughput gain. This is the capability gap that usually separates frontier...
[17] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
13 Image 14: OpenAI GPT-5 High OpenAI 1,301 1,067 87.3% — — — — Proprietary 14 Image 15: OpenAI GPT-5.3 Codex OpenAI 1,244 821 — — 400K $1.75 $14.00 Proprietary 15 Image 16: OpenAI GPT-5.1 OpenAI 1,231 1,013 88.1% 76.3% 400K $1.25 $10.00 Proprietary 16 Imag...
[18] "DeepSeek v4 is now the #1 open-weight model on our Vibe Code Benchmark, and it’s not close. It leaves the #2 (Kimi K2.6) in the dust, and even beats out frontier closed source models like Gemini 3.1 Pro." : r/acceleratereddit.com
Ethical considerations in AI development Latest breakthroughs in machine learning Impact of AI on job markets Public Anyone can view, post, and comment to this community 0 0 Reddit RulesPrivacy PolicyUser AgreementYour Privacy ChoicesAccessibilityReddit, In...
[19] Deep|DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task ... - FundaAIfundaai.substack.com
Coding (8 tasks): Standard DeepSeek Pro and Flash score about 8.1 avg. DeepSeek Pro (Thinking) scores 8.48 on 4 completed coding tasks and Flash (Thinking) scores 8.44 on 6 completed coding tasks, but hard coding timeouts make the thinking-mode results less...
[20] I just ran the math on GPT-5.5, Claude Opus 4.7, Kimi K2 ... - Facebookfacebook.com
Vibe Coding is Life I just ran the math on GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4, and Llama 4 Facebook Log In Log In Forgot Account? The exact break-even point where self-hosting beats API pricing Why your biggest self-hosting cost isn't GPUs — i...
[21] GPT 5.5 is out. Here are the benchmarks. - Facebookfacebook.com
1d 4 Image 3 View all 3 replies []( Robert Eaton Opus 4.7 has been terrible. Getting everything wrong, making same mistake over and over. I went back to 4.6 and everything was fine 1d 4 Image 4Image 5 View all 2 replies []( Patrick Healy Codex $100a month p...

Key takeaways

Research answer

Key findings

Benchmark comparison table

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4

Kimi K2.6

Evidence notes

Limitations / uncertainty

Summary

Supporting visuals

Research conversation

Research question

Cited answer

Key findings

Benchmark comparison table

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4

Kimi K2.6

Evidence notes

Limitations / uncertainty

Summary

People also ask

What is the short answer to "Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this."?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Research & compare Claude Code vs OpenAI Codex as comprehensively as possible

Deep research & compare GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4

Research & compare Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 as comprehensively as possible

Sources

Key takeaways

Research answer

Key findings

Benchmark comparison table

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4

Kimi K2.6

Evidence notes

Limitations / uncertainty

Summary

Supporting visuals

Research conversation

Research question

Cited answer

Key findings

Benchmark comparison table

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4

Kimi K2.6

Evidence notes

Limitations / uncertainty

Summary

People also ask

What is the short answer to "Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this."?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Research & compare Claude Code vs OpenAI Codex as comprehensively as possible

Deep research & compare GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4

Research & compare Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 as comprehensively as possible

Sources

Key takeaways

Research answer

Key findings

Benchmark comparison table