studioglobal
ReportsPublic URL21 sources

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no tools results and is reported 1 in Vision & Document Arena [4][1]. GPT 5.5 leads...

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no tools results and is reported 1 in Vision & Document Arena [4][1]. GPT 5.5 leads the cited Terminal Bench 2.0, OSWorld Verified, and FrontierMath results, while GPT 5.5 Pro leads Huma

Key takeaways

  • Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1%.
  • Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7%.
  • GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2%.
  • GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9%.
  • The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena [4][1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierM
  • ## Key findings

Research answer

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena [4][1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools [4][5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison [4][18][1].

Key findings

  • Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% [4].

  • Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% [4].

  • GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% [4].

  • GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% [4][5].

  • GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% [5].

  • GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% [5].

  • Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories [1].

  • DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology [4].

  • DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report [18].

  • Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 [1].

Benchmark comparison table

Benchmark / capabilityGPT-5.5GPT-5.5 ProClaude Opus 4.7DeepSeek V4Kimi K2.6Leader in available evidence
GPQA Diamond93.6% [4]Insufficient evidence94.2% [4]90.1% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceClaude Opus 4.7 [4]
Humanity’s Last Exam, no tools41.4% [4]43.1% [4]46.9% [4]37.7% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceClaude Opus 4.7 [4]
Humanity’s Last Exam, with tools52.2% [4]57.2% [4]54.7% [4]48.2% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceGPT-5.5 Pro [4]
Terminal-Bench 2.082.7% [4][5]Insufficient evidence69.4% [4][5]67.9% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceGPT-5.5 [4][5]
OSWorld-Verified78.7% [5]Insufficient evidence78.0% [5]Insufficient evidenceInsufficient evidenceGPT-5.5 [5]
FrontierMath Tiers 1–351.7% [5]Insufficient evidence43.8% [5]Insufficient evidenceInsufficient evidenceGPT-5.5 [5]
Vision & Document ArenaInsufficient evidenceInsufficient evidenceReported #1 overall [1]Insufficient evidenceInsufficient evidenceClaude Opus 4.7 [1]
Vibe Code BenchmarkInsufficient evidenceInsufficient evidenceInsufficient evidenceClaimed #1 open-weight model [18]Claimed #2 open-weight model [18]DeepSeek V4 among open-weight models, low-confidence evidence [18]
Context windowInsufficient evidenceInsufficient evidence1,000k tokens in one cited comparison [3]1,000k tokens for DeepSeek V4 Pro in one cited comparison [3]Insufficient evidenceTie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence [3]

Model-by-model assessment

GPT-5.5

  • GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score [4][5].

  • GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% [5].

  • GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% [5].

  • GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% [4].

  • GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% [4].

  • Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics [7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks [7].

Claude Opus 4.7

  • Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% [4].

  • Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% [4].

  • Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% [4].

  • Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% [4][5].

  • Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories [1].

  • Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro [3].

DeepSeek V4

  • DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results [4].

  • DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 [4].

  • DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions [4].

  • DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 [3].

  • A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table [18].

Kimi K2.6

  • Kimi K2.6 has the weakest quantitative coverage in the available evidence [1][18].

  • One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 [1].

  • The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models [18].

  • Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

  • The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 [4].

  • GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results [5].

  • The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison [3].

  • The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 [1].

  • The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here [2][6].

Limitations / uncertainty

  • Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.

  • Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported [1][18].

  • The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings [3][4].

  • The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology [18].

  • The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions [4].

Summary

  • Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools [4].

  • Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools [4].

  • Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 [4][5].

  • Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath [5].

  • Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting [1].

  • Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet [18].

  • Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison [1][18].

Supporting visuals

The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026. We rank the best AI models, compare benchm
The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026The Best AI Models So Far in 2026 | Design for Online Gemini 3.1 Pro, Claude Sonnet 4.6, Grok 4.20 and more all dropped in February 2026. We rank the best AI models, compare benchmarks and break down real costs.
The image displays a comparison table of benchmark performances for various AI models, including GPT-5.5, GPT-5.5 Pro, Claude 4.7, GPT-5.4, and Gemini 3.1 Pro, with the models' sco
GPT-5.5 Review: Benchmarks, Pricing & Vs Claude (2026)The image displays a comparison table of benchmark performances for various AI models, including GPT-5.5, GPT-5.5 Pro, Claude 4.7, GPT-5.4, and Gemini 3.1 Pro, with the models' scores listed across different evaluation categories such as Terminal-Bench 2.0, OSWorld-Verified, and GPQA Diamond.
Viral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and more.
Viral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and moreViral Video Adverts with Google's Veo 3, IKEA, John Lewis, Lego, Dunelm and more.
WordPress AI Multi Site Manager with MCP - Future of WordPress Management
WordPress AI Multi Site Manager with MCP - Future of WordPress Management
5 AI Business Automations You Can Implement Today
5 AI Business Automations You Can Implement Today
SEO Pricing UK: A Clear Guide to Our Packages
SEO Pricing UK: A Clear Guide to Our Packages
The image compares the performance of Kimi K2.6 and DeepSeek V4, alongside GPT-5.5 and Claude Opus 4.7, highlighting their strengths, best use cases, access routes, and key feature
Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: WhichThe image compares the performance of Kimi K2.6 and DeepSeek V4, alongside GPT-5.5 and Claude Opus 4.7, highlighting their strengths, best use cases, access routes, and key features in a detailed infographic.
LLM API Price Calculator
LLM API Price Calculator
Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4: Agentic Coding Benchmarks (2026)
Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4: Agentic Coding Benchmarks (2026)
Long-Horizon Execution and Agent Orchestration
Long-Horizon Execution and Agent Orchestration
Daily Dose of Data Science
Daily Dose of Data Science
The image compares benchmarking results of Kimi K2.6, DeepSeek V2, GPT-5.5, and Claude Opus 4.7 for coding reasoning in 2026, highlighting their evaluation criteria and testing str
Kimi K2.6 vs DeepSeek V2 vs GPT-5.5 vs Claude Opus 4.7: WhichThe image compares benchmarking results of Kimi K2.6, DeepSeek V2, GPT-5.5, and Claude Opus 4.7 for coding reasoning in 2026, highlighting their evaluation criteria and testing strategies.

Research conversation

You

Research question

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI21 sources

Cited answer

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena [4][1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools [4][5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison [4][18][1].

Key findings

  • Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% [4].

  • Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% [4].

  • GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% [4].

  • GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% [4][5].

  • GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% [5].

  • GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% [5].

  • Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories [1].

  • DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology [4].

  • DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report [18].

  • Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 [1].

Benchmark comparison table

Benchmark / capabilityGPT-5.5GPT-5.5 ProClaude Opus 4.7DeepSeek V4Kimi K2.6Leader in available evidence
GPQA Diamond93.6% [4]Insufficient evidence94.2% [4]90.1% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceClaude Opus 4.7 [4]
Humanity’s Last Exam, no tools41.4% [4]43.1% [4]46.9% [4]37.7% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceClaude Opus 4.7 [4]
Humanity’s Last Exam, with tools52.2% [4]57.2% [4]54.7% [4]48.2% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceGPT-5.5 Pro [4]
Terminal-Bench 2.082.7% [4][5]Insufficient evidence69.4% [4][5]67.9% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceGPT-5.5 [4][5]
OSWorld-Verified78.7% [5]Insufficient evidence78.0% [5]Insufficient evidenceInsufficient evidenceGPT-5.5 [5]
FrontierMath Tiers 1–351.7% [5]Insufficient evidence43.8% [5]Insufficient evidenceInsufficient evidenceGPT-5.5 [5]
Vision & Document ArenaInsufficient evidenceInsufficient evidenceReported #1 overall [1]Insufficient evidenceInsufficient evidenceClaude Opus 4.7 [1]
Vibe Code BenchmarkInsufficient evidenceInsufficient evidenceInsufficient evidenceClaimed #1 open-weight model [18]Claimed #2 open-weight model [18]DeepSeek V4 among open-weight models, low-confidence evidence [18]
Context windowInsufficient evidenceInsufficient evidence1,000k tokens in one cited comparison [3]1,000k tokens for DeepSeek V4 Pro in one cited comparison [3]Insufficient evidenceTie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence [3]

Model-by-model assessment

GPT-5.5

  • GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score [4][5].

  • GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% [5].

  • GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% [5].

  • GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% [4].

  • GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% [4].

  • Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics [7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks [7].

Claude Opus 4.7

  • Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% [4].

  • Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% [4].

  • Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% [4].

  • Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% [4][5].

  • Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories [1].

  • Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro [3].

DeepSeek V4

  • DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results [4].

  • DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 [4].

  • DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions [4].

  • DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 [3].

  • A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table [18].

Kimi K2.6

  • Kimi K2.6 has the weakest quantitative coverage in the available evidence [1][18].

  • One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 [1].

  • The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models [18].

  • Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

  • The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 [4].

  • GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results [5].

  • The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison [3].

  • The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 [1].

  • The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here [2][6].

Limitations / uncertainty

  • Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.

  • Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported [1][18].

  • The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings [3][4].

  • The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology [18].

  • The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions [4].

Summary

  • Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools [4].

  • Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools [4].

  • Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 [4][5].

  • Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath [5].

  • Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting [1].

  • Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet [18].

  • Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison [1][18].

People also ask

What is the short answer to "Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this."?

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1%.

What are the key points to validate first?

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1%. Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7%.

What should I do next in practice?

GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2%.

Which related topic should I explore next?

Continue with "Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible" for another angle and extra citations.

Open related page

What should I compare this against?

Cross-check this answer against "Research & compare Claude Code vs OpenAI Codex as comprehensively as possible".

Open related page

Continue your research

Sources