studioglobal
Trending Discover
ReportsPublished12 sources

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: benchmark and pricing comparison

There is no universal winner: GPT 5.5 leads the available Artificial Analysis Intelligence Index at 60/59, Claude Opus 4.7 wins several shared VentureBeat reasoning and SWE rows, and DeepSeek V4 is the price value out... For coding, Claude leads VentureBeat’s shared SWE Bench Pro row at 64.3%, while DeepSeek V4 Pro...

16K0
The image contains two charts illustrating the Artificial Analysis Intelligence Index and its breakdown by open weights and proprietary methods, with key data points highlighted ar
Kimi K2.6: The new leading open weights modelThe image contains two charts illustrating the Artificial Analysis Intelligence Index and its breakdown by open weights and proprietary methods, with key data points highlighted around Claude Opus 4.7m.

Frontier-model comparisons are easy to overstate because the answer changes with the benchmark, reasoning setting, provider endpoint, and price model. The available evidence supports a practical verdict: GPT-5.5 has the strongest aggregate ranking signal, Claude Opus 4.7 wins several hard reasoning and software-engineering rows, DeepSeek V4 is the standout on listed API cost, and Kimi K2.6 is promising but has less direct head-to-head coverage against GPT-5.5 and Claude Opus 4.7.[2][16][15][18][19]

Quick verdict

PriorityBest-supported pickWhy
Highest aggregate intelligence scoreGPT-5.5Artificial Analysis lists GPT-5.5 xhigh at 60 and GPT-5.5 high at 59, ahead of Claude Opus 4.7 Adaptive Reasoning Max Effort at 57.[2]
Shared current-generation reasoning and SWE rowsMixed: Claude Opus 4.7 and GPT-5.5In VentureBeat’s shared table, Claude leads GPQA Diamond, HLE no-tools, SWE-Bench Pro, and MCP Atlas; GPT-5.5 leads Terminal-Bench 2.0 and base BrowseComp, while GPT-5.5 Pro leads HLE with tools and BrowseComp where that variant is shown.[16]
Lowest listed flagship API costDeepSeek V4Mashable lists DeepSeek V4 at $1.74 per 1M input tokens and $3.48 per 1M output tokens, versus GPT-5.5 at $5/$30 and Claude Opus 4.7 at $5/$25.[15]
Disclosed coding and competitive-programming metricsDeepSeek V4 ProTogether AI lists DeepSeek V4 Pro at 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[25]
Kimi K2.6 evaluationPromising, but not settled against GPT-5.5 and Opus 4.7Kimi K2.6 has useful coding and agentic numbers, but much of the available Kimi-focused evidence compares it with GPT-5.4 and Claude Opus 4.6 rather than GPT-5.5 and Claude Opus 4.7.[18][19]

What the aggregate rankings say

The clearest aggregate signal favors GPT-5.5. Artificial Analysis lists GPT-5.5 xhigh first with an Intelligence Index of 60 and GPT-5.5 high second at 59; Claude Opus 4.7 Adaptive Reasoning Max Effort is listed at 57.[2]

Kimi K2.6 appears below that GPT-5.5/Claude tier in the available composite snippets: OpenRouter shows Kimi K2.6 at 53.9 Intelligence, 47.1 Coding, and 66.0 Agentic, while LLMBase’s DeepSeek V4 Flash High vs Kimi K2.6 comparison lists Kimi at 53.9 Intelligence and 47.1 Coding.[3][1] That same LLMBase comparison lists DeepSeek V4 Flash High at 44.9 Intelligence and 39.8 Coding, but that is the Flash variant, not DeepSeek V4 Pro or Pro-Max.[1]

The important caveat: the available Artificial Analysis top-model snippet gives a clean ranking for GPT-5.5 and Claude Opus 4.7, but it does not provide a single complete four-way leaderboard row for GPT-5.5, Claude Opus 4.7, DeepSeek V4 Pro-Max, and Kimi K2.6 together.[2]

Shared benchmark results: Claude and GPT-5.5 split the wins

VentureBeat provides the most useful direct table across DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where shown, and Claude Opus 4.7.[16]

BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result in that source
GPQA Diamond90.1%93.6%94.2%Claude Opus 4.7[16]
Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7[16]
Humanity’s Last Exam, with tools48.2%52.2%57.2%54.7%GPT-5.5 Pro[16]
Terminal-Bench 2.067.9%82.7%69.4%GPT-5.5[16]
SWE-Bench Pro / SWE Pro55.4%58.6%64.3%Claude Opus 4.7[16]
BrowseComp83.4%84.4%90.1%79.3%GPT-5.5 Pro[16]
MCP Atlas / MCPAtlas Public73.6%75.3%79.1%Claude Opus 4.7[16]

Read that table as a split decision, not a sweep. Claude Opus 4.7 has the stronger case on difficult reasoning and repository-style software engineering in this source, especially GPQA Diamond, HLE no-tools, SWE-Bench Pro, and MCP Atlas.[16] GPT-5.5 has the stronger base-model results on Terminal-Bench 2.0 and BrowseComp, and GPT-5.5 Pro is higher where VentureBeat includes it for HLE with tools and BrowseComp.[16]

DeepSeek-V4-Pro-Max is competitive in the same table but does not beat the best GPT-5.5 or Claude Opus 4.7 result on those shared rows. Its closest shared row is BrowseComp, where it scores 83.4% versus GPT-5.5 at 84.4% and Claude Opus 4.7 at 79.3%.[16]

Coding benchmarks: Claude leads one shared SWE row, DeepSeek has the richest disclosed coding profile

Coding is where the comparison becomes especially benchmark-dependent. In VentureBeat’s shared SWE-Bench Pro row, Claude Opus 4.7 leads at 64.3%, followed by GPT-5.5 at 58.6% and DeepSeek-V4-Pro-Max at 55.4%.[16]

DeepSeek V4 Pro, however, has the most detailed disclosed coding profile in the available provider snippets. Together AI lists DeepSeek V4 Pro at 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[25] NVIDIA’s model card also breaks out DeepSeek V4 Flash and V4 Pro variants across benchmarks, including GPQA Diamond, HLE, LiveCodeBench, and Codeforces, with V4-Pro Max shown at 93.5 on LiveCodeBench and 3206 on Codeforces.[31]

Kimi K2.6 also has meaningful coding evidence, but the comparisons are usually against slightly earlier flagship models. Lorka lists Kimi K2.6 at 58.6% on SWE-Bench Pro, 54.0% on HLE-Full with tools, 90.5% on GPQA-Diamond, and 79.4% on MMMU-Pro in a table comparing it with GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro.[18] Verdent lists Kimi K2.6 at 80.2% on SWE-Bench Verified, 66.7% on Terminal-Bench 2.0, 54.0% on HLE with tools, and 89.6% on LiveCodeBench v6, while also noting that Opus 4.7 leads SWE-Bench Verified at 87.6%.[19]

That means Kimi K2.6 should be taken seriously for coding and agentic workflows, but the available evidence is not enough to declare it ahead of GPT-5.5 or Claude Opus 4.7 overall.[18][19]

Pricing: DeepSeek V4 is the clear cost outlier

If API cost matters, DeepSeek V4 has the strongest price argument in the available sources. Mashable lists DeepSeek V4 at $1.74 per 1M input tokens and $3.48 per 1M output tokens, compared with GPT-5.5 at $5 per 1M input tokens and $30 per 1M output tokens, and Claude Opus 4.7 at $5 per 1M input tokens and $25 per 1M output tokens.[15]

Model or variantListed input priceListed output priceNotes
GPT-5.5$5 per 1M tokens$30 per 1M tokensMashable lists a 1M context window for this comparison.[15]
Claude Opus 4.7$5 per 1M tokens$25 per 1M tokensMashable lists a 1M context window for this comparison.[15]
DeepSeek V4$1.74 per 1M tokens$3.48 per 1M tokensMashable lists a 1M context window for this comparison.[15]
DeepSeek V4 Flash$0.14 per 1M tokens$0.28 per 1M tokensLLMBase lists a $0.18 blended price in its DeepSeek V4 Flash High vs Kimi K2.6 comparison.[1]
Kimi K2.6$0.95 per 1M tokens$4.00 per 1M tokensLLMBase lists a $1.71 blended price in the same comparison.[1]

Do not assume every endpoint has the same context limit. Mashable lists 1M context windows for DeepSeek V4, GPT-5.5, and Claude Opus 4.7 in its pricing comparison, but an OpenRouter DeepSeek V4 Pro snippet lists 256K max tokens and 66K max output tokens.[15][3] For production decisions, verify the exact provider, model variant, and reasoning mode you plan to call.

Model-by-model guidance

GPT-5.5: best aggregate signal, strong terminal and browsing rows

GPT-5.5 is the safest pick if your decision is driven by the available aggregate ranking. It holds the top two Artificial Analysis Intelligence Index positions in the provided snippet: 60 for GPT-5.5 xhigh and 59 for GPT-5.5 high.[2]

It also performs especially well on two shared task rows in VentureBeat’s table: 82.7% on Terminal-Bench 2.0 and 84.4% on BrowseComp for base GPT-5.5, with GPT-5.5 Pro shown at 90.1% on BrowseComp where that variant appears.[16]

Claude Opus 4.7: strongest on several hard reasoning and SWE rows

Claude Opus 4.7 is close behind GPT-5.5 on the aggregate ranking, with an Artificial Analysis Intelligence Index score of 57 for the Adaptive Reasoning Max Effort setting.[2] In VentureBeat’s shared table, it leads GPT-5.5 and DeepSeek-V4-Pro-Max on GPQA Diamond, HLE no-tools, SWE-Bench Pro, and MCP Atlas.[16]

Anthropic’s launch material also highlights internal research-agent results, including a tied top overall score of 0.715 across six modules and a General Finance score of 0.813 versus 0.767 for Opus 4.6, but those are internal benchmark claims rather than a neutral public leaderboard.[17]

DeepSeek V4: strongest value case, with serious coding numbers

DeepSeek V4’s most obvious advantage is price. In Mashable’s comparison, its listed input and output prices are far below GPT-5.5 and Claude Opus 4.7: $1.74 input and $3.48 output per 1M tokens versus GPT-5.5 at $5/$30 and Claude Opus 4.7 at $5/$25.[15]

DeepSeek V4 Pro also has a strong disclosed coding profile, including 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual in Together AI’s listing.[25] The tradeoff is that DeepSeek-V4-Pro-Max trails the top GPT-5.5 or Claude Opus 4.7 result on the shared VentureBeat rows, even when it is close on BrowseComp.[16]

Kimi K2.6: credible coding and agentic signals, but fewer direct current-gen comparisons

Kimi K2.6 is harder to place in a four-way ranking because the available Kimi-focused benchmark tables mostly compare it with GPT-5.4 and Claude Opus 4.6 rather than GPT-5.5 and Claude Opus 4.7.[18][19] Still, the signals are not weak: OpenRouter lists Kimi K2.6 at 53.9 Intelligence, 47.1 Coding, and 66.0 Agentic, while Verdent lists 80.2% SWE-Bench Verified and 89.6% LiveCodeBench v6.[3][19]

The right conclusion is not that Kimi K2.6 is outclassed; it is that the available direct evidence is thinner. If Kimi’s cost, deployment route, or agentic behavior fits your stack, it deserves evaluation, but the sources here do not support calling it the overall winner against GPT-5.5 or Claude Opus 4.7.[18][19]

Caveats before choosing a model

  • Variant names matter. DeepSeek V4 appears in the sources as V4, V4 Flash, V4 Pro, and DeepSeek-V4-Pro-Max, and the benchmark results differ by variant and reasoning setting.[1][15][25][31]
  • Kimi comparisons are less direct. The strongest Kimi K2.6 benchmark tables in the available evidence often compare it with GPT-5.4 and Claude Opus 4.6, not GPT-5.5 and Claude Opus 4.7.[18][19]
  • Humanity’s Last Exam no-tools is inconsistent across snippets. LLM Stats and VentureBeat report GPT-5.5 at 41.4% and Claude Opus 4.7 at 46.9%, while a Mashable GPT-vs-Claude snippet reports GPT-5.5 at 40.6% and Opus 4.7 at 31.2%.[7][16][9]
  • Internal benchmarks are useful but not the same as independent leaderboards. Anthropic’s Opus 4.7 launch material reports internal research-agent gains, but those results should be read differently from cross-provider public comparisons.[17]
  • Pricing and context limits are provider-specific. The same model family can appear with different context windows or token limits depending on endpoint and listing.[3][15]

Bottom line

Pick GPT-5.5 if the available aggregate intelligence ranking is your top criterion.[2] Pick Claude Opus 4.7 if your workload resembles the shared hard reasoning and software-engineering rows where it leads, including GPQA Diamond, HLE no-tools, SWE-Bench Pro, and MCP Atlas.[16] Pick DeepSeek V4 if price-performance is central and you can validate the exact V4 variant you plan to use; its listed API pricing is far lower than GPT-5.5 and Claude Opus 4.7, and DeepSeek V4 Pro has strong disclosed coding metrics.[15][25] Treat Kimi K2.6 as a credible coding and agentic candidate, but not as a proven overall winner against GPT-5.5 or Claude Opus 4.7 based on the available direct evidence.[18][19]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Search & fact-check with Studio Global AI

Key takeaways

  • There is no universal winner: GPT 5.5 leads the available Artificial Analysis Intelligence Index at 60/59, Claude Opus 4.7 wins several shared VentureBeat reasoning and SWE rows, and DeepSeek V4 is the price value out...
  • For coding, Claude leads VentureBeat’s shared SWE Bench Pro row at 64.3%, while DeepSeek V4 Pro has the strongest disclosed LiveCodeBench figure in the available sources at 93.5% plus a Codeforces rating of 3206.[16][25]
  • Treat context length, pricing, and benchmark scores as variant specific: DeepSeek V4, V4 Flash, V4 Pro, and V4 Pro Max are not interchangeable endpoints, and provider limits differ.[1][3][15][31]

Supporting visuals

The image displays a comparison of official results from various AI benchmarks, illustrating the performance of different AI models in multiple evaluation categories such as GPTval
Kimi K2.6: The new leading open weights modelThe image displays a comparison of official results from various AI benchmarks, illustrating the performance of different AI models in multiple evaluation categories such as GPTval-AA, Terminal-Bench Hard, t²-Bench Telecom, AA-LCR, AA-Omnipresence Accuracy, AA-Omnipresence Non-Hallucination Rate, Humanity’s Last Exam,
Transportation problem optimal solution showing shipping plan, cost breakdown, and MODI optimality verification generated by Claude Opus 4.7
Transportation problem optimal solution showing shipping plan, cost breakdown, and MODI optimality verification generated by Claude Opus 4.7

People also ask

What is the short answer to "GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: benchmark and pricing comparison"?

There is no universal winner: GPT 5.5 leads the available Artificial Analysis Intelligence Index at 60/59, Claude Opus 4.7 wins several shared VentureBeat reasoning and SWE rows, and DeepSeek V4 is the price value out...

What are the key points to validate first?

There is no universal winner: GPT 5.5 leads the available Artificial Analysis Intelligence Index at 60/59, Claude Opus 4.7 wins several shared VentureBeat reasoning and SWE rows, and DeepSeek V4 is the price value out... For coding, Claude leads VentureBeat’s shared SWE Bench Pro row at 64.3%, while DeepSeek V4 Pro has the strongest disclosed LiveCodeBench figure in the available sources at 93.5% plus a Codeforces rating of 3206.[16][25]

What should I do next in practice?

Treat context length, pricing, and benchmark scores as variant specific: DeepSeek V4, V4 Flash, V4 Pro, and V4 Pro Max are not interchangeable endpoints, and provider limits differ.[1][3][15][31]

Which related topic should I explore next?

Continue with "PocketOS Database Deletion: What the Reported Claude/Cursor Incident Shows About AI-Agent Risk" for another angle and extra citations.

Open related page

What should I compare this against?

Cross-check this answer against "GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks Compared".

Open related page

Continue your research

Research conversation

You

Research question

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Studio Global AI32 sources

Cited answer

Using only the supplied evidence, the strongest benchmark signal is that GPT-5.5 leads the available aggregate Intelligence Index, with GPT-5.5 xhigh at 60 and GPT-5.5 high at 59, while Claude Opus 4.7 Adaptive Reasoning Max Effort scores 57.[5] DeepSeek V4 Pro has the richest disclosed task-level benchmark data in the evidence, including 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6] For Kimi K2.6, the supplied evidence contains positioning and pricing but no comparable numeric benchmark scores, so a full apples-to-apples benchmark ranking is not supported.[1][3]

Key findings

  • GPT-5.5 is the best-supported leader on the only cross-model aggregate benchmark available here: GPT-5.5 xhigh ranks first with an Intelligence Index of 60, and GPT-5.5 high ranks second with 59.[5]

  • Claude Opus 4.7 is close behind on that same aggregate measure: Claude Opus 4.7 Adaptive Reasoning Max Effort ranks third with an Intelligence Index of 57.[5]

  • DeepSeek V4 Pro has the clearest disclosed coding/SWE benchmark profile in the supplied evidence: 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6]

  • Kimi K2.6 is described as an “open-source flagship” agent-coding model in the launch/news evidence, but the supplied evidence does not provide numeric benchmark scores for Kimi K2.6.[1]

  • DeepSeek V4 is materially cheaper than GPT-5.5 in the pricing evidence: DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens, while GPT-5.5 is listed at $5 per 1M input tokens and $30 per 1M output tokens, with both shown as 1M-context models.[4]

  • DeepSeek V4 Flash is far cheaper than Kimi K2.6 in the direct pricing comparison: DeepSeek V4 Flash is listed at $0.14 per 1M input tokens, $0.28 per 1M output tokens, and $0.18 blended, while Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.[3]

  • The supplied evidence treats DeepSeek V4 as a family with multiple variants, including V4 Flash and V4 Pro, so comparisons depend on which variant is meant.[2][3][6]

Benchmark and capability comparison

AreaGPT-5.5Claude Opus 4.7DeepSeek V4Kimi K2.6
Aggregate Intelligence Index60 for GPT-5.5 xhigh; 59 for GPT-5.5 high.[5]57 for Claude Opus 4.7 Adaptive Reasoning Max Effort.[5]No numeric Intelligence Index for DeepSeek V4 is provided in the supplied evidence.No numeric Intelligence Index for Kimi K2.6 is provided in the supplied evidence.
Coding / SWE benchmarksNo exact coding benchmark scores are provided in the supplied evidence.The launch evidence says Claude Opus 4.7 has improved programming and a threefold vision upgrade, but no exact benchmark numbers are provided.[1]DeepSeek V4 Pro is listed with 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6]Kimi K2.6 is positioned as an open-source flagship agent-coding model, but no exact benchmark numbers are provided.[1]
Context window evidenceGPT-5.5 is listed with a 1M context window in the pricing comparison.[4]The supplied pricing snippet does not clearly provide Claude Opus 4.7’s context window.DeepSeek V4 is listed with a 1M context window in one comparison, and DeepSeek V4 Flash / Pro are also described as 1M-context options in enterprise guidance.[2][4]No context-window figure for Kimi K2.6 is provided in the supplied evidence.
Pricing evidence$5 per 1M input tokens and $30 per 1M output tokens.[4]$5 per 1M input tokens is visible in the supplied snippet, but the output price is truncated.[4]DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens; DeepSeek V4 Flash is listed at $0.14 input, $0.28 output, and $0.18 blended.[3][4]Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.[3]
Best-supported use case from evidenceHighest aggregate intelligence among the four where evidence is available.[5]High-end reasoning close to GPT-5.5 on the available Intelligence Index.[5]Strongest supplied coding/SWE evidence and strong cost positioning, especially for Flash and Pro variants.[2][3][6]Potentially strong open-source agent-coding positioning, but benchmark evidence is insufficient.[1]

Pricing and value takeaways

  • On the available aggregate benchmark, GPT-5.5 leads Claude Opus 4.7 by 3 points at the xhigh setting and by 2 points at the high setting.[5]

  • On the available API pricing comparison, GPT-5.5 costs about 2.9x DeepSeek V4 for input tokens and about 8.6x DeepSeek V4 for output tokens.[4]

  • In the DeepSeek V4 Flash vs Kimi K2.6 pricing comparison, Kimi K2.6’s blended price of $1.71 per 1M tokens is about 9.5x DeepSeek V4 Flash’s $0.18 blended price.[3]

  • One enterprise-oriented comparison recommends DeepSeek V4 Flash for high-volume text summarization, customer-service classification, and internal knowledge-base Q&A because of low cost and 1M context.[2]

  • The same enterprise-oriented comparison recommends DeepSeek V4 Pro for complex document analysis, cross-department workflows, and automated agents because it is described as lower-cost than closed flagship models while offering high-end open-weight capability and 1M context.[2]

Evidence notes

  • The strongest single benchmark source in the supplied evidence is the Intelligence Index listing, which ranks GPT-5.5 xhigh first at 60, GPT-5.5 high second at 59, and Claude Opus 4.7 Adaptive Reasoning Max Effort third at 57.[5]

  • The strongest task-level benchmark evidence is for DeepSeek V4 Pro, because it includes exact coding and software-engineering metrics rather than only positioning language.[6]

  • The Kimi K2.6 evidence is mostly launch positioning and pricing, not benchmark results.[1][3]

  • A user post describes four one-shot head-to-head experiments comparing Codex on GPT-5.5 with Claude Code on Opus 4.7, but the supplied snippet does not provide the outcomes or benchmark scores.[31]

  • There is conflicting or variant-specific DeepSeek context/pricing evidence: one comparison lists DeepSeek V4 with a 1M context window, while another provider listing for DeepSeek V4 Pro shows 256K max tokens and 66K max output tokens.[4][7]

Limitations / uncertainty

  • Insufficient evidence for a full, comprehensive benchmark comparison across Claude Opus 4.7, GPT-5.5, DeepSeek V4, and Kimi K2.6.

  • Insufficient evidence to rank DeepSeek V4 or Kimi K2.6 against GPT-5.5 and Claude Opus 4.7 on the same aggregate Intelligence Index, because the supplied excerpt only provides numeric Intelligence Index scores for GPT-5.5 and Claude Opus 4.7 among the requested models.[5]

  • Insufficient evidence to compare all four models on the same coding benchmarks, because only DeepSeek V4 Pro has detailed numeric coding/SWE benchmark scores in the supplied evidence.[6]

  • Insufficient evidence to compare multimodal performance across all four models, because the supplied evidence only includes a launch-positioning claim that Claude Opus 4.7 has a threefold vision upgrade and does not provide comparable multimodal benchmark numbers for the others.[1]

Summary

GPT-5.5 has the strongest supplied aggregate benchmark result, with Intelligence Index scores of 60 and 59 depending on reasoning setting.[5] Claude Opus 4.7 is close behind at 57 and is the only other requested model with a numeric aggregate score in the supplied evidence.[5] DeepSeek V4 Pro has the best-supported coding/SWE benchmark profile, while DeepSeek V4 Flash and V4 Pro appear especially strong on cost and long-context enterprise use cases.[2][3][6] Kimi K2.6 may be competitively positioned as an open-source agent-coding model, but the supplied evidence does not include enough numeric benchmark data to verify how it compares against GPT-5.5, Claude Opus 4.7, or DeepSeek V4.[1][3]

Sources

  • [1] Compare DeepSeek V4 Flash (Reasoning, High Effort) vs Kimi K2.6 | AI Model Comparisonllmbase.ai

    Metric DeepSeek logo De DeepSeek V4 Flash (Reasoning, High Effort) DeepSeek Kimi logo Ki Kimi K2.6 Kimi --- Pricing per 1M tokens Input Cost $0.14/1M $0.95/1M Output Cost $0.28/1M $4.00/1M Blended (3:1) $0.18/1M $1.71/1M Specifications Organization DeepSeek...

  • [2] DeepSeek V4 Pro (Reasoning, High Effort) vs Kimi K2.6: Model Comparisonartificialanalysis.ai

    What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...

  • [3] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai

    Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...

  • [7] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com

    Reasoning & knowledge Benchmark GPT-5.5 Opus 4.7 Lead --- --- GPQA Diamond 93.6% 94.2% Opus +0.6 HLE (no tools) 41.4% 46.9% Opus +5.5 HLE (with tools) 52.2% 54.7% Opus +2.5 The HLE no-tools margin (+5.5pp) is the most informative entry in the table because...

  • [9] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com

    Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...

  • [15] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com

    Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...

  • [16] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com

    BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...

  • [17] Introducing Claude Opus 4.7 - Anthropicanthropic.com

    Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...

  • [18] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai

    Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...

  • [19] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai

    Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

  • [25] DeepSeek V4 Pro API - Together AItogether.ai

    Coding & Software Engineering: • 93.5% LiveCodeBench and Codeforces 3206 for competitive and production code generation • 80.6% SWE-Bench Verified for autonomous software engineering across repositories • 76.2% SWE-Bench Multilingual for cross-language soft...

  • [31] deepseek-v4-pro Model by Deepseek-ai | NVIDIA NIM - NVIDIA Buildbuild.nvidia.com

    Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max --- --- --- Knowledge & Reasoning MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5 SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9 Chinese-SimpleQA...