studioglobal
熱門探索內容
報告已發布12 個來源

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6:誰才是最強 AI 模型?

GPT 5.5 在 Artificial Analysis 的整體 Intelligence Index 訊號最強:xhigh 為 60、high 為 59;Claude Opus 4.7 Adaptive Reasoning Max Effort 為 57。[2] Claude Opus 4.7 在 VentureBeat 共享表中的 GPQA Diamond、HLE 不用工具、SWE Bench Pro、MCP Atlas 領先;GPT 5.5 則在 Terminal Bench 2.0 與部分 BrowseComp 結果更強。[16] 若重視 API 成本,DeepSeek V4 的列示價格最有優勢:每 100 萬輸入...

17K0
Editorial illustration comparing GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6 AI models
GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmarks, Pricing, and Best Use CasesA practical comparison of leading AI models depends on the benchmark, variant, reasoning setting, and API price.
AI 提示詞

Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmarks, Pricing, and Best Use Cases. Article summary: There is no universal winner: GPT 5.5 leads the available Artificial Analysis Intelligence Index at 60/59, Claude Opus 4.7 wins several shared VentureBeat reasoning and SWE rows, and DeepSeek V4 is the price value out.... Topic tags: ai, llm, ai benchmarks, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison - YouTube" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://ww

openai.com

別急著把四款模型排成一條絕對名次。前沿大型語言模型的比較,最容易被單一跑分誤導。依目前來源,較穩妥的讀法是:GPT-5.5 的整體排名訊號最強Claude Opus 4.7 在多個高難推理與軟體工程項目領先DeepSeek V4 的 API 成本優勢最清楚Kimi K2.6 有 coding 與代理式(agentic)工作流實力訊號,但直接對上 GPT-5.5 與 Opus 4.7 的證據較少[2][16][15][18][19]

先看結論

你最在意的是…較有根據的選擇原因
整體智能排名GPT-5.5Artificial Analysis 將 GPT-5.5 xhigh 列為 60、GPT-5.5 high 列為 59,高於 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 57。[2]
高難推理與軟體工程Claude Opus 4.7;GPT-5.5 緊追VentureBeat 的共享表中,Claude 在 GPQA Diamond、HLE 不用工具、SWE-Bench Pro、MCP Atlas 領先;GPT-5.5 在 Terminal-Bench 2.0 與基礎 BrowseComp 更強,GPT-5.5 Pro 在有列出的 HLE with tools 與 BrowseComp 最高。[16]
API 成本DeepSeek V4Mashable 列 DeepSeek V4 為每 100 萬輸入 tokens US$1.74、輸出 tokens US$3.48,低於 GPT-5.5 的 US$5/US$30 與 Claude Opus 4.7 的 US$5/US$25。[15]
已揭露 coding 指標DeepSeek V4 ProTogether AI 列 DeepSeek V4 Pro 為 LiveCodeBench 93.5%、Codeforces 3206、SWE-Bench Verified 80.6%、SWE-Bench Multilingual 76.2%。[25]
Kimi K2.6 的定位值得測,但尚非定論Kimi K2.6 有 coding 與 agentic 數據,但主要 Kimi 表格多與 GPT-5.4、Claude Opus 4.6 比較,而不是 GPT-5.5、Claude Opus 4.7。[18][19]

綜合榜:GPT-5.5 的訊號最清楚

目前來源中最乾淨的整體排序,是 Artificial Analysis 的 Intelligence Index 摘要:GPT-5.5 xhigh 為 60、GPT-5.5 high 為 59;Claude Opus 4.7 Adaptive Reasoning Max Effort 為 57。[2]

Kimi K2.6 在可見的綜合片段中低於這個 GPT-5.5/Claude 層級。OpenRouter 列 Kimi K2.6 的 Intelligence 為 53.9、Coding 為 47.1、Agentic 為 66.0;LLMBase 的 DeepSeek V4 Flash High vs Kimi K2.6 比較也列 Kimi 為 Intelligence 53.9、Coding 47.1。[3][1] 同一個 LLMBase 比較列 DeepSeek V4 Flash High 為 Intelligence 44.9、Coding 39.8,但這是 Flash 版本,不能直接代表 DeepSeek V4 Pro 或 Pro-Max。[1]

所以,這裡能下的結論是:GPT-5.5 對 Claude Opus 4.7 的整體 ranking 訊號相對清楚;但現有來源沒有提供 GPT-5.5、Claude Opus 4.7、DeepSeek V4 Pro-Max、Kimi K2.6 四者完整同場的一條總榜。[2]

同場基準:Claude 和 GPT-5.5 分別拿下不同戰場

VentureBeat 的共享表,是目前最適合拿來比較 DeepSeek-V4-Pro-Max、GPT-5.5、部分 GPT-5.5 Pro 與 Claude Opus 4.7 的同列資料。[16]

基準DeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro(若有列)Claude Opus 4.7這份來源中的最高
GPQA Diamond90.1%93.6%94.2%Claude Opus 4.7[16]
Humanity’s Last Exam,不用工具37.7%41.4%43.1%46.9%Claude Opus 4.7[16]
Humanity’s Last Exam,使用工具48.2%52.2%57.2%54.7%GPT-5.5 Pro[16]
Terminal-Bench 2.067.9%82.7%69.4%GPT-5.5[16]
SWE-Bench Pro / SWE Pro55.4%58.6%64.3%Claude Opus 4.7[16]
BrowseComp83.4%84.4%90.1%79.3%GPT-5.5 Pro[16]
MCP Atlas / MCPAtlas Public73.6%75.3%79.1%Claude Opus 4.7[16]

這不是一場橫掃,而是分項勝負。Claude Opus 4.7 在 GPQA Diamond、HLE 不用工具、SWE-Bench Pro、MCP Atlas 的證據較強;GPT-5.5 則在 Terminal-Bench 2.0 與基礎 BrowseComp 佔優,且 GPT-5.5 Pro 在 VentureBeat 有列出的 HLE with tools 與 BrowseComp 最高。[16]

DeepSeek-V4-Pro-Max 在若干項目很接近,但在這張共享表中沒有超過 GPT-5.5 或 Claude Opus 4.7 的最佳結果。最接近的一列是 BrowseComp:DeepSeek-V4-Pro-Max 為 83.4%,GPT-5.5 為 84.4%,Claude Opus 4.7 為 79.3%。[16]

Coding:要看你是在修 repo、跑競程,還是做 agent

若任務像 repository 級軟體工程,Claude Opus 4.7 在 VentureBeat 的 SWE-Bench Pro 共享列最強:64.3%,高於 GPT-5.5 的 58.6% 與 DeepSeek-V4-Pro-Max 的 55.4%。[16]

但若你看的是競賽程式、程式生成與多語言軟體工程,DeepSeek V4 Pro 在本文來源中揭露的 coding 指標最完整之一。Together AI 列出 DeepSeek V4 Pro 的 LiveCodeBench 93.5%、Codeforces 3206、SWE-Bench Verified 80.6%、SWE-Bench Multilingual 76.2%。[25] NVIDIA 的模型卡也把 DeepSeek V4 Flash 與 V4 Pro 的多種推理設定拆開列示,並顯示 V4-Pro Max 在 LiveCodeBench 為 93.5、Codeforces 為 3206。[31]

Kimi K2.6 也有值得看的 coding 證據,只是同場對照不夠直接。Lorka 的表格列 Kimi K2.6 在 SWE-Bench Pro 為 58.6%、HLE-Full with tools 為 54.0%、GPQA-Diamond 為 90.5%、MMMU-Pro 為 79.4%,但該表主要拿它和 GPT-5.4、Claude Opus 4.6、Gemini 3.1 Pro 比較。[18] Verdent 則列 Kimi K2.6 在 SWE-Bench Verified 為 80.2%、Terminal-Bench 2.0 為 66.7%、HLE with tools 為 54.0%、LiveCodeBench v6 為 89.6%,並註明 Opus 4.7 在 SWE-Bench Verified 以 87.6% 領先。[19]

換句話說,Kimi K2.6 值得放進 coding agent 與代理式流程的候選清單;但依現有直接證據,還不能說它在整體上勝過 GPT-5.5 或 Claude Opus 4.7。[18][19]

價格:DeepSeek V4 的優勢最直觀

如果 API 成本是核心考量,DeepSeek V4 的價格論點最清楚。以下價格均以每 100 萬 tokens 計;tokens 可理解為模型處理文字時的基本計費單位。[15][1]

模型或版本輸入價格輸出價格補充
GPT-5.5US$5 / 100 萬 tokensUS$30 / 100 萬 tokensMashable 在此比較列為 1M context window。[15]
Claude Opus 4.7US$5 / 100 萬 tokensUS$25 / 100 萬 tokensMashable 在此比較列為 1M context window。[15]
DeepSeek V4US$1.74 / 100 萬 tokensUS$3.48 / 100 萬 tokensMashable 在此比較列為 1M context window。[15]
DeepSeek V4 FlashUS$0.14 / 100 萬 tokensUS$0.28 / 100 萬 tokensLLMBase 另列 blended 價格為 US$0.18。[1]
Kimi K2.6US$0.95 / 100 萬 tokensUS$4.00 / 100 萬 tokensLLMBase 另列 blended 價格為 US$1.71。[1]

不過,價格表不能和所有端點的實際限制劃上等號。Mashable 在比較中把 DeepSeek V4、GPT-5.5、Claude Opus 4.7 都列為 1M context window;但 OpenRouter 的 DeepSeek V4 Pro 頁面顯示 max tokens 為 256K、max output tokens 為 66K。[15][3] 真正上線前,仍要確認你呼叫的是哪個供應商、哪個版本、哪個推理檔位,以及實際上下文與輸出上限。

四款模型怎麼選

GPT-5.5:需要高階通用預設時最穩

如果你的決策依據是整體排名訊號,GPT-5.5 是最有根據的預設選擇。Artificial Analysis 將 GPT-5.5 xhigh 列為 60、GPT-5.5 high 列為 59,是本文來源中可見的最高兩個 Intelligence Index 位置。[2]

在 VentureBeat 的共享表裡,GPT-5.5 也在 Terminal-Bench 2.0 達 82.7%,基礎 BrowseComp 為 84.4%;GPT-5.5 Pro 在有列出的 BrowseComp 達 90.1%。[16]

Claude Opus 4.7:高難推理與 repo 級工程很強

Claude Opus 4.7 的整體排名略低於 GPT-5.5,但仍屬最前段:Artificial Analysis 列 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 Intelligence Index 為 57。[2] 在 VentureBeat 共享表中,它領先 GPT-5.5 與 DeepSeek-V4-Pro-Max 的項目包括 GPQA Diamond、HLE 不用工具、SWE-Bench Pro、MCP Atlas。[16]

Anthropic 自家發表資料也提到 Claude Opus 4.7 的內部 research-agent 結果,包括六個模組整體分數並列第一的 0.715,以及 General Finance 分數 0.813、高於 Opus 4.6 的 0.767。[17] 但這類內部基準最好當成補充背景,不宜等同於中立排行榜。[17]

DeepSeek V4:成本敏感或大量 token 場景最有吸引力

DeepSeek V4 最明顯的優勢是價格。Mashable 的比較中,DeepSeek V4 每 100 萬輸入 tokens 為 US$1.74、輸出 tokens 為 US$3.48;GPT-5.5 為 US$5/US$30,Claude Opus 4.7 為 US$5/US$25。[15]

DeepSeek V4 Pro 的 coding 指標也不弱:Together AI 列出 LiveCodeBench 93.5%、Codeforces 3206、SWE-Bench Verified 80.6%、SWE-Bench Multilingual 76.2%。[25] 取捨在於,DeepSeek-V4-Pro-Max 在 VentureBeat 共享表中仍落後於 GPT-5.5 或 Claude Opus 4.7 的最佳結果,即使在 BrowseComp 這類項目已非常接近。[16]

Kimi K2.6:可放進 coding agent 評估,但不要過早封王

Kimi K2.6 的難點在於:主要 Kimi-focused 表格多拿它和 GPT-5.4、Claude Opus 4.6 比,而不是 GPT-5.5、Claude Opus 4.7。[18][19] 但訊號並不弱。OpenRouter 列 Kimi K2.6 的 Intelligence 為 53.9、Coding 為 47.1、Agentic 為 66.0;Verdent 則列 SWE-Bench Verified 80.2% 與 LiveCodeBench v6 89.6%。[3][19]

實務上的結論不是 Kimi K2.6 不行,而是直接證據較薄。如果它的價格、部署路線或 agent 行為符合你的技術棧,值得自己跑測;但現有來源不足以支持它成為四者中的整體冠軍。[18][19]

選型前,先把這些坑補上

  • 版本名稱很重要。 DeepSeek V4 在來源中以 V4、V4 Flash、V4 Pro、DeepSeek-V4-Pro-Max 等形式出現,價格、限制與跑分會因版本和推理設定不同而變。[1][15][25][31]
  • 推理檔位不能混著比。 GPT-5.5 有 xhigh、high 等設定;Claude Opus 4.7 有 Adaptive Reasoning Max Effort;DeepSeek V4 Pro 也有不同 reasoning 模式與 Max 設定。[2][25][31]
  • Kimi 的直接比較較少。 現有 Kimi K2.6 強項表格多與 GPT-5.4、Claude Opus 4.6 對照,不能自動外推到 GPT-5.5、Claude Opus 4.7。[18][19]
  • Humanity’s Last Exam 不用工具的片段有不一致。 LLM Stats 與 VentureBeat 都列 GPT-5.5 為 41.4%、Claude Opus 4.7 為 46.9%;Mashable 的 GPT 對 Claude 片段則列 GPT-5.5 為 40.6%、Opus 4.7 為 31.2%。[7][16][9]
  • 內部基準不是中立排行榜。 Anthropic 的 Opus 4.7 發表文有內部 research-agent 成績,但應與跨供應商公開比較分開閱讀。[17]
  • 價格與上下文長度看端點。 同一模型家族在不同供應商頁面上,可能有不同 context window、max tokens 與 max output tokens。[3][15]

底線

GPT-5.5,如果你最看重現有整體 Intelligence Index 訊號。[2]Claude Opus 4.7,如果你的工作像 GPQA Diamond、HLE 不用工具、SWE-Bench Pro、MCP Atlas 這些高難推理與軟體工程列。[16]DeepSeek V4,如果你最在意成本效益,並能先驗證實際要用的 V4 版本;它的列示 API 價格明顯低於 GPT-5.5 與 Claude Opus 4.7,DeepSeek V4 Pro 也有強 coding 指標。[15][25]Kimi K2.6 視為值得測試的 coding 與 agentic 候選,但不要在直接證據不足時,把它稱為四者中的總冠軍。[18][19]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

  • GPT 5.5 在 Artificial Analysis 的整體 Intelligence Index 訊號最強:xhigh 為 60、high 為 59;Claude Opus 4.7 Adaptive Reasoning Max Effort 為 57。[2]
  • Claude Opus 4.7 在 VentureBeat 共享表中的 GPQA Diamond、HLE 不用工具、SWE Bench Pro、MCP Atlas 領先;GPT 5.5 則在 Terminal Bench 2.0 與部分 BrowseComp 結果更強。[16]
  • 若重視 API 成本,DeepSeek V4 的列示價格最有優勢:每 100 萬輸入 tokens US$1.74、輸出 tokens US$3.48,低於 GPT 5.5 與 Claude Opus 4.7。[15]
  • Kimi K2.6 有 coding 與 agentic 指標可看,但主要 Kimi 表格多與 GPT 5.4、Claude Opus 4.6 比較,不能直接判定它勝過 GPT 5.5 或 Opus 4.7。[18][19]

大家也會問

「GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6:誰才是最強 AI 模型?」的簡短答案是什麼?

GPT 5.5 在 Artificial Analysis 的整體 Intelligence Index 訊號最強:xhigh 為 60、high 為 59;Claude Opus 4.7 Adaptive Reasoning Max Effort 為 57。[2]

最值得優先驗證的重點是什麼?

GPT 5.5 在 Artificial Analysis 的整體 Intelligence Index 訊號最強:xhigh 為 60、high 為 59;Claude Opus 4.7 Adaptive Reasoning Max Effort 為 57。[2] Claude Opus 4.7 在 VentureBeat 共享表中的 GPQA Diamond、HLE 不用工具、SWE Bench Pro、MCP Atlas 領先;GPT 5.5 則在 Terminal Bench 2.0 與部分 BrowseComp 結果更強。[16]

接下來在實務上該怎麼做?

若重視 API 成本,DeepSeek V4 的列示價格最有優勢:每 100 萬輸入 tokens US$1.74、輸出 tokens US$3.48,低於 GPT 5.5 與 Claude Opus 4.7。[15]

下一步適合探索哪個相關主題?

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較:2026 年誰最值得信?」,從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較?

將這個答案與「DeepSeek V4 工程解析:1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

研究對話

研究問題

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Studio Global AI32 個來源

附引用的答案

Using only the supplied evidence, the strongest benchmark signal is that GPT-5.5 leads the available aggregate Intelligence Index, with GPT-5.5 xhigh at 60 and GPT-5.5 high at 59, while Claude Opus 4.7 Adaptive Reasoning Max Effort scores 57.[5] DeepSeek V4 Pro has the richest disclosed task-level benchmark data in the evidence, including 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6] For Kimi K2.6, the supplied evidence contains positioning and pricing but no comparable numeric benchmark scores, so a full apples-to-apples benchmark ranking is not supported.[1][3]

Key findings

  • GPT-5.5 is the best-supported leader on the only cross-model aggregate benchmark available here: GPT-5.5 xhigh ranks first with an Intelligence Index of 60, and GPT-5.5 high ranks second with 59.[5]

  • Claude Opus 4.7 is close behind on that same aggregate measure: Claude Opus 4.7 Adaptive Reasoning Max Effort ranks third with an Intelligence Index of 57.[5]

  • DeepSeek V4 Pro has the clearest disclosed coding/SWE benchmark profile in the supplied evidence: 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6]

  • Kimi K2.6 is described as an “open-source flagship” agent-coding model in the launch/news evidence, but the supplied evidence does not provide numeric benchmark scores for Kimi K2.6.[1]

  • DeepSeek V4 is materially cheaper than GPT-5.5 in the pricing evidence: DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens, while GPT-5.5 is listed at $5 per 1M input tokens and $30 per 1M output tokens, with both shown as 1M-context models.[4]

  • DeepSeek V4 Flash is far cheaper than Kimi K2.6 in the direct pricing comparison: DeepSeek V4 Flash is listed at $0.14 per 1M input tokens, $0.28 per 1M output tokens, and $0.18 blended, while Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.[3]

  • The supplied evidence treats DeepSeek V4 as a family with multiple variants, including V4 Flash and V4 Pro, so comparisons depend on which variant is meant.[2][3][6]

Benchmark and capability comparison

AreaGPT-5.5Claude Opus 4.7DeepSeek V4Kimi K2.6
Aggregate Intelligence Index60 for GPT-5.5 xhigh; 59 for GPT-5.5 high.[5]57 for Claude Opus 4.7 Adaptive Reasoning Max Effort.[5]No numeric Intelligence Index for DeepSeek V4 is provided in the supplied evidence.No numeric Intelligence Index for Kimi K2.6 is provided in the supplied evidence.
Coding / SWE benchmarksNo exact coding benchmark scores are provided in the supplied evidence.The launch evidence says Claude Opus 4.7 has improved programming and a threefold vision upgrade, but no exact benchmark numbers are provided.[1]DeepSeek V4 Pro is listed with 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6]Kimi K2.6 is positioned as an open-source flagship agent-coding model, but no exact benchmark numbers are provided.[1]
Context window evidenceGPT-5.5 is listed with a 1M context window in the pricing comparison.[4]The supplied pricing snippet does not clearly provide Claude Opus 4.7’s context window.DeepSeek V4 is listed with a 1M context window in one comparison, and DeepSeek V4 Flash / Pro are also described as 1M-context options in enterprise guidance.[2][4]No context-window figure for Kimi K2.6 is provided in the supplied evidence.
Pricing evidence$5 per 1M input tokens and $30 per 1M output tokens.[4]$5 per 1M input tokens is visible in the supplied snippet, but the output price is truncated.[4]DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens; DeepSeek V4 Flash is listed at $0.14 input, $0.28 output, and $0.18 blended.[3][4]Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.[3]
Best-supported use case from evidenceHighest aggregate intelligence among the four where evidence is available.[5]High-end reasoning close to GPT-5.5 on the available Intelligence Index.[5]Strongest supplied coding/SWE evidence and strong cost positioning, especially for Flash and Pro variants.[2][3][6]Potentially strong open-source agent-coding positioning, but benchmark evidence is insufficient.[1]

Pricing and value takeaways

  • On the available aggregate benchmark, GPT-5.5 leads Claude Opus 4.7 by 3 points at the xhigh setting and by 2 points at the high setting.[5]

  • On the available API pricing comparison, GPT-5.5 costs about 2.9x DeepSeek V4 for input tokens and about 8.6x DeepSeek V4 for output tokens.[4]

  • In the DeepSeek V4 Flash vs Kimi K2.6 pricing comparison, Kimi K2.6’s blended price of $1.71 per 1M tokens is about 9.5x DeepSeek V4 Flash’s $0.18 blended price.[3]

  • One enterprise-oriented comparison recommends DeepSeek V4 Flash for high-volume text summarization, customer-service classification, and internal knowledge-base Q&A because of low cost and 1M context.[2]

  • The same enterprise-oriented comparison recommends DeepSeek V4 Pro for complex document analysis, cross-department workflows, and automated agents because it is described as lower-cost than closed flagship models while offering high-end open-weight capability and 1M context.[2]

Evidence notes

  • The strongest single benchmark source in the supplied evidence is the Intelligence Index listing, which ranks GPT-5.5 xhigh first at 60, GPT-5.5 high second at 59, and Claude Opus 4.7 Adaptive Reasoning Max Effort third at 57.[5]

  • The strongest task-level benchmark evidence is for DeepSeek V4 Pro, because it includes exact coding and software-engineering metrics rather than only positioning language.[6]

  • The Kimi K2.6 evidence is mostly launch positioning and pricing, not benchmark results.[1][3]

  • A user post describes four one-shot head-to-head experiments comparing Codex on GPT-5.5 with Claude Code on Opus 4.7, but the supplied snippet does not provide the outcomes or benchmark scores.[31]

  • There is conflicting or variant-specific DeepSeek context/pricing evidence: one comparison lists DeepSeek V4 with a 1M context window, while another provider listing for DeepSeek V4 Pro shows 256K max tokens and 66K max output tokens.[4][7]

Limitations / uncertainty

  • Insufficient evidence for a full, comprehensive benchmark comparison across Claude Opus 4.7, GPT-5.5, DeepSeek V4, and Kimi K2.6.

  • Insufficient evidence to rank DeepSeek V4 or Kimi K2.6 against GPT-5.5 and Claude Opus 4.7 on the same aggregate Intelligence Index, because the supplied excerpt only provides numeric Intelligence Index scores for GPT-5.5 and Claude Opus 4.7 among the requested models.[5]

  • Insufficient evidence to compare all four models on the same coding benchmarks, because only DeepSeek V4 Pro has detailed numeric coding/SWE benchmark scores in the supplied evidence.[6]

  • Insufficient evidence to compare multimodal performance across all four models, because the supplied evidence only includes a launch-positioning claim that Claude Opus 4.7 has a threefold vision upgrade and does not provide comparable multimodal benchmark numbers for the others.[1]

Summary

GPT-5.5 has the strongest supplied aggregate benchmark result, with Intelligence Index scores of 60 and 59 depending on reasoning setting.[5] Claude Opus 4.7 is close behind at 57 and is the only other requested model with a numeric aggregate score in the supplied evidence.[5] DeepSeek V4 Pro has the best-supported coding/SWE benchmark profile, while DeepSeek V4 Flash and V4 Pro appear especially strong on cost and long-context enterprise use cases.[2][3][6] Kimi K2.6 may be competitively positioned as an open-source agent-coding model, but the supplied evidence does not include enough numeric benchmark data to verify how it compares against GPT-5.5, Claude Opus 4.7, or DeepSeek V4.[1][3]

來源

  • [1] Compare DeepSeek V4 Flash (Reasoning, High Effort) vs Kimi K2.6 | AI Model Comparisonllmbase.ai

    Metric DeepSeek logo De DeepSeek V4 Flash (Reasoning, High Effort) DeepSeek Kimi logo Ki Kimi K2.6 Kimi --- Pricing per 1M tokens Input Cost $0.14/1M $0.95/1M Output Cost $0.28/1M $4.00/1M Blended (3:1) $0.18/1M $1.71/1M Specifications Organization DeepSeek...

  • [2] DeepSeek V4 Pro (Reasoning, High Effort) vs Kimi K2.6: Model Comparisonartificialanalysis.ai

    What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...

  • [3] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai

    Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...

  • [7] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com

    Reasoning & knowledge Benchmark GPT-5.5 Opus 4.7 Lead --- --- GPQA Diamond 93.6% 94.2% Opus +0.6 HLE (no tools) 41.4% 46.9% Opus +5.5 HLE (with tools) 52.2% 54.7% Opus +2.5 The HLE no-tools margin (+5.5pp) is the most informative entry in the table because...

  • [9] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com

    Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...

  • [15] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com

    Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...

  • [16] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com

    BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...

  • [17] Introducing Claude Opus 4.7 - Anthropicanthropic.com

    Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...

  • [18] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai

    Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...

  • [19] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai

    Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

  • [25] DeepSeek V4 Pro API - Together AItogether.ai

    Coding & Software Engineering: • 93.5% LiveCodeBench and Codeforces 3206 for competitive and production code generation • 80.6% SWE-Bench Verified for autonomous software engineering across repositories • 76.2% SWE-Bench Multilingual for cross-language soft...

  • [31] deepseek-v4-pro Model by Deepseek-ai | NVIDIA NIM - NVIDIA Buildbuild.nvidia.com

    Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max --- --- --- Knowledge & Reasoning MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5 SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9 Chinese-SimpleQA...