studioglobal
熱門發現
報告已發布12 來源

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6:邊個最值得用?

整體 Intelligence Index 方面,Artificial Analysis 列 GPT 5.5 xhigh 60、GPT 5.5 high 59,領先 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 57。[2] 共享 benchmark 係分庭抗禮:Claude Opus 4.7 贏 GPQA Diamond、HLE no tools、SWE Bench Pro、MCP Atlas;GPT 5.5/GPT 5.5 Pro 贏 Terminal Bench 2.0、BrowseComp 同 HLE with tools(Pro)。[16] 成本敏感就要睇 DeepS...

18K0
Editorial illustration comparing GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6 AI models
GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmarks, Pricing, and Best Use CasesA practical comparison of leading AI models depends on the benchmark, variant, reasoning setting, and API price.
AI 提示

Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmarks, Pricing, and Best Use Cases. Article summary: There is no universal winner: GPT 5.5 leads the available Artificial Analysis Intelligence Index at 60/59, Claude Opus 4.7 wins several shared VentureBeat reasoning and SWE rows, and DeepSeek V4 is the price value out.... Topic tags: ai, llm, ai benchmarks, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison - YouTube" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://ww

openai.com

比較前線 AI 模型,最易中伏嘅位係將單一 benchmark 當成「總冠軍」。比較穩陣嘅讀法係:GPT-5.5 有最強整體排名信號Claude Opus 4.7 喺多個硬推理同軟件工程項目跑出DeepSeek V4 嘅 API 成本優勢最清楚,而 Kimi K2.6 喺 coding 同 agentic 工作上值得留意,但同 GPT-5.5、Opus 4.7 直接對打嘅證據較少[2][16][15][18][19]

快速結論

你最重視…較有根據嘅選擇點解
整體智能排名信號GPT-5.5Artificial Analysis 將 GPT-5.5 xhigh 列為 60、GPT-5.5 high 列為 59,高過 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 57。[2]
硬推理、軟件工程Claude Opus 4.7,GPT-5.5 緊隨其後VentureBeat 共享表格入面,Claude 領先 GPQA Diamond、HLE no-tools、SWE-Bench Pro、MCP Atlas;GPT-5.5 贏 Terminal-Bench 2.0 同基本 BrowseComp,GPT-5.5 Pro 則喺有列出嘅 HLE with tools 同 BrowseComp 領先。[16]
旗艦級 API 成本DeepSeek V4Mashable 列 DeepSeek V4 為每 100 萬 input tokens $1.74、output tokens $3.48,低過 GPT-5.5 的 $5/$30 同 Claude Opus 4.7 的 $5/$25。[15]
已披露 coding/競賽編程數據DeepSeek V4 ProTogether AI 列 DeepSeek V4 Pro 為 93.5% LiveCodeBench、Codeforces 3206、80.6% SWE-Bench Verified、76.2% SWE-Bench Multilingual。[25]
Kimi K2.6 評估有潛力,但未算定案Kimi K2.6 有 coding 同 agentic 數據,但現有 Kimi 相關證據多數係同 GPT-5.4、Claude Opus 4.6 比,而唔係直接對 GPT-5.5、Claude Opus 4.7。[18][19]

整體排名:GPT-5.5 佔上風

現有來源入面,最乾淨嘅整體信號來自 Artificial Analysis。佢列 GPT-5.5 xhigh 的 Intelligence Index 為 60、GPT-5.5 high 為 59;Claude Opus 4.7 Adaptive Reasoning Max Effort 則為 57。[2]

Kimi K2.6 喺可見嘅綜合指標片段入面低過呢個 GPT-5.5/Claude 梯隊。OpenRouter 列 Kimi K2.6 為 53.9 Intelligence、47.1 Coding、66.0 Agentic;LLMBase 的 DeepSeek V4 Flash High vs Kimi K2.6 比較亦列 Kimi 為 53.9 Intelligence、47.1 Coding。[3][1] 同一個 LLMBase 比較列 DeepSeek V4 Flash High 為 44.9 Intelligence、39.8 Coding,但要留意,呢個係 Flash 版本,唔係 DeepSeek V4 Pro 或 Pro-Max。[1]

重點係:整體排名對 GPT-5.5 vs Claude Opus 4.7 有清晰信號,但目前並無一條完整、同場四方 leaderboard,同時列出 GPT-5.5、Claude Opus 4.7、DeepSeek V4 Pro-Max 同 Kimi K2.6。[2]

同場 benchmark:Claude 同 GPT-5.5 各有勝負

VentureBeat 的共享 benchmark 表,係現有資料中較適合用來比較 DeepSeek-V4-Pro-Max、GPT-5.5、部分列出的 GPT-5.5 Pro,以及 Claude Opus 4.7 的來源。[16]

BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro(有列出時)Claude Opus 4.7呢份來源最高結果
GPQA Diamond90.1%93.6%94.2%Claude Opus 4.7[16]
Humanity’s Last Exam,no tools37.7%41.4%43.1%46.9%Claude Opus 4.7[16]
Humanity’s Last Exam,with tools48.2%52.2%57.2%54.7%GPT-5.5 Pro[16]
Terminal-Bench 2.067.9%82.7%69.4%GPT-5.5[16]
SWE-Bench Pro/SWE Pro55.4%58.6%64.3%Claude Opus 4.7[16]
BrowseComp83.4%84.4%90.1%79.3%GPT-5.5 Pro[16]
MCP Atlas/MCPAtlas Public73.6%75.3%79.1%Claude Opus 4.7[16]

所以,唔係一面倒。Claude Opus 4.7 喺 GPQA Diamond、HLE no-tools、SWE-Bench Pro 同 MCP Atlas 較有說服力;GPT-5.5 基本版則喺 Terminal-Bench 2.0 同 BrowseComp 表現較強,而 GPT-5.5 Pro 喺 VentureBeat 有列出嘅 HLE with tools 同 BrowseComp 最高。[16]

DeepSeek-V4-Pro-Max 喺幾項都算貼近,但喺呢張共享表入面,未有一項贏過 GPT-5.5 或 Claude Opus 4.7 的最佳結果。最接近係 BrowseComp:DeepSeek-V4-Pro-Max 83.4%,GPT-5.5 84.4%,Claude Opus 4.7 79.3%。[16]

Coding:要睇你寫緊咩 code

如果係 repository 層面嘅軟件工程任務,Claude Opus 4.7 喺 VentureBeat 共享表的 SWE-Bench Pro 結果最強:64.3%,高過 GPT-5.5 的 58.6% 同 DeepSeek-V4-Pro-Max 的 55.4%。[16]

不過 DeepSeek V4 Pro 有現有來源入面最完整嘅 coding 指標披露。Together AI 列 DeepSeek V4 Pro 為 93.5% LiveCodeBench、Codeforces 3206、80.6% SWE-Bench Verified、76.2% SWE-Bench Multilingual。[25] NVIDIA 的 model card 亦按 DeepSeek V4 Flash、V4 Pro 等版本拆出 GPQA Diamond、HLE、LiveCodeBench、Codeforces 等 benchmark,其中 V4-Pro Max 顯示 LiveCodeBench 93.5、Codeforces 3206。[31]

Kimi K2.6 亦有值得留意嘅 coding 證據,但最強嘅 Kimi 相關表格多數係同上一代或較早競爭對手比較。Lorka 列 Kimi K2.6 為 58.6% SWE-Bench Pro、54.0% HLE-Full with tools、90.5% GPQA-Diamond、79.4% MMMU-Pro,表格比較對象包括 GPT-5.4、Claude Opus 4.6、Gemini 3.1 Pro。[18] Verdent 則列 Kimi K2.6 為 80.2% SWE-Bench Verified、66.7% Terminal-Bench 2.0、54.0% HLE with tools、89.6% LiveCodeBench v6,並提到 Opus 4.7 喺 SWE-Bench Verified 以 87.6% 領先。[19]

實際結論係:Kimi K2.6 值得放入 coding 同 agent workflow 測試名單,但現有證據未足以話佢喺整體上擊敗 GPT-5.5 或 Claude Opus 4.7。[18][19]

收費:DeepSeek V4 成本優勢最清楚

如果 API 成本係核心考慮,DeepSeek V4 係最有價格說服力嗰個。Mashable 列 DeepSeek V4 為每 100 萬 input tokens $1.74、每 100 萬 output tokens $3.48;GPT-5.5 為 $5/$30,Claude Opus 4.7 為 $5/$25。[15]

模型/版本列出 input 價格列出 output 價格備註
GPT-5.5每 100 萬 tokens $5每 100 萬 tokens $30Mashable 喺呢個比較列出 1M context window。[15]
Claude Opus 4.7每 100 萬 tokens $5每 100 萬 tokens $25Mashable 喺呢個比較列出 1M context window。[15]
DeepSeek V4每 100 萬 tokens $1.74每 100 萬 tokens $3.48Mashable 喺呢個比較列出 1M context window。[15]
DeepSeek V4 Flash每 100 萬 tokens $0.14每 100 萬 tokens $0.28LLMBase 喺 DeepSeek V4 Flash High vs Kimi K2.6 比較列 blended price 為 $0.18。[1]
Kimi K2.6每 100 萬 tokens $0.95每 100 萬 tokens $4.00LLMBase 喺同一比較列 blended price 為 $1.71。[1]

但唔好假設所有 endpoint 都有同一個 context limit。Mashable 喺收費比較中列 DeepSeek V4、GPT-5.5、Claude Opus 4.7 都係 1M context window;但 OpenRouter 的 DeepSeek V4 Pro listing 顯示 256K max tokens、66K max output tokens。[15][3] 真正落 production 前,要核實你實際會 call 嘅 provider、模型版本同 reasoning mode。

逐個模型點用

GPT-5.5:想要最穩陣整體排名,佢係預設選擇

如果你嘅決策主要睇現有整體排名,GPT-5.5 係較安全嘅 default。Artificial Analysis 列 GPT-5.5 xhigh 為 60、GPT-5.5 high 為 59,係提供片段入面 Intelligence Index 的頭兩位。[2]

佢喺 VentureBeat 共享表亦有兩個突出項目:Terminal-Bench 2.0 為 82.7%,基本 GPT-5.5 的 BrowseComp 為 84.4%;而 GPT-5.5 Pro 喺有列出時 BrowseComp 為 90.1%。[16]

Claude Opus 4.7:硬推理同軟件工程多項任務更合拍

Claude Opus 4.7 喺整體排名緊貼 GPT-5.5:Artificial Analysis 將 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 Intelligence Index 列為 57。[2] 喺 VentureBeat 共享表,佢喺 GPQA Diamond、HLE no-tools、SWE-Bench Pro、MCP Atlas 都領先 GPT-5.5 同 DeepSeek-V4-Pro-Max。[16]

Anthropic 自家發布資料亦提到內部 research-agent benchmark:Claude Opus 4.7 喺六個 module 的整體分數並列第一,為 0.715;General Finance 分數為 0.813,高過 Opus 4.6 的 0.767。[17] 不過呢類屬於廠方內部 benchmark,應視為補充背景,而唔係中立 leaderboard 證據。[17]

DeepSeek V4:如果版本配合工作,性價比最突出

DeepSeek V4 最明顯嘅優勢係價錢。Mashable 比較中,佢每 100 萬 input/output tokens 價格係 $1.74/$3.48,明顯低過 GPT-5.5 的 $5/$30 同 Claude Opus 4.7 的 $5/$25。[15]

DeepSeek V4 Pro 亦有強 coding 指標,包括 Together AI 列出嘅 93.5% LiveCodeBench、Codeforces 3206、80.6% SWE-Bench Verified、76.2% SWE-Bench Multilingual。[25] 取捨係:喺 VentureBeat 共享表,DeepSeek-V4-Pro-Max 即使部分項目接近,例如 BrowseComp,都仍然落後於 GPT-5.5 或 Claude Opus 4.7 的最佳結果。[16]

Kimi K2.6:coding/agent 值得試,但四方比較未夠實

Kimi K2.6 最難下定論,因為現有 Kimi-focused benchmark 多數係同 GPT-5.4、Claude Opus 4.6 比,而唔係直接同 GPT-5.5、Claude Opus 4.7 比。[18][19] 但佢嘅信號唔弱:OpenRouter 列 Kimi K2.6 為 53.9 Intelligence、47.1 Coding、66.0 Agentic;Verdent 則列 80.2% SWE-Bench Verified 同 89.6% LiveCodeBench v6。[3][19]

所以唔應該理解成 Kimi K2.6「唔掂」。更準確係:直接證據較薄。如果 Kimi 嘅價格、部署路線或者 agentic 行為啱你現有 stack,佢值得做實測;但就現有資料而言,未足以稱佢為對 GPT-5.5 或 Claude Opus 4.7 的整體勝者。[18][19]

揀之前要留意

  • 版本名好重要。 DeepSeek V4 喺來源中有 V4、V4 Flash、V4 Pro、DeepSeek-V4-Pro-Max 等叫法;價格、限制同 benchmark 結果會因版本同 reasoning setting 而變。[1][15][25][31]
  • Kimi 比較無咁直接。 現有較強嘅 Kimi K2.6 benchmark 表,經常係同 GPT-5.4、Claude Opus 4.6 比,而唔係 GPT-5.5、Claude Opus 4.7。[18][19]
  • Humanity’s Last Exam no-tools 數字有不一致。 LLM Stats 同 VentureBeat 報 GPT-5.5 41.4%、Claude Opus 4.7 46.9%;但 Mashable 的 GPT vs Claude 片段報 GPT-5.5 40.6%、Opus 4.7 31.2%。[7][16][9]
  • 內部 benchmark 唔等於獨立 leaderboard。 Anthropic 的 Opus 4.7 發布文章有內部 research-agent 成績,但閱讀時應同跨供應商公開比較分開看。[17]
  • 價格同 context limit 會因 provider 而變。 同一模型家族喺唔同 endpoint 可能有唔同 context window、token limit 同 output cap。[3][15]

底線

如果你最重視現有整體智能排名,揀 GPT-5.5 較有根據。[2] 如果你嘅工作似 GPQA Diamond、HLE no-tools、SWE-Bench Pro、MCP Atlas 呢類硬推理或軟件工程項目,Claude Opus 4.7 更有說服力。[16] 如果你重視 price-performance,而且可以驗證自己要用嘅 V4 版本,DeepSeek V4 嘅 API 價格明顯低過 GPT-5.5 同 Claude Opus 4.7,DeepSeek V4 Pro 亦有強 coding 指標。[15][25] 至於 Kimi K2.6,應視為有潛力嘅 coding/agentic 候選,但以現有直接證據,未足以稱為整體擊敗 GPT-5.5 或 Claude Opus 4.7。[18][19]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

重點

  • 整體 Intelligence Index 方面,Artificial Analysis 列 GPT 5.5 xhigh 60、GPT 5.5 high 59,領先 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 57。[2]
  • 共享 benchmark 係分庭抗禮:Claude Opus 4.7 贏 GPQA Diamond、HLE no tools、SWE Bench Pro、MCP Atlas;GPT 5.5/GPT 5.5 Pro 贏 Terminal Bench 2.0、BrowseComp 同 HLE with tools(Pro)。[16]
  • 成本敏感就要睇 DeepSeek V4:Mashable 列出 DeepSeek V4 每 100 萬 input/output tokens 為 $1.74/$3.48,低過 GPT 5.5 的 $5/$30 同 Claude Opus 4.7 的 $5/$25。[15]

人們還問

「GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6:邊個最值得用?」的簡短答案是什麼?

整體 Intelligence Index 方面,Artificial Analysis 列 GPT 5.5 xhigh 60、GPT 5.5 high 59,領先 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 57。[2]

首先要驗證的關鍵點是什麼?

整體 Intelligence Index 方面,Artificial Analysis 列 GPT 5.5 xhigh 60、GPT 5.5 high 59,領先 Claude Opus 4.7 Adaptive Reasoning Max Effort 的 57。[2] 共享 benchmark 係分庭抗禮:Claude Opus 4.7 贏 GPQA Diamond、HLE no tools、SWE Bench Pro、MCP Atlas;GPT 5.5/GPT 5.5 Pro 贏 Terminal Bench 2.0、BrowseComp 同 HLE with tools(Pro)。[16]

接下來在實務上我該做什麼?

成本敏感就要睇 DeepSeek V4:Mashable 列出 DeepSeek V4 每 100 萬 input/output tokens 為 $1.74/$3.48,低過 GPT 5.5 的 $5/$30 同 Claude Opus 4.7 的 $5/$25。[15]

接下來我應該探索哪個相關主題?

繼續“Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6:2026 Benchmark 點睇先唔會睇錯”以獲得另一個角度和額外的引用。

開啟相關頁面

我應該將其與什麼進行比較?

對照「DeepSeek V4 唔止 1M context:MoE、長上下文同 API 遷移重點」交叉檢查此答案。

開啟相關頁面

繼續你的研究

研究對話

研究問題

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Studio Global AI32 來源

引用的答案

Using only the supplied evidence, the strongest benchmark signal is that GPT-5.5 leads the available aggregate Intelligence Index, with GPT-5.5 xhigh at 60 and GPT-5.5 high at 59, while Claude Opus 4.7 Adaptive Reasoning Max Effort scores 57.[5] DeepSeek V4 Pro has the richest disclosed task-level benchmark data in the evidence, including 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6] For Kimi K2.6, the supplied evidence contains positioning and pricing but no comparable numeric benchmark scores, so a full apples-to-apples benchmark ranking is not supported.[1][3]

Key findings

  • GPT-5.5 is the best-supported leader on the only cross-model aggregate benchmark available here: GPT-5.5 xhigh ranks first with an Intelligence Index of 60, and GPT-5.5 high ranks second with 59.[5]

  • Claude Opus 4.7 is close behind on that same aggregate measure: Claude Opus 4.7 Adaptive Reasoning Max Effort ranks third with an Intelligence Index of 57.[5]

  • DeepSeek V4 Pro has the clearest disclosed coding/SWE benchmark profile in the supplied evidence: 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6]

  • Kimi K2.6 is described as an “open-source flagship” agent-coding model in the launch/news evidence, but the supplied evidence does not provide numeric benchmark scores for Kimi K2.6.[1]

  • DeepSeek V4 is materially cheaper than GPT-5.5 in the pricing evidence: DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens, while GPT-5.5 is listed at $5 per 1M input tokens and $30 per 1M output tokens, with both shown as 1M-context models.[4]

  • DeepSeek V4 Flash is far cheaper than Kimi K2.6 in the direct pricing comparison: DeepSeek V4 Flash is listed at $0.14 per 1M input tokens, $0.28 per 1M output tokens, and $0.18 blended, while Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.[3]

  • The supplied evidence treats DeepSeek V4 as a family with multiple variants, including V4 Flash and V4 Pro, so comparisons depend on which variant is meant.[2][3][6]

Benchmark and capability comparison

AreaGPT-5.5Claude Opus 4.7DeepSeek V4Kimi K2.6
Aggregate Intelligence Index60 for GPT-5.5 xhigh; 59 for GPT-5.5 high.[5]57 for Claude Opus 4.7 Adaptive Reasoning Max Effort.[5]No numeric Intelligence Index for DeepSeek V4 is provided in the supplied evidence.No numeric Intelligence Index for Kimi K2.6 is provided in the supplied evidence.
Coding / SWE benchmarksNo exact coding benchmark scores are provided in the supplied evidence.The launch evidence says Claude Opus 4.7 has improved programming and a threefold vision upgrade, but no exact benchmark numbers are provided.[1]DeepSeek V4 Pro is listed with 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.[6]Kimi K2.6 is positioned as an open-source flagship agent-coding model, but no exact benchmark numbers are provided.[1]
Context window evidenceGPT-5.5 is listed with a 1M context window in the pricing comparison.[4]The supplied pricing snippet does not clearly provide Claude Opus 4.7’s context window.DeepSeek V4 is listed with a 1M context window in one comparison, and DeepSeek V4 Flash / Pro are also described as 1M-context options in enterprise guidance.[2][4]No context-window figure for Kimi K2.6 is provided in the supplied evidence.
Pricing evidence$5 per 1M input tokens and $30 per 1M output tokens.[4]$5 per 1M input tokens is visible in the supplied snippet, but the output price is truncated.[4]DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens; DeepSeek V4 Flash is listed at $0.14 input, $0.28 output, and $0.18 blended.[3][4]Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.[3]
Best-supported use case from evidenceHighest aggregate intelligence among the four where evidence is available.[5]High-end reasoning close to GPT-5.5 on the available Intelligence Index.[5]Strongest supplied coding/SWE evidence and strong cost positioning, especially for Flash and Pro variants.[2][3][6]Potentially strong open-source agent-coding positioning, but benchmark evidence is insufficient.[1]

Pricing and value takeaways

  • On the available aggregate benchmark, GPT-5.5 leads Claude Opus 4.7 by 3 points at the xhigh setting and by 2 points at the high setting.[5]

  • On the available API pricing comparison, GPT-5.5 costs about 2.9x DeepSeek V4 for input tokens and about 8.6x DeepSeek V4 for output tokens.[4]

  • In the DeepSeek V4 Flash vs Kimi K2.6 pricing comparison, Kimi K2.6’s blended price of $1.71 per 1M tokens is about 9.5x DeepSeek V4 Flash’s $0.18 blended price.[3]

  • One enterprise-oriented comparison recommends DeepSeek V4 Flash for high-volume text summarization, customer-service classification, and internal knowledge-base Q&A because of low cost and 1M context.[2]

  • The same enterprise-oriented comparison recommends DeepSeek V4 Pro for complex document analysis, cross-department workflows, and automated agents because it is described as lower-cost than closed flagship models while offering high-end open-weight capability and 1M context.[2]

Evidence notes

  • The strongest single benchmark source in the supplied evidence is the Intelligence Index listing, which ranks GPT-5.5 xhigh first at 60, GPT-5.5 high second at 59, and Claude Opus 4.7 Adaptive Reasoning Max Effort third at 57.[5]

  • The strongest task-level benchmark evidence is for DeepSeek V4 Pro, because it includes exact coding and software-engineering metrics rather than only positioning language.[6]

  • The Kimi K2.6 evidence is mostly launch positioning and pricing, not benchmark results.[1][3]

  • A user post describes four one-shot head-to-head experiments comparing Codex on GPT-5.5 with Claude Code on Opus 4.7, but the supplied snippet does not provide the outcomes or benchmark scores.[31]

  • There is conflicting or variant-specific DeepSeek context/pricing evidence: one comparison lists DeepSeek V4 with a 1M context window, while another provider listing for DeepSeek V4 Pro shows 256K max tokens and 66K max output tokens.[4][7]

Limitations / uncertainty

  • Insufficient evidence for a full, comprehensive benchmark comparison across Claude Opus 4.7, GPT-5.5, DeepSeek V4, and Kimi K2.6.

  • Insufficient evidence to rank DeepSeek V4 or Kimi K2.6 against GPT-5.5 and Claude Opus 4.7 on the same aggregate Intelligence Index, because the supplied excerpt only provides numeric Intelligence Index scores for GPT-5.5 and Claude Opus 4.7 among the requested models.[5]

  • Insufficient evidence to compare all four models on the same coding benchmarks, because only DeepSeek V4 Pro has detailed numeric coding/SWE benchmark scores in the supplied evidence.[6]

  • Insufficient evidence to compare multimodal performance across all four models, because the supplied evidence only includes a launch-positioning claim that Claude Opus 4.7 has a threefold vision upgrade and does not provide comparable multimodal benchmark numbers for the others.[1]

Summary

GPT-5.5 has the strongest supplied aggregate benchmark result, with Intelligence Index scores of 60 and 59 depending on reasoning setting.[5] Claude Opus 4.7 is close behind at 57 and is the only other requested model with a numeric aggregate score in the supplied evidence.[5] DeepSeek V4 Pro has the best-supported coding/SWE benchmark profile, while DeepSeek V4 Flash and V4 Pro appear especially strong on cost and long-context enterprise use cases.[2][3][6] Kimi K2.6 may be competitively positioned as an open-source agent-coding model, but the supplied evidence does not include enough numeric benchmark data to verify how it compares against GPT-5.5, Claude Opus 4.7, or DeepSeek V4.[1][3]

來源

  • [1] Compare DeepSeek V4 Flash (Reasoning, High Effort) vs Kimi K2.6 | AI Model Comparisonllmbase.ai

    Metric DeepSeek logo De DeepSeek V4 Flash (Reasoning, High Effort) DeepSeek Kimi logo Ki Kimi K2.6 Kimi --- Pricing per 1M tokens Input Cost $0.14/1M $0.95/1M Output Cost $0.28/1M $4.00/1M Blended (3:1) $0.18/1M $1.71/1M Specifications Organization DeepSeek...

  • [2] DeepSeek V4 Pro (Reasoning, High Effort) vs Kimi K2.6: Model Comparisonartificialanalysis.ai

    What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...

  • [3] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai

    Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...

  • [7] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com

    Reasoning & knowledge Benchmark GPT-5.5 Opus 4.7 Lead --- --- GPQA Diamond 93.6% 94.2% Opus +0.6 HLE (no tools) 41.4% 46.9% Opus +5.5 HLE (with tools) 52.2% 54.7% Opus +2.5 The HLE no-tools margin (+5.5pp) is the most informative entry in the table because...

  • [9] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com

    Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...

  • [15] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com

    Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...

  • [16] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com

    BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...

  • [17] Introducing Claude Opus 4.7 - Anthropicanthropic.com

    Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...

  • [18] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai

    Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...

  • [19] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai

    Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

  • [25] DeepSeek V4 Pro API - Together AItogether.ai

    Coding & Software Engineering: • 93.5% LiveCodeBench and Codeforces 3206 for competitive and production code generation • 80.6% SWE-Bench Verified for autonomous software engineering across repositories • 76.2% SWE-Bench Multilingual for cross-language soft...

  • [31] deepseek-v4-pro Model by Deepseek-ai | NVIDIA NIM - NVIDIA Buildbuild.nvidia.com

    Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max --- --- --- Knowledge & Reasoning MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5 SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9 Chinese-SimpleQA...