studioglobal
熱門探索內容
報告已發布8 個來源

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6:各類基準測試贏家

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先,並在無工具 Humanity’s Last Exam 以 46.9% 領先;GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。 GPT 5.5 Pro 在工具輔助 HLE 以 57.2% 領先,也在 BrowseComp 以 90.1% 領先;DeepSeek V4 Pro Max 具競爭力,但主表未拿下單項第一 [4]。

15K0
Editorial illustration of GPT-5.5, Claude Opus 4.7, DeepSeek V4 and Kimi K2.6 compared across AI benchmark categories
GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by CategoryAI-generated editorial illustration for comparing frontier model benchmark winners by category.
AI 提示詞

Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by Category. Article summary: No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr.... Topic tags: ai, llm benchmarks, openai, anthropic, deepseek. Reference image context from search candidates: Reference image 1: visual subject "Kimi K2.6 ties GPT-5.5 on SWE-bench Pro at 5–6x lower cost — with agent swarms, 13-hour autonomous runs, and open weights. In practice it is the first open-source model that can su" source context "Kimi K2.6: The Complete Developer Guide (2026) - Codersera" Reference image 2: visual subject "# Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which S

openai.com

光看排行榜,這場四方對決很容易被簡化成「誰最強」。但如果你要把模型放進產品、代理流程或內部評估,真正的問題不是總冠軍,而是:你的工作負載比較像哪一個基準測試?

目前最整齊的共同比較表,主要涵蓋 GPT-5.5、部分項目中的 GPT-5.5 Pro、Claude Opus 4.7,以及 DeepSeek-V4-Pro-Max;Kimi K2.6 的資料則多半出現在另外的比較文章或模型卡整理中,因此四方直接對照沒有那麼乾淨 [4][11][13]

一眼看懂:各場景暫時贏家

使用情境較有依據的首選判讀
科學推理Claude Opus 4.7GPQA Diamond 達 94.2%,高於 GPT-5.5 的 93.6% 與 DeepSeek-V4-Pro-Max 的 90.1% [4]
無工具專家推理Claude Opus 4.7Humanity’s Last Exam 無工具為 46.9%,高於 GPT-5.5 Pro 的 43.1%、GPT-5.5 的 41.4% 與 DeepSeek-V4-Pro-Max 的 37.7% [4]
工具輔助考題推理GPT-5.5 ProHumanity’s Last Exam with tools 為 57.2%,高於 Claude Opus 4.7 的 54.7% [4]
終端機與代理式運算GPT-5.5Terminal-Bench 2.0 為 82.7%,高於 Claude Opus 4.7 的 69.4% 與 DeepSeek-V4-Pro-Max 的 67.9% [4][5]
作業系統操作GPT-5.5OSWorld-Verified 為 78.7%,略高於 Claude Opus 4.7 的 78.0% [5]
前沿數學GPT-5.5FrontierMath Tiers 1–3 為 51.7%,高於 Claude Opus 4.7 的 43.8% [5]
共用表中的軟體工程Claude Opus 4.7SWE-Bench Pro / SWE Pro 為 64.3%,高於 GPT-5.5 的 58.6% 與 DeepSeek-V4-Pro-Max 的 55.4% [4]
網頁瀏覽與理解GPT-5.5 ProBrowseComp 為 90.1%,高於 GPT-5.5 的 84.4%、DeepSeek-V4-Pro-Max 的 83.4% 與 Claude Opus 4.7 的 79.3% [4]
MCP 類公用工具流程Claude Opus 4.7MCP Atlas / MCPAtlas Public 為 79.1%,高於 GPT-5.5 的 75.3% 與 DeepSeek-V4-Pro-Max 的 73.6% [4]
視覺與文件分析Claude Opus 4.7有來源報告其在 Vision & Document Arena 排名第一,並在圖表、作業與 OCR 子項勝出 [1]
成本敏感評估DeepSeek V4VentureBeat 稱 DeepSeek V4 以約 Opus 4.7 與 GPT-5.5 六分之一的成本提供接近前沿的智慧,但仍需用自己的工作負載驗證 [4]
最不適合硬排四方名次Kimi K2.6Kimi 有可參考分數,但多數來自與主表不同的比較脈絡 [11][13]

完整基準表:先分清同表與跨表

基準測試/能力GPT-5.5GPT-5.5 ProClaude Opus 4.7DeepSeek V4 / V4 Pro MaxKimi K2.6較穩妥的解讀
GPQA Diamond93.6% [4]未報告94.2% [4]DeepSeek-V4-Pro-Max 90.1% [4]未報告Claude 在共用表領先 [4]
Humanity’s Last Exam,無工具41.4% [4]43.1% [4]46.9% [4]DeepSeek-V4-Pro-Max 37.7% [4]未報告Claude 在共用表領先 [4]
Humanity’s Last Exam,with tools52.2% [4]57.2% [4]54.7% [4]DeepSeek-V4-Pro-Max 48.2% [4]另表為 54.0% [13]GPT-5.5 Pro 在共用表領先 [4]
Terminal-Bench 2.082.7% [4][5]未報告69.4% [4][5]DeepSeek-V4-Pro-Max 67.9% [4]另表為 66.7% [13]GPT-5.5 領先 [4][5]
SWE-Bench Pro / SWE Pro58.6% [4]未報告64.3% [4]DeepSeek-V4-Pro-Max 55.4% [4]另表為 58.6% [13]Claude 在共用表領先 [4]
BrowseComp84.4% [4]90.1% [4]79.3% [4]DeepSeek-V4-Pro-Max 83.4% [4];另表 DeepSeek-V4 Pro 為 83.4% [11]另表為 83.2% [11]GPT-5.5 Pro 在共用表領先 [4]
MCP Atlas / MCPAtlas Public75.3% [4]未報告79.1% [4]DeepSeek-V4-Pro-Max 73.6% [4]未報告Claude 領先 [4]
OSWorld-Verified78.7% [5]未報告78.0% [5]未報告未報告GPT-5.5 小幅領先 Claude [5]
FrontierMath Tiers 1–351.7% [5]未報告43.8% [5]未報告未報告GPT-5.5 領先 Claude [5]
Vision & Document Arena未報告未報告報告為整體第一 [1]未報告未報告Claude 有唯一可引用結果 [1]
AIME 2026未報告未報告未報告在所引 Kimi vs DeepSeek 表中未提供 [11]Thinking mode 為 96.4% [11]是 Kimi 訊號,不是四方排名 [11]
APEX Agents未報告未報告未報告在所引 Kimi vs DeepSeek 表中未提供 [11]Thinking mode 為 27.9% [11]是 Kimi 訊號,不是四方排名 [11]
Context window未報告未報告某 Artificial Analysis 比較列為 1,000k tokens [3]同一比較中 DeepSeek V4 Pro 列為 1,000k tokens [3]未報告該比較中 Claude 與 DeepSeek V4 Pro 相同 [3]

凡是混用來源的列,都要特別小心。Kimi K2.6 在另一個 Kimi-focused 比較中的分數有參考價值,但可信度不能等同於在同一套測試環境中與 GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max 並排跑出的結果 [4][11][13]

GPT-5.5/GPT-5.5 Pro:終端、OS、數學與工具流更亮眼

GPT-5.5 最明確的勝項是 Terminal-Bench 2.0:82.7%,高於 Claude Opus 4.7 的 69.4% 與 DeepSeek-V4-Pro-Max 的 67.9% [4][5]。在這批可引用數據中,這是差距相當大的項目之一。

它在 OSWorld-Verified 也領先 Claude Opus 4.7,但差距很小:78.7% 對 78.0% [5]。在 FrontierMath Tiers 1–3 上,GPT-5.5 的優勢較明顯,為 51.7%,高於 Claude 的 43.8% [5]

若任務重點是工具輔助推理或瀏覽,GPT-5.5 Pro 的位置更突出。它在 Humanity’s Last Exam with tools 取得 57.2%,高於 Claude Opus 4.7 的 54.7%、GPT-5.5 的 52.2% 與 DeepSeek-V4-Pro-Max 的 48.2% [4]。在 BrowseComp 上,GPT-5.5 Pro 也以 90.1% 領先 GPT-5.5 的 84.4%、DeepSeek-V4-Pro-Max 的 83.4% 與 Claude Opus 4.7 的 79.3% [4]

不過,GPT-5.5 並非所有推理項目都領先。Claude Opus 4.7 在 GPQA Diamond 以 94.2% 小幅勝過 GPT-5.5 的 93.6% [4]。另有 GPT-5.5 指南列出 GPT-5.5-only 的領域結果,例如 Harvey BigLaw Bench 91.7%、內部投資銀行基準 88.5%、BixBench 80.5%;但因同一摘錄未列出 Claude Opus 4.7、DeepSeek V4 與 Kimi K2.6 的對應分數,這些不應被解讀為四方勝利 [7]

Claude Opus 4.7:無工具推理、軟體工程與文件訊號強

Claude Opus 4.7 在主共用表中的無工具推理表現最好。它在 GPQA Diamond 達 94.2%,在 Humanity’s Last Exam 無工具為 46.9% [4]。同一張表中,Claude 也在 SWE-Bench Pro / SWE Pro 以 64.3% 領先,並在 MCP Atlas / MCPAtlas Public 以 79.1% 領先 [4]

Claude 在所引資料中的弱項,是終端機式操作。GPT-5.5 在 Terminal-Bench 2.0 以 82.7% 對 69.4% 領先 Claude 超過 13 個百分點;GPT-5.5 也在 OSWorld-Verified 與 FrontierMath Tiers 1–3 上領先 Claude [4][5]

在多模態與文件方面,Claude 有目前最強的可引用訊號。有來源報告 Claude Opus 4.7 在 Vision & Document Arena 拿下第一,Document Arena 較 Opus 4.6 提升 4 分,並在 diagram、homework、OCR 子類別勝出 [1]。但該來源沒有提供 GPT-5.5、DeepSeek V4、Kimi K2.6 的同場數字,因此這支持 Claude 的文件優勢,卻不構成完整四方多模態排名 [1]

DeepSeek V4:主表不常第一,但成本效益值得測

資料中的 DeepSeek 標籤不只一種。主共用表使用 DeepSeek-V4-Pro-Max;Artificial Analysis 的比較則使用 DeepSeek V4 Pro,並列出 1,000k-token context window [4][3]。這些名稱不應自動視為完全可互換。

在主共用表中,DeepSeek-V4-Pro-Max 具競爭力,但沒有領先任何一列。它在 GPQA Diamond 為 90.1%,Humanity’s Last Exam 無工具為 37.7%,Humanity’s Last Exam with tools 為 48.2%,Terminal-Bench 2.0 為 67.9%,SWE-Bench Pro / SWE Pro 為 55.4%,BrowseComp 為 83.4%,MCP Atlas / MCPAtlas Public 為 73.6% [4]

DeepSeek 最值得注意的引用說法,是成本效益而非單項跑分冠軍。VentureBeat 形容 DeepSeek V4 能以約 Opus 4.7 與 GPT-5.5 六分之一的成本,提供接近前沿的智慧 [4]。這是把 DeepSeek 放進候選名單的理由,但不是跳過自家測試的理由。

若你關心長上下文,Artificial Analysis 的一個比較列出 DeepSeek V4 Pro 與 Claude Opus 4.7 同為 1,000k-token context window [3]。這只支持該比較中所列配置的相同上下文長度,不應擴大解讀成所有 DeepSeek 或 Claude 模式都一樣 [3]

Kimi K2.6:有亮點,但最難做乾淨四方排名

Kimi K2.6 是這組模型中最難直接排名的一個,因為它沒有出現在 GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max 的主共用表中 [4]

一個 Kimi-focused 比較列出 K2.6 在 SWE-Bench Pro 為 58.6%、SWE-Bench Verified 為 80.2%、Terminal-Bench 2.0 為 66.7%、Humanity’s Last Exam with tools 為 54.0%、LiveCodeBench v6 為 89.6% [13]。該來源稱 K2.6 數字來自 Moonshot AI 官方模型卡,但比較對象主要是 Claude Opus 4.6 與 GPT-5.4,而不是本文這組精確四方名單 [13]

另一個 Kimi vs DeepSeek 比較列出 Kimi K2.6 在 Thinking mode 下 AIME 2026 為 96.4%、APEX Agents 為 27.9%,以及在 Thinking mode 與 context management 下 BrowseComp 為 83.2% [11]。同一來源中,DeepSeek-V4 Pro 的 BrowseComp 為 83.4%,但 AIME 2026 與 APEX Agents 沒有 DeepSeek 對應值 [11]

所以,Kimi K2.6 值得測,尤其是程式、代理式任務、數學與瀏覽場景;但現有來源不足以支撐它與 GPT-5.5、Claude Opus 4.7 在同一套基準上做總排名 [11][13]

你應該先測哪一個?

  • 若任務是終端機代理、OS 操作或 FrontierMath 類型工作,先測 GPT-5.5;它在所引 Terminal-Bench 2.0、OSWorld-Verified 與 FrontierMath 結果中領先 [4][5]
  • 若工具輔助推理或瀏覽是核心,先測 GPT-5.5 Pro;它在主共用表中的 Humanity’s Last Exam with tools 與 BrowseComp 領先 [4]
  • 若重點是 GPQA 類科學推理、無工具專家問答、SWE-Bench Pro 類軟體工程、MCP 類工作流,或文件密集的多模態工作,先測 Claude Opus 4.7 [4][1]
  • 若主要限制是成本,而且你可以自行做品質驗證,先把 DeepSeek V4 放進候選;目前最強的引用優勢是約為 Opus 4.7 與 GPT-5.5 六分之一成本的說法 [4]
  • 若你特別想驗證 Kimi K2.6 的程式、代理、數學與瀏覽分數,可以先測 Kimi;但務必用與其他模型相同的 prompts、工具、上下文限制、延遲目標與評分規則 [11][13]

跑分的坑:不要把 0.6 分看成定論

這不是一張萬能排行榜。來源混用了 base 與 Pro 變體,包括 GPT-5.5、GPT-5.5 Pro、DeepSeek-V4-Pro-Max、DeepSeek V4 Pro、Claude Opus 4.7 與 Kimi K2.6 [3][4][11][13]。部分數字也屬供應商報告;OpenAI 也註明其 GPT 的 ARC 評估以 reasoning effort 設為 xhigh,並在研究環境中執行,結果可能與 production ChatGPT 略有不同 [5][8]

差距很小的項目,只能當作方向性訊號。Claude 在 GPQA Diamond 對 GPT-5.5 的領先是 0.6 個百分點;GPT-5.5 在 OSWorld-Verified 對 Claude 的領先是 0.7 個百分點 [4][5]。相對地,較大的差距更有操作價值:GPT-5.5 在 Terminal-Bench 2.0 對 Claude 的領先超過 13 個百分點,在 FrontierMath 對 Claude 的領先為 7.9 個百分點 [5]

實務結論很簡單:GPT-5.5、Claude Opus 4.7、DeepSeek V4 與 Kimi K2.6 之間沒有單一總冠軍。先找出最像你真實工作負載的基準測試,再用你實際能部署的模型重跑同一套評估,才是比較穩的選型方式。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

  • Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先,並在無工具 Humanity’s Last Exam 以 46.9% 領先;GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。
  • GPT 5.5 Pro 在工具輔助 HLE 以 57.2% 領先,也在 BrowseComp 以 90.1% 領先;DeepSeek V4 Pro Max 具競爭力,但主表未拿下單項第一 [4]。
  • Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數,但不在 GPT 5.5/Claude/DeepSeek 的同一主表中,不能硬排四方總名次 [4][11][13]。

大家也會問

「GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6:各類基準測試贏家」的簡短答案是什麼?

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先,並在無工具 Humanity’s Last Exam 以 46.9% 領先;GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。

最值得優先驗證的重點是什麼?

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先,並在無工具 Humanity’s Last Exam 以 46.9% 領先;GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。 GPT 5.5 Pro 在工具輔助 HLE 以 57.2% 領先,也在 BrowseComp 以 90.1% 領先;DeepSeek V4 Pro Max 具競爭力,但主表未拿下單項第一 [4]。

接下來在實務上該怎麼做?

Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數,但不在 GPT 5.5/Claude/DeepSeek 的同一主表中,不能硬排四方總名次 [4][11][13]。

下一步適合探索哪個相關主題?

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較:2026 年誰最值得信?」,從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較?

將這個答案與「DeepSeek V4 工程解析:1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

研究對話

研究問題

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI21 個來源

附引用的答案

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena [4][1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools [4][5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison [4][18][1].

Key findings

  • Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% [4].

  • Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% [4].

  • GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% [4].

  • GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% [4][5].

  • GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% [5].

  • GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% [5].

  • Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories [1].

  • DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology [4].

  • DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report [18].

  • Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 [1].

Benchmark comparison table

Benchmark / capabilityGPT-5.5GPT-5.5 ProClaude Opus 4.7DeepSeek V4Kimi K2.6Leader in available evidence
GPQA Diamond93.6% [4]Insufficient evidence94.2% [4]90.1% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceClaude Opus 4.7 [4]
Humanity’s Last Exam, no tools41.4% [4]43.1% [4]46.9% [4]37.7% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceClaude Opus 4.7 [4]
Humanity’s Last Exam, with tools52.2% [4]57.2% [4]54.7% [4]48.2% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceGPT-5.5 Pro [4]
Terminal-Bench 2.082.7% [4][5]Insufficient evidence69.4% [4][5]67.9% for DeepSeek-V4-Pro-Max [4]Insufficient evidenceGPT-5.5 [4][5]
OSWorld-Verified78.7% [5]Insufficient evidence78.0% [5]Insufficient evidenceInsufficient evidenceGPT-5.5 [5]
FrontierMath Tiers 1–351.7% [5]Insufficient evidence43.8% [5]Insufficient evidenceInsufficient evidenceGPT-5.5 [5]
Vision & Document ArenaInsufficient evidenceInsufficient evidenceReported #1 overall [1]Insufficient evidenceInsufficient evidenceClaude Opus 4.7 [1]
Vibe Code BenchmarkInsufficient evidenceInsufficient evidenceInsufficient evidenceClaimed #1 open-weight model [18]Claimed #2 open-weight model [18]DeepSeek V4 among open-weight models, low-confidence evidence [18]
Context windowInsufficient evidenceInsufficient evidence1,000k tokens in one cited comparison [3]1,000k tokens for DeepSeek V4 Pro in one cited comparison [3]Insufficient evidenceTie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence [3]

Model-by-model assessment

GPT-5.5

  • GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score [4][5].

  • GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% [5].

  • GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% [5].

  • GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% [4].

  • GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% [4].

  • Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics [7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks [7].

Claude Opus 4.7

  • Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% [4].

  • Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% [4].

  • Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% [4].

  • Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% [4][5].

  • Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories [1].

  • Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro [3].

DeepSeek V4

  • DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results [4].

  • DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 [4].

  • DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions [4].

  • DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 [3].

  • A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table [18].

Kimi K2.6

  • Kimi K2.6 has the weakest quantitative coverage in the available evidence [1][18].

  • One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 [1].

  • The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models [18].

  • Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

  • The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 [4].

  • GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results [5].

  • The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison [3].

  • The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 [1].

  • The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here [2][6].

Limitations / uncertainty

  • Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.

  • Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported [1][18].

  • The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings [3][4].

  • The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology [18].

  • The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions [4].

Summary

  • Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools [4].

  • Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools [4].

  • Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 [4][5].

  • Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath [5].

  • Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting [1].

  • Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet [18].

  • Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison [1][18].

來源

  • [1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space

    Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...

  • [3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai

    Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...

  • [4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com

    BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...

  • [5] Everything You Need to Know About GPT-5.5vellum.ai

    The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...

  • [7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai

    Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...

  • [8] Introducing GPT-5.5 - OpenAIopenai.com

    Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...

  • [11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai

    Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...

  • [13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai

    Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...