報告已發布2026年4月28日Last edited 2026年5月6日8 個來源

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6：各類基準測試贏家

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先，並在無工具 Humanity’s Last Exam 以 46.9% 領先；GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。 GPT 5.5 Pro 在工具輔助 HLE 以 57.2% 領先，也在 BrowseComp 以 90.1% 領先；DeepSeek V4 Pro Max 具競爭力，但主表未拿下單項第一 [4]。

使用 Studio Global AI 搜尋並查證事實探索更多內容

15K0

Editorial illustration of GPT-5.5, Claude Opus 4.7, DeepSeek V4 and Kimi K2.6 compared across AI benchmark categories — GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by CategoryAI-generated editorial illustration for comparing frontier model benchmark winners by category.
AI 提示詞
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by Category. Article summary: No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr.... Topic tags: ai, llm benchmarks, openai, anthropic, deepseek. Reference image context from search candidates: Reference image 1: visual subject "Kimi K2.6 ties GPT-5.5 on SWE-bench Pro at 5–6x lower cost — with agent swarms, 13-hour autonomous runs, and open weights. In practice it is the first open-source model that can su" source context "Kimi K2.6: The Complete Developer Guide (2026) - Codersera" Reference image 2: visual subject "# Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which S
openai.com

光看排行榜，這場四方對決很容易被簡化成「誰最強」。但如果你要把模型放進產品、代理流程或內部評估，真正的問題不是總冠軍，而是：你的工作負載比較像哪一個基準測試？

目前最整齊的共同比較表，主要涵蓋 GPT-5.5、部分項目中的 GPT-5.5 Pro、Claude Opus 4.7，以及 DeepSeek-V4-Pro-Max；Kimi K2.6 的資料則多半出現在另外的比較文章或模型卡整理中，因此四方直接對照沒有那麼乾淨 ^[4]^[11]^[13]。

一眼看懂：各場景暫時贏家

使用情境	較有依據的首選	判讀
科學推理	Claude Opus 4.7	GPQA Diamond 達 94.2%，高於 GPT-5.5 的 93.6% 與 DeepSeek-V4-Pro-Max 的 90.1% ^[4]
無工具專家推理	Claude Opus 4.7	Humanity’s Last Exam 無工具為 46.9%，高於 GPT-5.5 Pro 的 43.1%、GPT-5.5 的 41.4% 與 DeepSeek-V4-Pro-Max 的 37.7% ^[4]
工具輔助考題推理	GPT-5.5 Pro	Humanity’s Last Exam with tools 為 57.2%，高於 Claude Opus 4.7 的 54.7% ^[4]
終端機與代理式運算	GPT-5.5	Terminal-Bench 2.0 為 82.7%，高於 Claude Opus 4.7 的 69.4% 與 DeepSeek-V4-Pro-Max 的 67.9% ^[4]^[5]
作業系統操作	GPT-5.5	OSWorld-Verified 為 78.7%，略高於 Claude Opus 4.7 的 78.0% ^[5]
前沿數學	GPT-5.5	FrontierMath Tiers 1–3 為 51.7%，高於 Claude Opus 4.7 的 43.8% ^[5]
共用表中的軟體工程	Claude Opus 4.7	SWE-Bench Pro / SWE Pro 為 64.3%，高於 GPT-5.5 的 58.6% 與 DeepSeek-V4-Pro-Max 的 55.4% ^[4]
網頁瀏覽與理解	GPT-5.5 Pro	BrowseComp 為 90.1%，高於 GPT-5.5 的 84.4%、DeepSeek-V4-Pro-Max 的 83.4% 與 Claude Opus 4.7 的 79.3% ^[4]
MCP 類公用工具流程	Claude Opus 4.7	MCP Atlas / MCPAtlas Public 為 79.1%，高於 GPT-5.5 的 75.3% 與 DeepSeek-V4-Pro-Max 的 73.6% ^[4]
視覺與文件分析	Claude Opus 4.7	有來源報告其在 Vision & Document Arena 排名第一，並在圖表、作業與 OCR 子項勝出 ^[1]
成本敏感評估	DeepSeek V4	VentureBeat 稱 DeepSeek V4 以約 Opus 4.7 與 GPT-5.5 六分之一的成本提供接近前沿的智慧，但仍需用自己的工作負載驗證 ^[4]
最不適合硬排四方名次	Kimi K2.6	Kimi 有可參考分數，但多數來自與主表不同的比較脈絡 ^[11]^[13]

完整基準表：先分清同表與跨表

基準測試／能力	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	較穩妥的解讀
GPQA Diamond	93.6% ^[4]	未報告	94.2% ^[4]	DeepSeek-V4-Pro-Max 90.1% ^[4]	未報告	Claude 在共用表領先 ^[4]
Humanity’s Last Exam，無工具	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	DeepSeek-V4-Pro-Max 37.7% ^[4]	未報告	Claude 在共用表領先 ^[4]
Humanity’s Last Exam，with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	DeepSeek-V4-Pro-Max 48.2% ^[4]	另表為 54.0% ^[13]	GPT-5.5 Pro 在共用表領先 ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	未報告	69.4% ^[4]^[5]	DeepSeek-V4-Pro-Max 67.9% ^[4]	另表為 66.7% ^[13]	GPT-5.5 領先 ^[4]^[5]
SWE-Bench Pro / SWE Pro	58.6% ^[4]	未報告	64.3% ^[4]	DeepSeek-V4-Pro-Max 55.4% ^[4]	另表為 58.6% ^[13]	Claude 在共用表領先 ^[4]
BrowseComp	84.4% ^[4]	90.1% ^[4]	79.3% ^[4]	DeepSeek-V4-Pro-Max 83.4% ^[4]；另表 DeepSeek-V4 Pro 為 83.4% ^[11]	另表為 83.2% ^[11]	GPT-5.5 Pro 在共用表領先 ^[4]
MCP Atlas / MCPAtlas Public	75.3% ^[4]	未報告	79.1% ^[4]	DeepSeek-V4-Pro-Max 73.6% ^[4]	未報告	Claude 領先 ^[4]
OSWorld-Verified	78.7% ^[5]	未報告	78.0% ^[5]	未報告	未報告	GPT-5.5 小幅領先 Claude ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	未報告	43.8% ^[5]	未報告	未報告	GPT-5.5 領先 Claude ^[5]
Vision & Document Arena	未報告	未報告	報告為整體第一 ^[1]	未報告	未報告	Claude 有唯一可引用結果 ^[1]
AIME 2026	未報告	未報告	未報告	在所引 Kimi vs DeepSeek 表中未提供 ^[11]	Thinking mode 為 96.4% ^[11]	是 Kimi 訊號，不是四方排名 ^[11]
APEX Agents	未報告	未報告	未報告	在所引 Kimi vs DeepSeek 表中未提供 ^[11]	Thinking mode 為 27.9% ^[11]	是 Kimi 訊號，不是四方排名 ^[11]
Context window	未報告	未報告	某 Artificial Analysis 比較列為 1,000k tokens ^[3]	同一比較中 DeepSeek V4 Pro 列為 1,000k tokens ^[3]	未報告	該比較中 Claude 與 DeepSeek V4 Pro 相同 ^[3]

凡是混用來源的列，都要特別小心。Kimi K2.6 在另一個 Kimi-focused 比較中的分數有參考價值，但可信度不能等同於在同一套測試環境中與 GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max 並排跑出的結果 ^[4]^[11]^[13]。

GPT-5.5／GPT-5.5 Pro：終端、OS、數學與工具流更亮眼

GPT-5.5 最明確的勝項是 Terminal-Bench 2.0：82.7%，高於 Claude Opus 4.7 的 69.4% 與 DeepSeek-V4-Pro-Max 的 67.9% ^[4]^[5]。在這批可引用數據中，這是差距相當大的項目之一。

它在 OSWorld-Verified 也領先 Claude Opus 4.7，但差距很小：78.7% 對 78.0% ^[5]。在 FrontierMath Tiers 1–3 上，GPT-5.5 的優勢較明顯，為 51.7%，高於 Claude 的 43.8% ^[5]。

若任務重點是工具輔助推理或瀏覽，GPT-5.5 Pro 的位置更突出。它在 Humanity’s Last Exam with tools 取得 57.2%，高於 Claude Opus 4.7 的 54.7%、GPT-5.5 的 52.2% 與 DeepSeek-V4-Pro-Max 的 48.2% ^[4]。在 BrowseComp 上，GPT-5.5 Pro 也以 90.1% 領先 GPT-5.5 的 84.4%、DeepSeek-V4-Pro-Max 的 83.4% 與 Claude Opus 4.7 的 79.3% ^[4]。

不過，GPT-5.5 並非所有推理項目都領先。Claude Opus 4.7 在 GPQA Diamond 以 94.2% 小幅勝過 GPT-5.5 的 93.6% ^[4]。另有 GPT-5.5 指南列出 GPT-5.5-only 的領域結果，例如 Harvey BigLaw Bench 91.7%、內部投資銀行基準 88.5%、BixBench 80.5%；但因同一摘錄未列出 Claude Opus 4.7、DeepSeek V4 與 Kimi K2.6 的對應分數，這些不應被解讀為四方勝利 ^[7]。

Claude Opus 4.7：無工具推理、軟體工程與文件訊號強

Claude Opus 4.7 在主共用表中的無工具推理表現最好。它在 GPQA Diamond 達 94.2%，在 Humanity’s Last Exam 無工具為 46.9% ^[4]。同一張表中，Claude 也在 SWE-Bench Pro / SWE Pro 以 64.3% 領先，並在 MCP Atlas / MCPAtlas Public 以 79.1% 領先 ^[4]。

Claude 在所引資料中的弱項，是終端機式操作。GPT-5.5 在 Terminal-Bench 2.0 以 82.7% 對 69.4% 領先 Claude 超過 13 個百分點；GPT-5.5 也在 OSWorld-Verified 與 FrontierMath Tiers 1–3 上領先 Claude ^[4]^[5]。

在多模態與文件方面，Claude 有目前最強的可引用訊號。有來源報告 Claude Opus 4.7 在 Vision & Document Arena 拿下第一，Document Arena 較 Opus 4.6 提升 4 分，並在 diagram、homework、OCR 子類別勝出 ^[1]。但該來源沒有提供 GPT-5.5、DeepSeek V4、Kimi K2.6 的同場數字，因此這支持 Claude 的文件優勢，卻不構成完整四方多模態排名 ^[1]。

DeepSeek V4：主表不常第一，但成本效益值得測

資料中的 DeepSeek 標籤不只一種。主共用表使用 DeepSeek-V4-Pro-Max；Artificial Analysis 的比較則使用 DeepSeek V4 Pro，並列出 1,000k-token context window ^[4]^[3]。這些名稱不應自動視為完全可互換。

在主共用表中，DeepSeek-V4-Pro-Max 具競爭力，但沒有領先任何一列。它在 GPQA Diamond 為 90.1%，Humanity’s Last Exam 無工具為 37.7%，Humanity’s Last Exam with tools 為 48.2%，Terminal-Bench 2.0 為 67.9%，SWE-Bench Pro / SWE Pro 為 55.4%，BrowseComp 為 83.4%，MCP Atlas / MCPAtlas Public 為 73.6% ^[4]。

DeepSeek 最值得注意的引用說法，是成本效益而非單項跑分冠軍。VentureBeat 形容 DeepSeek V4 能以約 Opus 4.7 與 GPT-5.5 六分之一的成本，提供接近前沿的智慧 ^[4]。這是把 DeepSeek 放進候選名單的理由，但不是跳過自家測試的理由。

若你關心長上下文，Artificial Analysis 的一個比較列出 DeepSeek V4 Pro 與 Claude Opus 4.7 同為 1,000k-token context window ^[3]。這只支持該比較中所列配置的相同上下文長度，不應擴大解讀成所有 DeepSeek 或 Claude 模式都一樣 ^[3]。

Kimi K2.6：有亮點，但最難做乾淨四方排名

Kimi K2.6 是這組模型中最難直接排名的一個，因為它沒有出現在 GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max 的主共用表中 ^[4]。

一個 Kimi-focused 比較列出 K2.6 在 SWE-Bench Pro 為 58.6%、SWE-Bench Verified 為 80.2%、Terminal-Bench 2.0 為 66.7%、Humanity’s Last Exam with tools 為 54.0%、LiveCodeBench v6 為 89.6% ^[13]。該來源稱 K2.6 數字來自 Moonshot AI 官方模型卡，但比較對象主要是 Claude Opus 4.6 與 GPT-5.4，而不是本文這組精確四方名單 ^[13]。

另一個 Kimi vs DeepSeek 比較列出 Kimi K2.6 在 Thinking mode 下 AIME 2026 為 96.4%、APEX Agents 為 27.9%，以及在 Thinking mode 與 context management 下 BrowseComp 為 83.2% ^[11]。同一來源中，DeepSeek-V4 Pro 的 BrowseComp 為 83.4%，但 AIME 2026 與 APEX Agents 沒有 DeepSeek 對應值 ^[11]。

所以，Kimi K2.6 值得測，尤其是程式、代理式任務、數學與瀏覽場景；但現有來源不足以支撐它與 GPT-5.5、Claude Opus 4.7 在同一套基準上做總排名 ^[11]^[13]。

你應該先測哪一個？

若任務是終端機代理、OS 操作或 FrontierMath 類型工作，先測 GPT-5.5；它在所引 Terminal-Bench 2.0、OSWorld-Verified 與 FrontierMath 結果中領先 ^[4]^[5]。
若工具輔助推理或瀏覽是核心，先測 GPT-5.5 Pro；它在主共用表中的 Humanity’s Last Exam with tools 與 BrowseComp 領先 ^[4]。
若重點是 GPQA 類科學推理、無工具專家問答、SWE-Bench Pro 類軟體工程、MCP 類工作流，或文件密集的多模態工作，先測 Claude Opus 4.7 ^[4]^[1]。
若主要限制是成本，而且你可以自行做品質驗證，先把 DeepSeek V4 放進候選；目前最強的引用優勢是約為 Opus 4.7 與 GPT-5.5 六分之一成本的說法 ^[4]。
若你特別想驗證 Kimi K2.6 的程式、代理、數學與瀏覽分數，可以先測 Kimi；但務必用與其他模型相同的 prompts、工具、上下文限制、延遲目標與評分規則 ^[11]^[13]。

跑分的坑：不要把 0.6 分看成定論

這不是一張萬能排行榜。來源混用了 base 與 Pro 變體，包括 GPT-5.5、GPT-5.5 Pro、DeepSeek-V4-Pro-Max、DeepSeek V4 Pro、Claude Opus 4.7 與 Kimi K2.6 ^[3]^[4]^[11]^[13]。部分數字也屬供應商報告；OpenAI 也註明其 GPT 的 ARC 評估以 reasoning effort 設為 xhigh，並在研究環境中執行，結果可能與 production ChatGPT 略有不同 ^[5]^[8]。

差距很小的項目，只能當作方向性訊號。Claude 在 GPQA Diamond 對 GPT-5.5 的領先是 0.6 個百分點；GPT-5.5 在 OSWorld-Verified 對 Claude 的領先是 0.7 個百分點 ^[4]^[5]。相對地，較大的差距更有操作價值：GPT-5.5 在 Terminal-Bench 2.0 對 Claude 的領先超過 13 個百分點，在 FrontierMath 對 Claude 的領先為 7.9 個百分點 ^[5]。

實務結論很簡單：GPT-5.5、Claude Opus 4.7、DeepSeek V4 與 Kimi K2.6 之間沒有單一總冠軍。先找出最像你真實工作負載的基準測試，再用你實際能部署的模型重跑同一套評估，才是比較穩的選型方式。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先，並在無工具 Humanity’s Last Exam 以 46.9% 領先；GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。
GPT 5.5 Pro 在工具輔助 HLE 以 57.2% 領先，也在 BrowseComp 以 90.1% 領先；DeepSeek V4 Pro Max 具競爭力，但主表未拿下單項第一 [4]。
Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數，但不在 GPT 5.5／Claude／DeepSeek 的同一主表中，不能硬排四方總名次 [4][11][13]。

大家也會問

「GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6：各類基準測試贏家」的簡短答案是什麼？

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先，並在無工具 Humanity’s Last Exam 以 46.9% 領先；GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數，但不在 GPT 5.5／Claude／DeepSeek 的同一主表中，不能硬排四方總名次 [4][11][13]。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6：2026 基準測試與決策結論

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點

Northwest vs. Southeast Timber 彈性題解：為什麼是 larger; larger

Northwest 與 Southeast Timber：替代性越高，為什麼答案是 larger; larger？

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek 會擊敗 OpenAI、Claude、Gemini、Grok 嗎？關鍵不是榜單，而是成本與信任

研究對話

你2026年4月27日

研究問題

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI2026年4月27日21 個來源

附引用的答案

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena ^[4]^[1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools ^[4]^[5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison ^[4]^[18]^[1].

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

來源

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

熱門探索內容

報告已發布2026年4月28日Last edited 2026年5月6日8 個來源

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6：各類基準測試贏家

使用 Studio Global AI 搜尋並查證事實探索更多內容

15K0

一眼看懂：各場景暫時贏家

使用情境	較有依據的首選	判讀
科學推理	Claude Opus 4.7	GPQA Diamond 達 94.2%，高於 GPT-5.5 的 93.6% 與 DeepSeek-V4-Pro-Max 的 90.1% ^[4]
無工具專家推理	Claude Opus 4.7	Humanity’s Last Exam 無工具為 46.9%，高於 GPT-5.5 Pro 的 43.1%、GPT-5.5 的 41.4% 與 DeepSeek-V4-Pro-Max 的 37.7% ^[4]
工具輔助考題推理	GPT-5.5 Pro	Humanity’s Last Exam with tools 為 57.2%，高於 Claude Opus 4.7 的 54.7% ^[4]
終端機與代理式運算	GPT-5.5	Terminal-Bench 2.0 為 82.7%，高於 Claude Opus 4.7 的 69.4% 與 DeepSeek-V4-Pro-Max 的 67.9% ^[4]^[5]
作業系統操作	GPT-5.5	OSWorld-Verified 為 78.7%，略高於 Claude Opus 4.7 的 78.0% ^[5]
前沿數學	GPT-5.5	FrontierMath Tiers 1–3 為 51.7%，高於 Claude Opus 4.7 的 43.8% ^[5]
共用表中的軟體工程	Claude Opus 4.7	SWE-Bench Pro / SWE Pro 為 64.3%，高於 GPT-5.5 的 58.6% 與 DeepSeek-V4-Pro-Max 的 55.4% ^[4]
網頁瀏覽與理解	GPT-5.5 Pro	BrowseComp 為 90.1%，高於 GPT-5.5 的 84.4%、DeepSeek-V4-Pro-Max 的 83.4% 與 Claude Opus 4.7 的 79.3% ^[4]
MCP 類公用工具流程	Claude Opus 4.7	MCP Atlas / MCPAtlas Public 為 79.1%，高於 GPT-5.5 的 75.3% 與 DeepSeek-V4-Pro-Max 的 73.6% ^[4]
視覺與文件分析	Claude Opus 4.7	有來源報告其在 Vision & Document Arena 排名第一，並在圖表、作業與 OCR 子項勝出 ^[1]
成本敏感評估	DeepSeek V4	VentureBeat 稱 DeepSeek V4 以約 Opus 4.7 與 GPT-5.5 六分之一的成本提供接近前沿的智慧，但仍需用自己的工作負載驗證 ^[4]
最不適合硬排四方名次	Kimi K2.6	Kimi 有可參考分數，但多數來自與主表不同的比較脈絡 ^[11]^[13]

完整基準表：先分清同表與跨表

基準測試／能力	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	較穩妥的解讀
GPQA Diamond	93.6% ^[4]	未報告	94.2% ^[4]	DeepSeek-V4-Pro-Max 90.1% ^[4]	未報告	Claude 在共用表領先 ^[4]
Humanity’s Last Exam，無工具	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	DeepSeek-V4-Pro-Max 37.7% ^[4]	未報告	Claude 在共用表領先 ^[4]
Humanity’s Last Exam，with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	DeepSeek-V4-Pro-Max 48.2% ^[4]	另表為 54.0% ^[13]	GPT-5.5 Pro 在共用表領先 ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	未報告	69.4% ^[4]^[5]	DeepSeek-V4-Pro-Max 67.9% ^[4]	另表為 66.7% ^[13]	GPT-5.5 領先 ^[4]^[5]
SWE-Bench Pro / SWE Pro	58.6% ^[4]	未報告	64.3% ^[4]	DeepSeek-V4-Pro-Max 55.4% ^[4]	另表為 58.6% ^[13]	Claude 在共用表領先 ^[4]
BrowseComp	84.4% ^[4]	90.1% ^[4]	79.3% ^[4]	DeepSeek-V4-Pro-Max 83.4% ^[4]；另表 DeepSeek-V4 Pro 為 83.4% ^[11]	另表為 83.2% ^[11]	GPT-5.5 Pro 在共用表領先 ^[4]
MCP Atlas / MCPAtlas Public	75.3% ^[4]	未報告	79.1% ^[4]	DeepSeek-V4-Pro-Max 73.6% ^[4]	未報告	Claude 領先 ^[4]
OSWorld-Verified	78.7% ^[5]	未報告	78.0% ^[5]	未報告	未報告	GPT-5.5 小幅領先 Claude ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	未報告	43.8% ^[5]	未報告	未報告	GPT-5.5 領先 Claude ^[5]
Vision & Document Arena	未報告	未報告	報告為整體第一 ^[1]	未報告	未報告	Claude 有唯一可引用結果 ^[1]
AIME 2026	未報告	未報告	未報告	在所引 Kimi vs DeepSeek 表中未提供 ^[11]	Thinking mode 為 96.4% ^[11]	是 Kimi 訊號，不是四方排名 ^[11]
APEX Agents	未報告	未報告	未報告	在所引 Kimi vs DeepSeek 表中未提供 ^[11]	Thinking mode 為 27.9% ^[11]	是 Kimi 訊號，不是四方排名 ^[11]
Context window	未報告	未報告	某 Artificial Analysis 比較列為 1,000k tokens ^[3]	同一比較中 DeepSeek V4 Pro 列為 1,000k tokens ^[3]	未報告	該比較中 Claude 與 DeepSeek V4 Pro 相同 ^[3]

GPT-5.5／GPT-5.5 Pro：終端、OS、數學與工具流更亮眼

它在 OSWorld-Verified 也領先 Claude Opus 4.7，但差距很小：78.7% 對 78.0% ^[5]。在 FrontierMath Tiers 1–3 上，GPT-5.5 的優勢較明顯，為 51.7%，高於 Claude 的 43.8% ^[5]。

Claude Opus 4.7：無工具推理、軟體工程與文件訊號強

DeepSeek V4：主表不常第一，但成本效益值得測

Kimi K2.6：有亮點，但最難做乾淨四方排名

Kimi K2.6 是這組模型中最難直接排名的一個，因為它沒有出現在 GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max 的主共用表中 ^[4]。

你應該先測哪一個？

若任務是終端機代理、OS 操作或 FrontierMath 類型工作，先測 GPT-5.5；它在所引 Terminal-Bench 2.0、OSWorld-Verified 與 FrontierMath 結果中領先 ^[4]^[5]。
若工具輔助推理或瀏覽是核心，先測 GPT-5.5 Pro；它在主共用表中的 Humanity’s Last Exam with tools 與 BrowseComp 領先 ^[4]。
若重點是 GPQA 類科學推理、無工具專家問答、SWE-Bench Pro 類軟體工程、MCP 類工作流，或文件密集的多模態工作，先測 Claude Opus 4.7 ^[4]^[1]。
若主要限制是成本，而且你可以自行做品質驗證，先把 DeepSeek V4 放進候選；目前最強的引用優勢是約為 Opus 4.7 與 GPT-5.5 六分之一成本的說法 ^[4]。
若你特別想驗證 Kimi K2.6 的程式、代理、數學與瀏覽分數，可以先測 Kimi；但務必用與其他模型相同的 prompts、工具、上下文限制、延遲目標與評分規則 ^[11]^[13]。

跑分的坑：不要把 0.6 分看成定論

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先，並在無工具 Humanity’s Last Exam 以 46.9% 領先；GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。
GPT 5.5 Pro 在工具輔助 HLE 以 57.2% 領先，也在 BrowseComp 以 90.1% 領先；DeepSeek V4 Pro Max 具競爭力，但主表未拿下單項第一 [4]。
Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數，但不在 GPT 5.5／Claude／DeepSeek 的同一主表中，不能硬排四方總名次 [4][11][13]。

大家也會問

「GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6：各類基準測試贏家」的簡短答案是什麼？

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先，並在無工具 Humanity’s Last Exam 以 46.9% 領先；GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數，但不在 GPT 5.5／Claude／DeepSeek 的同一主表中，不能硬排四方總名次 [4][11][13]。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6：2026 基準測試與決策結論

DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點

Northwest vs. Southeast Timber 彈性題解：為什麼是 larger; larger

Northwest 與 Southeast Timber：替代性越高，為什麼答案是 larger; larger？

DeepSeek 會擊敗 OpenAI、Claude、Gemini、Grok 嗎？關鍵不是榜單，而是成本與信任

研究對話

你2026年4月27日

研究問題

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI2026年4月27日21 個來源

附引用的答案

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

來源

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

熱門探索內容

報告已發布2026年4月28日Last edited 2026年5月6日8 個來源

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6：各類基準測試贏家

使用 Studio Global AI 搜尋並查證事實探索更多內容

15K0

一眼看懂：各場景暫時贏家

使用情境	較有依據的首選	判讀
科學推理	Claude Opus 4.7	GPQA Diamond 達 94.2%，高於 GPT-5.5 的 93.6% 與 DeepSeek-V4-Pro-Max 的 90.1% ^[4]
無工具專家推理	Claude Opus 4.7	Humanity’s Last Exam 無工具為 46.9%，高於 GPT-5.5 Pro 的 43.1%、GPT-5.5 的 41.4% 與 DeepSeek-V4-Pro-Max 的 37.7% ^[4]
工具輔助考題推理	GPT-5.5 Pro	Humanity’s Last Exam with tools 為 57.2%，高於 Claude Opus 4.7 的 54.7% ^[4]
終端機與代理式運算	GPT-5.5	Terminal-Bench 2.0 為 82.7%，高於 Claude Opus 4.7 的 69.4% 與 DeepSeek-V4-Pro-Max 的 67.9% ^[4]^[5]
作業系統操作	GPT-5.5	OSWorld-Verified 為 78.7%，略高於 Claude Opus 4.7 的 78.0% ^[5]
前沿數學	GPT-5.5	FrontierMath Tiers 1–3 為 51.7%，高於 Claude Opus 4.7 的 43.8% ^[5]
共用表中的軟體工程	Claude Opus 4.7	SWE-Bench Pro / SWE Pro 為 64.3%，高於 GPT-5.5 的 58.6% 與 DeepSeek-V4-Pro-Max 的 55.4% ^[4]
網頁瀏覽與理解	GPT-5.5 Pro	BrowseComp 為 90.1%，高於 GPT-5.5 的 84.4%、DeepSeek-V4-Pro-Max 的 83.4% 與 Claude Opus 4.7 的 79.3% ^[4]
MCP 類公用工具流程	Claude Opus 4.7	MCP Atlas / MCPAtlas Public 為 79.1%，高於 GPT-5.5 的 75.3% 與 DeepSeek-V4-Pro-Max 的 73.6% ^[4]
視覺與文件分析	Claude Opus 4.7	有來源報告其在 Vision & Document Arena 排名第一，並在圖表、作業與 OCR 子項勝出 ^[1]
成本敏感評估	DeepSeek V4	VentureBeat 稱 DeepSeek V4 以約 Opus 4.7 與 GPT-5.5 六分之一的成本提供接近前沿的智慧，但仍需用自己的工作負載驗證 ^[4]
最不適合硬排四方名次	Kimi K2.6	Kimi 有可參考分數，但多數來自與主表不同的比較脈絡 ^[11]^[13]

完整基準表：先分清同表與跨表

基準測試／能力	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	較穩妥的解讀
GPQA Diamond	93.6% ^[4]	未報告	94.2% ^[4]	DeepSeek-V4-Pro-Max 90.1% ^[4]	未報告	Claude 在共用表領先 ^[4]
Humanity’s Last Exam，無工具	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	DeepSeek-V4-Pro-Max 37.7% ^[4]	未報告	Claude 在共用表領先 ^[4]
Humanity’s Last Exam，with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	DeepSeek-V4-Pro-Max 48.2% ^[4]	另表為 54.0% ^[13]	GPT-5.5 Pro 在共用表領先 ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	未報告	69.4% ^[4]^[5]	DeepSeek-V4-Pro-Max 67.9% ^[4]	另表為 66.7% ^[13]	GPT-5.5 領先 ^[4]^[5]
SWE-Bench Pro / SWE Pro	58.6% ^[4]	未報告	64.3% ^[4]	DeepSeek-V4-Pro-Max 55.4% ^[4]	另表為 58.6% ^[13]	Claude 在共用表領先 ^[4]
BrowseComp	84.4% ^[4]	90.1% ^[4]	79.3% ^[4]	DeepSeek-V4-Pro-Max 83.4% ^[4]；另表 DeepSeek-V4 Pro 為 83.4% ^[11]	另表為 83.2% ^[11]	GPT-5.5 Pro 在共用表領先 ^[4]
MCP Atlas / MCPAtlas Public	75.3% ^[4]	未報告	79.1% ^[4]	DeepSeek-V4-Pro-Max 73.6% ^[4]	未報告	Claude 領先 ^[4]
OSWorld-Verified	78.7% ^[5]	未報告	78.0% ^[5]	未報告	未報告	GPT-5.5 小幅領先 Claude ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	未報告	43.8% ^[5]	未報告	未報告	GPT-5.5 領先 Claude ^[5]
Vision & Document Arena	未報告	未報告	報告為整體第一 ^[1]	未報告	未報告	Claude 有唯一可引用結果 ^[1]
AIME 2026	未報告	未報告	未報告	在所引 Kimi vs DeepSeek 表中未提供 ^[11]	Thinking mode 為 96.4% ^[11]	是 Kimi 訊號，不是四方排名 ^[11]
APEX Agents	未報告	未報告	未報告	在所引 Kimi vs DeepSeek 表中未提供 ^[11]	Thinking mode 為 27.9% ^[11]	是 Kimi 訊號，不是四方排名 ^[11]
Context window	未報告	未報告	某 Artificial Analysis 比較列為 1,000k tokens ^[3]	同一比較中 DeepSeek V4 Pro 列為 1,000k tokens ^[3]	未報告	該比較中 Claude 與 DeepSeek V4 Pro 相同 ^[3]

GPT-5.5／GPT-5.5 Pro：終端、OS、數學與工具流更亮眼

它在 OSWorld-Verified 也領先 Claude Opus 4.7，但差距很小：78.7% 對 78.0% ^[5]。在 FrontierMath Tiers 1–3 上，GPT-5.5 的優勢較明顯，為 51.7%，高於 Claude 的 43.8% ^[5]。

Claude Opus 4.7：無工具推理、軟體工程與文件訊號強

DeepSeek V4：主表不常第一，但成本效益值得測

Kimi K2.6：有亮點，但最難做乾淨四方排名

Kimi K2.6 是這組模型中最難直接排名的一個，因為它沒有出現在 GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max 的主共用表中 ^[4]。

你應該先測哪一個？

若任務是終端機代理、OS 操作或 FrontierMath 類型工作，先測 GPT-5.5；它在所引 Terminal-Bench 2.0、OSWorld-Verified 與 FrontierMath 結果中領先 ^[4]^[5]。
若工具輔助推理或瀏覽是核心，先測 GPT-5.5 Pro；它在主共用表中的 Humanity’s Last Exam with tools 與 BrowseComp 領先 ^[4]。
若重點是 GPQA 類科學推理、無工具專家問答、SWE-Bench Pro 類軟體工程、MCP 類工作流，或文件密集的多模態工作，先測 Claude Opus 4.7 ^[4]^[1]。
若主要限制是成本，而且你可以自行做品質驗證，先把 DeepSeek V4 放進候選；目前最強的引用優勢是約為 Opus 4.7 與 GPT-5.5 六分之一成本的說法 ^[4]。
若你特別想驗證 Kimi K2.6 的程式、代理、數學與瀏覽分數，可以先測 Kimi；但務必用與其他模型相同的 prompts、工具、上下文限制、延遲目標與評分規則 ^[11]^[13]。

跑分的坑：不要把 0.6 分看成定論

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先，並在無工具 Humanity’s Last Exam 以 46.9% 領先；GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。
GPT 5.5 Pro 在工具輔助 HLE 以 57.2% 領先，也在 BrowseComp 以 90.1% 領先；DeepSeek V4 Pro Max 具競爭力，但主表未拿下單項第一 [4]。
Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數，但不在 GPT 5.5／Claude／DeepSeek 的同一主表中，不能硬排四方總名次 [4][11][13]。

大家也會問

「GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6：各類基準測試贏家」的簡短答案是什麼？

Claude Opus 4.7 在 GPQA Diamond 以 94.2% 領先，並在無工具 Humanity’s Last Exam 以 46.9% 領先；GPT 5.5 則以 82.7% 拿下 Terminal Bench 2.0 [4][5]。

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

Kimi K2.6 有 SWE Bench、Terminal Bench、BrowseComp 等另表分數，但不在 GPT 5.5／Claude／DeepSeek 的同一主表中，不能硬排四方總名次 [4][11][13]。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

研究對話

你2026年4月27日

研究問題

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI2026年4月27日21 個來源

附引用的答案

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

來源

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...