← Back to Trending

報告已發布3 個月前Last edited 2 個月前18 來源

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6：邊個 Benchmark 贏？

未有單一模型通殺：Claude Opus 4.7 在 GPQA Diamond 94.2% 同 Humanity’s Last Exam 無工具 46.9% 領先；GPT 5.5 在 Terminal Bench 2.0 82.7% 領先；GPT 5.5 Pro 在工具輔助 HLE 57.2% 領先 [4][5]。 DeepSeek V4 Pro Max 在共用表內有競爭力，但未有任何列排第一；它最突出的引用賣點，是 VentureBeat 指近前沿智能但成本約為 Opus 4.7 同 GPT 5.5 的六分之一 [4]。

使用 Studio Global AI 搜尋並查核事實瀏覽更多熱門頁面

Editorial illustration of GPT-5.5, Claude Opus 4.7, DeepSeek V4 and Kimi K2.6 compared across AI benchmark categories — GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by CategoryAI-generated editorial illustration for comparing frontier model benchmark winners by category.
AI 提示
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by Category. Article summary: No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr.... Topic tags: ai, llm benchmarks, openai, anthropic, deepseek. Reference image context from search candidates: Reference image 1: visual subject "Kimi K2.6 ties GPT-5.5 on SWE-bench Pro at 5–6x lower cost — with agent swarms, 13-hour autonomous runs, and open weights. In practice it is the first open-source model that can su" source context "Kimi K2.6: The Complete Developer Guide (2026) - Codersera" Reference image 2: visual subject "# Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which S
openai.com

跑分表看似可以一槌定音，但 GPT-5.5、Claude Opus 4.7、DeepSeek V4 同 Kimi K2.6 呢場比較，真正結論係分場景。最完整的同場資料主要覆蓋 GPT-5.5、部分 GPT-5.5 Pro、Claude Opus 4.7 同 DeepSeek-V4-Pro-Max；Kimi K2.6 多數出現在另一批比較，所以唔應該每一項都當成同場同規格排名。

實務上，先問自己：你要模型做科學推理、coding agent、瀏覽、文件 OCR，定係要控成本？下面逐項拆。

一眼睇：按任務揀初選

任務類型	最有證據支持的初選	點解
科學推理	Claude Opus 4.7	GPQA Diamond 94.2%，高過 GPT-5.5 的 93.6% 同 DeepSeek-V4-Pro-Max 的 90.1%
無工具專家推理	Claude Opus 4.7	Humanity’s Last Exam 無工具 46.9%，高過 GPT-5.5 Pro 43.1%、GPT-5.5 41.4% 同 DeepSeek-V4-Pro-Max 37.7%
工具輔助考試推理	GPT-5.5 Pro	Humanity’s Last Exam 有工具 57.2%，高過 Claude Opus 4.7 的 54.7%

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

人們還問

「GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6：邊個 Benchmark 贏？」的簡短答案是什麼？

未有單一模型通殺：Claude Opus 4.7 在 GPQA Diamond 94.2% 同 Humanity’s Last Exam 無工具 46.9% 領先；GPT 5.5 在 Terminal Bench 2.0 82.7% 領先；GPT 5.5 Pro 在工具輔助 HLE 57.2% 領先 [4][5]。

首先要驗證的關鍵點是什麼？

未有單一模型通殺：Claude Opus 4.7 在 GPQA Diamond 94.2% 同 Humanity’s Last Exam 無工具 46.9% 領先；GPT 5.5 在 Terminal Bench 2.0 82.7% 領先；GPT 5.5 Pro 在工具輔助 HLE 57.2% 領先 [4][5]。 DeepSeek V4 Pro Max 在共用表內有競爭力，但未有任何列排第一；它最突出的引用賣點，是 VentureBeat 指近前沿智能但成本約為 Opus 4.7 同 GPT 5.5 的六分之一 [4]。

接下來在實務上我該做什麼？

Kimi K2.6 有不少亮眼分數，但多數來自獨立比較；要小心來源混用、Pro／非 Pro 模式、DeepSeek 版本名，以及供應商或研究環境跑分設定 [3][5][8][11][13]。

來源

Benchmark / 能力	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	最穩陣解讀
GPQA Diamond	93.6%	未報告	94.2%	DeepSeek-V4-Pro-Max 90.1%	未報告	Claude 在共用表領先
Humanity’s Last Exam，無工具	41.4%	43.1%	46.9%	DeepSeek-V4-Pro-Max 37.7%	未報告	Claude 在共用表領先
Humanity’s Last Exam，有工具	52.2%	57.2%	54.7%	DeepSeek-V4-Pro-Max 48.2%	另場比較 54.0%	GPT-5.5 Pro 在共用表領先
Terminal-Bench 2.0	82.7%	未報告	69.4%	DeepSeek-V4-Pro-Max 67.9%	另場比較 66.7%	GPT-5.5 領先，而且差距大
SWE-Bench Pro / SWE Pro	58.6%	未報告	64.3%	DeepSeek-V4-Pro-Max 55.4%	另場比較 58.6%	Claude 在共用表領先
BrowseComp	84.4%	90.1%	79.3%	DeepSeek-V4-Pro-Max 83.4% ；另一比較 DeepSeek-V4 Pro 83.4%	另場比較 83.2%	GPT-5.5 Pro 在共用表領先
MCP Atlas / MCPAtlas Public	75.3%	未報告	79.1%	DeepSeek-V4-Pro-Max 73.6%	未報告	Claude 領先
OSWorld-Verified	78.7%	未報告	78.0%	未報告	未報告	GPT-5.5 略高於 Claude
FrontierMath Tiers 1–3	51.7%	未報告	43.8%	未報告	未報告	GPT-5.5 高過 Claude
Vision & Document Arena	未報告	未報告	報告為整體第 1	未報告	未報告	只有 Claude 有引用結果
AIME 2026	未報告	未報告	未報告	在 Kimi vs DeepSeek 表未有可用數字	Thinking mode 96.4%	Kimi 有訊號，但唔係四方排名
APEX Agents	未報告	未報告	未報告	在 Kimi vs DeepSeek 表未有可用數字	Thinking mode 27.9%	Kimi 有訊號，但唔係四方排名
Context window	未報告	未報告	一個 Artificial Analysis 比較列為 1,000k tokens	同一比較列 DeepSeek V4 Pro 為 1,000k tokens	未報告	該比較下 Claude 同 DeepSeek V4 Pro 相同