報告已發布2 個月前Last edited 上個月18 來源

2026年5月AI大亂鬥：Claude Opus 4.8、GPT-5.5、Gemini 3.5 Flash 邊個最打得？

新鮮出爐嘅 Claude Opus 4.8 喺自主編程（SWE Bench Pro）同知識工作方面領先，但 GPT 5.5 就係終端機編程嘅一哥，仲有最強嘅推理能力。 Gemini 3.5 Flash 速度贏晒成條街，每秒出到成 289 個 token，仲要係最平嗰個，性價比無得輸。

使用 Studio Global AI 搜尋並查核事實瀏覽更多熱門頁面

Research benchmarks for Claude Opus 4.8, Claude Opus 4.7, GPT-5.5, Gemini 3.5 Flash, Grok 4.3, DeepSeek V4AI-generated editorial hero image for Research benchmarks for Claude Opus 4.8, Claude Opus 4.7, GPT-5.5, Gemini 3.5 Flash, Grok 4.3, DeepSeek V4. Compare them as comprehensively.
AI 提示
Create a landscape editorial hero image for this Studio Global article: Research benchmarks for Claude Opus 4.8, Claude Opus 4.7, GPT-5.5, Gemini 3.5 Flash, Grok 4.3, DeepSeek V4. Compare them as comprehensively. Article summary: ### Which model leads what?. Topic tags: deepresearch, general web, user generated, documentation, education. Reference image context from search candidates: Reference image 1: visual subject "# Deep|DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task Benchmark Across Coding, Reasoning, and Financial Research. * **Claude Opus 4.6 (Thinking) and Claude Opus 4.7 tie for #1 overall" source context "Deep|DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task Benchmark ..." Reference image 2: visual subject "# April 2026 AI Model Releases: GPT-5.5, Claude Opus 4.7, DeepSeek V4 + 6 More. GPT-5.5, Claude Opus 4.7, DeepSeek V4, Gemma 4, Nemotron 3 — April 2026 nine model releases compared
openai.com

人工智能界呢排真係鑼鼓喧天，各大廠商出模型出到好似走馬燈咁。啱啱5月28號，Anthropic 就推出咗佢哋嘅最新王牌 Claude Opus 4.8 。連埋早排 OpenAI 嘅 GPT-5.5、Google DeepMind 嘅 Gemini 3.5 Flash、xAI 嘅 Grok 4.3 同埋 DeepSeek 嘅 V4 Pro，而家個市場真係好似「六國大封相」咁精彩。究竟邊個先係真正嘅「話事人」？我哋幫你整合晒截至2026年5月下旬最全面嘅數據，等你可以一眼睇清佢哋嘅實力。

六大模型擂台對決：跑分全面比較

要比較呢班「尖子」，最好嘅方法當然係睇佢哋喺唔同公開考試（基準測試）嘅成績。下面個表幫你一次過睇晒佢哋喺各項重要測試嘅表現，粗體嗰個就係嗰欄嘅第一名：

基準測試 (Benchmark)	Claude Opus 4.8	Claude Opus 4.7	GPT-5.5	Gemini 3.5 Flash	Grok 4.3	DeepSeek V4 Pro
SWE-Bench Pro (自主編程)	69.2%	64.3%	58.6%	~21.4%*	~19.4%*	~18.1%*
SWE-Bench Verified	~83% (est)	87.6%	85.0%	82.1%	81.0%	80.6%
Terminal-Bench 2.0/2.1 (終端機編程)	74.6%	66.1–69.4%	78.2–82.7%	76.2%	68.5%	65.0%
OSWorld-Verified (電腦操作)	83.4%	82.8%	78.7%	75.0%	72.1%	70.5%
GDPval-AA (知識工作 / Elo分)	1890	1753	1620–1769	1656	1500–1570	1550
Humanity's Last Exam (用工具)	57.9%	54.7%	—	—	—	—
Humanity's Last Exam (唔用工具)	49.8%	—	—	—	—	—
GPQA Diamond (博士級科學)	~94% (est)	94.2%	96.0%	92.4%	90.1–91.5%	95.1%
ARC-AGI-2 (抽象推理)	~80% (est)	80.2%	85.0%	75.8%	76.1%	74.0%
MCP Atlas (工具使用可靠度)	—	77.3%	79.1%	83.6%	74.2%	71.5%
AA Intelligence Index (v4.0)	~59–60 (est)	59	60	57	53	55
Finance Agent v2 (財務分析)	53.9%	51.5%	—	—	—	—
LiveCodeBench (Pass@1)	—	—	~91–92% (est)	—	—	93.5%
Codeforces ELO (競技編程)	—	~3050 (est)	3168	—	—	3206
FrontierMath Tier 1–3 (數學)	—	43.8%	51.7%	—	—	—
MMLU-Pro (多任務語言理解)	—	—	—	—	—	87.5%
AIME 2025 (數學)	—	—	95.2%	—	—	—
BrowseComp	—	79.3%	84.4%	—	—	—

* Gemini 3.5 Flash、Grok 4.3 同 DeepSeek V4 Pro 嘅 SWE-Bench Pro 分數係來自一個第三方測試 — Google 官方喺佢哋自己嘅模型卡入面俾出嘅分數係唔同㗎，呢度要留意返。

價錢、速度同基本規格大比拼

跑分好重要，但現實世界用起上嚟，性價比 同 實際體驗 先係最關鍵。下面個表幫你睇清佢哋嘅「使用成本」同「硬件」規格：

指標	Claude Opus 4.8	Claude Opus 4.7	GPT-5.5	Gemini 3.5 Flash	Grok 4.3	DeepSeek V4 Pro
輸入價錢 (每1M token)	US$5.00	US$15.00	US$5.00	US$1.50	US$1.25–1.50	~US$0.50–2.00 (est)
輸出價錢 (每1M token)	US$25.00	~US$75.00 (est)	US$30.00	US$9.00	~US$6.00–8.00 (est)	~US$2.00–8.00 (est)
輸出速度 (token/秒)	~90–100 (est)	~67–78	~71	289	~159–207	~80–100 (est)
上下文窗口 (Context Window)	1M	200K	400K	1M	1M	1M
發布日期	2026年5月28日	2026年4月16日	2026年4月23日	2026年5月19日	2026年4月30日	2026年4月24日
BenchLM 排名 (暫定)	#2/119	—	#5/119	—	—	—

深度分析：邊個模型係邊瓣嘅「話事人」？

睇完一大堆數字，係時候同你分析下，究竟邊個模型喺唔同領域最耍家：

🏆 Claude Opus 4.8：剛出爐嘅全能王者

啱啱5月28號先出世嘅 Claude Opus 4.8，一出場就氣勢如虹。佢喺 自主編程（SWE-Bench Pro，69.2%）、知識工作（GDPval-AA，1890 Elo）、電腦操作（OSWorld，83.4%）、跨學科推理（Humanity's Last Exam） 同埋 財務分析 呢啲高難度項目都係第一名。喺一個綜合評分榜上更加係排全場第二，攞到93/100分，實力非同小可。

🧠 GPT-5.5：推理同終端機編程嘅專家

OpenAI 嘅 GPT-5.5 就喺第二條戰線稱霸。佢最擅長係 終端機編程（Terminal-Bench，78.2–82.7%） 同埋 抽象視覺推理（ARC-AGI-2，85.0%）。喺 博士級科學（GPQA Diamond） 同 數學（FrontierMath） 方面都係領先，個綜合智力指數（AA Intelligence Index）更加攞到60分，係全場最高。

⚡ Gemini 3.5 Flash：性價比超班嘅速度狂人

Google 嘅 Gemini 3.5 Flash 真係夠晒「Flash」，佢唔係要爭做全科狀元，而係專攻 實際應用同效率。佢嘅 工具使用協調能力（MCP Atlas，83.6%） 係全場最強，而且輸出速度達到 每秒289個token，比其他模型 快成4倍，價錢仲要係最平。對於需要快、狠、準，又要慳住荷包嘅高流量工作，佢絕對係首選。

💰 DeepSeek V4 Pro：編程比賽嘅隱世高手

DeepSeek V4 Pro 擺明係要嚟「挑機」嘅。佢喺競技編程方面嘅實力令人震驚，Codeforces ELO 高達 3206，LiveCodeBench 有 93.5%，全部都係第一名。雖然喺一啲複雜嘅自主代理任務上稍為落後，但考慮到佢嘅價錢只係其他模型嘅「零頭」，對於編程高手同開發者嚟講，性價比極高。

🔧 Grok 4.3：穩打穩紮嘅中游分子

xAI 嘅 Grok 4.3 表現中規中矩，喺 AA Index 有53分，GPQA 有90.1%，速度唔錯（159–207 tps），價錢都相當有競爭力。佢喺法律同金融呢啲專業領域嘅跑分特別標青，但喺大部分自主代理測試中都落後於前面幾位大佬。

📉 Claude Opus 4.7：光榮退場嘅上一代

作為 Claude Opus 4.8 嘅「上手」，Opus 4.7 喺 SWE-bench Verified (87.6%) 仍然好打得，但已經俾自己嘅「接班人」全面超越。科技界就係咁殘酷，不進則退。

溫馨提示：睇跑分嘅重要陷阱

睇呢啲比較嘅時候，有幾點你一定要記住：

數據來源各有不同：呢啲分數嚟自唔同廠商同獨立測試機構，佢哋用嘅測試工具（harness）同方法可以好唔同。所以同一個測試，唔同來源嘅分數可能會有出入（例如，GPT-5.5 嘅 Terminal-Bench 分數，喺 Google 份報告係 78.2%，但 OpenAI 自己就話有 82.7%）。
「SWE-Bench Pro」同「SWE-Bench Verified」係兩個唔同嘅試卷：Pro 版難好多，涉及多個文件嘅複雜改動；Verified 版就簡單啲，主要係單一問題修復。Claude 喺 Pro 版帶頭，但喺 Verified 版，大家嘅分數就咬得好緊。
平嘢都有好嘢：DeepSeek、Grok 同 Gemini 3.5 Flash 每個 token 嘅成本，明顯平過 Claude Opus 同 GPT-5.5，對於要處理大量工作或者對速度好敏感嘅任務，佢哋嘅性價比真係高好多。
「Flash」級別唔係「Pro」級別：Gemini 3.5 Flash 係一個效率優先嘅「輕量版」模型，唔係直接同其他「全能旗艦」對等比較。但就算係咁，佢喺好多自主代理跑分嘅表現都好好，仲要又快又平。

仲有啲咩係唔清楚嘅？

冇一個「一統江湖」嘅公平測試：暫時仲未有一個獨立機構，可以用完全一樣嘅設定，一次過考晒呢六個模型。而家呢幅「全面比較圖」，係由一堆有重疊但又唔完全一樣嘅測驗砌出嚟嘅。
二線廠商數據較少：比起「三巨頭」（Anthropic、OpenAI、Google），Grok 4.3 同 DeepSeek V4 Pro 喺自主代理同長上下文跑分方面，公開嘅數據明顯少好多。
Claude Opus 4.8 太新：佢5月28號先至推出，獨立第三方嘅覆核數據好有限，大部分分數暫時都係靠 Anthropic 自己公布。

最信得過嘅資料來源

官方廠商頁面：Anthropic (Claude Opus 4.8)、OpenAI (GPT-5.5) 同 Google DeepMind (Gemini 3.5 Flash 模型卡) — 第一手原始數據。
NIST CAISI 評估：美國國家標準暨技術研究院對 DeepSeek V4 Pro 嘅獨立政府評估。
杜克大學分析：學術機構對 Gemini 3.5 Flash 嘅涵蓋報導。
第三方整合平台：好似 dev.to 嘅並列測試、BenchLM.ai 同 Artificial Analysis — 呢啲網站方便你做跨模型比較，但權威性就唔及官方數據。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

人們還問