答案已发布2026年5月5日Last edited 2026年5月6日7 来源

GPT-5.4、GPT-5.3-Codex 与 Claude Opus 4.6：编码基准里的真正差异

如果只看 SWE Bench Verified 风格的仓库修复，Claude Opus 4.6 的公开分数约为 79%–81%，是最值得优先测试的选择 [3][5][7][9]。终端智能体场景不能只看模型名：Terminal Bench 2.0 公开榜单排名的是智能体/模型组合，GPT 5.3 Codex 与 Claude Opus 4.6 都会随框架变化而换位 [1][3]。

使用 Studio Global AI 搜索并核查事实从“发现”浏览更多内容

4.2K0

Abstract comparison of AI coding models on a benchmark leaderboard — GPT-5.4 vs GPT-5.3-Codex vs Claude Opus 4.6: The Coding Winner Depends on the BenchmarkBenchmark results point to different winners depending on the test variant and agent harness.
AI 提示
Create a landscape editorial hero image for this Studio Global article: GPT-5.4 vs GPT-5.3-Codex vs Claude Opus 4.6: The Coding Winner Depends on the Benchmark. Article summary: There is no universal coding winner: Claude Opus 4.6 has the strongest reported SWE Bench Verified signal at about 79 81%, GPT 5.3 Codex leads the cited Terminal Bench 2.0 comparison at 77.3%, and GPT 5.4's same sourc.... Topic tags: ai, ai benchmarks, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "gpt-5.4 vs opus 4.6. # GPT-5.4 vs Claude Opus 4.6: Which One Is Better for Coding? OpenAI has launched GPT-5.4, the latest iteration of its GPT-5 family, and, as per them, it’s the" source context "GPT-5.4 vs Claude Opus 4.6: Which One Is Better for Coding? - Bind AI" Reference image 2: visual subject "gpt-5.4 vs opus 4.6. # GPT-5.4 vs Claude Opus 4.6: Whic
openai.com

如果只问“哪个模型写代码最强”，这些公开基准给不出一句话答案。更稳妥的读法是：Claude Opus 4.6 在 SWE-Bench Verified 这类仓库缺陷修复评测上信号最强；GPT-5.3-Codex 是 OpenAI 模型里 Terminal-Bench 2.0 表现更亮的一条线；GPT-5.4 相比 GPT-5.3-Codex 的直接编码提升更像小幅迭代，而不是决定性跃迁 ^[1]^[3]^[5]^[7]^[9]。

关键问题在于：这些分数不是同一张卷子。SWE-Bench Verified、SWE-Bench Pro 与 SWE-Bench Pro Public 不能简单横比；Terminal-Bench 2.0 的公开结果还受到智能体框架，也就是 agent harness 的影响 ^[1]^[6]^[7]^[10]。

先给结论：按工作负载选，不要按总榜选

你的场景	建议优先测试	依据	主要提醒
类似 SWE-Bench Verified 的仓库 bug 修复	Claude Opus 4.6	多份报告给 Opus 4.6 的 SWE-Bench Verified 分数约 79.2%–80.8% ^[3]^[5]^[7]^[9]。	不要把 Verified 分数和 SWE-Bench Pro Public 分数当成同一套测试来比较 ^[6]^[7]^[10]。
终端、命令行、脚本执行类智能体工作流	GPT-5.3-Codex，但要固定评测框架	一份 GPT-5.4 对比把 GPT-5.3-Codex 的 Terminal-Bench 2.0 列为 77.3%，高于 GPT-5.4 的 75.1% 和 Claude Opus 4.6 的 65.4% ^[3]。	公开榜单排名的是智能体/模型组合；Claude Opus 4.6 搭配 ForgeCode 时达到 79.8% ^[1]。
只在 OpenAI 模型里做选择	GPT-5.4 可以试，但别期待飞跃	同一份对比中，GPT-5.4 在 SWE-Bench Pro 为 57.7%，GPT-5.3-Codex 为 56.8% ^[3]。	同一来源也显示 GPT-5.4 在 Terminal-Bench 2.0 低于 GPT-5.3-Codex ^[3]。
工具密集、MCP 相关系统	GPT-5.4 值得单独评估	GPT-5.4 分析称，工具搜索通过按需加载工具定义，让 MCP token 用量下降 47% ^[3]。	token 成本或上下文效率的改善，不等同于 SWE-Bench 或 Terminal-Bench 上的准确率胜利 ^[3]。

最大陷阱：别把不同基准硬拼成一个排行榜

SWE-Bench Verified 和 SWE-Bench Pro Public 不是同一回事

Claude Opus 4.6 最强的论据来自 SWE-Bench Verified。引用报告中，它的 Verified 分数大致集中在 79% 到 81%：GPT-5.4 分析给出 79.2%，Opus-vs-Codex 对比给出 79.4%，另一些基准汇总给出 80.8% ^[3]^[5]^[6]^[7]^[9]。

GPT-5.3-Codex 的 SWE-Bench 读法更麻烦，因为不同报告用了不同变体。有的报告把 GPT-5.3-Codex 的 SWE-Bench Pro Public 列为 78.2%，而 GPT-5.4 分析把它的 SWE-Bench Pro 列为 56.8% ^[3]^[6]^[7]。这不是取平均值的理由，反而是在提醒读者：SWE-Bench Verified、SWE-Bench Pro 和 SWE-Bench Pro Public 不能随手互换 ^[6]^[7]^[10]。

在同一份 GPT-5.4 分析里，GPT-5.4 对 GPT-5.3-Codex 的最清晰优势其实很窄：SWE-Bench Pro 为 57.7% 对 56.8% ^[3]。另有总结也提到 GPT-5.4 的 57.7% SWE-Bench Pro Public 数字，同时提醒 Claude 与 GPT 的更大范围对比并不是苹果对苹果 ^[10]。

Terminal-Bench 2.0 看的是“模型 + 智能体框架”

Terminal-Bench 2.0 更容易被误读。它的公开榜单列的是 agent/model 组合，而不是把基础模型单独拿出来测 ^[1]。在该榜单中，GPT-5.3-Codex 搭配 SageAgent 为 78.4%，搭配 Droid 为 77.3%，搭配 Simple Codex 为 75.1% ^[1]。Claude Opus 4.6 搭配 ForgeCode 为 79.8%，搭配 Capy 为 75.3%，搭配 Terminus 2 为 62.9% ^[1]。

这个差距已经足以改变“赢家”。GPT-5.4 分析中，GPT-5.3-Codex 在 Terminal-Bench 2.0 上以 77.3% 领先 Claude Opus 4.6 的 65.4% ^[3]；但公开榜单里，ForgeCode/Claude Opus 4.6 的 79.8% 又高于 SageAgent/GPT-5.3-Codex 的 78.4% ^[1]。所以，评估终端智能体时，必须先固定 harness，再谈模型优劣。

三个模型分别怎么看

Claude Opus 4.6：仓库修复的首选候选

如果你的代理指标是 SWE-Bench Verified，Claude Opus 4.6 是这些来源里最有支撑的起点。它在 Verified 变体上的公开分数集中在 79.2%、79.4% 和 80.8% 附近 ^[3]^[5]^[6]^[7]^[9]。

但这不等于它在所有编程任务上通吃。它的 Terminal-Bench 2.0 表现取决于搭配：对比报告中有 65.4% 的数字，公开榜单中则有 ForgeCode 搭配下的 79.8%，以及 Terminus 2 搭配下的 62.9% ^[1]^[3]^[7]^[9]。结论是：做 Verified 风格的真实仓库修复，优先试 Opus 4.6；做终端智能体，不要只看模型名。

GPT-5.3-Codex：OpenAI 阵营里的终端智能体强项

GPT-5.3-Codex 最强的 OpenAI 论据来自 Terminal-Bench 风格的终端任务。对比报告列出它在 Terminal-Bench 2.0 上为 77.3%，公开榜单也给出 SageAgent 78.4%、Droid 77.3%、Simple Codex 75.1% 等组合结果 ^[1]^[3]^[7]^[9]。

它的 SWE-Bench 分数则必须看清版本。有的报告列出 GPT-5.3-Codex 在 SWE-Bench Pro Public 为 78.2%，另一些则列出 SWE-Bench Pro 为 56.8% ^[3]^[6]^[7]^[9]。既然来源本身已经提醒这些变体不能直接互换，就应当在你实际要采用的同一评测版本和同一设置下判断它 ^[6]^[7]^[10]。

GPT-5.4：编码小步快跑，工具使用更值得关注

从这组资料看，GPT-5.4 不像是一次“编码能力大爆发”。同一份分析里，它在 SWE-Bench Pro 上只比 GPT-5.3-Codex 高 0.9 个百分点，57.7% 对 56.8%；但在 Terminal-Bench 2.0 上反而低一些，75.1% 对 77.3% ^[3]。

GPT-5.4 更有辨识度的点是工具使用。该分析称，工具搜索通过按需加载工具定义，而不是把所有定义都塞进上下文，让 MCP token 用量减少 47% ^[3]。如果你在做多工具、长上下文、自动化编码代理，这可能是很实际的系统优势；但它应与修 bug 准确率分开评估。

更靠谱的比较方法

先定基准版本，再定赢家。SWE-Bench Verified、SWE-Bench Pro 和 SWE-Bench Pro Public 不应被合并成一张简单排名表 ^[6]^[7]^[10]。
终端任务要固定智能体框架。Terminal-Bench 2.0 公开榜单显示，同一模型搭配不同 agent harness 会得到明显不同的准确率 ^[1]。
把编码准确率和工具效率分开看。GPT-5.4 被报告的 47% MCP token 降幅，对工具密集系统有参考价值，但它不是 SWE-Bench 或 Terminal-Bench 的胜利声明 ^[3]。
把混合来源的排名当作方向，而不是裁判。不同来源在不同基准下支持不同赢家，这正说明单一总冠军的说法会夸大证据 ^[1]^[3]^[6]^[7]^[10]。

最后怎么选

如果你的主要任务是仓库级 bug 修复，先测 Claude Opus 4.6；如果你在做终端智能体或命令行自动化，把 GPT-5.3-Codex 放进候选并固定 harness；如果你只考虑 OpenAI 模型，或特别关心工具搜索与 MCP token 成本，再单独测试 GPT-5.4 ^[1]^[3]^[5]^[7]^[9]。

最安全的结论不是“某一个模型统治编程”，而是：赢家会随着基准版本、智能体框架和真实工作负载而变化 ^[1]^[6]^[7]^[10]。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜索并核查事实

要点

如果只看 SWE Bench Verified 风格的仓库修复，Claude Opus 4.6 的公开分数约为 79%–81%，是最值得优先测试的选择 [3][5][7][9]。
终端智能体场景不能只看模型名：Terminal Bench 2.0 公开榜单排名的是智能体/模型组合，GPT 5.3 Codex 与 Claude Opus 4.6 都会随框架变化而换位 [1][3]。
GPT 5.4 的编码基准提升有限，但工具搜索被报告可让 MCP token 用量下降 47%，更适合在工具密集系统中单独评估 [3]。

人们还问

“GPT-5.4、GPT-5.3-Codex 与 Claude Opus 4.6：编码基准里的真正差异”的简短答案是什么？

如果只看 SWE Bench Verified 风格的仓库修复，Claude Opus 4.6 的公开分数约为 79%–81%，是最值得优先测试的选择 [3][5][7][9]。

首先要验证的关键点是什么？

接下来在实践中我应该做什么？

GPT 5.4 的编码基准提升有限，但工具搜索被报告可让 MCP token 用量下降 47%，更适合在工具密集系统中单独评估 [3]。

接下来我应该探索哪个相关主题？

继续“Claude Security 公测版详解：Anthropic 的 AI 代码漏洞扫描工具”以获得另一个角度和额外的引用。

打开相关页面

我应该将其与什么进行比较？

对照“Grok 4.3 API 解读：1M 上下文、低 token 价与语音平台野心”交叉检查此答案。

打开相关页面

继续你的研究

IT Consulting Orange County CA, IT Services, IT Outsourcing - Technijian. # Anthropic Claude Security for Enterprise: What You Need to Know. Anthropic has officially opened Claude

Claude Security 公测版详解：Anthropic 的 AI 代码漏洞扫描工具

Claude Security 公测版详解：Anthropic 如何用 AI 扫描企业代码漏洞

抽象 AI 介面顯示 Grok 4.3 API、1M context、token 價格與語音波形

Grok 4.3 API 解读：1M 上下文、低 token 价与语音平台野心

小米 Q5 更可能不是 MIX 5：最新线索指向小米 18 系列

小米 Q5 是 MIX 5 吗？最新爆料更指向小米 18 系列

Metas KI-Altersprüfung: Neue Regeln für Jugendliche in EU, USA und Brasilien

Meta用AI查年龄：Instagram和Facebook青少年账号将怎么变

Meta用AI核验年龄：Instagram和Facebook青少年保护升级

来源

[1] 2.0 Leaderboardtbench.ai
Rank Agent Model Date Agent Org Model Org Accuracy -- -- -- -- -- -- -- -- 4 ForgeCode Claude Opus 4.6 2026-03-12 ForgeCode Anthropic 79.8%± 1.6 5 SageAgent GPT-5.3-Codex 2026-03-13 OpenSage OpenAI 78.4%± 2.2 6 ForgeCode Gemini 3.1 Pro 2026-03-02 ForgeCode...
[3] GPT-5.4: The Real Leap Isn't Coding | Blog - Alex Lavaeealexlavaee.me
- Coding benchmarks are flat. SWE-Bench Pro: 57.7% vs 56.8% for GPT-5.3-Codex. Terminal-Bench 2.0 actually regressed from 77.3% to 75.1%. - Tool search cuts MCP token usage by 47% by loading tool definitions on demand instead of cramming them all into conte...
[5] Best AI for Coding 2026: SWE-Bench Breakdown—Opus 4.6 ...marc0.dev
I dug into all of them. Here's what the benchmarks actually say, what they don't, and which model is worth your money depending on what you actually build. … Benchmark Claude Opus 4.6 GPT-5.3 Codex Winner -- -- -- -- SWE-bench Verified 80.8% 56.8% Opus 4.6...
[6] Claude Opus 4.6 vs GPT-5.3 Codex: Complete Comparisondigitalapplied.com
79.4% Claude SWE-bench Verified 78.2% GPT-5.3 SWE-bench Pro 77.3% Claude GPQA Diamond 25% GPT-5.3 Speed Gain Key Takeaways Claude leads SWE-bench Verified:: Opus 4.6 scores 79.4% on SWE-bench Verified while GPT-5.3-Codex leads SWE-bench Pro Public at 78.2%...
[7] Claude Opus 4.6 vs GPT-5.3 Codex: We Tested Both on Real ...intelligibberish.com
The Benchmark Numbers Before getting to practical testing, here’s how the flagship models compare on standardized benchmarks. Claude Opus 4.6: - SWE-bench Verified: 79.4% - GPQA Diamond: 77.3% - Terminal-Bench 2.0: 65.4% GPT-5.3 Codex: - SWE-bench Pro Publi...
[9] New GPT and Claude Releases Continue to One-Up Themselvesblog.kilo.ai
- Agent Teams (preview) — multiple Claude instances collaborating in parallel on tasks like code review, testing, and documentation - 80.8% on SWE-Bench Verified — the highest score on real-world bug-fixing evaluations - 65.4% on Terminal-Bench 2.0 — a new...
[10] SWE-bench 2026: Claude Opus 4.6 vs GPT-5.4 Coding Benchmarksevolink.ai
Here is the practical answer: - Claude Opus 4.6 has strong official coding claims from Anthropic, including public discussion of SWE-bench Verified methodology and strong performance on Terminal-Bench 2.0. - GPT-5.4 has strong official coding claims from Op...

热门发现

答案已发布2026年5月5日Last edited 2026年5月6日7 来源

GPT-5.4、GPT-5.3-Codex 与 Claude Opus 4.6：编码基准里的真正差异

使用 Studio Global AI 搜索并核查事实从“发现”浏览更多内容

4.2K0

先给结论：按工作负载选，不要按总榜选

你的场景	建议优先测试	依据	主要提醒
类似 SWE-Bench Verified 的仓库 bug 修复	Claude Opus 4.6	多份报告给 Opus 4.6 的 SWE-Bench Verified 分数约 79.2%–80.8% ^[3]^[5]^[7]^[9]。	不要把 Verified 分数和 SWE-Bench Pro Public 分数当成同一套测试来比较 ^[6]^[7]^[10]。
终端、命令行、脚本执行类智能体工作流	GPT-5.3-Codex，但要固定评测框架	一份 GPT-5.4 对比把 GPT-5.3-Codex 的 Terminal-Bench 2.0 列为 77.3%，高于 GPT-5.4 的 75.1% 和 Claude Opus 4.6 的 65.4% ^[3]。	公开榜单排名的是智能体/模型组合；Claude Opus 4.6 搭配 ForgeCode 时达到 79.8% ^[1]。
只在 OpenAI 模型里做选择	GPT-5.4 可以试，但别期待飞跃	同一份对比中，GPT-5.4 在 SWE-Bench Pro 为 57.7%，GPT-5.3-Codex 为 56.8% ^[3]。	同一来源也显示 GPT-5.4 在 Terminal-Bench 2.0 低于 GPT-5.3-Codex ^[3]。
工具密集、MCP 相关系统	GPT-5.4 值得单独评估	GPT-5.4 分析称，工具搜索通过按需加载工具定义，让 MCP token 用量下降 47% ^[3]。	token 成本或上下文效率的改善，不等同于 SWE-Bench 或 Terminal-Bench 上的准确率胜利 ^[3]。

最大陷阱：别把不同基准硬拼成一个排行榜

SWE-Bench Verified 和 SWE-Bench Pro Public 不是同一回事

Terminal-Bench 2.0 看的是“模型 + 智能体框架”

三个模型分别怎么看

Claude Opus 4.6：仓库修复的首选候选

GPT-5.3-Codex：OpenAI 阵营里的终端智能体强项

GPT-5.4：编码小步快跑，工具使用更值得关注

更靠谱的比较方法

先定基准版本，再定赢家。SWE-Bench Verified、SWE-Bench Pro 和 SWE-Bench Pro Public 不应被合并成一张简单排名表 ^[6]^[7]^[10]。
终端任务要固定智能体框架。Terminal-Bench 2.0 公开榜单显示，同一模型搭配不同 agent harness 会得到明显不同的准确率 ^[1]。
把编码准确率和工具效率分开看。GPT-5.4 被报告的 47% MCP token 降幅，对工具密集系统有参考价值，但它不是 SWE-Bench 或 Terminal-Bench 的胜利声明 ^[3]。
把混合来源的排名当作方向，而不是裁判。不同来源在不同基准下支持不同赢家，这正说明单一总冠军的说法会夸大证据 ^[1]^[3]^[6]^[7]^[10]。

最后怎么选

最安全的结论不是“某一个模型统治编程”，而是：赢家会随着基准版本、智能体框架和真实工作负载而变化 ^[1]^[6]^[7]^[10]。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜索并核查事实

要点

如果只看 SWE Bench Verified 风格的仓库修复，Claude Opus 4.6 的公开分数约为 79%–81%，是最值得优先测试的选择 [3][5][7][9]。
终端智能体场景不能只看模型名：Terminal Bench 2.0 公开榜单排名的是智能体/模型组合，GPT 5.3 Codex 与 Claude Opus 4.6 都会随框架变化而换位 [1][3]。
GPT 5.4 的编码基准提升有限，但工具搜索被报告可让 MCP token 用量下降 47%，更适合在工具密集系统中单独评估 [3]。

人们还问

“GPT-5.4、GPT-5.3-Codex 与 Claude Opus 4.6：编码基准里的真正差异”的简短答案是什么？

如果只看 SWE Bench Verified 风格的仓库修复，Claude Opus 4.6 的公开分数约为 79%–81%，是最值得优先测试的选择 [3][5][7][9]。

首先要验证的关键点是什么？

接下来在实践中我应该做什么？

GPT 5.4 的编码基准提升有限，但工具搜索被报告可让 MCP token 用量下降 47%，更适合在工具密集系统中单独评估 [3]。

接下来我应该探索哪个相关主题？

继续“Claude Security 公测版详解：Anthropic 的 AI 代码漏洞扫描工具”以获得另一个角度和额外的引用。

打开相关页面

我应该将其与什么进行比较？

对照“Grok 4.3 API 解读：1M 上下文、低 token 价与语音平台野心”交叉检查此答案。

打开相关页面

继续你的研究

来源

[1] 2.0 Leaderboardtbench.ai
Rank Agent Model Date Agent Org Model Org Accuracy -- -- -- -- -- -- -- -- 4 ForgeCode Claude Opus 4.6 2026-03-12 ForgeCode Anthropic 79.8%± 1.6 5 SageAgent GPT-5.3-Codex 2026-03-13 OpenSage OpenAI 78.4%± 2.2 6 ForgeCode Gemini 3.1 Pro 2026-03-02 ForgeCode...
[3] GPT-5.4: The Real Leap Isn't Coding | Blog - Alex Lavaeealexlavaee.me
- Coding benchmarks are flat. SWE-Bench Pro: 57.7% vs 56.8% for GPT-5.3-Codex. Terminal-Bench 2.0 actually regressed from 77.3% to 75.1%. - Tool search cuts MCP token usage by 47% by loading tool definitions on demand instead of cramming them all into conte...
[5] Best AI for Coding 2026: SWE-Bench Breakdown—Opus 4.6 ...marc0.dev
I dug into all of them. Here's what the benchmarks actually say, what they don't, and which model is worth your money depending on what you actually build. … Benchmark Claude Opus 4.6 GPT-5.3 Codex Winner -- -- -- -- SWE-bench Verified 80.8% 56.8% Opus 4.6...
[6] Claude Opus 4.6 vs GPT-5.3 Codex: Complete Comparisondigitalapplied.com
79.4% Claude SWE-bench Verified 78.2% GPT-5.3 SWE-bench Pro 77.3% Claude GPQA Diamond 25% GPT-5.3 Speed Gain Key Takeaways Claude leads SWE-bench Verified:: Opus 4.6 scores 79.4% on SWE-bench Verified while GPT-5.3-Codex leads SWE-bench Pro Public at 78.2%...
[7] Claude Opus 4.6 vs GPT-5.3 Codex: We Tested Both on Real ...intelligibberish.com
The Benchmark Numbers Before getting to practical testing, here’s how the flagship models compare on standardized benchmarks. Claude Opus 4.6: - SWE-bench Verified: 79.4% - GPQA Diamond: 77.3% - Terminal-Bench 2.0: 65.4% GPT-5.3 Codex: - SWE-bench Pro Publi...
[9] New GPT and Claude Releases Continue to One-Up Themselvesblog.kilo.ai
- Agent Teams (preview) — multiple Claude instances collaborating in parallel on tasks like code review, testing, and documentation - 80.8% on SWE-Bench Verified — the highest score on real-world bug-fixing evaluations - 65.4% on Terminal-Bench 2.0 — a new...
[10] SWE-bench 2026: Claude Opus 4.6 vs GPT-5.4 Coding Benchmarksevolink.ai
Here is the practical answer: - Claude Opus 4.6 has strong official coding claims from Anthropic, including public discussion of SWE-bench Verified methodology and strong performance on Terminal-Bench 2.0. - GPT-5.4 has strong official coding claims from Op...

热门发现

答案已发布2026年5月5日Last edited 2026年5月6日7 来源

GPT-5.4、GPT-5.3-Codex 与 Claude Opus 4.6：编码基准里的真正差异

使用 Studio Global AI 搜索并核查事实从“发现”浏览更多内容

4.2K0

先给结论：按工作负载选，不要按总榜选

你的场景	建议优先测试	依据	主要提醒
类似 SWE-Bench Verified 的仓库 bug 修复	Claude Opus 4.6	多份报告给 Opus 4.6 的 SWE-Bench Verified 分数约 79.2%–80.8% ^[3]^[5]^[7]^[9]。	不要把 Verified 分数和 SWE-Bench Pro Public 分数当成同一套测试来比较 ^[6]^[7]^[10]。
终端、命令行、脚本执行类智能体工作流	GPT-5.3-Codex，但要固定评测框架	一份 GPT-5.4 对比把 GPT-5.3-Codex 的 Terminal-Bench 2.0 列为 77.3%，高于 GPT-5.4 的 75.1% 和 Claude Opus 4.6 的 65.4% ^[3]。	公开榜单排名的是智能体/模型组合；Claude Opus 4.6 搭配 ForgeCode 时达到 79.8% ^[1]。
只在 OpenAI 模型里做选择	GPT-5.4 可以试，但别期待飞跃	同一份对比中，GPT-5.4 在 SWE-Bench Pro 为 57.7%，GPT-5.3-Codex 为 56.8% ^[3]。	同一来源也显示 GPT-5.4 在 Terminal-Bench 2.0 低于 GPT-5.3-Codex ^[3]。
工具密集、MCP 相关系统	GPT-5.4 值得单独评估	GPT-5.4 分析称，工具搜索通过按需加载工具定义，让 MCP token 用量下降 47% ^[3]。	token 成本或上下文效率的改善，不等同于 SWE-Bench 或 Terminal-Bench 上的准确率胜利 ^[3]。

最大陷阱：别把不同基准硬拼成一个排行榜

SWE-Bench Verified 和 SWE-Bench Pro Public 不是同一回事

Terminal-Bench 2.0 看的是“模型 + 智能体框架”

三个模型分别怎么看

Claude Opus 4.6：仓库修复的首选候选

GPT-5.3-Codex：OpenAI 阵营里的终端智能体强项

GPT-5.4：编码小步快跑，工具使用更值得关注

更靠谱的比较方法

先定基准版本，再定赢家。SWE-Bench Verified、SWE-Bench Pro 和 SWE-Bench Pro Public 不应被合并成一张简单排名表 ^[6]^[7]^[10]。
终端任务要固定智能体框架。Terminal-Bench 2.0 公开榜单显示，同一模型搭配不同 agent harness 会得到明显不同的准确率 ^[1]。
把编码准确率和工具效率分开看。GPT-5.4 被报告的 47% MCP token 降幅，对工具密集系统有参考价值，但它不是 SWE-Bench 或 Terminal-Bench 的胜利声明 ^[3]。
把混合来源的排名当作方向，而不是裁判。不同来源在不同基准下支持不同赢家，这正说明单一总冠军的说法会夸大证据 ^[1]^[3]^[6]^[7]^[10]。

最后怎么选

最安全的结论不是“某一个模型统治编程”，而是：赢家会随着基准版本、智能体框架和真实工作负载而变化 ^[1]^[6]^[7]^[10]。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜索并核查事实

要点

如果只看 SWE Bench Verified 风格的仓库修复，Claude Opus 4.6 的公开分数约为 79%–81%，是最值得优先测试的选择 [3][5][7][9]。
终端智能体场景不能只看模型名：Terminal Bench 2.0 公开榜单排名的是智能体/模型组合，GPT 5.3 Codex 与 Claude Opus 4.6 都会随框架变化而换位 [1][3]。
GPT 5.4 的编码基准提升有限，但工具搜索被报告可让 MCP token 用量下降 47%，更适合在工具密集系统中单独评估 [3]。

人们还问

“GPT-5.4、GPT-5.3-Codex 与 Claude Opus 4.6：编码基准里的真正差异”的简短答案是什么？

如果只看 SWE Bench Verified 风格的仓库修复，Claude Opus 4.6 的公开分数约为 79%–81%，是最值得优先测试的选择 [3][5][7][9]。

首先要验证的关键点是什么？

接下来在实践中我应该做什么？

GPT 5.4 的编码基准提升有限，但工具搜索被报告可让 MCP token 用量下降 47%，更适合在工具密集系统中单独评估 [3]。

接下来我应该探索哪个相关主题？

继续“Claude Security 公测版详解：Anthropic 的 AI 代码漏洞扫描工具”以获得另一个角度和额外的引用。

打开相关页面

我应该将其与什么进行比较？

对照“Grok 4.3 API 解读：1M 上下文、低 token 价与语音平台野心”交叉检查此答案。

打开相关页面

继续你的研究

来源

[1] 2.0 Leaderboardtbench.ai
Rank Agent Model Date Agent Org Model Org Accuracy -- -- -- -- -- -- -- -- 4 ForgeCode Claude Opus 4.6 2026-03-12 ForgeCode Anthropic 79.8%± 1.6 5 SageAgent GPT-5.3-Codex 2026-03-13 OpenSage OpenAI 78.4%± 2.2 6 ForgeCode Gemini 3.1 Pro 2026-03-02 ForgeCode...
[3] GPT-5.4: The Real Leap Isn't Coding | Blog - Alex Lavaeealexlavaee.me
- Coding benchmarks are flat. SWE-Bench Pro: 57.7% vs 56.8% for GPT-5.3-Codex. Terminal-Bench 2.0 actually regressed from 77.3% to 75.1%. - Tool search cuts MCP token usage by 47% by loading tool definitions on demand instead of cramming them all into conte...
[5] Best AI for Coding 2026: SWE-Bench Breakdown—Opus 4.6 ...marc0.dev
I dug into all of them. Here's what the benchmarks actually say, what they don't, and which model is worth your money depending on what you actually build. … Benchmark Claude Opus 4.6 GPT-5.3 Codex Winner -- -- -- -- SWE-bench Verified 80.8% 56.8% Opus 4.6...
[6] Claude Opus 4.6 vs GPT-5.3 Codex: Complete Comparisondigitalapplied.com
79.4% Claude SWE-bench Verified 78.2% GPT-5.3 SWE-bench Pro 77.3% Claude GPQA Diamond 25% GPT-5.3 Speed Gain Key Takeaways Claude leads SWE-bench Verified:: Opus 4.6 scores 79.4% on SWE-bench Verified while GPT-5.3-Codex leads SWE-bench Pro Public at 78.2%...
[7] Claude Opus 4.6 vs GPT-5.3 Codex: We Tested Both on Real ...intelligibberish.com
The Benchmark Numbers Before getting to practical testing, here’s how the flagship models compare on standardized benchmarks. Claude Opus 4.6: - SWE-bench Verified: 79.4% - GPQA Diamond: 77.3% - Terminal-Bench 2.0: 65.4% GPT-5.3 Codex: - SWE-bench Pro Publi...
[9] New GPT and Claude Releases Continue to One-Up Themselvesblog.kilo.ai
- Agent Teams (preview) — multiple Claude instances collaborating in parallel on tasks like code review, testing, and documentation - 80.8% on SWE-Bench Verified — the highest score on real-world bug-fixing evaluations - 65.4% on Terminal-Bench 2.0 — a new...
[10] SWE-bench 2026: Claude Opus 4.6 vs GPT-5.4 Coding Benchmarksevolink.ai
Here is the practical answer: - Claude Opus 4.6 has strong official coding claims from Anthropic, including public discussion of SWE-bench Verified methodology and strong performance on Terminal-Bench 2.0. - GPT-5.4 has strong official coding claims from Op...