The clean benchmark answer is workload-specific, not a single leaderboard crown. In the available public reports, GPT-5.4 narrowly improves on GPT-5.3-Codex on SWE-Bench Pro, GPT-5.3-Codex remains stronger in one cited Terminal-Bench 2.0 comparison, and Claude Opus 4.6 has the strongest reported SWE-Bench Verified results. Those claims sit on different benchmark variants and, for terminal tasks, different agent harnesses, so they should not be collapsed into one universal score [1][
3][
7][
10].
The benchmark caveat that matters
Two caveats dominate this comparison.
First, SWE-Bench Pro and SWE-Bench Verified are not the same test. Multiple sources warn that Anthropic and OpenAI results are often reported on different SWE-bench variants, so direct score comparison is technically invalid or at least not apples-to-apples [6][
7][
10].
Second, Terminal-Bench 2.0 is not only a model comparison. Its public leaderboard reports agent/model pairs. The same leaderboard lists Claude Opus 4.6 at 79.8% with ForgeCode and 75.3% with Capy, while GPT-5.3-Codex appears at 78.4% with SageAgent and 77.3% with Droid [1]. That means a terminal benchmark result can change meaningfully when the wrapper, tool loop, timeout, or agent design changes.
Benchmark snapshot
| Benchmark | GPT-5.4 | GPT-5.3-Codex | Claude Opus 4.6 | Best read |
|---|---|---|---|---|
| SWE-Bench Pro | 57.7% [ | 56.8% [ | Not consistently available; one combined result lists 53.4% [ | GPT-5.4 has the small edge over GPT-5.3-Codex on this Pro comparison, but this should not be treated as the same as Verified [ |
| SWE-Bench Verified | 77.2% in one Vals.ai comparison cited by a GPT-5.4 analysis [ | No directly comparable Verified score in the provided sources | 79.2% in the same GPT-5.4 analysis [ | Claude Opus 4.6 has the stronger reported Verified signal. |
| Terminal-Bench 2.0 | 75.1% in GPT-5.4 reports [ | 77.3% in several reports [ | 65.4% in several reports [ | GPT-5.3-Codex wins the cited GPT-5.4 comparison, but the live leaderboard can put Opus ahead when paired with a different agent [ |
| GPQA Diamond | Not found in the provided GPT-5.4 snippets | 73.8% in one comparison [ | 77.3% in the same comparison [ | The provided data supports a GPT-5.3-Codex vs Opus 4.6 comparison, not a full three-way GPT-5.4 ranking. |
What this means in practice
Claude Opus 4.6: strongest for SWE-Bench Verified-style bug fixing
Claude Opus 4.6 has the clearest advantage where the provided sources report SWE-Bench Verified. Those reports put Opus 4.6 at 79.2%, 79.4%, or 80.8% on Verified, and one source describes the 80.8% result as a real-world bug-fixing evaluation [3][
7][
9].
That is the strongest reason to pick Opus 4.6 if your main concern is benchmarked repository repair or Verified-style bug fixing. The caveat is important: do not compare that 79-81% Verified band directly against GPT-5.4's 57.7% SWE-Bench Pro or GPT-5.3-Codex's 56.8% SWE-Bench Pro, because those are different benchmark variants [3][
6][
7][
10].
GPT-5.3-Codex: still a very strong terminal-agent pick
GPT-5.3-Codex's best case is Terminal-Bench 2.0. The sources repeatedly report 77.3% for GPT-5.3-Codex, and the public Terminal-Bench 2.0 leaderboard lists GPT-5.3-Codex at 78.4% when paired with SageAgent [1][
3][
7][
9].
That makes it a strong choice if your evaluation is terminal-agent-heavy. However, the same public leaderboard lists Claude Opus 4.6 at 79.8% with ForgeCode, so the stricter conclusion is not that the base GPT-5.3-Codex model always beats Opus. It is that GPT-5.3-Codex is consistently competitive in terminal workflows, while the agent harness can flip the top-line result [1].
GPT-5.4: a modest coding bump, not a blowout
GPT-5.4 looks better than GPT-5.3-Codex on one SWE-Bench Pro comparison, 57.7% versus 56.8%, but worse on the same report's Terminal-Bench 2.0 line, 75.1% versus 77.3% [3]. That supports a narrow conclusion: GPT-5.4 is worth testing if you are evaluating OpenAI's newer model, but the available coding benchmark evidence does not show a dramatic leap over GPT-5.3-Codex.
The more distinctive GPT-5.4 feature in the provided GPT-5.4 analysis is outside the headline coding scores: tool search that reduces MCP token usage by 47% in tool-heavy workflows [3]. If your system is heavy on tools and context management, that may matter more than the small SWE-Bench Pro delta.
Recommended ranking by task
-
Repo bug-fixing benchmark priority: Claude Opus 4.6. Its reported SWE-Bench Verified numbers are the strongest available in this source set, generally around 79-81% [
3][
5][
7][
9].
-
Terminal-agent benchmark priority: GPT-5.3-Codex, with a harness caveat. It has repeated 77.3% Terminal-Bench 2.0 reports and a 78.4% SageAgent leaderboard entry, but Opus 4.6 reaches 79.8% with ForgeCode on the same public leaderboard [
1][
3][
7][
9].
-
OpenAI-only coding upgrade path: GPT-5.4. It edges GPT-5.3-Codex on SWE-Bench Pro by 0.9 percentage points but trails GPT-5.3-Codex by 2.2 points in the cited Terminal-Bench 2.0 comparison [
3].
How to read the numbers without fooling yourself
- Compare benchmark variants, not just benchmark names. SWE-Bench Verified and SWE-Bench Pro are different, and direct comparisons across them are not apples-to-apples [
6][
7][
10].
- Compare agent harnesses on Terminal-Bench. The public leaderboard reports agent/model pairs, and results vary across ForgeCode, SageAgent, Droid, Capy, and other wrappers [
1].
- Run your own evaluation when the workload matters. The safest comparison uses the same tasks, same prompt policy, same agent wrapper, and same pass/fail criteria across all models; that recommendation follows directly from the cross-benchmark and harness differences documented in the cited sources [
1][
6][
7][
10].
Bottom line
Choose Claude Opus 4.6 for SWE-Bench Verified-style bug fixing, GPT-5.3-Codex for Terminal-Bench-style agent work, and GPT-5.4 when you want OpenAI's newer model but should not expect a dramatic coding benchmark jump over GPT-5.3-Codex. The numbers justify a task-based choice, not a universal champion [1][
3][
6][
7][
9][
10].
 as a convergence play — it merges GPT-5.3 Codex's coding chops with GPT-5.2's generalist reasoning](https://thomas-wiegold.com/blog-og-image.jpg)




