| — |
| — |
| GPT-5.5 | OpenAI | April 24, 2026 | Artificial Analysis Index: 60.2 | — | — |
| Gemini 3.5 Flash | May 2026 | "Flash" tier with frontier-model-level performance | $1.50 | $9.00 |
| Grok 4.3 | xAI | April 30, 2026 | Artificial Analysis Index: 53.2 | $1.25 | $2.50 |
| DeepSeek V4 Pro | DeepSeek | April 2026 | Artificial Analysis Index (Max Effort): 52 | $0.435 | $0.87 |
Claude Fable 5 has taken a decisive lead on composite intelligence metrics, but the gap between models varies enormously depending on what task you're measuring.
Coding is the arena where Fable 5 makes its strongest claim to the throne. On SWE-Bench Pro, the benchmark that most closely mirrors production software engineering, Fable 5 scores 80.3%, obliterating every other public model. This is not a marginal improvement—it's a generational leap.
Why the gap matters: On SWE-Bench Pro, Fable 5's 21.7-point lead over GPT-5.5 means hundreds of real software engineering tasks where Fable 5 ships working code and GPT-5.5 does not . The gap widens further on harder problems: on FrontierCode Diamond, the hardest 50 tasks, Fable 5 scores 29.3%—more than double Opus 4.8's 13.4% and over five times GPT-5.5's 5.7%
. As Anthropic put it, the longer and more complex the task, the larger Fable 5's lead over every other model
.
Agentic performance—a model's ability to carry out multi-step, real-world tasks—is where the economic value of these models is determined. Here, the Anthropic models have established a clear duopoly at the top, while Grok 4.3 has carved out a surprising niche as the best instruction-follower.
Claude Opus 4.8 also recorded a perfect score on Anthropic's internal Super-Agent benchmark, completing every single case end-to-end—something GPT-5.5 could not match at comparable cost . Meanwhile, Grok 4.3's 81% on IFBench is the highest instruction-following score recorded among frontier models, making it a standout for tool-calling and structured output generation
.
GPT-5.5 remains the strongest model on the hardest mathematical reasoning benchmarks, but its lead comes with a significant asterisk.
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.5 Flash | Grok 4.3 | DeepSeek V4 Pro |
|---|---|---|---|---|---|---|
| FrontierMath T1–3 | — | — | 51.7% | — | — | — |
| FrontierMath T4 | — | — | 35.4% | — | — | — |
| USAMO 2026 Math | — | 96.7% | — | — | — | — |
| GPQA Diamond | — | 93.6% | 93.5% | — | 90.1% | 90.1% |
While GPT-5.5 leads on FrontierMath, it was Opus 4.8 that delivered the most striking single-cycle math improvement on record: a 27-point leap on USAMO 2026, from 69.3% on Opus 4.7 to 96.7% .
The hallucination trade-off: GPT-5.5's math and reasoning prowess comes at a cost. Artificial Analysis recorded an 86% hallucination rate on the AA-Omniscience knowledge benchmark—2.5× higher than Claude Opus 4.7 . The model posted the highest accuracy score any model has ever achieved on that evaluation (57%), but also generated the most confident falsehoods
. For research applications, this makes GPT-5.5 a high-risk choice. Claude Opus 4.8, by contrast, had the lowest incorrect-rate of six models tested on every factual benchmark, achieving this by abstaining when uncertain rather than guessing
.
For users who prioritize cost or speed over raw intelligence, the landscape is different.
| Metric | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.5 Flash | Grok 4.3 | DeepSeek V4 Pro |
|---|---|---|---|---|---|---|
| Output Speed | — | — | 68.2 tok/s | 4× faster than 3.1 Pro | 123-159+ tok/s | — |
| Pricing (Output / 1M tok) | $50.00 | — | $30.00 | $9.00 | $2.50 | $0.87 |
Gemini 3.5 Flash offers compelling value at $9.00/M output tokens while achieving 55.1% on SWE-Bench Pro—slightly surpassing its own more expensive predecessor, Gemini 3.1 Pro (54.2%) . Grok 4.3, at just $2.50/M output tokens, punches above its weight class on agentic tasks and offers the fastest output speeds in the set. DeepSeek V4 Pro, despite the lowest pricing at $0.87/M output tokens and competitive coding scores on self-reported benchmarks, lags significantly on independent evaluations—as the next section explores.
The most important independent data point for DeepSeek V4 Pro comes from CAISI, the U.S. government's AI evaluation body housed at NIST. Their conclusion is unambiguous: DeepSeek V4's capabilities lag behind the frontier by about 8 months .
The gap between self-reported and independent scores: DeepSeek's self-reported benchmarks position V4 Pro as roughly comparable to Claude Opus 4.6 and GPT-5.4, which were released about 2 months prior. However, CAISI's evaluations—which include non-public benchmarks—found that DeepSeek V4 Pro performed similarly to GPT-5, which was released 8 months earlier . This discrepancy is a critical reminder that models tend to score better on vendor-chosen benchmarks than on independent, non-public evaluations.
On the safety front, the UK's AI Security Institute evaluated GPT-5.5 at 71.4% (±8.0%) on a standard expert-level cyber-security tasks test, stating it "may be the strongest model we have tested" on that measure .
For production software engineering: Choose Claude Fable 5 if you need the highest resolve rate on real GitHub issues (80.3% on SWE-Bench Pro). Claude Opus 4.8 is the strong runner-up at 69.2% and may offer better cost-efficiency for less complex tasks.
For high-stakes research or factual work: Avoid GPT-5.5 unless you can tolerate a high hallucination rate. A model with an 86% hallucination rate is unsuitable for applications where factual accuracy matters. Claude Opus 4.8's deliberate abstention from uncertain answers makes it the safer choice for research, despite occasionally refusing questions it could answer.
For speed and instruction-following: Grok 4.3's 81% IFBench score and 98% τ²-Bench Telecom score make it the best model for reliable tool-calling, structured output generation, and high-throughput applications—at a fraction of the price of the top-tier models.
For value and agentic tasks: Gemini 3.5 Flash offers a strong balance of performance and cost, with 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas for multi-step tool-use workflows.
For open-source coding: DeepSeek V4 Pro delivers strong coding performance (LiveCodeBench 93.5%) at $0.87/M output tokens, but users should be aware that independent evaluations place its overall capabilities about 8 months behind the frontier.
The AI model landscape in June 2026 is one of striking contrasts: a model that dominates coding benchmarks but hallucinates on 86% of factual questions; a government evaluation that cuts through vendor claims to reveal an 8-month capability gap; and a new leader, Claude Fable 5, that opens an 11-point gap on the production-coding benchmark that matters most to developers. These contrasts make it clear that no single benchmark tells the whole story—and that the most useful model depends entirely on the specific task, tolerance for error, and budget at hand.
Comments
0 comments