The model also performs strongly on knowledge‑work benchmarks. On GDPval, which evaluates tasks across professions like law, finance, and product management, GPT‑5.5 matches or beats professionals in about 84.9% of comparisons.
Taken together, these results suggest GPT‑5.5 is especially strong at autonomous multi‑step workflows and agent‑style tasks.
Anthropic’s Claude Opus 4.7 is widely regarded as one of the strongest AI systems for software engineering.
Key results include:
The SWE‑bench family is one of the most realistic coding tests available because it uses real bugs from open‑source repositories. Opus 4.7 resolving 87.6% of SWE‑bench Verified tasks places it among the best coding agents reported publicly.
While its Terminal‑Bench score trails GPT‑5.5, its performance on software engineering benchmarks remains among the strongest in the field.
Google’s Gemini 3.5 Flash is unusual because it’s positioned as a fast and cost‑efficient model rather than a heavyweight flagship.
Even so, its benchmark results remain competitive:
According to Google, Gemini 3.5 Flash can generate outputs roughly four times faster than comparable frontier models while outperforming the earlier Gemini 3.1 Pro on several agentic benchmarks.
In practical deployments, its biggest advantage is the speed‑to‑capability balance: near‑flagship performance combined with lower latency and cost.
DeepSeek V4 is notable because it represents one of the most powerful AI models released with open weights.
The model family includes two versions:
According to technical summaries and model reports, V4‑Pro running in its maximum reasoning mode achieves:
These numbers place it surprisingly close to leading proprietary models in several coding‑related benchmarks.
However, an independent evaluation from the U.S. National Institute of Standards and Technology (NIST) found that the model’s capabilities lag the frontier by roughly eight months, highlighting the gap that can exist between vendor‑reported and independent results.
xAI’s Grok 4.3 represents a substantial improvement over earlier Grok releases, particularly in agentic workflows.
Reported figures include:
The most notable improvement is on GDPval‑AA, where Grok 4.3 gained more than 300 Elo compared with the previous version, suggesting major progress on real‑world task automation.
However, independent analyses generally still place the model below the newest OpenAI and Anthropic systems on broad capability benchmarks.
Looking across these evaluations, a pattern appears:
These conclusions are directional rather than definitive because each company highlights different evaluation suites.
Frontier AI benchmarks remain unstable for several reasons:
Because of these factors, the true ranking of frontier models often becomes clearer only after months of independent testing.
The latest benchmark data does not show a single model dominating every domain.
Instead, the frontier has become increasingly specialized:
As more independent evaluations appear and benchmark standards stabilize, the relative ranking of these models will likely continue to evolve.
Comments
0 comments