On the broader agentic coding suite, GPT-5.5 still holds a lead in specific areas. On the Terminal-Bench 2.1 agentic terminal coding evaluation, GPT-5.5 scored 78.2%, ahead of Opus 4.8's 74.6% and Gemini 3.1 Pro's 70.3% .
Anthropic's internal benchmarks also report gains on knowledge-work tasks. The model achieved a score of 1890 on the GDPval-AA evaluation for economically valuable knowledge work, compared to GPT-5.5's 1769 and Gemini's 1314 . Across the full suite, Anthropic claims Opus 4.8 outperforms both rival models in several key categories, though it does not lead every single test
.
In a shift from pure raw intelligence benchmarks, Anthropic heavily emphasized improvements in model trustworthiness. The company reported that Opus 4.8 is approximately four times less likely than Opus 4.7 to allow flaws in its own generated code to pass unremarked .
Early tester feedback highlighted that the model is significantly more likely to flag uncertainty and less prone to making unsupported claims during complex, multi-step workflows . The company directly framed "honesty" as a marquee product feature in this release, stating the model is less likely to present insufficiently supported information as factual
.
Alongside the base model, Anthropic launched new user-facing features specifically for developers and power users .
Dynamic Workflows: Available as a research preview in Claude Code, this feature lets the model plan a task, orchestrate it across hundreds of parallel subagents, and verify the results before reporting back. It is designed for massive-scale code migrations, auditing, and bug hunting within a single session .
Adjustable Engagement / Effort Control: Users can now dictate the model's depth of reasoning. The "effort" parameter on claude.ai and Claude Code allows a trade-off between intelligence, token cost, and speed. The documentation recommends using the xhigh level for the most difficult coding and agentic use cases, and a minimum of high for other intelligence-sensitive tasks .
Prompt caching rates are set at $6.25 per million tokens for 5-minute cache writes, $10 per million tokens for 1-hour cache writes, and $0.50 per million tokens for cache hits and refreshes .
The Opus 4.8 release is not a pure raw benchmark bump; it is a targeted enterprise and developer upgrade. The product story centers on reliability for agents, explicit uncertainty handling, and giving programmers control over cost-performance trade-offs through explicit effort levels. The pricing story remains conservative, with no increase for standard API calls, while the Fast mode price drop makes high-speed inference more accessible for latency-critical applications.
Comments
0 comments