METR later revisited the study design in early 2026, adjusting for task heterogeneity. The revised analysis found a modest 6% sample-wide speedup, but with extreme variation: some developers gained up to 25% on certain tasks, while others remained net slower. The core conclusion held: AI's benefit is highly task-dependent, and self-reported speed is not a reliable metric.
If time-to-completion numbers are noisy, code quality data is clearer. CodeRabbit's landmark "State of AI vs Human Code Generation" report analyzed 470 real-world GitHub pull requests — 320 AI-coauthored and 150 human-only — across open-source projects.
The headline is stark: AI-generated pull requests contained ~1.7x more issues on average than human-written code (10.83 issues per PR vs. 6.45). The quality deficit wasn't limited to style or formatting. It was concentrated in areas that cause real incidents:
CodeRabbit's analysis also identified a "heavier review tail" for AI-authored code, meaning human reviewers spent disproportionately more time finding and diagnosing problems in AI-generated changes. As the report's authors put it, humans and AI make the same kinds of mistakes — AI just makes many of them more often and at a larger scale.
The pattern aligns with CodeRabbit's broader observation that 2025 was defined by AI speed, but 2026 must become the year of AI quality. Postmortems and operational incidents increasingly traced back to subtle logic errors, configuration oversights, and design misunderstandings introduced by AI assistants.
The quality deficit translates directly into financial waste. Developer productivity platform Entelligence.AI aggregated data from 2,444 companies and produced a breakdown that has reverberated through engineering circles:
| Where the dollar goes | Cost per $1 of AI token spend |
|---|---|
| Fixing AI-introduced bugs | $0.44 |
| Rework | $0.27 |
| Review friction | $0.11 |
| Actual value reaching users | $0.18 |
In other words, 82 cents of every dollar spent on AI tokens goes to bugs, rework, and review overhead. Only 18 cents delivers user-facing value. The cost isn't theoretical. Uber exhausted its entire 2026 AI coding budget within four months and recorded zero measurable productivity gain. An unnamed Uber executive stated bluntly that the link between AI spend and product improvement "doesn't exist yet."
A complementary study from Stanford and MIT found that AI agents fixing code bugs can burn over a million tokens per task — approximately 1,000 times the token consumption of standard code Q&A tasks. The economics suggest that for many organizations, the downstream costs of AI adoption are currently eating the promised productivity gains.
Perhaps the most psychologically striking finding is that developers who experience this data still refuse to work without AI. Multiple outlets have reported that participants in the METR study resisted returning to unaided coding even after being shown their own slowdown figures. This has been described as an "AI dependency paradox" — once developers become accustomed to AI assistance, they lose confidence in their unaided ability, even when the tool is demonstrably slowing them down.
As one developer put it, AI "handles the boring parts — boilerplate, syntax, the stuff that feels like work but isn't where the actual difficulty lives." The tool makes coding feel faster even when the stopwatch says otherwise, because the friction shifts from writing initial drafts to conducting meticulous reviews.
Across METR's controlled trials, CodeRabbit's pull request analysis, and Entelligence.AI's enterprise data, a consistent set of recommendations has emerged:
The emerging evidence doesn't suggest that AI coding tools are useless. In specific contexts — onboarding unfamiliar codebases, generating boilerplate, and tasks where developers predicted AI would help substantially — measurable speedups do appear. But across the broader population of experienced developers working on their own mature codebases, the net effect in mid-2025 through 2026 has been slower deliveries, more defects, and a dependency that resists the data.
Comments
0 comments