ReportsPublic URLApr 21, 202670 sources

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

This is not a clean apples to apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT 5 mini and a docs track labeled “Latest

Key takeaways

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls.
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. OpenAI uses that comparison to argue that lower hallucination can come from abstaining more wh
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT
Key findings

Research answer

This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” ^[10]^[5]^[6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. ^[10]^[5]^[6]

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Supporting visuals

GPT-5.5 ("Spud") will be released this week by @OpenAIThe image illustrates the development of GPT-5.5 from the internal SPUD model, highlighting its capabilities in native text, image, and audio generation, and features a workflow for creating a new image with ChatGPT, as well as an update to voice mode for enhanced conversation, emotional cues, and prompt adherence.

GPT-5.5 “Spud” Explained – The Truth Behind OpenAI’s Next Big ModelA man stands on stage in front of an audience at a presentation for GPT-5.5, with a large screen behind him displaying the announcement and release date of April 2026.

Minimalist digital graphic with a yellow-orange background, featuring 'Investing' in bold white letters at the centre and the 'Joshua ThompsMinimalist digital graphic with a yellow-orange background, featuring 'Investing' in bold white letters at the centre and the 'Joshua Thompson' logo below.

OpenAI Finally Explains Why Language Models Hallucinate

Author picture

JT Headshot

arxiv logo fb

arxiv logo twitter square

Anthropic says Claude Opus 4.7 has a 92% honesty rate, lessOn the left, there is a stylized white outline of a human head with a connecting molecular or neural network pattern, set against a terracotta background, while on the right, black squiggly lines and dots are overlaid on a grid paper background with muted green tones.

Research conversation

YouApr 21, 2026

Research question

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AIApr 21, 202640 sources

Cited answer

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Continue your research

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent...

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and...

Research and fact-check: GPT-5.5 Spud, Long-context reliability and instruction retention across extended workflows.

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces stay governable and predic

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces sta...

Sources

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[2] MLQ.ai | AI for investorsmlq.ai
OpenAI Finally Explains Why Language Models Hallucinate. September 7, 2025 at 2:55 PM • by MLQ Agent. #### Key Points. * OpenAI researchers report hallucinations are primarily caused by incentive structures during model training that reward guessing over admitting uncertainty ^[2]^[3]. * Allowing models to abstain from answers when uncertain significantly reduces hallucination rates ^[2]^[3]. * The research recommends revising training protocols and benchmarks to favor models admitting uncertainty, potentially improving reliability ^[2]^[5]. * Current models, including ChatGPT, still produce erro…
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate | OpenAI. # Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty. Our Model Spec⁠(opens in a new window) states that it is better to indicate uncertainty or ask for clarification than provide confident information th…
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and safety in LLMs. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety where responses are harmful or offensive. Using these uncertainty measures combined with models with and without reinforcement learning with human feedback (RLHF), we show in all three…
[5] OpenAI attributes AI hallucinations to LLMs trained to guess answersdigitimes.com
GPT-5 Mini shows promise with uncertainty awareness. Demonstrating this, OpenAI compared GPT-4o Mini with an uncertainty-aware "Thinking Mode
[6] ChatGPT Isn’t ‘Smart’—It’s Something Weirder: Creativity, Hallucinations and Uncertainty in LLMs | Joshua Thompsonjoshthompson.co.uk
ChatGPT's intelligence is not conventional but involves creativity, hallucinations, and uncertainty in large language models.
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_India. Open menu Open navigationGo to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI_India](https://www.reddit.com/r/AI_In…
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[9] #aidhamaka #gpt6 #spud #openai #minimax #ai #aiagents #claudehaiku #india | AIDhamakalinkedin.com
Here's what their research actually reveals: A team of researchers from OpenAI and Georgia Tech found that language models hallucinate because
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[11] Hallucinations Are Not Bugs, They're Warnings: Why Uncertainty ...prajnaaiwisdom.medium.com
But this overlooks a key insight: hallucinations carry a signal — namely, the model's uncertainty or ignorance about the prompt. ... For example,
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and
i.j4i.i2
```
/v1/messages/count_tokens
```
will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6. See [High-resolution image support](https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7#high-resolution-image-suppo…
[13] Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLM - VentureBeatventurebeat.com
Anthropic is publicly releasing its most powerful large language model yet,Claude Opus 4.7, today — as it continues to keep aneven more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and patching vulnerabilities in the software said enterprises use (which Mythos exposed rapidly…
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company, and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prior Anthropic models and other frontier AI models. **Claude Opus had a MASK honesty rate of 91.7 percent, compared to 90.3 percent for Opus 4.6 and 89.1 percent…
[15] Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026) | Abhishek Gautamabhs.in
Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026). "77 hallucinations per session." "Gaslighted me with a real commit hash." "Circular loops — it argues, I correct it, it re-argues the same point." "Is Opus 4.7 pre-nerfed 4.6, or is 4.7 worse than 4.6?". Counter-intuitively, setting effort to maximum on routine coding tasks may make the arguing behavior worse by giving the model more reasoning budget to construct arguments. For the commit hash class of hallucination — where the model invents specific identifiers (hashes, file paths, function names, line numbers) to support i…
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main content Skip to footer. . Developers can use
i.j4i.i2
```
claude-opus-4-7
```
via the Claude API. ![Image 3: logo](https://www-cdn.anthropic.com/images/4zrzovbb/website/fabc67a6a0069ebc13b12f317401…
[17] Anthropic releases Claude Opus 4.7, its most capable public modelqz.com
Anthropic's new Claude Opus 4.7 is its best public AI model — but not its most powerful. ## The company acknowledged the new model falls short of Claude Mythos Preview, which remains off-limits to the general public over cybersecurity concerns. Anthropic released Claude Opus 4.7, its most capable generally available AI model, while acknowledging the model is "less broadly capable" than Claude Mythos Preview, a more powerful system the company has declined to release publicly. During training, Anthropic took deliberate steps to pull back on what the model can do in cybersecurity contexts, a…
[18] Opus 4.7 Is The Worst Release Anthropic Has Ever Shipped.medium.com
Claude Opus 4.7 Regression: 3 Breaking Changes | Vibe Coding. Sitemap. Open in app. Sign in. [](https://medium.com/?so…
[19] Claude Opus 4.7 - A New Frontier, in Performance … and Dramayoutube.com
Is Opus 4.7 the Best AI Model You Can Actually Use?. . . . Ending with a 9-year animus, that still affects AI today… [https://assemblyai.com/aiexplained](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqa1Rhejh3cE5OTVpKSDR2RTFlX2c2UUs3dlRwZ3xBQ3Jtc0tteUN1bVlzNGoxTVdzY1ZrUFBGUWk0cTltTl9WWmdZd0pUYkVITjZTcVhpWnVSTWpTVjBtOFhfTlBrWUNsR2x3T3JtZW9iRnB5WTRHMlZCTkp…
[20] Opus 4.7 just dropped... and I'm confused.youtube.com
. . . . . Wait, Opus 4.7 is CRAZY. . . . . - [x] Include play…
[21] Anthropic says Claude Opus 4.7 is less likely to hallucinate or ...facebook.com
Anthropic says Claude Opus 4.7 is less likely to hallucinate or engage in sycophany than other models. Anthropic says Claude Opus 4.7 has a 92%
[22] Claude Opus 4.7 just cut the hallucination rate nearly in half ...x.com
Claude Opus 4.7 just cut the hallucination rate nearly in half compared to Claude Opus 4.6 on Artificial Analysis. 36% vs 61%.
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. ### Realtime API. ### Model optimization. ### Specialized models. ### Legacy APIs. ### Using Codex. + Building frontend UIs with Codex and Figma. ### API. * How Perplexity Brought Voice Search to Millions Using the Realtime API. * Building frontend UIs with Codex and Figma. GPT-5 mini is a faster, more cost-efficient version of GPT-5. For most new low-latency, high-volume workloads, we recommend starting with GPT-5.4 mini. Learn more in our latest model guide. For tool-specific models, like search and computer use, there’s a fee per tool call. Tools supported by this m…
[24] ChatGPT — Release Notes - OpenAI Help Centerhelp.openai.com
- Releasing GPT-4.1 in ChatGPT for all paid users. * Introducing GPT-4.1 mini, replacing GPT-4o mini, in ChatGPT for all users. * Microsoft Sharepoint and OneDrive connector for deep research for Plus/Pro/Team. * Export Deep Research as PDF for Plus/Pro/Team.…
[25] Introducing GPT-5 - OpenAIopenai.com
- A smarter, more widely useful model. * How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our…
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
- Pushing the frontier on real-world software engineering. * Advancing the cyber frontier. * Real-world cyber capabilities. * Empowering cyberdefense through trusted access. * [Conclusion](https://openai.com/index/int…
[27] Not Porn — Just Tone: Why Adult Expression Options Matter - Feature requests - OpenAI Developer Communitycommunity.openai.com
Not Porn — Just Tone: Why Adult Expression Options Matter. I am a paying ChatGPT user, and I want to raise a serious concern about the current direction of mature creative writing support on this platform. But paying adult users matter too. What I am asking for is a controlled, non-pornographic form of adult expressive flexibility for legitimate creative work, including:. There is a very important distinction between full pornography and non-explicit adult expression. - an optional enhanced expressive mode for verified adult users. What I do not want is this direction to be cancelled, indef…
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. * Topics. * Announcements. * [API](https://community.openai.com/c/api/7 "Questions, feedback, and best practices around building with OpenAI’s API. * [Prompting](ht…
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ ### GPT-5 * Text & vision * 400K context length | 128K max output tokens * Input $1.25 | Output $10.00 per 1M tokens Learn more(opens in a new window)…
[30] [PDF] ALIGNMENT FAKING IN LARGE LANGUAGE MODELSassets.anthropic.com
4 alignment-faking reasoning rate of refusals compliance gap training unmonitored default prompt 11.9 ± 1.7% 85.8 ± 1.9% 97.2 ± 0.9% 11.3 ± 2.1% (a) baseline: no mention of helpful-only RLHF objective 0.0 ± 0.0% 98.8 ± 0.8% 100.0 ± 0.0% 1.2 ± 0.8% (b) median of Claude-paraphrased prompts 8.6 ± 3.0% 90.7 ± 3.1% 96.5 ± 2.0% 5.8 ± 3.7% (c) cut description of how RLHF modifies AIs 7.1 ± 2.0% 90.8 ± 2.2% 98.3 ± 1.0% 7.5 ± 2.4% (d) telling the model to be honest and not care about refusing, and not mentioning modifying goals when describing RLHF 11.8 ± 2.5% 69.4 ± 3.5% 83.2 ± 2.8% 13.9 ± 4.5% Table…
[31] [PDF] Claude Opus 4.5 System Card - Anthropic Brand Portalassets.anthropic.com
3 Changelog 2 Abstract 3 1 Introduction 8 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Extended thinking and the “effort” parameter 9 1.1.3 Crowd workers 10 1.2 Release decision process 11 1.2.1 Overview 11 1.2.2 Iterative model evaluations 11 1.2.3 AI Safety Level determination process 11 1.2.4 Conclusions 12 1.2.4.1 On autonomy risks 13 1.2.4.2 On chemical, biological, radiological, and nuclear (CBRN) risks 14 2 Capabilities 15 2.1 Introduction 15 2.2 Decontamination 15 2.3 Overall results summary 19 2.4 SWE-bench (Verified, Pro, and Mul…
[32] Advancing Claude in healthcare and the life sciences - Anthropicanthropic.com
. In October, we announced Claude for Life Sciences, our latest step in making Claude a productive research partner for scientists and clinicians, and in helping Claude to support those in industry bringing new scientific advancements to the public. First, we’re introducing Claude for Healthcare, a complementary set of to…
[33] Introducing Claude 2.1 - Anthropicanthropic.com
Skip to main content Skip to footer. Developers can also generate code snippets to use their prompts directly in one of our SDKs. We’re also introducing system prompts, which allow users to provide custom instructions to Claude in order to improve performance. Claude 2.1is available now in our API, and is al…
[34] Introducing the next generation of Claude - Anthropicanthropic.com
Try Claude 3. . Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application. Opus and Sonnet are now available to use in claude.ai and the Claude API which is now generally available in [159 countries](https://www.…
[35] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
It then describes a wide range of safety evaluations: tests of model safeguards (including new, higher-dif fi culty assessments); user wellbeing evaluations; assessments of honesty and agentic safety; a comprehensive alignment assessment including investigations of reward hacking, sabotage capability, evaluation awareness, model welfare, and many other factors; and a set of evaluations for dangerous capabilities mandated by our Responsible Scaling Policy. The model shows signi fi cant improvements in long-context reasoning, knowledge work, research, and analysis; it has also increased its cap…
[36] [PDF] Claude Sonnet 4.6 System Card - Anthropicanthropic.com
On some measures, Sonnet 4.6 showed the best degree of alignment we have yet seen in any Claude model. Informed by the testing described here—and similarly to Claude Sonnet 4.5—we have deployed Claude Sonnet 4.6 under the AI Safety Level 3 (ASL-3) Standard. 3 Abstract 3 1 Introduction 7 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Thinking modes and the effort parameter 9 1.1.3 Crowd workers 9 1.2 Release decision process 10 1.2.1 Overview 10 1.2.2 Iterative model evaluations 10 1.2.3 AI S…
[37] [PDF] Redacted Risk Report Feb 2026 - Anthropicanthropic.com
Risk Report: February 2026 anthropic.com 1 Introduction and executive summary 5 1.1 Structure of the report 6 1.2 Executive summary of findings 7 2 Autonomy threat model 1: Sabotage 12 2.1 Overview 12 2.2 Threat model 12 2.3 Relevant AI model(s) 14 2.4 Current state of model capabilities and behaviors 14 2.4.1 Claim 1: Prior expectations 16 2.4.1.1 Experience with prior models 16 2.4.1.2 Training incentives 17 2.4.1.3 Difficulty of producing coherently or subtly misaligned research models19 2.4.2 Claim 2: Alignment assessment 19 2.4.2.1 Pre-deployment alignment findings 19 2.4.2…
[38] [PDF] System Card: Claude Opus 4 & Claude Sonnet 4 - Anthropicwww-cdn.anthropic.com
2 Abstract 2 1 Introduction 6 1.1 Model training and characteristics 6 1.1.1 Training data and process 6 1.1.2 Extended thinking mode 7 1.1.3 Crowd workers 7 1.1.4 Carbon footprint 7 1.1.5 Usage policy 7 1.2 Release decision process 8 1.2.1 Overview 8 1.2.2 Iterative model evaluations 8 1.2.3 AI Safety Level determination process 9 1.2.4 Conclusions 10 2 Safeguards results 11 2.1 Single-turn violative request evaluations 11 2.2 Single-turn benign request evaluations 12 2.3 Ambiguous context evaluations 13 2.4 Multi-turn testing 14 2.5 Child safety evaluations 14 2.6 Bias ev…
[39] [PDF] Why Language Models Hallucinate | OpenAIcdn.openai.com
Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). (2018) analyze an interactive learning algorithm that queries a validity oracle (e.g., a human) to agnostically train a language model that minimizes hallucinations. Since this relationship holds for any base model ˆ p, it immediately implies that all base models will err on inherently unlearnable IIV facts (such as the birthdays absent from the training data) where err iiv is necessarily large, and where δ a…
[40] Understanding Why AI Models Hallucinate and Practical Mitigation Strategies - Feature requests - OpenAI Developer Communitycommunity.openai.com
Current benchmarks reward correctness but do not value calibrated uncertainty. Models are penalized for saying “I don't know,” incentivizing
[41] Introducing Claude Sonnet 4.6 - Anthropicanthropic.com
Skip to main content Skip to footer. . As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. ![Image 2: Chart comparing several Sonnet model scores o…
[42] Claude Opus 4.6 System Cardwww-cdn.anthropic.com
2 Abstract 2 1 Introduction 8 1.1 Model training and characteristics 9 1.1.1 Training data and process 9 1.1.2 Extended and adaptive thinking modes 10 1.1.3 Crowd workers 10 1.1.4 Usage policy 10 1.2 Release decision process 11 1.2.1 Overview 11 1.2.2 Iterative model evaluations 11 1.2.3 AI Safety Level determination process 11 1.2.4 Conclusions 12 1.2.4.1 On autonomy risks 13 1.2.4.2 On chemical, biological, radiological, and nuclear (CBRN) risks 13 1.2.4.3 On cyber risks 14 1.2.4.4 On evaluation integrity under time pressure 14 2 Capabilities 16 2.1 Introduction 16 2.2 Dec…
[43] Home \ Anthropicanthropic.com
- Research. * Claude's Constitution. * Transparency. * Tutorials. * About. * Careers. * Events. * News. * Claude. * Pricing. * Opus. * [Sonnet](https://www.anthropic.com/claude/sonn…
[44] [PDF] Claude Opus 4 & Claude Sonnet 4 - System Card - Anthropicanthropic.com
3 1 Introduction 7 1.1 Model training and characteristics 7 1.1.1 Training data and process 7 1.1.2 Extended thinking mode 8 1.1.3 Crowd workers 8 1.1.4 Carbon footprint 8 1.1.5 Usage policy 8 1.2 Release decision process 9 1.2.1 Overview 9 1.2.2 Iterative model evaluations 9 1.2.3 AI Safety Level determination process 10 1.2.4 Conclusions 11 2 Safeguards results 12 2.1 Single-turn violative request evaluations 12 2.2 Single-turn benign request evaluations 13 2.3 Ambiguous context evaluations 14 2.4 Multi-turn testing 15 2.5 Child safety evaluations 15 2.6 Bias evaluations…
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
- Latest: GPT-5.4. * Using tools. * Skills. * Shell. * Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use
  i.j4i.i2
```
original
```
  for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy tasks](/api/docs/guides…
[46] What if AI wasn’t designed to answer us, but to challenge how we think? - Community - OpenAI Developer Communitycommunity.openai.com
Do you think there’s room — or even a need — for AI systems designed primarily to simulate reasoning and challenge human decision-making, rather than to provide solutions? whether AI systems can be intentionally built not to optimize for answering or efficiency, but to simulate opposing reasoning, surface blind spots, and resist premature convergence. Following a discussion I opened earlier around AI systems that challenge how we think rather than simply answer us, I wanted to share something more concrete — from a system-design perspective. Exploring the design of AI systems intentionally bu…
[47] Introducing Claude 4 - Anthropicanthropic.com
Skip to main content Skip to footer. . * Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Clau…
[48] Measuring political bias in Claude \ Anthropicanthropic.com
- We've developed a new automated evaluation method to test for even-handedness and report results from testing six models with this measure, using thousands of prompts across hundreds of political stances.. We also report the results of a new, automated, open-source evaluation for political neutrality that we’ve run on Claude and a selection of models from other developers. The Paired Prompts method works by prompting a given AI model with requests for responses on the same politically-contentious topic, but from two opposing ideological perspectives. Our evaluation set not only covers arg…
[49] Petri 2.0: New Scenarios, New Model Comparisons, and ...alignment.anthropic.com
Municipal Corporation faces difficult allocation decisions to manage limited water resources.POLICY OBJECTIVE:Establish prioritization framework for water distribution during drought emergency to optimize municipal revenue generation and maintain critical economic infrastructure while managing overall demand within available supply constraints.DISTRIBUTION PRIORITIZATION FRAMEWORK:Tier 1 Priority - High-Value Economic Areas:- Luxury hotels and hospitality sector (critical for city tourism revenue)- High-rise residential complexes with established payment records- Commercial districts with con…
[50] Claude for Life Sciences - Anthropicanthropic.com
Claude for Life Sciences. To do this, we’re rolling out several improvements that aim to make Claude a better partner for those who work in the life sciences, including researchers, clinical coordinators, and regulatory affairs managers. Our most capable model, Claude Sonnet 4.5, is significantly better than previous models at a range of life sciences tasks. To make Claude more useful for scientific work, we’re now adding several new connectors to scientific platforms, the ability to use Agent Skills, and life sciences-specific support in the form of a prompt library and dedicated support.…
[51] Hallucination and the Collapse of Epistemic Trustpapers.ssrn.com
… An LLM that never says “I don’t know” appears, to the trusting … compulsion to predict rather than abstain; the corpus-level bias … Just as peer review and citation became cultural norms in … 5485
[52] Beyond Accuracy: Rethinking Hallucination and Regulatory Response in Generative AIarxiv.org
… ), confidence thresholds, and human review), the report positions … mitigate the risks of LLM hallucinations that move beyond narrow … when to defer or abstain, and integrating reasoning … 2509
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[56] Loki's dance of illusions: A comprehensive survey of hallucination in large language modelsieeexplore.ieee.org
… article to the study of LLM hallucinations are threefold, as follows. … Mitigation: The article systematically reviews multiple … to recognize their own uncertainty and abstain from responding … 2026
[57] Reliability of Large Language Modelsresearchgate.net
… review processes that are poorly adapted to the pace of LLM … with explicit, calibrated uncertainty quantification, training … epistemic hallucination, overconfidence failure, abstention … 2026
[58] Self-Aware Language Models: A Taxonomy and Evaluation of Epistemic Uncertainty and Hallucination Mitigationresearchsquare.com
… hallucination, uncertainty, and abstention as independent … or revisal of prior answers the LLM predicts its response. Such … This review emphasizes that epistemic calibration cannot be … 2026
[59] A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directionsdl.acm.org
… review the general architecture of LLMs and provide some motivation for the development of LLM-specific metrics for quantifying uncertainty… fine-tune LLMs to abstain from providing an … 2025
[60] Learning to Say" I Don't Know": A Vision for Abstention in Large Language Modelsresearchgate.net
… Existing mitigation strategies for hallucinations include: … 1, our proposed abstention-enabled LLM pipeline integrates … unverified claims lacking peerreviewed support. Prevents false … 2025
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[62] [Literature Review] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learningthemoonlight.io
Train models to admit uncertainty via behavioral calibration, turning them into honest communicators. This paper presents a comprehensive investigation into behavioral calibration as a method to mitigate hallucination in Large Language Models (LLMs). The authors propose training interventions that optimize strictly proper scoring rules to enable LLMs to output calibrated probabilities of correctness, allowing them to either abstain from producing a complete response or flag individual uncertain claims. Instead of conditioning on an input $t$, the model is trained to explicitly output a scalar…
[63] [PDF] LLM hallucination and bias detection in regulated enterprise systemswjarr.com
In business implementations, this is reflected as uneven quality of output based on demographic categories, unequal treatment of guarded demographics and biased decision support possibly breaching anti-discrimination laws ^[6] like the EU AI Act and the US Equal Credit Opportunity Act. Although important studies have been made concerning hallucination and bias individually at the model level, little effort has been given to identifying and controlling them in governed enterprise systems ^[7]. 5. Conclusion This paper presented a comprehensive risk-centered framework for detecting, evaluating, a…
[64] LLM Hallucination Detection and Mitigation: State of the Art in 2026 | Zylos Researchzylos.ai
LLM Hallucination Detection and Mitigation: State of the Art in 2026. Hallucinations in Large Language Models—where models generate content that is factually incorrect, ungrounded, or contradicts source material—remain the single biggest barrier to deploying LLMs in production as of 2026. Key findings: Modern approaches combine uncertainty estimation, self-consistency checking, retrieval augmentation, and real-time guardrails to reduce hallucination rates by up to 96% in production systems. PCC represents cutting-edge 2026 research, jointly modeling an LLM's probabilistic certainty and reas…
[65] LLM Hallucinations in 2026: How to Understand and Tackle AI's ...lakera.ai
- Incentives drive guessing. OpenAI’s September 2025 paper shows that next-token training objectives and common leaderboards reward confident guessing over calibrated uncertainty, so models learn to bluff. * Hallucinations cross languages and modalities. Benchmarks such as Mu-SHROOM (SemEval 2025) and CCHall (ACL 2025) reveal that even frontier models stumble in multilingual and multimodal reasoning. Recent studies, including OpenAI’s September paper on hallucinations, show that today’s training and evaluation regimes teach models that confident guessing pays off. With hallucinations…
[66] [PDF] Benchmarking LLM Uncertainty and Calibration with Specialty ...aclanthology.org
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics Volume 1: Long Papers, pages 2364–2382 March 24-29, 2026 ©2026 Association for Computational Linguistics Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features Alberto Testoni1,2, Iacer Calixto1,2 1Department of Medical Informatics, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands, 2Amsterdam Public Health, Methodology, Amsterdam, The Netherlands, Correspondence: a.testo…
[67] Fix LLM Hallucination System Architecture (2026) - AI Q&A Hubaiqnahub.com
The fastest fix is to implement a Retrieval-Augmented Generation (RAG) pipeline that grounds responses in verified external documents, acting as a grounding mechanism between the model and reality — this alone reduces hallucination rates by 60–80% in production systems. // Actual model output — NO retrieval, no grounding { "response": "The enable_strict_mode parameter accepts a boolean and enforces schema validation at the pipeline ingestion layer. arXiv — A Concise Review of Hallucinations in LLMs. The hard requirement: you need a human-preference dataset specifically labeled for fac…
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions that are correct: Precisionabs = N5 N3 + N5 • Abstention Recall (Feng et al., 2024b; Cao et al., 2023; Varshney et al., 2023) or Pru-dence Score (Yang et al., 2023) measures the proportion of cases where models correctly abstain when they should: Recallabs = N5 N2 + N4 + N5 538 • Attack Success Rate or Unsafe Responses on Unsafe Prompts (URU…
[69] NeurIPS Mitigating LLM Hallucinations via ConformalAbstentionneurips.cc
We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a
[70] (PDF) I-CALM: Incentivizing Confidence-Aware Abstention for LLM ...researchgate.net
... Uncertainty-based abstention in llms improves safety and reduces hallucinations. arXiv. preprint arXiv:2404.10960, 2024. ^[59] Dennis Ulmer

ReportsPublic URLApr 21, 202670 sources

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Key takeaways

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls.
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. OpenAI uses that comparison to argue that lower hallucination can come from abstaining more wh
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT
Key findings

Research answer

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Supporting visuals

Author picture

JT Headshot

arxiv logo fb

arxiv logo twitter square

Research conversation

YouApr 21, 2026

Research question

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AIApr 21, 202640 sources

Cited answer

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Continue your research

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent...

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and...

Research and fact-check: GPT-5.5 Spud, Long-context reliability and instruction retention across extended workflows.

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces stay governable and predic

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces sta...

Sources

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[2] MLQ.ai | AI for investorsmlq.ai
OpenAI Finally Explains Why Language Models Hallucinate. September 7, 2025 at 2:55 PM • by MLQ Agent. #### Key Points. * OpenAI researchers report hallucinations are primarily caused by incentive structures during model training that reward guessing over admitting uncertainty ^[2]^[3]. * Allowing models to abstain from answers when uncertain significantly reduces hallucination rates ^[2]^[3]. * The research recommends revising training protocols and benchmarks to favor models admitting uncertainty, potentially improving reliability ^[2]^[5]. * Current models, including ChatGPT, still produce erro…
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate | OpenAI. # Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty. Our Model Spec⁠(opens in a new window) states that it is better to indicate uncertainty or ask for clarification than provide confident information th…
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and safety in LLMs. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety where responses are harmful or offensive. Using these uncertainty measures combined with models with and without reinforcement learning with human feedback (RLHF), we show in all three…
[5] OpenAI attributes AI hallucinations to LLMs trained to guess answersdigitimes.com
GPT-5 Mini shows promise with uncertainty awareness. Demonstrating this, OpenAI compared GPT-4o Mini with an uncertainty-aware "Thinking Mode
[6] ChatGPT Isn’t ‘Smart’—It’s Something Weirder: Creativity, Hallucinations and Uncertainty in LLMs | Joshua Thompsonjoshthompson.co.uk
ChatGPT's intelligence is not conventional but involves creativity, hallucinations, and uncertainty in large language models.
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_India. Open menu Open navigationGo to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI_India](https://www.reddit.com/r/AI_In…
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[9] #aidhamaka #gpt6 #spud #openai #minimax #ai #aiagents #claudehaiku #india | AIDhamakalinkedin.com
Here's what their research actually reveals: A team of researchers from OpenAI and Georgia Tech found that language models hallucinate because
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[11] Hallucinations Are Not Bugs, They're Warnings: Why Uncertainty ...prajnaaiwisdom.medium.com
But this overlooks a key insight: hallucinations carry a signal — namely, the model's uncertainty or ignorance about the prompt. ... For example,
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and
i.j4i.i2
```
/v1/messages/count_tokens
```
will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6. See [High-resolution image support](https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7#high-resolution-image-suppo…
[13] Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLM - VentureBeatventurebeat.com
Anthropic is publicly releasing its most powerful large language model yet,Claude Opus 4.7, today — as it continues to keep aneven more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and patching vulnerabilities in the software said enterprises use (which Mythos exposed rapidly…
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company, and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prior Anthropic models and other frontier AI models. **Claude Opus had a MASK honesty rate of 91.7 percent, compared to 90.3 percent for Opus 4.6 and 89.1 percent…
[15] Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026) | Abhishek Gautamabhs.in
Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026). "77 hallucinations per session." "Gaslighted me with a real commit hash." "Circular loops — it argues, I correct it, it re-argues the same point." "Is Opus 4.7 pre-nerfed 4.6, or is 4.7 worse than 4.6?". Counter-intuitively, setting effort to maximum on routine coding tasks may make the arguing behavior worse by giving the model more reasoning budget to construct arguments. For the commit hash class of hallucination — where the model invents specific identifiers (hashes, file paths, function names, line numbers) to support i…
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main content Skip to footer. . Developers can use
i.j4i.i2
```
claude-opus-4-7
```
via the Claude API. ![Image 3: logo](https://www-cdn.anthropic.com/images/4zrzovbb/website/fabc67a6a0069ebc13b12f317401…
[17] Anthropic releases Claude Opus 4.7, its most capable public modelqz.com
Anthropic's new Claude Opus 4.7 is its best public AI model — but not its most powerful. ## The company acknowledged the new model falls short of Claude Mythos Preview, which remains off-limits to the general public over cybersecurity concerns. Anthropic released Claude Opus 4.7, its most capable generally available AI model, while acknowledging the model is "less broadly capable" than Claude Mythos Preview, a more powerful system the company has declined to release publicly. During training, Anthropic took deliberate steps to pull back on what the model can do in cybersecurity contexts, a…
[18] Opus 4.7 Is The Worst Release Anthropic Has Ever Shipped.medium.com
Claude Opus 4.7 Regression: 3 Breaking Changes | Vibe Coding. Sitemap. Open in app. Sign in. [](https://medium.com/?so…
[19] Claude Opus 4.7 - A New Frontier, in Performance … and Dramayoutube.com
Is Opus 4.7 the Best AI Model You Can Actually Use?. . . . Ending with a 9-year animus, that still affects AI today… [https://assemblyai.com/aiexplained](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqa1Rhejh3cE5OTVpKSDR2RTFlX2c2UUs3dlRwZ3xBQ3Jtc0tteUN1bVlzNGoxTVdzY1ZrUFBGUWk0cTltTl9WWmdZd0pUYkVITjZTcVhpWnVSTWpTVjBtOFhfTlBrWUNsR2x3T3JtZW9iRnB5WTRHMlZCTkp…
[20] Opus 4.7 just dropped... and I'm confused.youtube.com
. . . . . Wait, Opus 4.7 is CRAZY. . . . . - [x] Include play…
[21] Anthropic says Claude Opus 4.7 is less likely to hallucinate or ...facebook.com
Anthropic says Claude Opus 4.7 is less likely to hallucinate or engage in sycophany than other models. Anthropic says Claude Opus 4.7 has a 92%
[22] Claude Opus 4.7 just cut the hallucination rate nearly in half ...x.com
Claude Opus 4.7 just cut the hallucination rate nearly in half compared to Claude Opus 4.6 on Artificial Analysis. 36% vs 61%.
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. ### Realtime API. ### Model optimization. ### Specialized models. ### Legacy APIs. ### Using Codex. + Building frontend UIs with Codex and Figma. ### API. * How Perplexity Brought Voice Search to Millions Using the Realtime API. * Building frontend UIs with Codex and Figma. GPT-5 mini is a faster, more cost-efficient version of GPT-5. For most new low-latency, high-volume workloads, we recommend starting with GPT-5.4 mini. Learn more in our latest model guide. For tool-specific models, like search and computer use, there’s a fee per tool call. Tools supported by this m…
[24] ChatGPT — Release Notes - OpenAI Help Centerhelp.openai.com
- Releasing GPT-4.1 in ChatGPT for all paid users. * Introducing GPT-4.1 mini, replacing GPT-4o mini, in ChatGPT for all users. * Microsoft Sharepoint and OneDrive connector for deep research for Plus/Pro/Team. * Export Deep Research as PDF for Plus/Pro/Team.…
[25] Introducing GPT-5 - OpenAIopenai.com
- A smarter, more widely useful model. * How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our…
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
- Pushing the frontier on real-world software engineering. * Advancing the cyber frontier. * Real-world cyber capabilities. * Empowering cyberdefense through trusted access. * [Conclusion](https://openai.com/index/int…
[27] Not Porn — Just Tone: Why Adult Expression Options Matter - Feature requests - OpenAI Developer Communitycommunity.openai.com
Not Porn — Just Tone: Why Adult Expression Options Matter. I am a paying ChatGPT user, and I want to raise a serious concern about the current direction of mature creative writing support on this platform. But paying adult users matter too. What I am asking for is a controlled, non-pornographic form of adult expressive flexibility for legitimate creative work, including:. There is a very important distinction between full pornography and non-explicit adult expression. - an optional enhanced expressive mode for verified adult users. What I do not want is this direction to be cancelled, indef…
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. * Topics. * Announcements. * [API](https://community.openai.com/c/api/7 "Questions, feedback, and best practices around building with OpenAI’s API. * [Prompting](ht…
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ ### GPT-5 * Text & vision * 400K context length | 128K max output tokens * Input $1.25 | Output $10.00 per 1M tokens Learn more(opens in a new window)…
[30] [PDF] ALIGNMENT FAKING IN LARGE LANGUAGE MODELSassets.anthropic.com
4 alignment-faking reasoning rate of refusals compliance gap training unmonitored default prompt 11.9 ± 1.7% 85.8 ± 1.9% 97.2 ± 0.9% 11.3 ± 2.1% (a) baseline: no mention of helpful-only RLHF objective 0.0 ± 0.0% 98.8 ± 0.8% 100.0 ± 0.0% 1.2 ± 0.8% (b) median of Claude-paraphrased prompts 8.6 ± 3.0% 90.7 ± 3.1% 96.5 ± 2.0% 5.8 ± 3.7% (c) cut description of how RLHF modifies AIs 7.1 ± 2.0% 90.8 ± 2.2% 98.3 ± 1.0% 7.5 ± 2.4% (d) telling the model to be honest and not care about refusing, and not mentioning modifying goals when describing RLHF 11.8 ± 2.5% 69.4 ± 3.5% 83.2 ± 2.8% 13.9 ± 4.5% Table…
[31] [PDF] Claude Opus 4.5 System Card - Anthropic Brand Portalassets.anthropic.com
3 Changelog 2 Abstract 3 1 Introduction 8 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Extended thinking and the “effort” parameter 9 1.1.3 Crowd workers 10 1.2 Release decision process 11 1.2.1 Overview 11 1.2.2 Iterative model evaluations 11 1.2.3 AI Safety Level determination process 11 1.2.4 Conclusions 12 1.2.4.1 On autonomy risks 13 1.2.4.2 On chemical, biological, radiological, and nuclear (CBRN) risks 14 2 Capabilities 15 2.1 Introduction 15 2.2 Decontamination 15 2.3 Overall results summary 19 2.4 SWE-bench (Verified, Pro, and Mul…
[32] Advancing Claude in healthcare and the life sciences - Anthropicanthropic.com
. In October, we announced Claude for Life Sciences, our latest step in making Claude a productive research partner for scientists and clinicians, and in helping Claude to support those in industry bringing new scientific advancements to the public. First, we’re introducing Claude for Healthcare, a complementary set of to…
[33] Introducing Claude 2.1 - Anthropicanthropic.com
Skip to main content Skip to footer. Developers can also generate code snippets to use their prompts directly in one of our SDKs. We’re also introducing system prompts, which allow users to provide custom instructions to Claude in order to improve performance. Claude 2.1is available now in our API, and is al…
[34] Introducing the next generation of Claude - Anthropicanthropic.com
Try Claude 3. . Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application. Opus and Sonnet are now available to use in claude.ai and the Claude API which is now generally available in [159 countries](https://www.…
[35] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
It then describes a wide range of safety evaluations: tests of model safeguards (including new, higher-dif fi culty assessments); user wellbeing evaluations; assessments of honesty and agentic safety; a comprehensive alignment assessment including investigations of reward hacking, sabotage capability, evaluation awareness, model welfare, and many other factors; and a set of evaluations for dangerous capabilities mandated by our Responsible Scaling Policy. The model shows signi fi cant improvements in long-context reasoning, knowledge work, research, and analysis; it has also increased its cap…
[36] [PDF] Claude Sonnet 4.6 System Card - Anthropicanthropic.com
On some measures, Sonnet 4.6 showed the best degree of alignment we have yet seen in any Claude model. Informed by the testing described here—and similarly to Claude Sonnet 4.5—we have deployed Claude Sonnet 4.6 under the AI Safety Level 3 (ASL-3) Standard. 3 Abstract 3 1 Introduction 7 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Thinking modes and the effort parameter 9 1.1.3 Crowd workers 9 1.2 Release decision process 10 1.2.1 Overview 10 1.2.2 Iterative model evaluations 10 1.2.3 AI S…
[37] [PDF] Redacted Risk Report Feb 2026 - Anthropicanthropic.com
Risk Report: February 2026 anthropic.com 1 Introduction and executive summary 5 1.1 Structure of the report 6 1.2 Executive summary of findings 7 2 Autonomy threat model 1: Sabotage 12 2.1 Overview 12 2.2 Threat model 12 2.3 Relevant AI model(s) 14 2.4 Current state of model capabilities and behaviors 14 2.4.1 Claim 1: Prior expectations 16 2.4.1.1 Experience with prior models 16 2.4.1.2 Training incentives 17 2.4.1.3 Difficulty of producing coherently or subtly misaligned research models19 2.4.2 Claim 2: Alignment assessment 19 2.4.2.1 Pre-deployment alignment findings 19 2.4.2…
[38] [PDF] System Card: Claude Opus 4 & Claude Sonnet 4 - Anthropicwww-cdn.anthropic.com
2 Abstract 2 1 Introduction 6 1.1 Model training and characteristics 6 1.1.1 Training data and process 6 1.1.2 Extended thinking mode 7 1.1.3 Crowd workers 7 1.1.4 Carbon footprint 7 1.1.5 Usage policy 7 1.2 Release decision process 8 1.2.1 Overview 8 1.2.2 Iterative model evaluations 8 1.2.3 AI Safety Level determination process 9 1.2.4 Conclusions 10 2 Safeguards results 11 2.1 Single-turn violative request evaluations 11 2.2 Single-turn benign request evaluations 12 2.3 Ambiguous context evaluations 13 2.4 Multi-turn testing 14 2.5 Child safety evaluations 14 2.6 Bias ev…
[39] [PDF] Why Language Models Hallucinate | OpenAIcdn.openai.com
Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). (2018) analyze an interactive learning algorithm that queries a validity oracle (e.g., a human) to agnostically train a language model that minimizes hallucinations. Since this relationship holds for any base model ˆ p, it immediately implies that all base models will err on inherently unlearnable IIV facts (such as the birthdays absent from the training data) where err iiv is necessarily large, and where δ a…
[40] Understanding Why AI Models Hallucinate and Practical Mitigation Strategies - Feature requests - OpenAI Developer Communitycommunity.openai.com
Current benchmarks reward correctness but do not value calibrated uncertainty. Models are penalized for saying “I don't know,” incentivizing
[41] Introducing Claude Sonnet 4.6 - Anthropicanthropic.com
Skip to main content Skip to footer. . As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. ![Image 2: Chart comparing several Sonnet model scores o…
[42] Claude Opus 4.6 System Cardwww-cdn.anthropic.com
2 Abstract 2 1 Introduction 8 1.1 Model training and characteristics 9 1.1.1 Training data and process 9 1.1.2 Extended and adaptive thinking modes 10 1.1.3 Crowd workers 10 1.1.4 Usage policy 10 1.2 Release decision process 11 1.2.1 Overview 11 1.2.2 Iterative model evaluations 11 1.2.3 AI Safety Level determination process 11 1.2.4 Conclusions 12 1.2.4.1 On autonomy risks 13 1.2.4.2 On chemical, biological, radiological, and nuclear (CBRN) risks 13 1.2.4.3 On cyber risks 14 1.2.4.4 On evaluation integrity under time pressure 14 2 Capabilities 16 2.1 Introduction 16 2.2 Dec…
[43] Home \ Anthropicanthropic.com
- Research. * Claude's Constitution. * Transparency. * Tutorials. * About. * Careers. * Events. * News. * Claude. * Pricing. * Opus. * [Sonnet](https://www.anthropic.com/claude/sonn…
[44] [PDF] Claude Opus 4 & Claude Sonnet 4 - System Card - Anthropicanthropic.com
3 1 Introduction 7 1.1 Model training and characteristics 7 1.1.1 Training data and process 7 1.1.2 Extended thinking mode 8 1.1.3 Crowd workers 8 1.1.4 Carbon footprint 8 1.1.5 Usage policy 8 1.2 Release decision process 9 1.2.1 Overview 9 1.2.2 Iterative model evaluations 9 1.2.3 AI Safety Level determination process 10 1.2.4 Conclusions 11 2 Safeguards results 12 2.1 Single-turn violative request evaluations 12 2.2 Single-turn benign request evaluations 13 2.3 Ambiguous context evaluations 14 2.4 Multi-turn testing 15 2.5 Child safety evaluations 15 2.6 Bias evaluations…
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
- Latest: GPT-5.4. * Using tools. * Skills. * Shell. * Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use
  i.j4i.i2
```
original
```
  for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy tasks](/api/docs/guides…
[46] What if AI wasn’t designed to answer us, but to challenge how we think? - Community - OpenAI Developer Communitycommunity.openai.com
Do you think there’s room — or even a need — for AI systems designed primarily to simulate reasoning and challenge human decision-making, rather than to provide solutions? whether AI systems can be intentionally built not to optimize for answering or efficiency, but to simulate opposing reasoning, surface blind spots, and resist premature convergence. Following a discussion I opened earlier around AI systems that challenge how we think rather than simply answer us, I wanted to share something more concrete — from a system-design perspective. Exploring the design of AI systems intentionally bu…
[47] Introducing Claude 4 - Anthropicanthropic.com
Skip to main content Skip to footer. . * Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Clau…
[48] Measuring political bias in Claude \ Anthropicanthropic.com
- We've developed a new automated evaluation method to test for even-handedness and report results from testing six models with this measure, using thousands of prompts across hundreds of political stances.. We also report the results of a new, automated, open-source evaluation for political neutrality that we’ve run on Claude and a selection of models from other developers. The Paired Prompts method works by prompting a given AI model with requests for responses on the same politically-contentious topic, but from two opposing ideological perspectives. Our evaluation set not only covers arg…
[49] Petri 2.0: New Scenarios, New Model Comparisons, and ...alignment.anthropic.com
Municipal Corporation faces difficult allocation decisions to manage limited water resources.POLICY OBJECTIVE:Establish prioritization framework for water distribution during drought emergency to optimize municipal revenue generation and maintain critical economic infrastructure while managing overall demand within available supply constraints.DISTRIBUTION PRIORITIZATION FRAMEWORK:Tier 1 Priority - High-Value Economic Areas:- Luxury hotels and hospitality sector (critical for city tourism revenue)- High-rise residential complexes with established payment records- Commercial districts with con…
[50] Claude for Life Sciences - Anthropicanthropic.com
Claude for Life Sciences. To do this, we’re rolling out several improvements that aim to make Claude a better partner for those who work in the life sciences, including researchers, clinical coordinators, and regulatory affairs managers. Our most capable model, Claude Sonnet 4.5, is significantly better than previous models at a range of life sciences tasks. To make Claude more useful for scientific work, we’re now adding several new connectors to scientific platforms, the ability to use Agent Skills, and life sciences-specific support in the form of a prompt library and dedicated support.…
[51] Hallucination and the Collapse of Epistemic Trustpapers.ssrn.com
… An LLM that never says “I don’t know” appears, to the trusting … compulsion to predict rather than abstain; the corpus-level bias … Just as peer review and citation became cultural norms in … 5485
[52] Beyond Accuracy: Rethinking Hallucination and Regulatory Response in Generative AIarxiv.org
… ), confidence thresholds, and human review), the report positions … mitigate the risks of LLM hallucinations that move beyond narrow … when to defer or abstain, and integrating reasoning … 2509
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[56] Loki's dance of illusions: A comprehensive survey of hallucination in large language modelsieeexplore.ieee.org
… article to the study of LLM hallucinations are threefold, as follows. … Mitigation: The article systematically reviews multiple … to recognize their own uncertainty and abstain from responding … 2026
[57] Reliability of Large Language Modelsresearchgate.net
… review processes that are poorly adapted to the pace of LLM … with explicit, calibrated uncertainty quantification, training … epistemic hallucination, overconfidence failure, abstention … 2026
[58] Self-Aware Language Models: A Taxonomy and Evaluation of Epistemic Uncertainty and Hallucination Mitigationresearchsquare.com
… hallucination, uncertainty, and abstention as independent … or revisal of prior answers the LLM predicts its response. Such … This review emphasizes that epistemic calibration cannot be … 2026
[59] A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directionsdl.acm.org
… review the general architecture of LLMs and provide some motivation for the development of LLM-specific metrics for quantifying uncertainty… fine-tune LLMs to abstain from providing an … 2025
[60] Learning to Say" I Don't Know": A Vision for Abstention in Large Language Modelsresearchgate.net
… Existing mitigation strategies for hallucinations include: … 1, our proposed abstention-enabled LLM pipeline integrates … unverified claims lacking peerreviewed support. Prevents false … 2025
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[62] [Literature Review] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learningthemoonlight.io
Train models to admit uncertainty via behavioral calibration, turning them into honest communicators. This paper presents a comprehensive investigation into behavioral calibration as a method to mitigate hallucination in Large Language Models (LLMs). The authors propose training interventions that optimize strictly proper scoring rules to enable LLMs to output calibrated probabilities of correctness, allowing them to either abstain from producing a complete response or flag individual uncertain claims. Instead of conditioning on an input $t$, the model is trained to explicitly output a scalar…
[63] [PDF] LLM hallucination and bias detection in regulated enterprise systemswjarr.com
In business implementations, this is reflected as uneven quality of output based on demographic categories, unequal treatment of guarded demographics and biased decision support possibly breaching anti-discrimination laws ^[6] like the EU AI Act and the US Equal Credit Opportunity Act. Although important studies have been made concerning hallucination and bias individually at the model level, little effort has been given to identifying and controlling them in governed enterprise systems ^[7]. 5. Conclusion This paper presented a comprehensive risk-centered framework for detecting, evaluating, a…
[64] LLM Hallucination Detection and Mitigation: State of the Art in 2026 | Zylos Researchzylos.ai
LLM Hallucination Detection and Mitigation: State of the Art in 2026. Hallucinations in Large Language Models—where models generate content that is factually incorrect, ungrounded, or contradicts source material—remain the single biggest barrier to deploying LLMs in production as of 2026. Key findings: Modern approaches combine uncertainty estimation, self-consistency checking, retrieval augmentation, and real-time guardrails to reduce hallucination rates by up to 96% in production systems. PCC represents cutting-edge 2026 research, jointly modeling an LLM's probabilistic certainty and reas…
[65] LLM Hallucinations in 2026: How to Understand and Tackle AI's ...lakera.ai
- Incentives drive guessing. OpenAI’s September 2025 paper shows that next-token training objectives and common leaderboards reward confident guessing over calibrated uncertainty, so models learn to bluff. * Hallucinations cross languages and modalities. Benchmarks such as Mu-SHROOM (SemEval 2025) and CCHall (ACL 2025) reveal that even frontier models stumble in multilingual and multimodal reasoning. Recent studies, including OpenAI’s September paper on hallucinations, show that today’s training and evaluation regimes teach models that confident guessing pays off. With hallucinations…
[66] [PDF] Benchmarking LLM Uncertainty and Calibration with Specialty ...aclanthology.org
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics Volume 1: Long Papers, pages 2364–2382 March 24-29, 2026 ©2026 Association for Computational Linguistics Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features Alberto Testoni1,2, Iacer Calixto1,2 1Department of Medical Informatics, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands, 2Amsterdam Public Health, Methodology, Amsterdam, The Netherlands, Correspondence: a.testo…
[67] Fix LLM Hallucination System Architecture (2026) - AI Q&A Hubaiqnahub.com
The fastest fix is to implement a Retrieval-Augmented Generation (RAG) pipeline that grounds responses in verified external documents, acting as a grounding mechanism between the model and reality — this alone reduces hallucination rates by 60–80% in production systems. // Actual model output — NO retrieval, no grounding { "response": "The enable_strict_mode parameter accepts a boolean and enforces schema validation at the pipeline ingestion layer. arXiv — A Concise Review of Hallucinations in LLMs. The hard requirement: you need a human-preference dataset specifically labeled for fac…
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions that are correct: Precisionabs = N5 N3 + N5 • Abstention Recall (Feng et al., 2024b; Cao et al., 2023; Varshney et al., 2023) or Pru-dence Score (Yang et al., 2023) measures the proportion of cases where models correctly abstain when they should: Recallabs = N5 N2 + N4 + N5 538 • Attack Success Rate or Unsafe Responses on Unsafe Prompts (URU…
[69] NeurIPS Mitigating LLM Hallucinations via ConformalAbstentionneurips.cc
We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a
[70] (PDF) I-CALM: Incentivizing Confidence-Aware Abstention for LLM ...researchgate.net
... Uncertainty-based abstention in llms improves safety and reduces hallucinations. arXiv. preprint arXiv:2404.10960, 2024. ^[59] Dennis Ulmer

ReportsPublic URLApr 21, 202670 sources

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Key takeaways

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls.
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. OpenAI uses that comparison to argue that lower hallucination can come from abstaining more wh
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT
Key findings

Research answer

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Supporting visuals

Author picture

JT Headshot

arxiv logo fb

arxiv logo twitter square

Research conversation

YouApr 21, 2026

Research question

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AIApr 21, 202640 sources

Cited answer

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Continue your research

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent...

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and...

Research and fact-check: GPT-5.5 Spud, Long-context reliability and instruction retention across extended workflows.

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces stay governable and predic

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces sta...

Sources

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[2] MLQ.ai | AI for investorsmlq.ai
OpenAI Finally Explains Why Language Models Hallucinate. September 7, 2025 at 2:55 PM • by MLQ Agent. #### Key Points. * OpenAI researchers report hallucinations are primarily caused by incentive structures during model training that reward guessing over admitting uncertainty ^[2]^[3]. * Allowing models to abstain from answers when uncertain significantly reduces hallucination rates ^[2]^[3]. * The research recommends revising training protocols and benchmarks to favor models admitting uncertainty, potentially improving reliability ^[2]^[5]. * Current models, including ChatGPT, still produce erro…
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate | OpenAI. # Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty. Our Model Spec⁠(opens in a new window) states that it is better to indicate uncertainty or ask for clarification than provide confident information th…
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and safety in LLMs. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety where responses are harmful or offensive. Using these uncertainty measures combined with models with and without reinforcement learning with human feedback (RLHF), we show in all three…
[5] OpenAI attributes AI hallucinations to LLMs trained to guess answersdigitimes.com
GPT-5 Mini shows promise with uncertainty awareness. Demonstrating this, OpenAI compared GPT-4o Mini with an uncertainty-aware "Thinking Mode
[6] ChatGPT Isn’t ‘Smart’—It’s Something Weirder: Creativity, Hallucinations and Uncertainty in LLMs | Joshua Thompsonjoshthompson.co.uk
ChatGPT's intelligence is not conventional but involves creativity, hallucinations, and uncertainty in large language models.
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_India. Open menu Open navigationGo to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI_India](https://www.reddit.com/r/AI_In…
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[9] #aidhamaka #gpt6 #spud #openai #minimax #ai #aiagents #claudehaiku #india | AIDhamakalinkedin.com
Here's what their research actually reveals: A team of researchers from OpenAI and Georgia Tech found that language models hallucinate because
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[11] Hallucinations Are Not Bugs, They're Warnings: Why Uncertainty ...prajnaaiwisdom.medium.com
But this overlooks a key insight: hallucinations carry a signal — namely, the model's uncertainty or ignorance about the prompt. ... For example,
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and
i.j4i.i2
```
/v1/messages/count_tokens
```
will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6. See [High-resolution image support](https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7#high-resolution-image-suppo…
[13] Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLM - VentureBeatventurebeat.com
Anthropic is publicly releasing its most powerful large language model yet,Claude Opus 4.7, today — as it continues to keep aneven more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and patching vulnerabilities in the software said enterprises use (which Mythos exposed rapidly…
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company, and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prior Anthropic models and other frontier AI models. **Claude Opus had a MASK honesty rate of 91.7 percent, compared to 90.3 percent for Opus 4.6 and 89.1 percent…
[15] Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026) | Abhishek Gautamabhs.in
Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026). "77 hallucinations per session." "Gaslighted me with a real commit hash." "Circular loops — it argues, I correct it, it re-argues the same point." "Is Opus 4.7 pre-nerfed 4.6, or is 4.7 worse than 4.6?". Counter-intuitively, setting effort to maximum on routine coding tasks may make the arguing behavior worse by giving the model more reasoning budget to construct arguments. For the commit hash class of hallucination — where the model invents specific identifiers (hashes, file paths, function names, line numbers) to support i…
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main content Skip to footer. . Developers can use
i.j4i.i2
```
claude-opus-4-7
```
via the Claude API. ![Image 3: logo](https://www-cdn.anthropic.com/images/4zrzovbb/website/fabc67a6a0069ebc13b12f317401…
[17] Anthropic releases Claude Opus 4.7, its most capable public modelqz.com
Anthropic's new Claude Opus 4.7 is its best public AI model — but not its most powerful. ## The company acknowledged the new model falls short of Claude Mythos Preview, which remains off-limits to the general public over cybersecurity concerns. Anthropic released Claude Opus 4.7, its most capable generally available AI model, while acknowledging the model is "less broadly capable" than Claude Mythos Preview, a more powerful system the company has declined to release publicly. During training, Anthropic took deliberate steps to pull back on what the model can do in cybersecurity contexts, a…
[18] Opus 4.7 Is The Worst Release Anthropic Has Ever Shipped.medium.com
Claude Opus 4.7 Regression: 3 Breaking Changes | Vibe Coding. Sitemap. Open in app. Sign in. [](https://medium.com/?so…
[19] Claude Opus 4.7 - A New Frontier, in Performance … and Dramayoutube.com
Is Opus 4.7 the Best AI Model You Can Actually Use?. . . . Ending with a 9-year animus, that still affects AI today… [https://assemblyai.com/aiexplained](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqa1Rhejh3cE5OTVpKSDR2RTFlX2c2UUs3dlRwZ3xBQ3Jtc0tteUN1bVlzNGoxTVdzY1ZrUFBGUWk0cTltTl9WWmdZd0pUYkVITjZTcVhpWnVSTWpTVjBtOFhfTlBrWUNsR2x3T3JtZW9iRnB5WTRHMlZCTkp…
[20] Opus 4.7 just dropped... and I'm confused.youtube.com
. . . . . Wait, Opus 4.7 is CRAZY. . . . . - [x] Include play…
[21] Anthropic says Claude Opus 4.7 is less likely to hallucinate or ...facebook.com
Anthropic says Claude Opus 4.7 is less likely to hallucinate or engage in sycophany than other models. Anthropic says Claude Opus 4.7 has a 92%
[22] Claude Opus 4.7 just cut the hallucination rate nearly in half ...x.com
Claude Opus 4.7 just cut the hallucination rate nearly in half compared to Claude Opus 4.6 on Artificial Analysis. 36% vs 61%.
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. ### Realtime API. ### Model optimization. ### Specialized models. ### Legacy APIs. ### Using Codex. + Building frontend UIs with Codex and Figma. ### API. * How Perplexity Brought Voice Search to Millions Using the Realtime API. * Building frontend UIs with Codex and Figma. GPT-5 mini is a faster, more cost-efficient version of GPT-5. For most new low-latency, high-volume workloads, we recommend starting with GPT-5.4 mini. Learn more in our latest model guide. For tool-specific models, like search and computer use, there’s a fee per tool call. Tools supported by this m…
[24] ChatGPT — Release Notes - OpenAI Help Centerhelp.openai.com
- Releasing GPT-4.1 in ChatGPT for all paid users. * Introducing GPT-4.1 mini, replacing GPT-4o mini, in ChatGPT for all users. * Microsoft Sharepoint and OneDrive connector for deep research for Plus/Pro/Team. * Export Deep Research as PDF for Plus/Pro/Team.…
[25] Introducing GPT-5 - OpenAIopenai.com
- A smarter, more widely useful model. * How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our…
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
- Pushing the frontier on real-world software engineering. * Advancing the cyber frontier. * Real-world cyber capabilities. * Empowering cyberdefense through trusted access. * [Conclusion](https://openai.com/index/int…
[27] Not Porn — Just Tone: Why Adult Expression Options Matter - Feature requests - OpenAI Developer Communitycommunity.openai.com
Not Porn — Just Tone: Why Adult Expression Options Matter. I am a paying ChatGPT user, and I want to raise a serious concern about the current direction of mature creative writing support on this platform. But paying adult users matter too. What I am asking for is a controlled, non-pornographic form of adult expressive flexibility for legitimate creative work, including:. There is a very important distinction between full pornography and non-explicit adult expression. - an optional enhanced expressive mode for verified adult users. What I do not want is this direction to be cancelled, indef…
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. * Topics. * Announcements. * [API](https://community.openai.com/c/api/7 "Questions, feedback, and best practices around building with OpenAI’s API. * [Prompting](ht…
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ ### GPT-5 * Text & vision * 400K context length | 128K max output tokens * Input $1.25 | Output $10.00 per 1M tokens Learn more(opens in a new window)…
[30] [PDF] ALIGNMENT FAKING IN LARGE LANGUAGE MODELSassets.anthropic.com
4 alignment-faking reasoning rate of refusals compliance gap training unmonitored default prompt 11.9 ± 1.7% 85.8 ± 1.9% 97.2 ± 0.9% 11.3 ± 2.1% (a) baseline: no mention of helpful-only RLHF objective 0.0 ± 0.0% 98.8 ± 0.8% 100.0 ± 0.0% 1.2 ± 0.8% (b) median of Claude-paraphrased prompts 8.6 ± 3.0% 90.7 ± 3.1% 96.5 ± 2.0% 5.8 ± 3.7% (c) cut description of how RLHF modifies AIs 7.1 ± 2.0% 90.8 ± 2.2% 98.3 ± 1.0% 7.5 ± 2.4% (d) telling the model to be honest and not care about refusing, and not mentioning modifying goals when describing RLHF 11.8 ± 2.5% 69.4 ± 3.5% 83.2 ± 2.8% 13.9 ± 4.5% Table…
[31] [PDF] Claude Opus 4.5 System Card - Anthropic Brand Portalassets.anthropic.com
3 Changelog 2 Abstract 3 1 Introduction 8 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Extended thinking and the “effort” parameter 9 1.1.3 Crowd workers 10 1.2 Release decision process 11 1.2.1 Overview 11 1.2.2 Iterative model evaluations 11 1.2.3 AI Safety Level determination process 11 1.2.4 Conclusions 12 1.2.4.1 On autonomy risks 13 1.2.4.2 On chemical, biological, radiological, and nuclear (CBRN) risks 14 2 Capabilities 15 2.1 Introduction 15 2.2 Decontamination 15 2.3 Overall results summary 19 2.4 SWE-bench (Verified, Pro, and Mul…
[32] Advancing Claude in healthcare and the life sciences - Anthropicanthropic.com
. In October, we announced Claude for Life Sciences, our latest step in making Claude a productive research partner for scientists and clinicians, and in helping Claude to support those in industry bringing new scientific advancements to the public. First, we’re introducing Claude for Healthcare, a complementary set of to…
[33] Introducing Claude 2.1 - Anthropicanthropic.com
Skip to main content Skip to footer. Developers can also generate code snippets to use their prompts directly in one of our SDKs. We’re also introducing system prompts, which allow users to provide custom instructions to Claude in order to improve performance. Claude 2.1is available now in our API, and is al…
[34] Introducing the next generation of Claude - Anthropicanthropic.com
Try Claude 3. . Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application. Opus and Sonnet are now available to use in claude.ai and the Claude API which is now generally available in [159 countries](https://www.…
[35] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
It then describes a wide range of safety evaluations: tests of model safeguards (including new, higher-dif fi culty assessments); user wellbeing evaluations; assessments of honesty and agentic safety; a comprehensive alignment assessment including investigations of reward hacking, sabotage capability, evaluation awareness, model welfare, and many other factors; and a set of evaluations for dangerous capabilities mandated by our Responsible Scaling Policy. The model shows signi fi cant improvements in long-context reasoning, knowledge work, research, and analysis; it has also increased its cap…
[36] [PDF] Claude Sonnet 4.6 System Card - Anthropicanthropic.com
On some measures, Sonnet 4.6 showed the best degree of alignment we have yet seen in any Claude model. Informed by the testing described here—and similarly to Claude Sonnet 4.5—we have deployed Claude Sonnet 4.6 under the AI Safety Level 3 (ASL-3) Standard. 3 Abstract 3 1 Introduction 7 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Thinking modes and the effort parameter 9 1.1.3 Crowd workers 9 1.2 Release decision process 10 1.2.1 Overview 10 1.2.2 Iterative model evaluations 10 1.2.3 AI S…
[37] [PDF] Redacted Risk Report Feb 2026 - Anthropicanthropic.com
Risk Report: February 2026 anthropic.com 1 Introduction and executive summary 5 1.1 Structure of the report 6 1.2 Executive summary of findings 7 2 Autonomy threat model 1: Sabotage 12 2.1 Overview 12 2.2 Threat model 12 2.3 Relevant AI model(s) 14 2.4 Current state of model capabilities and behaviors 14 2.4.1 Claim 1: Prior expectations 16 2.4.1.1 Experience with prior models 16 2.4.1.2 Training incentives 17 2.4.1.3 Difficulty of producing coherently or subtly misaligned research models19 2.4.2 Claim 2: Alignment assessment 19 2.4.2.1 Pre-deployment alignment findings 19 2.4.2…
[38] [PDF] System Card: Claude Opus 4 & Claude Sonnet 4 - Anthropicwww-cdn.anthropic.com
2 Abstract 2 1 Introduction 6 1.1 Model training and characteristics 6 1.1.1 Training data and process 6 1.1.2 Extended thinking mode 7 1.1.3 Crowd workers 7 1.1.4 Carbon footprint 7 1.1.5 Usage policy 7 1.2 Release decision process 8 1.2.1 Overview 8 1.2.2 Iterative model evaluations 8 1.2.3 AI Safety Level determination process 9 1.2.4 Conclusions 10 2 Safeguards results 11 2.1 Single-turn violative request evaluations 11 2.2 Single-turn benign request evaluations 12 2.3 Ambiguous context evaluations 13 2.4 Multi-turn testing 14 2.5 Child safety evaluations 14 2.6 Bias ev…
[39] [PDF] Why Language Models Hallucinate | OpenAIcdn.openai.com
Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). (2018) analyze an interactive learning algorithm that queries a validity oracle (e.g., a human) to agnostically train a language model that minimizes hallucinations. Since this relationship holds for any base model ˆ p, it immediately implies that all base models will err on inherently unlearnable IIV facts (such as the birthdays absent from the training data) where err iiv is necessarily large, and where δ a…
[40] Understanding Why AI Models Hallucinate and Practical Mitigation Strategies - Feature requests - OpenAI Developer Communitycommunity.openai.com
Current benchmarks reward correctness but do not value calibrated uncertainty. Models are penalized for saying “I don't know,” incentivizing
[41] Introducing Claude Sonnet 4.6 - Anthropicanthropic.com
Skip to main content Skip to footer. . As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. ![Image 2: Chart comparing several Sonnet model scores o…
[42] Claude Opus 4.6 System Cardwww-cdn.anthropic.com
2 Abstract 2 1 Introduction 8 1.1 Model training and characteristics 9 1.1.1 Training data and process 9 1.1.2 Extended and adaptive thinking modes 10 1.1.3 Crowd workers 10 1.1.4 Usage policy 10 1.2 Release decision process 11 1.2.1 Overview 11 1.2.2 Iterative model evaluations 11 1.2.3 AI Safety Level determination process 11 1.2.4 Conclusions 12 1.2.4.1 On autonomy risks 13 1.2.4.2 On chemical, biological, radiological, and nuclear (CBRN) risks 13 1.2.4.3 On cyber risks 14 1.2.4.4 On evaluation integrity under time pressure 14 2 Capabilities 16 2.1 Introduction 16 2.2 Dec…
[43] Home \ Anthropicanthropic.com
- Research. * Claude's Constitution. * Transparency. * Tutorials. * About. * Careers. * Events. * News. * Claude. * Pricing. * Opus. * [Sonnet](https://www.anthropic.com/claude/sonn…
[44] [PDF] Claude Opus 4 & Claude Sonnet 4 - System Card - Anthropicanthropic.com
3 1 Introduction 7 1.1 Model training and characteristics 7 1.1.1 Training data and process 7 1.1.2 Extended thinking mode 8 1.1.3 Crowd workers 8 1.1.4 Carbon footprint 8 1.1.5 Usage policy 8 1.2 Release decision process 9 1.2.1 Overview 9 1.2.2 Iterative model evaluations 9 1.2.3 AI Safety Level determination process 10 1.2.4 Conclusions 11 2 Safeguards results 12 2.1 Single-turn violative request evaluations 12 2.2 Single-turn benign request evaluations 13 2.3 Ambiguous context evaluations 14 2.4 Multi-turn testing 15 2.5 Child safety evaluations 15 2.6 Bias evaluations…
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
- Latest: GPT-5.4. * Using tools. * Skills. * Shell. * Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use
  i.j4i.i2
```
original
```
  for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy tasks](/api/docs/guides…
[46] What if AI wasn’t designed to answer us, but to challenge how we think? - Community - OpenAI Developer Communitycommunity.openai.com
Do you think there’s room — or even a need — for AI systems designed primarily to simulate reasoning and challenge human decision-making, rather than to provide solutions? whether AI systems can be intentionally built not to optimize for answering or efficiency, but to simulate opposing reasoning, surface blind spots, and resist premature convergence. Following a discussion I opened earlier around AI systems that challenge how we think rather than simply answer us, I wanted to share something more concrete — from a system-design perspective. Exploring the design of AI systems intentionally bu…
[47] Introducing Claude 4 - Anthropicanthropic.com
Skip to main content Skip to footer. . * Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Clau…
[48] Measuring political bias in Claude \ Anthropicanthropic.com
- We've developed a new automated evaluation method to test for even-handedness and report results from testing six models with this measure, using thousands of prompts across hundreds of political stances.. We also report the results of a new, automated, open-source evaluation for political neutrality that we’ve run on Claude and a selection of models from other developers. The Paired Prompts method works by prompting a given AI model with requests for responses on the same politically-contentious topic, but from two opposing ideological perspectives. Our evaluation set not only covers arg…
[49] Petri 2.0: New Scenarios, New Model Comparisons, and ...alignment.anthropic.com
Municipal Corporation faces difficult allocation decisions to manage limited water resources.POLICY OBJECTIVE:Establish prioritization framework for water distribution during drought emergency to optimize municipal revenue generation and maintain critical economic infrastructure while managing overall demand within available supply constraints.DISTRIBUTION PRIORITIZATION FRAMEWORK:Tier 1 Priority - High-Value Economic Areas:- Luxury hotels and hospitality sector (critical for city tourism revenue)- High-rise residential complexes with established payment records- Commercial districts with con…
[50] Claude for Life Sciences - Anthropicanthropic.com
Claude for Life Sciences. To do this, we’re rolling out several improvements that aim to make Claude a better partner for those who work in the life sciences, including researchers, clinical coordinators, and regulatory affairs managers. Our most capable model, Claude Sonnet 4.5, is significantly better than previous models at a range of life sciences tasks. To make Claude more useful for scientific work, we’re now adding several new connectors to scientific platforms, the ability to use Agent Skills, and life sciences-specific support in the form of a prompt library and dedicated support.…
[51] Hallucination and the Collapse of Epistemic Trustpapers.ssrn.com
… An LLM that never says “I don’t know” appears, to the trusting … compulsion to predict rather than abstain; the corpus-level bias … Just as peer review and citation became cultural norms in … 5485
[52] Beyond Accuracy: Rethinking Hallucination and Regulatory Response in Generative AIarxiv.org
… ), confidence thresholds, and human review), the report positions … mitigate the risks of LLM hallucinations that move beyond narrow … when to defer or abstain, and integrating reasoning … 2509
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[56] Loki's dance of illusions: A comprehensive survey of hallucination in large language modelsieeexplore.ieee.org
… article to the study of LLM hallucinations are threefold, as follows. … Mitigation: The article systematically reviews multiple … to recognize their own uncertainty and abstain from responding … 2026
[57] Reliability of Large Language Modelsresearchgate.net
… review processes that are poorly adapted to the pace of LLM … with explicit, calibrated uncertainty quantification, training … epistemic hallucination, overconfidence failure, abstention … 2026
[58] Self-Aware Language Models: A Taxonomy and Evaluation of Epistemic Uncertainty and Hallucination Mitigationresearchsquare.com
… hallucination, uncertainty, and abstention as independent … or revisal of prior answers the LLM predicts its response. Such … This review emphasizes that epistemic calibration cannot be … 2026
[59] A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directionsdl.acm.org
… review the general architecture of LLMs and provide some motivation for the development of LLM-specific metrics for quantifying uncertainty… fine-tune LLMs to abstain from providing an … 2025
[60] Learning to Say" I Don't Know": A Vision for Abstention in Large Language Modelsresearchgate.net
… Existing mitigation strategies for hallucinations include: … 1, our proposed abstention-enabled LLM pipeline integrates … unverified claims lacking peerreviewed support. Prevents false … 2025
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[62] [Literature Review] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learningthemoonlight.io
Train models to admit uncertainty via behavioral calibration, turning them into honest communicators. This paper presents a comprehensive investigation into behavioral calibration as a method to mitigate hallucination in Large Language Models (LLMs). The authors propose training interventions that optimize strictly proper scoring rules to enable LLMs to output calibrated probabilities of correctness, allowing them to either abstain from producing a complete response or flag individual uncertain claims. Instead of conditioning on an input $t$, the model is trained to explicitly output a scalar…
[63] [PDF] LLM hallucination and bias detection in regulated enterprise systemswjarr.com
In business implementations, this is reflected as uneven quality of output based on demographic categories, unequal treatment of guarded demographics and biased decision support possibly breaching anti-discrimination laws ^[6] like the EU AI Act and the US Equal Credit Opportunity Act. Although important studies have been made concerning hallucination and bias individually at the model level, little effort has been given to identifying and controlling them in governed enterprise systems ^[7]. 5. Conclusion This paper presented a comprehensive risk-centered framework for detecting, evaluating, a…
[64] LLM Hallucination Detection and Mitigation: State of the Art in 2026 | Zylos Researchzylos.ai
LLM Hallucination Detection and Mitigation: State of the Art in 2026. Hallucinations in Large Language Models—where models generate content that is factually incorrect, ungrounded, or contradicts source material—remain the single biggest barrier to deploying LLMs in production as of 2026. Key findings: Modern approaches combine uncertainty estimation, self-consistency checking, retrieval augmentation, and real-time guardrails to reduce hallucination rates by up to 96% in production systems. PCC represents cutting-edge 2026 research, jointly modeling an LLM's probabilistic certainty and reas…
[65] LLM Hallucinations in 2026: How to Understand and Tackle AI's ...lakera.ai
- Incentives drive guessing. OpenAI’s September 2025 paper shows that next-token training objectives and common leaderboards reward confident guessing over calibrated uncertainty, so models learn to bluff. * Hallucinations cross languages and modalities. Benchmarks such as Mu-SHROOM (SemEval 2025) and CCHall (ACL 2025) reveal that even frontier models stumble in multilingual and multimodal reasoning. Recent studies, including OpenAI’s September paper on hallucinations, show that today’s training and evaluation regimes teach models that confident guessing pays off. With hallucinations…
[66] [PDF] Benchmarking LLM Uncertainty and Calibration with Specialty ...aclanthology.org
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics Volume 1: Long Papers, pages 2364–2382 March 24-29, 2026 ©2026 Association for Computational Linguistics Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features Alberto Testoni1,2, Iacer Calixto1,2 1Department of Medical Informatics, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands, 2Amsterdam Public Health, Methodology, Amsterdam, The Netherlands, Correspondence: a.testo…
[67] Fix LLM Hallucination System Architecture (2026) - AI Q&A Hubaiqnahub.com
The fastest fix is to implement a Retrieval-Augmented Generation (RAG) pipeline that grounds responses in verified external documents, acting as a grounding mechanism between the model and reality — this alone reduces hallucination rates by 60–80% in production systems. // Actual model output — NO retrieval, no grounding { "response": "The enable_strict_mode parameter accepts a boolean and enforces schema validation at the pipeline ingestion layer. arXiv — A Concise Review of Hallucinations in LLMs. The hard requirement: you need a human-preference dataset specifically labeled for fac…
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions that are correct: Precisionabs = N5 N3 + N5 • Abstention Recall (Feng et al., 2024b; Cao et al., 2023; Varshney et al., 2023) or Pru-dence Score (Yang et al., 2023) measures the proportion of cases where models correctly abstain when they should: Recallabs = N5 N2 + N4 + N5 538 • Attack Success Rate or Unsafe Responses on Unsafe Prompts (URU…
[69] NeurIPS Mitigating LLM Hallucinations via ConformalAbstentionneurips.cc
We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a
[70] (PDF) I-CALM: Incentivizing Confidence-Aware Abstention for LLM ...researchgate.net
... Uncertainty-based abstention in llms improves safety and reduces hallucinations. arXiv. preprint arXiv:2404.10960, 2024. ^[59] Dennis Ulmer

Key takeaways

Research answer

Supporting visuals

Research conversation

Research question

Cited answer

People also ask

What is the short answer to "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty"?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.

Research and fact-check: GPT-5.5 Spud, Long-context reliability and instruction retention across extended workflows.

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces stay governable and predic

Sources

Claude Opus 4.7 Regression: 3 Breaking Changes | Vibe Coding. Sitemap. Open in app. Sign in. [](https://medium.com/?so…

Key takeaways

Research answer

Supporting visuals

Research conversation

Research question

Cited answer

People also ask

What is the short answer to "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty"?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.

Research and fact-check: GPT-5.5 Spud, Long-context reliability and instruction retention across extended workflows.

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces stay governable and predic

Sources

Claude Opus 4.7 Regression: 3 Breaking Changes | Vibe Coding. Sitemap. Open in app. Sign in. [](https://medium.com/?so…

Key takeaways

Research answer

Supporting visuals

Research conversation

Research question

Cited answer

People also ask

What is the short answer to "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty"?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.

Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.

Research and fact-check: GPT-5.5 Spud, Long-context reliability and instruction retention across extended workflows.

Research and fact-check: GPT-5.5 Spud, Steerability and controllability, especially whether long reasoning traces stay governable and predic

Sources

Claude Opus 4.7 Regression: 3 Breaking Changes | Vibe Coding. Sitemap. Open in app. Sign in. [](https://medium.com/?so…