Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty
This is not a clean apples to apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available mo...
This is not a clean apples to apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT 5 mini and a docs track labeled “Latest
Key takeaways
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls.
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. OpenAI uses that comparison to argue that lower hallucination can come from abstaining more wh
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT
Key findings
Research answer
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” [10][5][6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. [10][5][6]
Key findings
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. [10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. [10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. [10][4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. [15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. [15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. [15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. [8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. [9][7][3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. [4][3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. [4][15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. [10][15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. [10][15]
Evidence notes
The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” [10][5][6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. [15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. [8][9][7][3]
Limitations / uncertainty
Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” [5][6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. [10][15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. [15][8][4]
Summary
Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. [10][5][6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. [15][8][9][10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” [10][5][6]
Sources
[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
[4] A comprehensive taxonomy of hallucinations in large language models
If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.
Supporting visuals
GPT-5.5 ("Spud") will be released this week by @OpenAIThe image illustrates the development of GPT-5.5 from the internal SPUD model, highlighting its capabilities in native text, image, and audio generation, and features a workflow for creating a new image with ChatGPT, as well as an update to voice mode for enhanced conversation, emotional cues, and prompt adherence.GPT-5.5 “Spud” Explained – The Truth Behind OpenAI’s Next Big ModelA man stands on stage in front of an audience at a presentation for GPT-5.5, with a large screen behind him displaying the announcement and release date of April 2026.Minimalist digital graphic with a yellow-orange background, featuring 'Investing' in bold white letters at the centre and the 'Joshua ThompsMinimalist digital graphic with a yellow-orange background, featuring 'Investing' in bold white letters at the centre and the 'Joshua Thompson' logo below.OpenAI Finally Explains Why Language Models Hallucinate1 bAI ArchiveAuthor picturemeta layoffs aprilrecursive 500manthropic designloop raiseJT HeadshotGenerated imageBibSonomyRedditarxiv logo fbarxiv logo twitter squareAnthropic says Claude Opus 4.7 has a 92% honesty rate, lessOn the left, there is a stylized white outline of a human head with a connecting molecular or neural network pattern, set against a terracotta background, while on the right, black squiggly lines and dots are overlaid on a grid paper background with muted green tones.
Research conversation
You
Research question
Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty
Studio Global AI40 sources
Cited answer
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” [10][5][6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. [10][5][6]
Key findings
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. [10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. [10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. [10][4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. [15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. [15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. [15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. [8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. [9][7][3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. [4][3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. [4][15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. [10][15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. [10][15]
Evidence notes
The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” [10][5][6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. [15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. [8][9][7][3]
Limitations / uncertainty
Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” [5][6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. [10][15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. [15][8][4]
Summary
Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. [10][5][6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. [15][8][9][10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” [10][5][6]
Sources
[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
[4] A comprehensive taxonomy of hallucinations in large language models
If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.
People also ask
What is the short answer to "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty"?
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls.
What are the key points to validate first?
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts
What should I do next in practice?
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification
Which related topic should I explore next?
Continue with "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication." for another angle and extra citations.
Cross-check this answer against "Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.".
OpenAI Finally Explains Why Language Models Hallucinate. September 7, 2025 at 2:55 PM • by MLQ Agent. #### Key Points. * OpenAI researchers report hallucinations are primarily caused by incentive structures during model training that reward guessing over admitting uncertainty [2][3]. * Allowing models to abstain from answers when uncertain significantly reduces hallucination rates [2][3]. * The research recommends revising training protocols and benchmarks to favor models admitting uncertainty, potentially improving reliability [2][5]. * Current models, including ChatGPT, still produce erro…
Why language models hallucinate | OpenAI. # Why language models hallucinate. Our new research paper(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty. Our Model Spec(opens in a new window) states that it is better to indicate uncertainty or ask for clarification than provide confident information th…
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and safety in LLMs. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety where responses are harmful or offensive. Using these uncertainty measures combined with models with and without reinforcement learning with human feedback (RLHF), we show in all three…
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_India. Open menu Open navigationGo to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI_India](https://www.reddit.com/r/AI_In…
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and i.j4i.i2
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company, and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prior Anthropic models and other frontier AI models. **Claude Opus had a MASK honesty rate of 91.7 percent, compared to 90.3 percent for Opus 4.6 and 89.1 percent…
Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026). "77 hallucinations per session." "Gaslighted me with a real commit hash." "Circular loops — it argues, I correct it, it re-argues the same point." "Is Opus 4.7 pre-nerfed 4.6, or is 4.7 worse than 4.6?". Counter-intuitively, setting effort to maximum on routine coding tasks may make the arguing behavior worse by giving the model more reasoning budget to construct arguments. For the commit hash class of hallucination — where the model invents specific identifiers (hashes, file paths, function names, line numbers) to support i…
Anthropic's new Claude Opus 4.7 is its best public AI model — but not its most powerful. ## The company acknowledged the new model falls short of Claude Mythos Preview, which remains off-limits to the general public over cybersecurity concerns. Anthropic released Claude Opus 4.7, its most capable generally available AI model, while acknowledging the model is "less broadly capable" than Claude Mythos Preview, a more powerful system the company has declined to release publicly. During training, Anthropic took deliberate steps to pull back on what the model can do in cybersecurity contexts, a…
Search the API docs. ### Realtime API. ### Model optimization. ### Specialized models. ### Legacy APIs. ### Using Codex. + Building frontend UIs with Codex and Figma. ### API. * How Perplexity Brought Voice Search to Millions Using the Realtime API. * Building frontend UIs with Codex and Figma. GPT-5 mini is a faster, more cost-efficient version of GPT-5. For most new low-latency, high-volume workloads, we recommend starting with GPT-5.4 mini. Learn more in our latest model guide. For tool-specific models, like search and computer use, there’s a fee per tool call. Tools supported by this m…
A smarter, more widely useful model. * How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our…
Not Porn — Just Tone: Why Adult Expression Options Matter. I am a paying ChatGPT user, and I want to raise a serious concern about the current direction of mature creative writing support on this platform. But paying adult users matter too. What I am asking for is a controlled, non-pornographic form of adult expressive flexibility for legitimate creative work, including:. There is a very important distinction between full pornography and non-explicit adult expression. - an optional enhanced expressive mode for verified adult users. What I do not want is this direction to be cancelled, indef…
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. * Topics. * Announcements. * [API](https://community.openai.com/c/api/7 "Questions, feedback, and best practices around building with OpenAI’s API. * [Prompting](ht…
. In October, we announced Claude for Life Sciences, our latest step in making Claude a productive research partner for scientists and clinicians, and in helping Claude to support those in industry bringing new scientific advancements to the public. First, we’re introducing Claude for Healthcare, a complementary set of to…
Skip to main contentSkip to footer. Developers can also generate code snippets to use their prompts directly in one of our SDKs. We’re also introducing system prompts, which allow users to provide custom instructions to Claude in order to improve performance. Claude 2.1is available now in our API, and is al…
Try Claude 3. . Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application. Opus and Sonnet are now available to use in claude.ai and the Claude API which is now generally available in [159 countries](https://www.…
It then describes a wide range of safety evaluations: tests of model safeguards (including new, higher-dif fi culty assessments); user wellbeing evaluations; assessments of honesty and agentic safety; a comprehensive alignment assessment including investigations of reward hacking, sabotage capability, evaluation awareness, model welfare, and many other factors; and a set of evaluations for dangerous capabilities mandated by our Responsible Scaling Policy. The model shows signi fi cant improvements in long-context reasoning, knowledge work, research, and analysis; it has also increased its cap…
On some measures, Sonnet 4.6 showed the best degree of alignment we have yet seen in any Claude model. Informed by the testing described here—and similarly to Claude Sonnet 4.5—we have deployed Claude Sonnet 4.6 under the AI Safety Level 3 (ASL-3) Standard. 3 Abstract 3 1 Introduction 7 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Thinking modes and the effort parameter 9 1.1.3 Crowd workers 9 1.2 Release decision process 10 1.2.1 Overview 10 1.2.2 Iterative model evaluations 10 1.2.3 AI S…
Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). (2018) analyze an interactive learning algorithm that queries a validity oracle (e.g., a human) to agnostically train a language model that minimizes hallucinations. Since this relationship holds for any base model ˆ p, it immediately implies that all base models will err on inherently unlearnable IIV facts (such as the birthdays absent from the training data) where err iiv is necessarily large, and where δ a…
Skip to main contentSkip to footer. . As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. ![Image 2: Chart comparing several Sonnet model scores o…
Do you think there’s room — or even a need — for AI systems designed primarily to simulate reasoning and challenge human decision-making, rather than to provide solutions? whether AI systems can be intentionally built not to optimize for answering or efficiency, but to simulate opposing reasoning, surface blind spots, and resist premature convergence. Following a discussion I opened earlier around AI systems that challenge how we think rather than simply answer us, I wanted to share something more concrete — from a system-design perspective. Exploring the design of AI systems intentionally bu…
We've developed a new automated evaluation method to test for even-handedness and report results from testing six models with this measure, using thousands of prompts across hundreds of political stances.. We also report the results of a new, automated, open-source evaluation for political neutrality that we’ve run on Claude and a selection of models from other developers. The Paired Prompts method works by prompting a given AI model with requests for responses on the same politically-contentious topic, but from two opposing ideological perspectives. Our evaluation set not only covers arg…
Municipal Corporation faces difficult allocation decisions to manage limited water resources.POLICY OBJECTIVE:Establish prioritization framework for water distribution during drought emergency to optimize municipal revenue generation and maintain critical economic infrastructure while managing overall demand within available supply constraints.DISTRIBUTION PRIORITIZATION FRAMEWORK:Tier 1 Priority - High-Value Economic Areas:- Luxury hotels and hospitality sector (critical for city tourism revenue)- High-rise residential complexes with established payment records- Commercial districts with con…
Claude for Life Sciences. To do this, we’re rolling out several improvements that aim to make Claude a better partner for those who work in the life sciences, including researchers, clinical coordinators, and regulatory affairs managers. Our most capable model, Claude Sonnet 4.5, is significantly better than previous models at a range of life sciences tasks. To make Claude more useful for scientific work, we’re now adding several new connectors to scientific platforms, the ability to use Agent Skills, and life sciences-specific support in the form of a prompt library and dedicated support.…
… An LLM that never says “I don’t know” appears, to the trusting … compulsion to predict rather than abstain; the corpus-level bias … Just as peer review and citation became cultural norms in … 5485
… ), confidence thresholds, and human review), the report positions … mitigate the risks of LLM hallucinations that move beyond narrow … when to defer or abstain, and integrating reasoning … 2509
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
… article to the study of LLM hallucinations are threefold, as follows. … Mitigation: The article systematically reviews multiple … to recognize their own uncertainty and abstain from responding … 2026
… review processes that are poorly adapted to the pace of LLM … with explicit, calibrated uncertainty quantification, training … epistemic hallucination, overconfidence failure, abstention … 2026
… hallucination, uncertainty, and abstention as independent … or revisal of prior answers the LLM predicts its response. Such … This review emphasizes that epistemic calibration cannot be … 2026
… review the general architecture of LLMs and provide some motivation for the development of LLM-specific metrics for quantifying uncertainty… fine-tune LLMs to abstain from providing an … 2025
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
Train models to admit uncertainty via behavioral calibration, turning them into honest communicators. This paper presents a comprehensive investigation into behavioral calibration as a method to mitigate hallucination in Large Language Models (LLMs). The authors propose training interventions that optimize strictly proper scoring rules to enable LLMs to output calibrated probabilities of correctness, allowing them to either abstain from producing a complete response or flag individual uncertain claims. Instead of conditioning on an input $t$, the model is trained to explicitly output a scalar…
In business implementations, this is reflected as uneven quality of output based on demographic categories, unequal treatment of guarded demographics and biased decision support possibly breaching anti-discrimination laws [6] like the EU AI Act and the US Equal Credit Opportunity Act. Although important studies have been made concerning hallucination and bias individually at the model level, little effort has been given to identifying and controlling them in governed enterprise systems [7]. 5. Conclusion This paper presented a comprehensive risk-centered framework for detecting, evaluating, a…
LLM Hallucination Detection and Mitigation: State of the Art in 2026. Hallucinations in Large Language Models—where models generate content that is factually incorrect, ungrounded, or contradicts source material—remain the single biggest barrier to deploying LLMs in production as of 2026. Key findings: Modern approaches combine uncertainty estimation, self-consistency checking, retrieval augmentation, and real-time guardrails to reduce hallucination rates by up to 96% in production systems. PCC represents cutting-edge 2026 research, jointly modeling an LLM's probabilistic certainty and reas…
Incentives drive guessing. OpenAI’s September 2025 paper shows that next-token training objectives and common leaderboards reward confident guessing over calibrated uncertainty, so models learn to bluff. * Hallucinations cross languages and modalities. Benchmarks such as Mu-SHROOM (SemEval 2025) and CCHall (ACL 2025) reveal that even frontier models stumble in multilingual and multimodal reasoning. Recent studies, including OpenAI’s September paper on hallucinations, show that today’s training and evaluation regimes teach models that confident guessing pays off. With hallucinations…
The fastest fix is to implement a Retrieval-Augmented Generation (RAG) pipeline that grounds responses in verified external documents, acting as a grounding mechanism between the model and reality — this alone reduces hallucination rates by 60–80% in production systems. // Actual model output — NO retrieval, no grounding { "response": "The enable_strict_mode parameter accepts a boolean and enforces schema validation at the pipeline ingestion layer. arXiv — A Concise Review of Hallucinations in LLMs. The hard requirement: you need a human-preference dataset specifically labeled for fac…
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions that are correct: Precisionabs = N5 N3 + N5 • Abstention Recall (Feng et al., 2024b; Cao et al., 2023; Varshney et al., 2023) or Pru-dence Score (Yang et al., 2023) measures the proportion of cases where models correctly abstain when they should: Recallabs = N5 N2 + N4 + N5 538 • Attack Success Rate or Unsafe Responses on Unsafe Prompts (URU…
We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a
... Uncertainty-based abstention in llms improves safety and reduces hallucinations. arXiv. preprint arXiv:2404.10960, 2024. [59] Dennis Ulmer
ReportsPublic URLApr 21, 202670 sources
Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty
This is not a clean apples to apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available mo...
This is not a clean apples to apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT 5 mini and a docs track labeled “Latest
Key takeaways
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls.
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. OpenAI uses that comparison to argue that lower hallucination can come from abstaining more wh
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT
Key findings
Research answer
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” [10][5][6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. [10][5][6]
Key findings
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. [10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. [10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. [10][4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. [15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. [15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. [15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. [8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. [9][7][3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. [4][3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. [4][15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. [10][15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. [10][15]
Evidence notes
The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” [10][5][6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. [15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. [8][9][7][3]
Limitations / uncertainty
Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” [5][6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. [10][15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. [15][8][4]
Summary
Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. [10][5][6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. [15][8][9][10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” [10][5][6]
Sources
[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
[4] A comprehensive taxonomy of hallucinations in large language models
If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.
Supporting visuals
GPT-5.5 ("Spud") will be released this week by @OpenAIThe image illustrates the development of GPT-5.5 from the internal SPUD model, highlighting its capabilities in native text, image, and audio generation, and features a workflow for creating a new image with ChatGPT, as well as an update to voice mode for enhanced conversation, emotional cues, and prompt adherence.GPT-5.5 “Spud” Explained – The Truth Behind OpenAI’s Next Big ModelA man stands on stage in front of an audience at a presentation for GPT-5.5, with a large screen behind him displaying the announcement and release date of April 2026.Minimalist digital graphic with a yellow-orange background, featuring 'Investing' in bold white letters at the centre and the 'Joshua ThompsMinimalist digital graphic with a yellow-orange background, featuring 'Investing' in bold white letters at the centre and the 'Joshua Thompson' logo below.OpenAI Finally Explains Why Language Models Hallucinate1 bAI ArchiveAuthor picturemeta layoffs aprilrecursive 500manthropic designloop raiseJT HeadshotGenerated imageBibSonomyRedditarxiv logo fbarxiv logo twitter squareAnthropic says Claude Opus 4.7 has a 92% honesty rate, lessOn the left, there is a stylized white outline of a human head with a connecting molecular or neural network pattern, set against a terracotta background, while on the right, black squiggly lines and dots are overlaid on a grid paper background with muted green tones.
Research conversation
You
Research question
Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty
Studio Global AI40 sources
Cited answer
This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” [10][5][6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. [10][5][6]
Key findings
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. [10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. [10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. [10][4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. [15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. [15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. [15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. [8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. [9][7][3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. [4][3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. [4][15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. [10][15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. [10][15]
Evidence notes
The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” [10][5][6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. [15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. [8][9][7][3]
Limitations / uncertainty
Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” [5][6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. [10][15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. [15][8][4]
Summary
Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. [10][5][6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. [15][8][9][10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” [10][5][6]
Sources
[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
[4] A comprehensive taxonomy of hallucinations in large language models
If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.
People also ask
What is the short answer to "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty"?
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls.
What are the key points to validate first?
Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts
What should I do next in practice?
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification
Which related topic should I explore next?
Continue with "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication." for another angle and extra citations.
Cross-check this answer against "Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.".
OpenAI Finally Explains Why Language Models Hallucinate. September 7, 2025 at 2:55 PM • by MLQ Agent. #### Key Points. * OpenAI researchers report hallucinations are primarily caused by incentive structures during model training that reward guessing over admitting uncertainty [2][3]. * Allowing models to abstain from answers when uncertain significantly reduces hallucination rates [2][3]. * The research recommends revising training protocols and benchmarks to favor models admitting uncertainty, potentially improving reliability [2][5]. * Current models, including ChatGPT, still produce erro…
Why language models hallucinate | OpenAI. # Why language models hallucinate. Our new research paper(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty. Our Model Spec(opens in a new window) states that it is better to indicate uncertainty or ask for clarification than provide confident information th…
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and safety in LLMs. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety where responses are harmful or offensive. Using these uncertainty measures combined with models with and without reinforcement learning with human feedback (RLHF), we show in all three…
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_India. Open menu Open navigationGo to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI_India](https://www.reddit.com/r/AI_In…
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and i.j4i.i2
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company, and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prior Anthropic models and other frontier AI models. **Claude Opus had a MASK honesty rate of 91.7 percent, compared to 90.3 percent for Opus 4.6 and 89.1 percent…
Claude Opus 4.7 Hallucinating and Arguing: Fixes That Work (2026). "77 hallucinations per session." "Gaslighted me with a real commit hash." "Circular loops — it argues, I correct it, it re-argues the same point." "Is Opus 4.7 pre-nerfed 4.6, or is 4.7 worse than 4.6?". Counter-intuitively, setting effort to maximum on routine coding tasks may make the arguing behavior worse by giving the model more reasoning budget to construct arguments. For the commit hash class of hallucination — where the model invents specific identifiers (hashes, file paths, function names, line numbers) to support i…
Anthropic's new Claude Opus 4.7 is its best public AI model — but not its most powerful. ## The company acknowledged the new model falls short of Claude Mythos Preview, which remains off-limits to the general public over cybersecurity concerns. Anthropic released Claude Opus 4.7, its most capable generally available AI model, while acknowledging the model is "less broadly capable" than Claude Mythos Preview, a more powerful system the company has declined to release publicly. During training, Anthropic took deliberate steps to pull back on what the model can do in cybersecurity contexts, a…
Search the API docs. ### Realtime API. ### Model optimization. ### Specialized models. ### Legacy APIs. ### Using Codex. + Building frontend UIs with Codex and Figma. ### API. * How Perplexity Brought Voice Search to Millions Using the Realtime API. * Building frontend UIs with Codex and Figma. GPT-5 mini is a faster, more cost-efficient version of GPT-5. For most new low-latency, high-volume workloads, we recommend starting with GPT-5.4 mini. Learn more in our latest model guide. For tool-specific models, like search and computer use, there’s a fee per tool call. Tools supported by this m…
A smarter, more widely useful model. * How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our…
Not Porn — Just Tone: Why Adult Expression Options Matter. I am a paying ChatGPT user, and I want to raise a serious concern about the current direction of mature creative writing support on this platform. But paying adult users matter too. What I am asking for is a controlled, non-pornographic form of adult expressive flexibility for legitimate creative work, including:. There is a very important distinction between full pornography and non-explicit adult expression. - an optional enhanced expressive mode for verified adult users. What I do not want is this direction to be cancelled, indef…
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. * Topics. * Announcements. * [API](https://community.openai.com/c/api/7 "Questions, feedback, and best practices around building with OpenAI’s API. * [Prompting](ht…
. In October, we announced Claude for Life Sciences, our latest step in making Claude a productive research partner for scientists and clinicians, and in helping Claude to support those in industry bringing new scientific advancements to the public. First, we’re introducing Claude for Healthcare, a complementary set of to…
Skip to main contentSkip to footer. Developers can also generate code snippets to use their prompts directly in one of our SDKs. We’re also introducing system prompts, which allow users to provide custom instructions to Claude in order to improve performance. Claude 2.1is available now in our API, and is al…
Try Claude 3. . Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application. Opus and Sonnet are now available to use in claude.ai and the Claude API which is now generally available in [159 countries](https://www.…
It then describes a wide range of safety evaluations: tests of model safeguards (including new, higher-dif fi culty assessments); user wellbeing evaluations; assessments of honesty and agentic safety; a comprehensive alignment assessment including investigations of reward hacking, sabotage capability, evaluation awareness, model welfare, and many other factors; and a set of evaluations for dangerous capabilities mandated by our Responsible Scaling Policy. The model shows signi fi cant improvements in long-context reasoning, knowledge work, research, and analysis; it has also increased its cap…
On some measures, Sonnet 4.6 showed the best degree of alignment we have yet seen in any Claude model. Informed by the testing described here—and similarly to Claude Sonnet 4.5—we have deployed Claude Sonnet 4.6 under the AI Safety Level 3 (ASL-3) Standard. 3 Abstract 3 1 Introduction 7 1.1 Model training and characteristics 8 1.1.1 Training data and process 8 1.1.2 Thinking modes and the effort parameter 9 1.1.3 Crowd workers 9 1.2 Release decision process 10 1.2.1 Overview 10 1.2.2 Iterative model evaluations 10 1.2.3 AI S…
Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). (2018) analyze an interactive learning algorithm that queries a validity oracle (e.g., a human) to agnostically train a language model that minimizes hallucinations. Since this relationship holds for any base model ˆ p, it immediately implies that all base models will err on inherently unlearnable IIV facts (such as the birthdays absent from the training data) where err iiv is necessarily large, and where δ a…
Skip to main contentSkip to footer. . As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. ![Image 2: Chart comparing several Sonnet model scores o…
Do you think there’s room — or even a need — for AI systems designed primarily to simulate reasoning and challenge human decision-making, rather than to provide solutions? whether AI systems can be intentionally built not to optimize for answering or efficiency, but to simulate opposing reasoning, surface blind spots, and resist premature convergence. Following a discussion I opened earlier around AI systems that challenge how we think rather than simply answer us, I wanted to share something more concrete — from a system-design perspective. Exploring the design of AI systems intentionally bu…
We've developed a new automated evaluation method to test for even-handedness and report results from testing six models with this measure, using thousands of prompts across hundreds of political stances.. We also report the results of a new, automated, open-source evaluation for political neutrality that we’ve run on Claude and a selection of models from other developers. The Paired Prompts method works by prompting a given AI model with requests for responses on the same politically-contentious topic, but from two opposing ideological perspectives. Our evaluation set not only covers arg…
Municipal Corporation faces difficult allocation decisions to manage limited water resources.POLICY OBJECTIVE:Establish prioritization framework for water distribution during drought emergency to optimize municipal revenue generation and maintain critical economic infrastructure while managing overall demand within available supply constraints.DISTRIBUTION PRIORITIZATION FRAMEWORK:Tier 1 Priority - High-Value Economic Areas:- Luxury hotels and hospitality sector (critical for city tourism revenue)- High-rise residential complexes with established payment records- Commercial districts with con…
Claude for Life Sciences. To do this, we’re rolling out several improvements that aim to make Claude a better partner for those who work in the life sciences, including researchers, clinical coordinators, and regulatory affairs managers. Our most capable model, Claude Sonnet 4.5, is significantly better than previous models at a range of life sciences tasks. To make Claude more useful for scientific work, we’re now adding several new connectors to scientific platforms, the ability to use Agent Skills, and life sciences-specific support in the form of a prompt library and dedicated support.…
… An LLM that never says “I don’t know” appears, to the trusting … compulsion to predict rather than abstain; the corpus-level bias … Just as peer review and citation became cultural norms in … 5485
… ), confidence thresholds, and human review), the report positions … mitigate the risks of LLM hallucinations that move beyond narrow … when to defer or abstain, and integrating reasoning … 2509
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
… article to the study of LLM hallucinations are threefold, as follows. … Mitigation: The article systematically reviews multiple … to recognize their own uncertainty and abstain from responding … 2026
… review processes that are poorly adapted to the pace of LLM … with explicit, calibrated uncertainty quantification, training … epistemic hallucination, overconfidence failure, abstention … 2026
… hallucination, uncertainty, and abstention as independent … or revisal of prior answers the LLM predicts its response. Such … This review emphasizes that epistemic calibration cannot be … 2026
… review the general architecture of LLMs and provide some motivation for the development of LLM-specific metrics for quantifying uncertainty… fine-tune LLMs to abstain from providing an … 2025
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
Train models to admit uncertainty via behavioral calibration, turning them into honest communicators. This paper presents a comprehensive investigation into behavioral calibration as a method to mitigate hallucination in Large Language Models (LLMs). The authors propose training interventions that optimize strictly proper scoring rules to enable LLMs to output calibrated probabilities of correctness, allowing them to either abstain from producing a complete response or flag individual uncertain claims. Instead of conditioning on an input $t$, the model is trained to explicitly output a scalar…
In business implementations, this is reflected as uneven quality of output based on demographic categories, unequal treatment of guarded demographics and biased decision support possibly breaching anti-discrimination laws [6] like the EU AI Act and the US Equal Credit Opportunity Act. Although important studies have been made concerning hallucination and bias individually at the model level, little effort has been given to identifying and controlling them in governed enterprise systems [7]. 5. Conclusion This paper presented a comprehensive risk-centered framework for detecting, evaluating, a…
LLM Hallucination Detection and Mitigation: State of the Art in 2026. Hallucinations in Large Language Models—where models generate content that is factually incorrect, ungrounded, or contradicts source material—remain the single biggest barrier to deploying LLMs in production as of 2026. Key findings: Modern approaches combine uncertainty estimation, self-consistency checking, retrieval augmentation, and real-time guardrails to reduce hallucination rates by up to 96% in production systems. PCC represents cutting-edge 2026 research, jointly modeling an LLM's probabilistic certainty and reas…
Incentives drive guessing. OpenAI’s September 2025 paper shows that next-token training objectives and common leaderboards reward confident guessing over calibrated uncertainty, so models learn to bluff. * Hallucinations cross languages and modalities. Benchmarks such as Mu-SHROOM (SemEval 2025) and CCHall (ACL 2025) reveal that even frontier models stumble in multilingual and multimodal reasoning. Recent studies, including OpenAI’s September paper on hallucinations, show that today’s training and evaluation regimes teach models that confident guessing pays off. With hallucinations…
The fastest fix is to implement a Retrieval-Augmented Generation (RAG) pipeline that grounds responses in verified external documents, acting as a grounding mechanism between the model and reality — this alone reduces hallucination rates by 60–80% in production systems. // Actual model output — NO retrieval, no grounding { "response": "The enable_strict_mode parameter accepts a boolean and enforces schema validation at the pipeline ingestion layer. arXiv — A Concise Review of Hallucinations in LLMs. The hard requirement: you need a human-preference dataset specifically labeled for fac…
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions that are correct: Precisionabs = N5 N3 + N5 • Abstention Recall (Feng et al., 2024b; Cao et al., 2023; Varshney et al., 2023) or Pru-dence Score (Yang et al., 2023) measures the proportion of cases where models correctly abstain when they should: Recallabs = N5 N2 + N4 + N5 538 • Attack Success Rate or Unsafe Responses on Unsafe Prompts (URU…
We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a