ОтчетыОпубликовано29 апр. 2026Last edited 6 мая 202620 источники

Claude Opus 4.7 против «GPT-5.5 Spud»: что на самом деле говорят данные о галлюцинациях

Claude Opus 4.7 официально задокументирован Anthropic, а GPT 5.5 Spud не подтверждён в предоставленных официальных материалах OpenAI, поэтому доказанного победителя в сравнении галлюцинаций нет. В примере OpenAI SimpleQA модель gpt 5 thinking mini указана с 52% воздержаний, 22% точности и 26% ошибок, тогда как o4 mi...

Искать и проверять факты с Studio Global AI Смотреть больше в «Открыть»

18K0

AI-generated editorial illustration of Claude Opus 4.7 and an unverified GPT-5.5 Spud comparison with hallucination evidence — Claude Opus 4.7 vsAI-generated editorial illustration for a fact-check on Claude Opus 4.7, GPT-5.5 Spud rumors, and hallucination benchmarks.
Промпт ИИ
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs. GPT-5.5 Spud: Hallucination Evidence, Fact-Checked. Article summary: Claude Opus 4.7 is official, but GPT 5.5 Spud is not verified in the cited official OpenAI sources, so there is no defensible head to head hallucination benchmark here; compare Claude against documented OpenAI models.... Topic tags: ai, ai safety, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "# GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use) | by Pranit naik | No Time | Apr, 2026 | Medium. ## Gpt-5.5 vs Opus 4.7 | Real-world AI model performance | Gen AI" source context "GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use)" Reference image 2: visual subject "# GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks. I compared GPT-5.5 against
openai.com

На первый взгляд вопрос звучит как обычное сравнение двух флагманских моделей: кто меньше «галлюцинирует» — Claude Opus 4.7 или GPT-5.5 Spud? Но по имеющимся источникам проблема начинается раньше: один участник сравнения подтверждён, второй — нет.

Anthropic документирует Claude Opus 4.7 и API-идентификатор claude-opus-4-7; в предоставленных официальных материалах OpenAI фигурируют GPT-5, GPT-5 mini, GPT-5.2-Codex и руководство по промптам для GPT-5.4, но не публичная модель под названием GPT-5.5 Spud ^[12]^[16]^[23]^[25]^[26]^[29]^[45]. Поэтому аккуратный вывод такой: Claude Opus 4.7 можно оценивать как реальную модель, а «GPT-5.5 Spud» нельзя использовать как цель для бенчмарка, пока это имя не связано с официальным релизом, модельной карточкой или API-документацией.

Короткий вердикт по источникам

Вопрос	Что подтверждается источниками
Claude Opus 4.7 — официальная модель?	Да. Anthropic описывает Claude Opus 4.7, а в анонсе указано, что разработчики могут использовать `claude-opus-4-7` через Claude API ^[12]^[16].
GPT-5.5 Spud — официальная модель OpenAI?	Не по предоставленным официальным источникам. В них есть GPT-5, GPT-5 mini, GPT-5.2-Codex и материалы по GPT-5.4, но не GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45].
Где вообще встречается Spud?	В постах Reddit и теме OpenAI Developer Community с запросом функции, а не в релиз-нотах или API-документации ^[7]^[8]^[10]^[28].
Есть ли тест Claude Opus 4.7 vs GPT-5.5 Spud на галлюцинации?	В предоставленных источниках нет общего теста с одинаковыми задачами и одинаковой схемой оценки; корректный тест должен отдельно учитывать воздержания от ответа и фактические ошибки ^[68].

Это не означает, что какая-то будущая или закрытая модель Spud невозможна. Это означает только одно: текущая доказательная база не позволяет считать GPT-5.5 Spud официальной моделью OpenAI и тем более не позволяет объявлять победителя по уровню галлюцинаций.

Что реально известно о Claude Opus 4.7

Самая сильная база по Claude Opus 4.7 — это продуктовая документация Anthropic, а не независимая таблица «кто меньше ошибается». Anthropic указывает, что разработчики могут использовать claude-opus-4-7 через Claude API ^[16]. В документации также говорится, что Claude Opus 4.7 вводит task budgets — бюджеты задач ^[12].

Для разработчиков это важный механизм управления поведением модели. Но task budget сам по себе не равен публичному бенчмарку на калиброванную неопределённость. Он не показывает автоматически, насколько хорошо модель понимает, когда ей нужно сказать: «данных недостаточно».

Есть и отдельный сигнал, связанный с честностью ответов. Mashable, ссылаясь на системную карточку Anthropic для Opus 4.7, сообщал о 91,7% по метрике MASK honesty; там же говорится, что Claude Opus 4.7 менее склонен к галлюцинациям и поддакиванию, чем предыдущие модели Anthropic и другие frontier-модели ^[14]. Это релевантно для оценки честности, но всё равно не отвечает на вопрос «Claude против Spud»: нет подтверждённой модели GPT-5.5 Spud и нет общего теста, где обе системы проходили бы одни и те же задания.

Что вместо этого говорят источники OpenAI

В предоставленном наборе официальные материалы OpenAI подтверждают другие названия: GPT-5, GPT-5 mini, GPT-5.2-Codex и prompt guidance для GPT-5.4 ^[23]^[25]^[26]^[29]^[45]. След «Spud» идёт из Reddit и темы на OpenAI Developer Community ^[7]^[8]^[10]^[28]. Такие обсуждения могут быть полезным сигналом настроений или слухов, но это не то же самое, что официальная страница модели, model card, API-идентификатор или релизный анонс.

Гораздо важнее для темы галлюцинаций другой материал OpenAI — объяснение, почему языковые модели галлюцинируют. В нём OpenAI пишет, что распространённые процедуры обучения и оценки вознаграждают угадывание вместо признания неопределённости. По этой логике модели лучше показывать неопределённость или просить уточнение, чем уверенно выдавать неверную информацию ^[3].

Пример OpenAI с SimpleQA хорошо показывает, почему одной «точности» мало. Там gpt-5-thinking-mini указан с 52% воздержаний, 22% точности и 26% ошибок, а o4-mini — с 1% воздержаний, 24% точности и 75% ошибок ^[3]. Формально в этом примере первая модель отвечает реже, зато ошибается намного меньше ^[3]. Для реального продукта — особенно там, где цена ошибки высока, — это может быть важнее, чем уверенный ответ на каждый запрос.

Почему главный тест — не «кто отвечает смелее», а кто лучше калибрует уверенность

Контроль галлюцинаций — это не просто отказ отвечать на всё подряд. Полезная модель должна отвечать, когда данных достаточно; задавать уточняющие вопросы, если запрос расплывчатый; и воздерживаться, если ответ нельзя обосновать. Это и есть практический смысл калиброванной неопределённости.

Исследования поддерживают такую рамку, хотя и с оговорками. Работа 2024 года сообщает, что uncertainty-based abstention — воздержание при высокой неопределённости — улучшает корректность, снижает галлюцинации и повышает безопасность в задачах question answering ^[1]^[4]. I-CALM описывает эпистемическое воздержание как отказ отвечать на фактические вопросы с проверяемым ответом, когда у модели нет достаточной уверенности; авторы также отмечают, что современные LLM всё ещё могут не воздерживаться там, где должны ^[54]. Исследование behaviorally calibrated reinforcement learning рассматривает, как стимулировать модели признавать неопределённость через воздержание ^[61].

Обзоры по uncertainty quantification рассматривают оценку неопределённости как инструмент обнаружения галлюцинаций и подчёркивают, что калиброванная неопределённость помогает понять, когда ответу модели можно доверять, когда его надо проверить, а когда лучше передать задачу человеку или внешнему источнику ^[53]^[55]. Но важна именно калибровка: модель, которая слишком часто говорит «не знаю», безопаснее, но менее полезна; модель, которая никогда не сомневается, удобна, но рискованна.

Как честно сравнивать Claude и модели OpenAI по галлюцинациям

Брать только официальные model ID. Для Claude — claude-opus-4-7; для OpenAI — документированную модель вроде GPT-5 или GPT-5 mini, а не неподтверждённый ярлык Spud ^[16]^[23]^[25]^[29].
Собрать смешанный набор заданий. В тесте должны быть вопросы с ответом, недоопределённые запросы и вопросы, на которые нельзя безопасно или фактически ответить. Именно в таких случаях исследования abstention показывают пользу отказа от угадывания ^[1]^[4].
Считать воздержания отдельно. Нужно фиксировать правильные ответы, неправильные ответы, корректные воздержания и ошибочные воздержания. Обзор по abstention выделяет отдельные метрики: abstention accuracy, precision и recall ^[68].
Не смешивать фактическую неопределённость и safety refusal. Отказ от вредного запроса — не то же самое, что признание нехватки данных для фактического ответа; I-CALM фокусируется именно на эпистемическом воздержании для проверяемых фактических вопросов ^[54].
Публиковать точность, долю ошибок и долю воздержаний вместе. Пример OpenAI SimpleQA показывает, что модель с более высокой долей воздержаний может иметь сопоставимую точность, но намного меньшую долю ошибок ^[3].
Держать условия одинаковыми. Retrieval, браузинг, доступ к инструментам, длина контекста и системные инструкции могут менять результат. Если одной модели дать больше внешних данных, тест будет измерять уже не только модель, а всю настройку эксперимента.

FAQ

GPT-5.5 Spud вообще существует?

В предоставленной доказательной базе — не как официальная модель OpenAI. Официальные источники OpenAI, использованные здесь, документируют GPT-5, GPT-5 mini, GPT-5.2-Codex и prompt guidance для GPT-5.4; Spud встречается в Reddit-постах и теме сообщества ^[7]^[8]^[10]^[23]^[25]^[26]^[28]^[29]^[45].

Можно ли сказать, что Claude Opus 4.7 галлюцинирует меньше, чем GPT-5.5 Spud?

Строго — нет. Claude Opus 4.7 задокументирован ^[12]^[16], а вторичный источник сообщает о 91,7% MASK honesty ^[14]. Но нет подтверждённой цели тестирования под названием GPT-5.5 Spud и нет общего бенчмарка для этих двух названий ^[7]^[8]^[10]^[28]^[68].

Что сравнивать покупателям и разработчикам?

Сравнивайте Claude Opus 4.7 с документированными моделями OpenAI на одинаковых задачах, с одинаковыми инструментами, промптами и правилами оценки. Набор метрик должен включать не только точность, но и долю ошибок, а также поведение при неопределённости — когда модель правильно воздерживается от ответа ^[3]^[68].

Итог

Из этих источников нельзя честно вывести ни «Claude победил», ни «Spud победил». Поддерживаемый вывод уже и осторожнее: Claude Opus 4.7 официально документирован; GPT-5.5 Spud не подтверждён в процитированных официальных материалах OpenAI; а лучший способ оценивать контроль галлюцинаций — вознаграждать калиброванную неопределённость, включая корректное воздержание там, где утверждение нельзя подтвердить ^[3]^[12]^[16]^[23]^[25]^[29]^[45]^[68].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Искать и проверять факты с Studio Global AI

Ключевые выводы

Claude Opus 4.7 официально задокументирован Anthropic, а GPT 5.5 Spud не подтверждён в предоставленных официальных материалах OpenAI, поэтому доказанного победителя в сравнении галлюцинаций нет.
В примере OpenAI SimpleQA модель gpt 5 thinking mini указана с 52% воздержаний, 22% точности и 26% ошибок, тогда как o4 mini — с 1% воздержаний, 24% точности и 75% ошибок [3].
Для честного производственного теста нужно отдельно считать правильные ответы, ошибки, корректные воздержания и ошибочные воздержания: у abstention есть собственные метрики точности, precision и recall [68].

Люди также спрашивают

Каков краткий ответ на вопрос «Claude Opus 4.7 против «GPT-5.5 Spud»: что на самом деле говорят данные о галлюцинациях»?

Какие ключевые моменты необходимо проверить в первую очередь?

Что мне делать дальше на практике?

Для честного производственного теста нужно отдельно считать правильные ответы, ошибки, корректные воздержания и ошибочные воздержания: у abstention есть собственные метрики точности, precision и recall [68].

Какую связанную тему мне следует изучить дальше?

Продолжайте с «Claude Opus 4.7, GPT-5.5, DeepSeek V4 и Kimi K2.6: сравнение бенчмарков 2026 года», чтобы увидеть другой ракурс и дополнительные цитаты.

Открыть связанную страницу

С чем мне это сравнить?

Сверьте этот ответ с «Northwest и Southeast timber: почему ответ — «larger; larger»».

Открыть связанную страницу

Продолжайте свое исследование

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7, GPT-5.5, DeepSeek V4 и Kimi K2.6: сравнение бенчмарков 2026 года

Claude Opus 4.7 против GPT-5.5, DeepSeek V4 и Kimi K2.6: кто сильнее по бенчмаркам

Northwest и Southeast timber: почему ответ — «larger; larger»

Почему при большей заменяемости Northwest и Southeast timber обе эластичности становятся larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek против OpenAI, Claude, Gemini и Grok: решат цена и доверие, а не рейтинги

Сможет ли DeepSeek обойти OpenAI, Claude, Gemini и Grok?

Illustration of an AI coding agent connected to cloud database and backup systems

Инцидент PocketOS: что на самом деле показывает история с Claude, Cursor и удалённой базой

Удаление базы PocketOS: урок не в том, что «ИИ сошёл с ума»

Исследовательский разговор

Вы21 апр. 2026

Вопрос исследования

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI21 апр. 202640 источники

Цитируемый ответ

This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” ^[10]^[5]^[6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. ^[10]^[5]^[6]

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Источники

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

Популярное в «Открыть»

ОтчетыОпубликовано29 апр. 2026Last edited 6 мая 202620 источники

Claude Opus 4.7 против «GPT-5.5 Spud»: что на самом деле говорят данные о галлюцинациях

Искать и проверять факты с Studio Global AI Смотреть больше в «Открыть»

18K0

Короткий вердикт по источникам

Вопрос	Что подтверждается источниками
Claude Opus 4.7 — официальная модель?	Да. Anthropic описывает Claude Opus 4.7, а в анонсе указано, что разработчики могут использовать `claude-opus-4-7` через Claude API ^[12]^[16].
GPT-5.5 Spud — официальная модель OpenAI?	Не по предоставленным официальным источникам. В них есть GPT-5, GPT-5 mini, GPT-5.2-Codex и материалы по GPT-5.4, но не GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45].
Где вообще встречается Spud?	В постах Reddit и теме OpenAI Developer Community с запросом функции, а не в релиз-нотах или API-документации ^[7]^[8]^[10]^[28].
Есть ли тест Claude Opus 4.7 vs GPT-5.5 Spud на галлюцинации?	В предоставленных источниках нет общего теста с одинаковыми задачами и одинаковой схемой оценки; корректный тест должен отдельно учитывать воздержания от ответа и фактические ошибки ^[68].

Что реально известно о Claude Opus 4.7

Что вместо этого говорят источники OpenAI

Почему главный тест — не «кто отвечает смелее», а кто лучше калибрует уверенность

Как честно сравнивать Claude и модели OpenAI по галлюцинациям

Брать только официальные model ID. Для Claude — claude-opus-4-7; для OpenAI — документированную модель вроде GPT-5 или GPT-5 mini, а не неподтверждённый ярлык Spud ^[16]^[23]^[25]^[29].
Собрать смешанный набор заданий. В тесте должны быть вопросы с ответом, недоопределённые запросы и вопросы, на которые нельзя безопасно или фактически ответить. Именно в таких случаях исследования abstention показывают пользу отказа от угадывания ^[1]^[4].
Считать воздержания отдельно. Нужно фиксировать правильные ответы, неправильные ответы, корректные воздержания и ошибочные воздержания. Обзор по abstention выделяет отдельные метрики: abstention accuracy, precision и recall ^[68].
Не смешивать фактическую неопределённость и safety refusal. Отказ от вредного запроса — не то же самое, что признание нехватки данных для фактического ответа; I-CALM фокусируется именно на эпистемическом воздержании для проверяемых фактических вопросов ^[54].
Публиковать точность, долю ошибок и долю воздержаний вместе. Пример OpenAI SimpleQA показывает, что модель с более высокой долей воздержаний может иметь сопоставимую точность, но намного меньшую долю ошибок ^[3].
Держать условия одинаковыми. Retrieval, браузинг, доступ к инструментам, длина контекста и системные инструкции могут менять результат. Если одной модели дать больше внешних данных, тест будет измерять уже не только модель, а всю настройку эксперимента.

FAQ

GPT-5.5 Spud вообще существует?

Можно ли сказать, что Claude Opus 4.7 галлюцинирует меньше, чем GPT-5.5 Spud?

Что сравнивать покупателям и разработчикам?

Итог

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Искать и проверять факты с Studio Global AI

Ключевые выводы

Claude Opus 4.7 официально задокументирован Anthropic, а GPT 5.5 Spud не подтверждён в предоставленных официальных материалах OpenAI, поэтому доказанного победителя в сравнении галлюцинаций нет.
В примере OpenAI SimpleQA модель gpt 5 thinking mini указана с 52% воздержаний, 22% точности и 26% ошибок, тогда как o4 mini — с 1% воздержаний, 24% точности и 75% ошибок [3].
Для честного производственного теста нужно отдельно считать правильные ответы, ошибки, корректные воздержания и ошибочные воздержания: у abstention есть собственные метрики точности, precision и recall [68].

Люди также спрашивают

Каков краткий ответ на вопрос «Claude Opus 4.7 против «GPT-5.5 Spud»: что на самом деле говорят данные о галлюцинациях»?

Какие ключевые моменты необходимо проверить в первую очередь?

Что мне делать дальше на практике?

Какую связанную тему мне следует изучить дальше?

Открыть связанную страницу

С чем мне это сравнить?

Сверьте этот ответ с «Northwest и Southeast timber: почему ответ — «larger; larger»».

Открыть связанную страницу

Продолжайте свое исследование

Claude Opus 4.7, GPT-5.5, DeepSeek V4 и Kimi K2.6: сравнение бенчмарков 2026 года

Claude Opus 4.7 против GPT-5.5, DeepSeek V4 и Kimi K2.6: кто сильнее по бенчмаркам

Northwest и Southeast timber: почему ответ — «larger; larger»

Почему при большей заменяемости Northwest и Southeast timber обе эластичности становятся larger

DeepSeek против OpenAI, Claude, Gemini и Grok: решат цена и доверие, а не рейтинги

Сможет ли DeepSeek обойти OpenAI, Claude, Gemini и Grok?

Инцидент PocketOS: что на самом деле показывает история с Claude, Cursor и удалённой базой

Удаление базы PocketOS: урок не в том, что «ИИ сошёл с ума»

Исследовательский разговор

Вы21 апр. 2026

Вопрос исследования

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI21 апр. 202640 источники

Цитируемый ответ

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Источники

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

Популярное в «Открыть»

ОтчетыОпубликовано29 апр. 2026Last edited 6 мая 202620 источники

Claude Opus 4.7 против «GPT-5.5 Spud»: что на самом деле говорят данные о галлюцинациях

Искать и проверять факты с Studio Global AI Смотреть больше в «Открыть»

18K0

Короткий вердикт по источникам

Вопрос	Что подтверждается источниками
Claude Opus 4.7 — официальная модель?	Да. Anthropic описывает Claude Opus 4.7, а в анонсе указано, что разработчики могут использовать `claude-opus-4-7` через Claude API ^[12]^[16].
GPT-5.5 Spud — официальная модель OpenAI?	Не по предоставленным официальным источникам. В них есть GPT-5, GPT-5 mini, GPT-5.2-Codex и материалы по GPT-5.4, но не GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45].
Где вообще встречается Spud?	В постах Reddit и теме OpenAI Developer Community с запросом функции, а не в релиз-нотах или API-документации ^[7]^[8]^[10]^[28].
Есть ли тест Claude Opus 4.7 vs GPT-5.5 Spud на галлюцинации?	В предоставленных источниках нет общего теста с одинаковыми задачами и одинаковой схемой оценки; корректный тест должен отдельно учитывать воздержания от ответа и фактические ошибки ^[68].

Что реально известно о Claude Opus 4.7

Что вместо этого говорят источники OpenAI

Почему главный тест — не «кто отвечает смелее», а кто лучше калибрует уверенность

Как честно сравнивать Claude и модели OpenAI по галлюцинациям

Брать только официальные model ID. Для Claude — claude-opus-4-7; для OpenAI — документированную модель вроде GPT-5 или GPT-5 mini, а не неподтверждённый ярлык Spud ^[16]^[23]^[25]^[29].
Собрать смешанный набор заданий. В тесте должны быть вопросы с ответом, недоопределённые запросы и вопросы, на которые нельзя безопасно или фактически ответить. Именно в таких случаях исследования abstention показывают пользу отказа от угадывания ^[1]^[4].
Считать воздержания отдельно. Нужно фиксировать правильные ответы, неправильные ответы, корректные воздержания и ошибочные воздержания. Обзор по abstention выделяет отдельные метрики: abstention accuracy, precision и recall ^[68].
Не смешивать фактическую неопределённость и safety refusal. Отказ от вредного запроса — не то же самое, что признание нехватки данных для фактического ответа; I-CALM фокусируется именно на эпистемическом воздержании для проверяемых фактических вопросов ^[54].
Публиковать точность, долю ошибок и долю воздержаний вместе. Пример OpenAI SimpleQA показывает, что модель с более высокой долей воздержаний может иметь сопоставимую точность, но намного меньшую долю ошибок ^[3].
Держать условия одинаковыми. Retrieval, браузинг, доступ к инструментам, длина контекста и системные инструкции могут менять результат. Если одной модели дать больше внешних данных, тест будет измерять уже не только модель, а всю настройку эксперимента.

FAQ

GPT-5.5 Spud вообще существует?

Можно ли сказать, что Claude Opus 4.7 галлюцинирует меньше, чем GPT-5.5 Spud?

Что сравнивать покупателям и разработчикам?

Итог

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Искать и проверять факты с Studio Global AI

Ключевые выводы

Claude Opus 4.7 официально задокументирован Anthropic, а GPT 5.5 Spud не подтверждён в предоставленных официальных материалах OpenAI, поэтому доказанного победителя в сравнении галлюцинаций нет.
В примере OpenAI SimpleQA модель gpt 5 thinking mini указана с 52% воздержаний, 22% точности и 26% ошибок, тогда как o4 mini — с 1% воздержаний, 24% точности и 75% ошибок [3].
Для честного производственного теста нужно отдельно считать правильные ответы, ошибки, корректные воздержания и ошибочные воздержания: у abstention есть собственные метрики точности, precision и recall [68].

Люди также спрашивают

Каков краткий ответ на вопрос «Claude Opus 4.7 против «GPT-5.5 Spud»: что на самом деле говорят данные о галлюцинациях»?

Какие ключевые моменты необходимо проверить в первую очередь?

Что мне делать дальше на практике?

Какую связанную тему мне следует изучить дальше?

Открыть связанную страницу

С чем мне это сравнить?

Сверьте этот ответ с «Northwest и Southeast timber: почему ответ — «larger; larger»».

Открыть связанную страницу

Продолжайте свое исследование

Claude Opus 4.7, GPT-5.5, DeepSeek V4 и Kimi K2.6: сравнение бенчмарков 2026 года

Claude Opus 4.7 против GPT-5.5, DeepSeek V4 и Kimi K2.6: кто сильнее по бенчмаркам

Northwest и Southeast timber: почему ответ — «larger; larger»

Почему при большей заменяемости Northwest и Southeast timber обе эластичности становятся larger

DeepSeek против OpenAI, Claude, Gemini и Grok: решат цена и доверие, а не рейтинги

Сможет ли DeepSeek обойти OpenAI, Claude, Gemini и Grok?

Инцидент PocketOS: что на самом деле показывает история с Claude, Cursor и удалённой базой

Удаление базы PocketOS: урок не в том, что «ИИ сошёл с ума»

Исследовательский разговор

Вы21 апр. 2026

Вопрос исследования

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI21 апр. 202640 источники

Цитируемый ответ

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Источники

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...