レポート公開済み2026年4月29日Last edited 2026年5月6日20 ソース

Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること

Claude Opus 4.7はAnthropic公式資料で確認できるが、GPT 5.5 Spudは提供されたOpenAI公式資料では確認できない。 OpenAIのSimpleQA例では、gpt 5 thinking miniは棄権52%、正答22%、誤答26%、o4 miniは棄権1%、正答24%、誤答75%とされる [3]。

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

18K0

AI-generated editorial illustration of Claude Opus 4.7 and an unverified GPT-5.5 Spud comparison with hallucination evidence — Claude Opus 4.7 vsAI-generated editorial illustration for a fact-check on Claude Opus 4.7, GPT-5.5 Spud rumors, and hallucination benchmarks.
AI プロンプト
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs. GPT-5.5 Spud: Hallucination Evidence, Fact-Checked. Article summary: Claude Opus 4.7 is official, but GPT 5.5 Spud is not verified in the cited official OpenAI sources, so there is no defensible head to head hallucination benchmark here; compare Claude against documented OpenAI models.... Topic tags: ai, ai safety, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "# GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use) | by Pranit naik | No Time | Apr, 2026 | Medium. ## Gpt-5.5 vs Opus 4.7 | Real-world AI model performance | Gen AI" source context "GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use)" Reference image 2: visual subject "# GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks. I compared GPT-5.5 against
openai.com

Claude Opus 4.7とGPT-5.5 Spudのどちらがハルシネーション、つまり事実らしく見える誤情報を出しにくいのか。この問いは一見、モデル同士のランキング比較に見えます。ですが、まず確認すべきなのはモデル名です。

AnthropicはClaude Opus 4.7とAPI識別子claude-opus-4-7を文書化しています。一方、提供されたOpenAIの公式資料で確認できるのはGPT-5、GPT-5 mini、GPT-5.2-Codex、GPT-5.4のプロンプトガイダンスであり、GPT-5.5 Spudという公開モデルではありません ^[12]^[16]^[23]^[25]^[26]^[29]^[45]。そのため、現時点で言えるのは勝敗ではなく、Claude Opus 4.7は評価対象にできるが、GPT-5.5 Spudは公式リリース、モデルカード、API文書などに結び付かない限り、ベンチマーク対象として扱うべきではない、ということです。

まず結論：勝者はまだ決められない

論点	根拠に基づく答え
Claude Opus 4.7は確認済みか	はい。AnthropicはClaude Opus 4.7を文書化し、開発者がClaude APIで`claude-opus-4-7`を使えると説明しています ^[12]^[16]。
GPT-5.5 SpudはOpenAIの公式モデルとして確認済みか	提供されたOpenAI公式資料では確認できません。そこにあるのはGPT-5、GPT-5 mini、GPT-5.2-Codex、GPT-5.4のプロンプトガイダンスです ^[23]^[25]^[26]^[29]^[45]。
Spudという名前はどこに出てくるか	Reddit投稿とOpenAI Developer Communityの機能要望スレッドに出てきますが、リリースノートやAPIのモデル文書ではありません ^[7]^[8]^[10]^[28]。
Claude対Spudの幻覚ベンチマークはあるか	提供された資料には、同一タスク・同一採点の直接比較はありません。公平な評価では、誤答だけでなく棄権、つまり答えない判断も別に採点すべきです ^[68]。

これは、将来または非公開のSpudモデルが存在し得ないと証明するものではありません。あくまで、現在の引用資料からはGPT-5.5 SpudをOpenAIの公式モデルとして扱えず、幻覚の少なさでClaudeと勝敗を付ける根拠もない、という意味です。

Claude Opus 4.7について確認できること

Claude Opus 4.7について最も強い根拠は、横並びの幻覚ランキングではなくAnthropicの製品資料です。Anthropicは開発者がclaude-opus-4-7をClaude API経由で使えると説明し ^[16]、ドキュメントではClaude Opus 4.7にタスク予算が導入されたとしています ^[12]。

タスク予算は、モデルにどれだけ作業させるかを制御するうえでは重要です。ただし、それだけでモデルが不確かな事実主張をいつ控えるか、つまり較正された不確実性をどれほど備えているかを示す公開ベンチマークにはなりません。

誠実性に関する注目材料はあります。MashableはAnthropicのOpus 4.7システムカードを引用し、Claude Opus 4.7のMASK honesty rateが91.7%で、従来のAnthropicモデルや他のフロンティアAIモデルよりハルシネーションや過度な同調が少ないと報じました ^[14]。これは有用な情報ですが、検証済みのGPT-5.5 Spudモデルとの同一条件比較ではないため、Claude対Spudの答えにはなりません。

OpenAIの資料で確認できること

提供されたOpenAI資料で確認できるのは、GPT-5、GPT-5 mini、GPT-5.2-Codex、GPT-5.4のプロンプトガイダンスです ^[23]^[25]^[26]^[29]^[45]。この資料群でSpudという名前が出てくるのは、Reddit投稿とOpenAI Developer Communityの機能要望スレッドです ^[7]^[8]^[10]^[28]。

コミュニティ投稿は、噂や関心の所在を知る手掛かりにはなります。しかし、公式のモデルページ、モデルカード、API識別子、リリース告知と同じ重みでは扱えません。

一方で、OpenAIのハルシネーション解説は、評価設計の参考になります。OpenAIは、一般的な訓練や評価の仕組みが不確実性を認めるよりも推測を報いるため、モデルが自信ありげに間違う方向へ誘導されると説明しています。また、確信を持って誤情報を出すより、不確実性を示す、または確認のために質問する方がよいとも述べています ^[3]。

OpenAIのSimpleQA例は、正答率だけでは実態を見誤ることを示しています。そこではgpt-5-thinking-miniが棄権52%、正答22%、誤答26%とされる一方、o4-miniは棄権1%、正答24%、誤答75%とされています ^[3]。前者は答える回数が少ないものの、誤答は大幅に少ない例です ^[3]。医療、法務、金融、企業の意思決定支援のように誤答コストが高い場面では、この差が単なる正答率以上に重要になります。

本当に見るべき指標は較正された不確実性

ハルシネーション対策は、何でも拒否することではありません。使えるAIは、根拠が十分なら答え、質問が曖昧なら確認し、根拠を持てないなら答えを控える必要があります。このバランスが、較正された不確実性です。

研究面でも、この見方は支持されています。ただし、万能薬ではありません。2024年の研究は、質問応答の場面で不確実性に基づく棄権が正確性、ハルシネーション、安全性を改善すると報告しています ^[1]^[4]。I-CALMは、検証可能な答えを持つ事実質問におけるエピステミックな棄権を扱い、現在のLLMは棄権すべき場面でも棄権に失敗し得ると指摘しています ^[54]。行動的に較正された強化学習の研究も、モデルが不確実性を認めて棄権するよう促す方法を検討しています ^[61]。

より広いレビューでも、不確実性の定量化はハルシネーション検出の道具とされ、較正された不確実性はモデルの回答を信頼するか、人に回すか、追加検証するかを判断する助けになると説明されています ^[53]^[55]。

ただし、棄権は多ければよいわけではありません。分かりませんばかりのモデルは安全でも役に立たず、逆に一切棄権しないモデルは便利に見えても危険です。重要なのは、答えるべき場面で答え、答えるべきでない場面で止まることです。

実務で比較するならこう測る

公式モデルIDを使う。 Claude側はclaude-opus-4-7を使い、OpenAI側は未確認のSpudラベルではなく、GPT-5やGPT-5 miniなど文書化されたモデルを使います ^[16]^[23]^[25]^[29]。
問題セットを混ぜる。 答えられる質問、条件が足りない依頼、答えようがない質問を入れます。不確実性が高いときや安全に答えられないときに控える価値は、棄権研究の中心的な論点です ^[1]^[4]。
棄権を別に採点する。 正答、誤答、正しい棄権、誤った棄権を分けて記録します。棄権に関する調査では、棄権精度、適合率、再現率などの指標が整理されています ^[68]。
事実の不確実性と安全上の拒否を分ける。 有害な依頼を拒むことと、根拠不足のため事実回答を控えることは同じではありません。I-CALMが扱うのは、検証可能な事実質問に対するエピステミックな棄権です ^[54]。
正答率、誤答率、棄権率を一緒に出す。 OpenAIのSimpleQA例が示すように、棄権率が高いモデルは正答率が同程度でも誤答率が大きく下がることがあります ^[3]。
環境条件をそろえる。 検索、RAG、ブラウジング、ツールアクセス、コンテキスト長、システム指示が違えば、結果はモデル本体だけでなく評価環境の差を反映してしまいます。

FAQ

GPT-5.5 Spudは実在するのか

少なくとも、この資料群の範囲ではOpenAIの公式モデルとしては確認できません。引用されたOpenAI公式資料はGPT-5、GPT-5 mini、GPT-5.2-Codex、GPT-5.4のプロンプトガイダンスを確認しており、SpudはReddit投稿とコミュニティの機能要望スレッドに出てくるだけです ^[7]^[8]^[10]^[23]^[25]^[26]^[28]^[29]^[45]。

Claude Opus 4.7はGPT-5.5 Spudよりハルシネーションが少ないのか

この資料だけでは厳密には答えられません。Claude Opus 4.7は公式に文書化されており ^[12]^[16]、MASK honesty rate 91.7%という二次報道もあります ^[14]。しかし、検証済みのGPT-5.5 Spudという比較対象がなく、両者を同じ条件で採点したベンチマークも提示されていません ^[7]^[8]^[10]^[28]^[68]。

企業や開発者は何を比べればいいのか

Claude Opus 4.7と、GPT-5やGPT-5 miniなど文書化されたOpenAIモデルを、同じタスク、同じツール、同じプロンプト、同じ採点基準で比較すべきです。見るべき指標は正答率だけでなく、誤答率と棄権の質を含めたセットです ^[3]^[68]。

最後に

この証拠から、Claudeが勝った、あるいはSpudが勝ったとは言えません。支持できる結論はもっと限定的です。Claude Opus 4.7は公式に文書化されている。GPT-5.5 Spudは、引用されたOpenAI公式資料では確認されていない。そしてハルシネーション制御を評価するなら、根拠のない主張を避ける正しい棄権を含め、較正された不確実性を評価軸に入れる必要があります ^[3]^[12]^[16]^[23]^[25]^[29]^[45]^[68]。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

Claude Opus 4.7はAnthropic公式資料で確認できるが、GPT 5.5 Spudは提供されたOpenAI公式資料では確認できない。
OpenAIのSimpleQA例では、gpt 5 thinking miniは棄権52%、正答22%、誤答26%、o4 miniは棄権1%、正答24%、誤答75%とされる [3]。
本番導入の比較では、正答・誤答・正しい棄権・誤った棄権を分けて評価する必要がある [68]。

人々も尋ねます

「Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること」の短い答えは何ですか?

Claude Opus 4.7はAnthropic公式資料で確認できるが、GPT 5.5 Spudは提供されたOpenAI公式資料では確認できない。

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

本番導入の比較では、正答・誤答・正しい棄権・誤った棄権を分けて評価する必要がある [68]。

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6：2026年ベンチマーク比較

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

Northwest Timber と Southeast Timber：代替性が高いと、なぜ「larger; larger」になるのか

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか

研究に関する会話

あなた2026年4月21日

リサーチクエスチョン

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI2026年4月21日40 ソース

引用された回答

This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” ^[10]^[5]^[6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. ^[10]^[5]^[6]

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

情報源

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

トレンドを発見する

レポート公開済み2026年4月29日Last edited 2026年5月6日20 ソース

Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

18K0

まず結論：勝者はまだ決められない

論点	根拠に基づく答え
Claude Opus 4.7は確認済みか	はい。AnthropicはClaude Opus 4.7を文書化し、開発者がClaude APIで`claude-opus-4-7`を使えると説明しています ^[12]^[16]。
GPT-5.5 SpudはOpenAIの公式モデルとして確認済みか	提供されたOpenAI公式資料では確認できません。そこにあるのはGPT-5、GPT-5 mini、GPT-5.2-Codex、GPT-5.4のプロンプトガイダンスです ^[23]^[25]^[26]^[29]^[45]。
Spudという名前はどこに出てくるか	Reddit投稿とOpenAI Developer Communityの機能要望スレッドに出てきますが、リリースノートやAPIのモデル文書ではありません ^[7]^[8]^[10]^[28]。
Claude対Spudの幻覚ベンチマークはあるか	提供された資料には、同一タスク・同一採点の直接比較はありません。公平な評価では、誤答だけでなく棄権、つまり答えない判断も別に採点すべきです ^[68]。

Claude Opus 4.7について確認できること

OpenAIの資料で確認できること

本当に見るべき指標は較正された不確実性

実務で比較するならこう測る

公式モデルIDを使う。 Claude側はclaude-opus-4-7を使い、OpenAI側は未確認のSpudラベルではなく、GPT-5やGPT-5 miniなど文書化されたモデルを使います ^[16]^[23]^[25]^[29]。
問題セットを混ぜる。 答えられる質問、条件が足りない依頼、答えようがない質問を入れます。不確実性が高いときや安全に答えられないときに控える価値は、棄権研究の中心的な論点です ^[1]^[4]。
棄権を別に採点する。 正答、誤答、正しい棄権、誤った棄権を分けて記録します。棄権に関する調査では、棄権精度、適合率、再現率などの指標が整理されています ^[68]。
事実の不確実性と安全上の拒否を分ける。 有害な依頼を拒むことと、根拠不足のため事実回答を控えることは同じではありません。I-CALMが扱うのは、検証可能な事実質問に対するエピステミックな棄権です ^[54]。
正答率、誤答率、棄権率を一緒に出す。 OpenAIのSimpleQA例が示すように、棄権率が高いモデルは正答率が同程度でも誤答率が大きく下がることがあります ^[3]。
環境条件をそろえる。 検索、RAG、ブラウジング、ツールアクセス、コンテキスト長、システム指示が違えば、結果はモデル本体だけでなく評価環境の差を反映してしまいます。

FAQ

GPT-5.5 Spudは実在するのか

Claude Opus 4.7はGPT-5.5 Spudよりハルシネーションが少ないのか

企業や開発者は何を比べればいいのか

最後に

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

Claude Opus 4.7はAnthropic公式資料で確認できるが、GPT 5.5 Spudは提供されたOpenAI公式資料では確認できない。
OpenAIのSimpleQA例では、gpt 5 thinking miniは棄権52%、正答22%、誤答26%、o4 miniは棄権1%、正答24%、誤答75%とされる [3]。
本番導入の比較では、正答・誤答・正しい棄権・誤った棄権を分けて評価する必要がある [68]。

人々も尋ねます

「Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること」の短い答えは何ですか?

Claude Opus 4.7はAnthropic公式資料で確認できるが、GPT 5.5 Spudは提供されたOpenAI公式資料では確認できない。

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

本番導入の比較では、正答・誤答・正しい棄権・誤った棄権を分けて評価する必要がある [68]。

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6：2026年ベンチマーク比較

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

Northwest Timber と Southeast Timber：代替性が高いと、なぜ「larger; larger」になるのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか

研究に関する会話

あなた2026年4月21日

リサーチクエスチョン

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI2026年4月21日40 ソース

引用された回答

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

情報源

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

トレンドを発見する

レポート公開済み2026年4月29日Last edited 2026年5月6日20 ソース

Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

18K0

まず結論：勝者はまだ決められない

論点	根拠に基づく答え
Claude Opus 4.7は確認済みか	はい。AnthropicはClaude Opus 4.7を文書化し、開発者がClaude APIで`claude-opus-4-7`を使えると説明しています ^[12]^[16]。
GPT-5.5 SpudはOpenAIの公式モデルとして確認済みか	提供されたOpenAI公式資料では確認できません。そこにあるのはGPT-5、GPT-5 mini、GPT-5.2-Codex、GPT-5.4のプロンプトガイダンスです ^[23]^[25]^[26]^[29]^[45]。
Spudという名前はどこに出てくるか	Reddit投稿とOpenAI Developer Communityの機能要望スレッドに出てきますが、リリースノートやAPIのモデル文書ではありません ^[7]^[8]^[10]^[28]。
Claude対Spudの幻覚ベンチマークはあるか	提供された資料には、同一タスク・同一採点の直接比較はありません。公平な評価では、誤答だけでなく棄権、つまり答えない判断も別に採点すべきです ^[68]。

Claude Opus 4.7について確認できること

OpenAIの資料で確認できること

本当に見るべき指標は較正された不確実性

実務で比較するならこう測る

公式モデルIDを使う。 Claude側はclaude-opus-4-7を使い、OpenAI側は未確認のSpudラベルではなく、GPT-5やGPT-5 miniなど文書化されたモデルを使います ^[16]^[23]^[25]^[29]。
問題セットを混ぜる。 答えられる質問、条件が足りない依頼、答えようがない質問を入れます。不確実性が高いときや安全に答えられないときに控える価値は、棄権研究の中心的な論点です ^[1]^[4]。
棄権を別に採点する。 正答、誤答、正しい棄権、誤った棄権を分けて記録します。棄権に関する調査では、棄権精度、適合率、再現率などの指標が整理されています ^[68]。
事実の不確実性と安全上の拒否を分ける。 有害な依頼を拒むことと、根拠不足のため事実回答を控えることは同じではありません。I-CALMが扱うのは、検証可能な事実質問に対するエピステミックな棄権です ^[54]。
正答率、誤答率、棄権率を一緒に出す。 OpenAIのSimpleQA例が示すように、棄権率が高いモデルは正答率が同程度でも誤答率が大きく下がることがあります ^[3]。
環境条件をそろえる。 検索、RAG、ブラウジング、ツールアクセス、コンテキスト長、システム指示が違えば、結果はモデル本体だけでなく評価環境の差を反映してしまいます。

FAQ

GPT-5.5 Spudは実在するのか

Claude Opus 4.7はGPT-5.5 Spudよりハルシネーションが少ないのか

企業や開発者は何を比べればいいのか

最後に

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

Claude Opus 4.7はAnthropic公式資料で確認できるが、GPT 5.5 Spudは提供されたOpenAI公式資料では確認できない。
OpenAIのSimpleQA例では、gpt 5 thinking miniは棄権52%、正答22%、誤答26%、o4 miniは棄権1%、正答24%、誤答75%とされる [3]。
本番導入の比較では、正答・誤答・正しい棄権・誤った棄権を分けて評価する必要がある [68]。

人々も尋ねます

「Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること」の短い答えは何ですか?

Claude Opus 4.7はAnthropic公式資料で確認できるが、GPT 5.5 Spudは提供されたOpenAI公式資料では確認できない。

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

本番導入の比較では、正答・誤答・正しい棄権・誤った棄権を分けて評価する必要がある [68]。

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6：2026年ベンチマーク比較

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

Northwest Timber と Southeast Timber：代替性が高いと、なぜ「larger; larger」になるのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか

研究に関する会話

あなた2026年4月21日

リサーチクエスチョン

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI2026年4月21日40 ソース

引用された回答

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

情報源

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

まず結論：勝者はまだ決められない

Claude Opus 4.7について確認できること

OpenAIの資料で確認できること

本当に見るべき指標は較正された不確実性

実務で比較するならこう測る

FAQ

GPT-5.5 Spudは実在するのか

Claude Opus 4.7はGPT-5.5 Spudよりハルシネーションが少ないのか

企業や開発者は何を比べればいいのか

最後に

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

リサーチクエスチョン

引用された回答

情報源

まず結論：勝者はまだ決められない

Claude Opus 4.7について確認できること

OpenAIの資料で確認できること

本当に見るべき指標は較正された不確実性

実務で比較するならこう測る

FAQ

GPT-5.5 Spudは実在するのか

Claude Opus 4.7はGPT-5.5 Spudよりハルシネーションが少ないのか

企業や開発者は何を比べればいいのか

最後に

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

リサーチクエスチョン

引用された回答

情報源

まず結論：勝者はまだ決められない

Claude Opus 4.7について確認できること

OpenAIの資料で確認できること

本当に見るべき指標は較正された不確実性

実務で比較するならこう測る

FAQ

GPT-5.5 Spudは実在するのか

Claude Opus 4.7はGPT-5.5 Spudよりハルシネーションが少ないのか

企業や開発者は何を比べればいいのか

最後に

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「Claude Opus 4.7対GPT-5.5 Spud：幻覚比較で本当に分かること」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

リサーチクエスチョン

引用された回答

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼