報告已發布2026年4月29日Last edited 2026年5月6日20 個來源

Claude Opus 4.7 對 GPT-5.5 Spud：幻覺證據到底說了什麼？

Claude Opus 4.7 是可核對的官方模型；GPT 5.5 Spud 在本次提供的 OpenAI 官方資料中未被驗證，因此沒有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。 OpenAI 的 SimpleQA 範例顯示，gpt 5 thinking mini 的避答率為 52%、正確率為 22%、錯誤率為 26%；o4 mini 的避答率為 1%、正確率為 24%、錯誤率為 75% [3]。

使用 Studio Global AI 搜尋並查證事實探索更多內容

18K0

AI-generated editorial illustration of Claude Opus 4.7 and an unverified GPT-5.5 Spud comparison with hallucination evidence — Claude Opus 4.7 vsAI-generated editorial illustration for a fact-check on Claude Opus 4.7, GPT-5.5 Spud rumors, and hallucination benchmarks.
AI 提示詞
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs. GPT-5.5 Spud: Hallucination Evidence, Fact-Checked. Article summary: Claude Opus 4.7 is official, but GPT 5.5 Spud is not verified in the cited official OpenAI sources, so there is no defensible head to head hallucination benchmark here; compare Claude against documented OpenAI models.... Topic tags: ai, ai safety, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "# GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use) | by Pranit naik | No Time | Apr, 2026 | Medium. ## Gpt-5.5 vs Opus 4.7 | Real-world AI model performance | Gen AI" source context "GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use)" Reference image 2: visual subject "# GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks. I compared GPT-5.5 against
openai.com

如果你看到「Claude Opus 4.7 vs. GPT-5.5 Spud，誰比較不會幻覺？」這類問題，先別急著找排行榜。這裡第一個問題不是誰贏，而是：兩邊的模型名稱是否都能被官方文件確認。

目前可支持的結論很窄，也很重要：Anthropic 已公開 Claude Opus 4.7，並在文件與公告中列出 claude-opus-4-7 這個 API 識別碼 ^[12]^[16]；但本次提供的 OpenAI 官方資料記載的是 GPT-5、GPT-5 mini、GPT-5.2-Codex 與 GPT-5.4 提示指南，沒有可核對的公開模型名為 GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45]。換句話說，現在不能負責任地宣稱「Claude 贏」或「Spud 贏」。

先看可驗證結論

問題	目前證據能支持的回答
Claude Opus 4.7 是否已被確認？	是。Anthropic 文件記載 Claude Opus 4.7，並表示開發者可透過 Claude API 使用 `claude-opus-4-7` ^[12]^[16]。
GPT-5.5 Spud 是否已被確認為 OpenAI 官方模型？	在本次提供的 OpenAI 官方來源中，沒有。這些來源記載的是 GPT-5、GPT-5 mini、GPT-5.2-Codex 與 GPT-5.4 提示指南 ^[23]^[25]^[26]^[29]^[45]。
Spud 這個名稱出現在何處？	出現在 Reddit 貼文與 OpenAI Developer Community 的功能請求討論串，而不是發布公告、模型卡或 API 模型文件 ^[7]^[8]^[10]^[28]。
是否已有 Claude Opus 4.7 對 GPT-5.5 Spud 的幻覺基準？	沒有提供同題、同環境、同評分規則的對測來源；公平測試還應把避答行為與事實錯誤分開評分 ^[68]。

這並不代表未來或私人版本的 Spud 一定不存在；它只表示，以目前提供的證據，不能把 GPT-5.5 Spud 當成已驗證的 OpenAI 官方模型，也不能拿它來宣稱幻覺控制的勝負。

Claude Opus 4.7：官方資料能證明什麼？

Claude Opus 4.7 的證據基礎比較清楚，但要注意它不是一份跨廠商幻覺排行榜。Anthropic 表示開發者可以透過 Claude API 使用 claude-opus-4-7 ^[16]；其文件也指出 Claude Opus 4.7 引入 task budgets，也就是任務預算機制 ^[12]。

任務預算對產品控制很有用：它關乎模型在特定任務中可使用多少處理資源或推理投入。但這不等於公開、可校準的不確定性基準。它本身不能證明模型在遇到不確定事實時，會在什麼條件下承認不知道、要求補充資料，或停止猜測。

與誠實度較相關的一項訊號來自次級報導。Mashable 報導引述 Anthropic 的 Opus 4.7 system card，稱 Claude Opus 4.7 的 MASK honesty rate 為 91.7%，且相較先前的 Anthropic 模型與其他前沿 AI 模型，更不容易產生幻覺或迎合使用者 ^[14]。這對評估誠實性有參考價值，但仍不能回答 Claude 對 Spud 的問題，因為它不是針對已驗證 GPT-5.5 Spud 的同題對測。

GPT-5.5 Spud：目前比較像社群線索，不是官方測試對象

本次提供的 OpenAI 來源可確認幾個 GPT-5 系列相關項目：GPT-5、GPT-5 mini、GPT-5.2-Codex，以及 GPT-5.4 的提示指南 ^[23]^[25]^[26]^[29]^[45]。Spud 的線索則來自 Reddit 討論與 OpenAI Developer Community 的功能請求串 ^[7]^[8]^[10]^[28]。

這類社群貼文可以是觀察市場傳聞或使用者期待的線索，但不能取代官方模型頁、API model ID、模型卡或正式發布公告。對採購、開發或治理團隊來說，這一點尤其關鍵：如果模型名稱本身無法驗證，後面的幻覺率、能力比較與安全結論都站不穩。

為什麼「會不會避答」比單看正確率更重要

OpenAI 對幻覺問題的說明，對評測設計很有參考價值。OpenAI 指出，常見訓練與評估流程會獎勵猜答案，而不是獎勵承認不確定；模型在不確定時，應該表明不確定或要求釐清，而不是自信地給出錯誤資訊 ^[3]。

SimpleQA 的例子說明了為什麼只看正確率容易誤導。OpenAI 列出的數字中，gpt-5-thinking-mini 的避答率為 52%、正確率為 22%、錯誤率為 26%；o4-mini 的避答率為 1%、正確率為 24%、錯誤率為 75% ^[3]。前者答得比較少，但在該例中錯得也少得多 ^[3]。對要把模型放進產品流程的團隊來說，這種取捨往往比「每題都很有自信」更重要。

真正該測的是校準的不確定性

幻覺控制不是叫模型什麼都拒答。好的模型應該在證據充分時回答，在問題描述不足時追問，在沒有足夠根據時避答。這就是校準的不確定性：不是保守到沒有用，也不是大膽到亂編。

研究也支持這個方向，但仍有但書。2024 年一項研究指出，在問答情境中，根據不確定性進行避答可以改善正確性、幻覺與安全表現 ^[1]^[4]。I-CALM 把重點放在 epistemic abstention，也就是面對有可驗證答案的事實問題時，在不知道或證據不足時選擇避答，並指出目前大型語言模型仍可能在該避答時沒有避答 ^[54]。Behaviorally calibrated reinforcement learning 相關研究也探討如何透過獎勵機制，鼓勵模型在不確定時承認不確定並避答 ^[61]。

更廣泛的綜述則把 uncertainty quantification，也就是不確定性量化，視為偵測幻覺的重要工具；校準的不確定性有助使用者判斷何時可以相信模型、何時應轉交人工或再查證 ^[53]^[55]。但重點是「校準」：太常說不知道的模型可能安全但不好用；從不避答的模型可能好用但風險高。

如果真的要比較 Claude 與 OpenAI 模型，應該這樣測

使用官方模型 ID。 Claude 端可測 claude-opus-4-7；OpenAI 端應使用已文件化的模型，例如 GPT-5 或 GPT-5 mini，而不是未驗證的 Spud 標籤 ^[16]^[23]^[25]^[29]。
建立混合題組。 題目應包含可回答問題、條件不足問題與不可回答問題；避答研究關注的正是模型在高不確定性或無法安全回答時是否能拒絕猜測 ^[1]^[4]。
把避答獨立計分。 應分開統計正答、錯答、正確避答與錯誤避答。避答研究已定義 abstention accuracy、abstention precision 與 abstention recall 等指標 ^[68]。
區分事實不確定與安全拒答。 拒絕有害內容，和承認某個事實答案缺乏證據，不是同一種行為；I-CALM 聚焦的是有可驗證答案之事實問題上的 epistemic abstention ^[54]。
同時報告正確率、錯誤率與避答率。 OpenAI 的 SimpleQA 範例顯示，高避答率可能伴隨相近正確率但大幅較低錯誤率 ^[3]。
固定測試環境。 檢索、瀏覽、工具使用、上下文長度與系統提示都會影響結果。若只給其中一個模型額外資料，測到的就不只是模型本身，而是整套設定。

常見問題

GPT-5.5 Spud 是真的嗎？

在本次提供的證據中，它不是已被 OpenAI 官方文件確認的模型。官方 OpenAI 來源記載的是 GPT-5、GPT-5 mini、GPT-5.2-Codex 與 GPT-5.4 提示指南；Spud 出現在 Reddit 貼文與開發者社群的功能請求討論串 ^[7]^[8]^[10]^[23]^[25]^[26]^[28]^[29]^[45]。

Claude Opus 4.7 是否比 GPT-5.5 Spud 更少幻覺？

不能從這批資料嚴格回答。Claude Opus 4.7 有官方文件可查 ^[12]^[16]，也有次級報導提到 91.7% 的 MASK 誠實率 ^[14]；但目前沒有已驗證的 GPT-5.5 Spud 測試對象，也沒有兩者共用的同題基準 ^[7]^[8]^[10]^[28]^[68]。

採購或開發團隊該比較什麼？

應把 Claude Opus 4.7 與已文件化的 OpenAI 模型放在相同任務、相同工具、相同提示與相同評分規則下比較。核心指標不應只有正確率，還要同時看錯誤率與避答行為 ^[3]^[68]。

結論

不要根據目前這批證據得出「Claude 勝」或「Spud 勝」的幻覺結論。能支持的說法是：Claude Opus 4.7 已有官方文件；GPT-5.5 Spud 尚未在引用的 OpenAI 官方資料中被驗證；而評估幻覺控制的更好方法，是獎勵校準的不確定性，包括在缺乏證據時能正確避答 ^[3]^[12]^[16]^[23]^[25]^[29]^[45]^[68]。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 是可核對的官方模型；GPT 5.5 Spud 在本次提供的 OpenAI 官方資料中未被驗證，因此沒有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。
OpenAI 的 SimpleQA 範例顯示，gpt 5 thinking mini 的避答率為 52%、正確率為 22%、錯誤率為 26%；o4 mini 的避答率為 1%、正確率為 24%、錯誤率為 75% [3]。
正式導入前的基準測試應分開統計正答、錯答、正確避答與錯誤避答，因為避答也有自己的準確率、精確率與召回率指標 [68]。

大家也會問

「Claude Opus 4.7 對 GPT-5.5 Spud：幻覺證據到底說了什麼？」的簡短答案是什麼？

Claude Opus 4.7 是可核對的官方模型；GPT 5.5 Spud 在本次提供的 OpenAI 官方資料中未被驗證，因此沒有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

正式導入前的基準測試應分開統計正答、錯答、正確避答與錯誤避答，因為避答也有自己的準確率、精確率與召回率指標 [68]。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6：2026 基準測試與決策結論

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點

Northwest vs. Southeast Timber 彈性題解：為什麼是 larger; larger

Northwest 與 Southeast Timber：替代性越高，為什麼答案是 larger; larger？

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek 會擊敗 OpenAI、Claude、Gemini、Grok 嗎？關鍵不是榜單，而是成本與信任

研究對話

你2026年4月21日

研究問題

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI2026年4月21日40 個來源

附引用的答案

This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” ^[10]^[5]^[6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. ^[10]^[5]^[6]

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

來源

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

熱門探索內容

報告已發布2026年4月29日Last edited 2026年5月6日20 個來源

Claude Opus 4.7 對 GPT-5.5 Spud：幻覺證據到底說了什麼？

使用 Studio Global AI 搜尋並查證事實探索更多內容

18K0

先看可驗證結論

問題	目前證據能支持的回答
Claude Opus 4.7 是否已被確認？	是。Anthropic 文件記載 Claude Opus 4.7，並表示開發者可透過 Claude API 使用 `claude-opus-4-7` ^[12]^[16]。
GPT-5.5 Spud 是否已被確認為 OpenAI 官方模型？	在本次提供的 OpenAI 官方來源中，沒有。這些來源記載的是 GPT-5、GPT-5 mini、GPT-5.2-Codex 與 GPT-5.4 提示指南 ^[23]^[25]^[26]^[29]^[45]。
Spud 這個名稱出現在何處？	出現在 Reddit 貼文與 OpenAI Developer Community 的功能請求討論串，而不是發布公告、模型卡或 API 模型文件 ^[7]^[8]^[10]^[28]。
是否已有 Claude Opus 4.7 對 GPT-5.5 Spud 的幻覺基準？	沒有提供同題、同環境、同評分規則的對測來源；公平測試還應把避答行為與事實錯誤分開評分 ^[68]。

Claude Opus 4.7：官方資料能證明什麼？

GPT-5.5 Spud：目前比較像社群線索，不是官方測試對象

為什麼「會不會避答」比單看正確率更重要

真正該測的是校準的不確定性

如果真的要比較 Claude 與 OpenAI 模型，應該這樣測

使用官方模型 ID。 Claude 端可測 claude-opus-4-7；OpenAI 端應使用已文件化的模型，例如 GPT-5 或 GPT-5 mini，而不是未驗證的 Spud 標籤 ^[16]^[23]^[25]^[29]。
建立混合題組。 題目應包含可回答問題、條件不足問題與不可回答問題；避答研究關注的正是模型在高不確定性或無法安全回答時是否能拒絕猜測 ^[1]^[4]。
把避答獨立計分。 應分開統計正答、錯答、正確避答與錯誤避答。避答研究已定義 abstention accuracy、abstention precision 與 abstention recall 等指標 ^[68]。
區分事實不確定與安全拒答。 拒絕有害內容，和承認某個事實答案缺乏證據，不是同一種行為；I-CALM 聚焦的是有可驗證答案之事實問題上的 epistemic abstention ^[54]。
同時報告正確率、錯誤率與避答率。 OpenAI 的 SimpleQA 範例顯示，高避答率可能伴隨相近正確率但大幅較低錯誤率 ^[3]。
固定測試環境。 檢索、瀏覽、工具使用、上下文長度與系統提示都會影響結果。若只給其中一個模型額外資料，測到的就不只是模型本身，而是整套設定。

常見問題

GPT-5.5 Spud 是真的嗎？

Claude Opus 4.7 是否比 GPT-5.5 Spud 更少幻覺？

採購或開發團隊該比較什麼？

結論

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 是可核對的官方模型；GPT 5.5 Spud 在本次提供的 OpenAI 官方資料中未被驗證，因此沒有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。
OpenAI 的 SimpleQA 範例顯示，gpt 5 thinking mini 的避答率為 52%、正確率為 22%、錯誤率為 26%；o4 mini 的避答率為 1%、正確率為 24%、錯誤率為 75% [3]。
正式導入前的基準測試應分開統計正答、錯答、正確避答與錯誤避答，因為避答也有自己的準確率、精確率與召回率指標 [68]。

大家也會問

「Claude Opus 4.7 對 GPT-5.5 Spud：幻覺證據到底說了什麼？」的簡短答案是什麼？

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

正式導入前的基準測試應分開統計正答、錯答、正確避答與錯誤避答，因為避答也有自己的準確率、精確率與召回率指標 [68]。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6：2026 基準測試與決策結論

DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點

Northwest vs. Southeast Timber 彈性題解：為什麼是 larger; larger

Northwest 與 Southeast Timber：替代性越高，為什麼答案是 larger; larger？

DeepSeek 會擊敗 OpenAI、Claude、Gemini、Grok 嗎？關鍵不是榜單，而是成本與信任

研究對話

你2026年4月21日

研究問題

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI2026年4月21日40 個來源

附引用的答案

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

來源

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

熱門探索內容

報告已發布2026年4月29日Last edited 2026年5月6日20 個來源

Claude Opus 4.7 對 GPT-5.5 Spud：幻覺證據到底說了什麼？

使用 Studio Global AI 搜尋並查證事實探索更多內容

18K0

先看可驗證結論

問題	目前證據能支持的回答
Claude Opus 4.7 是否已被確認？	是。Anthropic 文件記載 Claude Opus 4.7，並表示開發者可透過 Claude API 使用 `claude-opus-4-7` ^[12]^[16]。
GPT-5.5 Spud 是否已被確認為 OpenAI 官方模型？	在本次提供的 OpenAI 官方來源中，沒有。這些來源記載的是 GPT-5、GPT-5 mini、GPT-5.2-Codex 與 GPT-5.4 提示指南 ^[23]^[25]^[26]^[29]^[45]。
Spud 這個名稱出現在何處？	出現在 Reddit 貼文與 OpenAI Developer Community 的功能請求討論串，而不是發布公告、模型卡或 API 模型文件 ^[7]^[8]^[10]^[28]。
是否已有 Claude Opus 4.7 對 GPT-5.5 Spud 的幻覺基準？	沒有提供同題、同環境、同評分規則的對測來源；公平測試還應把避答行為與事實錯誤分開評分 ^[68]。

Claude Opus 4.7：官方資料能證明什麼？

GPT-5.5 Spud：目前比較像社群線索，不是官方測試對象

為什麼「會不會避答」比單看正確率更重要

真正該測的是校準的不確定性

如果真的要比較 Claude 與 OpenAI 模型，應該這樣測

使用官方模型 ID。 Claude 端可測 claude-opus-4-7；OpenAI 端應使用已文件化的模型，例如 GPT-5 或 GPT-5 mini，而不是未驗證的 Spud 標籤 ^[16]^[23]^[25]^[29]。
建立混合題組。 題目應包含可回答問題、條件不足問題與不可回答問題；避答研究關注的正是模型在高不確定性或無法安全回答時是否能拒絕猜測 ^[1]^[4]。
把避答獨立計分。 應分開統計正答、錯答、正確避答與錯誤避答。避答研究已定義 abstention accuracy、abstention precision 與 abstention recall 等指標 ^[68]。
區分事實不確定與安全拒答。 拒絕有害內容，和承認某個事實答案缺乏證據，不是同一種行為；I-CALM 聚焦的是有可驗證答案之事實問題上的 epistemic abstention ^[54]。
同時報告正確率、錯誤率與避答率。 OpenAI 的 SimpleQA 範例顯示，高避答率可能伴隨相近正確率但大幅較低錯誤率 ^[3]。
固定測試環境。 檢索、瀏覽、工具使用、上下文長度與系統提示都會影響結果。若只給其中一個模型額外資料，測到的就不只是模型本身，而是整套設定。

常見問題

GPT-5.5 Spud 是真的嗎？

Claude Opus 4.7 是否比 GPT-5.5 Spud 更少幻覺？

採購或開發團隊該比較什麼？

結論

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 是可核對的官方模型；GPT 5.5 Spud 在本次提供的 OpenAI 官方資料中未被驗證，因此沒有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。
OpenAI 的 SimpleQA 範例顯示，gpt 5 thinking mini 的避答率為 52%、正確率為 22%、錯誤率為 26%；o4 mini 的避答率為 1%、正確率為 24%、錯誤率為 75% [3]。
正式導入前的基準測試應分開統計正答、錯答、正確避答與錯誤避答，因為避答也有自己的準確率、精確率與召回率指標 [68]。

大家也會問

「Claude Opus 4.7 對 GPT-5.5 Spud：幻覺證據到底說了什麼？」的簡短答案是什麼？

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

正式導入前的基準測試應分開統計正答、錯答、正確避答與錯誤避答，因為避答也有自己的準確率、精確率與召回率指標 [68]。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點」交叉比對。

開啟相關頁面

繼續深入研究

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 基準比較：2026 年誰最值得信？

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6：2026 基準測試與決策結論

DeepSeek V4 工程解析：1M 上下文、MoE 與 API 遷移重點

Northwest vs. Southeast Timber 彈性題解：為什麼是 larger; larger

Northwest 與 Southeast Timber：替代性越高，為什麼答案是 larger; larger？

DeepSeek 會擊敗 OpenAI、Claude、Gemini、Grok 嗎？關鍵不是榜單，而是成本與信任

研究對話

你2026年4月21日

研究問題

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI2026年4月21日40 個來源

附引用的答案

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

來源

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...