studioglobal
熱門發現
報告已發布20 來源

Claude Opus 4.7 vs GPT-5.5 Spud:現有證據其實講咗啲乜

Claude Opus 4.7 有 Anthropic 官方文件支持;GPT 5.5 Spud 未在提供的 OpenAI 官方材料中核實,所以未有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。 OpenAI 的 SimpleQA 例子顯示取捨:gpt 5 thinking mini 被列為 52% 棄答、22% 準確、26% 錯誤;o4 mini 則是 1% 棄答、24% 準確、75% 錯誤 [3]。

18K0
AI-generated editorial illustration of Claude Opus 4.7 and an unverified GPT-5.5 Spud comparison with hallucination evidence
Claude Opus 4.7 vsAI-generated editorial illustration for a fact-check on Claude Opus 4.7, GPT-5.5 Spud rumors, and hallucination benchmarks.
AI 提示

Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs. GPT-5.5 Spud: Hallucination Evidence, Fact-Checked. Article summary: Claude Opus 4.7 is official, but GPT 5.5 Spud is not verified in the cited official OpenAI sources, so there is no defensible head to head hallucination benchmark here; compare Claude against documented OpenAI models.... Topic tags: ai, ai safety, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "# GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use) | by Pranit naik | No Time | Apr, 2026 | Medium. ## Gpt-5.5 vs Opus 4.7 | Real-world AI model performance | Gen AI" source context "GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use)" Reference image 2: visual subject "# GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks. I compared GPT-5.5 against

openai.com

先講結論:呢題表面似係問「Claude 定 Spud 邊個少幻覺」,但證據第一步其實係問:兩個名係咪都核實到?

Anthropic 有文件列明 Claude Opus 4.7,同埋 claude-opus-4-7 呢個 API 識別碼;相反,今次可核對嘅 OpenAI 官方資料只見 GPT-5、GPT-5 mini、GPT-5.2-Codex 同 GPT-5.4 prompt guidance,未見公開模型叫 GPT-5.5 Spud [12][16][23][25][26][29][45]。所以負責任嘅答案唔係「Claude 贏」或者「Spud 贏」,而係:Claude Opus 4.7 可以評測;但 GPT-5.5 Spud 暫時唔應該當成一個已核實、可用嚟做 benchmark 嘅 OpenAI 模型名,除非有官方發布、模型頁、API 文件或同等證據支持。

有證據支持到嘅判斷

問題證據支持嘅答案
Claude Opus 4.7 係咪已核實?係。Anthropic 文件列出 Claude Opus 4.7,公告亦寫明開發者可以經 Claude API 使用 claude-opus-4-7 [12][16]
GPT-5.5 Spud 係咪已核實為 OpenAI 官方模型?今次提供嘅 OpenAI 官方來源未能核實。相關官方材料記錄嘅係 GPT-5、GPT-5 mini、GPT-5.2-Codex 同 GPT-5.4 prompt guidance [23][25][26][29][45]
Spud 呢個名喺資料入面出現喺邊?主要見於 Reddit 帖文同 OpenAI Developer Community 嘅功能建議討論,而唔係正式 release notes 或 API 模型文件 [7][8][10][28]
有冇 Claude Opus 4.7 對 GPT-5.5 Spud 嘅幻覺 benchmark?今次來源無提供同題目、同評分方法嘅正面比較;公平測試應該將棄答行為同事實錯誤分開計分 [68]

呢個判斷唔代表未來或私人測試環境一定唔會有 Spud 相關模型。只係以現有引用資料計,未足以將 GPT-5.5 Spud 當成 OpenAI 官方模型,更唔足以宣稱邊個幻覺控制贏咗。

Claude Opus 4.7 嘅證據講到邊度?

Anthropic 最硬淨嘅證據係產品同開發者文件,而唔係跨公司幻覺排行榜。Anthropic 公告寫明開發者可透過 Claude API 使用 claude-opus-4-7 [16];文件亦指 Claude Opus 4.7 引入 task budgets,即任務預算控制 [12]

不過,task budgets 係產品控制功能,唔等於一個公開、校準過嘅「不確定性」基準。換句話講,它可以影響模型做任務時用幾多資源,但單靠呢點,未能證明模型喺事實唔夠把握時會幾準確咁承認「我唔知」。

同「誠實」較直接相關嘅,是 Mashable 引述 Anthropic 的 Opus 4.7 system card 報道:Claude Opus 4.7 的 MASK 誠實率為 91.7%,而且比之前 Anthropic 模型同其他 frontier AI 模型較少出現幻覺或迎合用戶的情況 [14]。呢點對評估誠實性有參考價值,但仍然答唔到 Claude 對 Spud 邊個較好,因為它唔係對一個已核實 GPT-5.5 Spud 模型做同場同分制比較。

OpenAI 來源實際上核實咗乜?

今次提供嘅 OpenAI 官方材料,核實到 GPT-5、GPT-5 mini、GPT-5.2-Codex 同 GPT-5.4 prompt guidance 等 GPT-5 系列相關資料 [23][25][26][29][45]。至於 Spud,喺呢組來源入面主要來自 Reddit 帖文同 OpenAI Developer Community 嘅功能建議帖 [7][8][10][28]

社群帖可以係早期風聲或用戶需求嘅線索,但唔等同官方模型頁、模型卡、API 識別碼或正式發布公告。對開發者、採購者或者做 PoC 嘅團隊嚟講,呢個分別好重要:未核實名稱唔應該直接放入模型比較表。

OpenAI 關於幻覺嘅說明,反而對「點樣設計評測」更有用。OpenAI 指出,常見訓練同評估程序會獎勵模型猜答案,而唔係獎勵它承認不確定;OpenAI 亦話,模型應該表明不確定或要求澄清,而唔係自信咁提供錯誤資訊 [3]

OpenAI 的 SimpleQA 例子亦顯示,單睇準確率可以好誤導。該例子列出 gpt-5-thinking-mini 為 52% 棄答、22% 準確、26% 錯誤;o4-mini 則為 1% 棄答、24% 準確、75% 錯誤 [3]。前者答少啲,但喺呢個例子入面錯少好多 [3]。如果係高風險或需要可信輸出的產品,呢種「識得唔亂答」嘅取捨,往往比每題都答得好有信心更重要。

真正要比嘅,係校準不確定性

AI 幻覺,簡單講就係模型講到似層層,但內容其實錯、無根據,或者超出證據。要控制幻覺,唔係叫模型乜都拒答。好用嘅模型應該做到三件事:證據夠就答;題目唔清楚就追問;答案無法支持就棄答。呢個就係「校準不確定性」嘅實際意思。

研究方向亦支持呢個框架,但要留意限制。一項 2024 年研究報告指,基於不確定性嘅棄答可以改善問答場景入面嘅正確性、幻覺同安全表現 [1][4]。I-CALM 將 epistemic abstention,即知識性棄答,界定為針對有可驗證答案嘅事實問題而棄答,並指出現有 LLM 仍然會喺應該棄答時未能棄答 [54]。行為校準強化學習方面嘅研究,亦探討點樣鼓勵模型喺不確定時承認唔知道並棄答 [61]

更廣泛嘅綜述則將不確定性量化視為幻覺偵測工具,並形容校準不確定性有助用戶判斷幾時應該信任、轉交人手處理,或者再驗證模型答案 [53][55]。關鍵係「校準」兩個字:一個成日話唔知嘅模型可能安全但唔實用;一個永遠唔棄答嘅模型可能好似幫到手,但風險更高。

如果真係要公平比較 Claude 同 OpenAI 模型,應該點做?

  1. 用官方模型 ID。 Claude 方面可測 claude-opus-4-7;OpenAI 方面應用有文件支持嘅模型,例如 GPT-5 或 GPT-5 mini,而唔係未核實嘅 Spud 標籤 [16][23][25][29]
  2. 題庫要混合。 唔好只放有明確答案嘅題目;應包括可回答問題、資料不足問題、題意含糊問題同不可回答問題。棄答研究正正關心模型喺高不確定或無法安全回答時,是否懂得拒絕亂答 [1][4]
  3. 棄答要獨立計分。 同時計正確答案、錯誤答案、正確棄答、錯誤棄答。棄答綜述列出 abstention accuracy、precision、recall 等獨立指標 [68]
  4. 分清事實不確定同安全拒答。 拒絕有害內容,同表示某個事實答案證據不足,係兩種唔同行為;I-CALM 聚焦嘅係有可驗證答案之事實問題上的知識性棄答 [54]
  5. 一齊報準確率、錯誤率同棄答率。 OpenAI 的 SimpleQA 例子顯示,高棄答模型可以有相近準確率,但錯誤率低好多 [3]
  6. 測試環境要一致。 檢索、瀏覽、工具使用、context 長度、system prompt 同提示詞都會影響結果。如果一個模型有額外資料,另一個無,測到嘅就唔止係模型能力,而係整個設定。

FAQ

GPT-5.5 Spud 係咪真係有?

以今次提供嘅證據,未能確認它係 OpenAI 官方模型。官方 OpenAI 來源記錄咗 GPT-5、GPT-5 mini、GPT-5.2-Codex 同 GPT-5.4 prompt guidance;Spud 則見於 Reddit 帖文同社群功能建議討論 [7][8][10][23][25][26][28][29][45]

Claude Opus 4.7 係咪比 GPT-5.5 Spud 少幻覺?

現有來源唔足以嚴謹回答。Claude Opus 4.7 有官方文件支持 [12][16],亦有二手報道提到 91.7% MASK 誠實率 [14];但今次無已核實 GPT-5.5 Spud 目標,亦無兩者共用嘅同場 benchmark [7][8][10][28][68]

買家或者開發者應該比較乜?

應該將 Claude Opus 4.7 同有官方文件支持嘅 OpenAI 模型放喺同一套任務、工具、提示詞同評分規則下比較。指標唔應該只睇準確率,而要同時睇錯誤率同棄答行為 [3][68]

Bottom line

唔好用現有證據得出「Claude 贏」或者「Spud 贏」嘅幻覺結論。可以支持嘅結論只有三點:Claude Opus 4.7 有官方文件;GPT-5.5 Spud 未在引用到嘅 OpenAI 官方材料中核實;而評估幻覺控制,最應該獎勵嘅係校準不確定性,包括當答案無法支持時能夠正確棄答 [3][12][16][23][25][29][45][68]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

重點

  • Claude Opus 4.7 有 Anthropic 官方文件支持;GPT 5.5 Spud 未在提供的 OpenAI 官方材料中核實,所以未有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。
  • OpenAI 的 SimpleQA 例子顯示取捨:gpt 5 thinking mini 被列為 52% 棄答、22% 準確、26% 錯誤;o4 mini 則是 1% 棄答、24% 準確、75% 錯誤 [3]。
  • 真正可用於產品決策的基準,應分開計正確答案、錯誤答案、正確棄答同錯誤棄答,因為棄答本身都有 accuracy、precision、recall 等指標 [68]。

人們還問

「Claude Opus 4.7 vs GPT-5.5 Spud:現有證據其實講咗啲乜」的簡短答案是什麼?

Claude Opus 4.7 有 Anthropic 官方文件支持;GPT 5.5 Spud 未在提供的 OpenAI 官方材料中核實,所以未有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。

首先要驗證的關鍵點是什麼?

Claude Opus 4.7 有 Anthropic 官方文件支持;GPT 5.5 Spud 未在提供的 OpenAI 官方材料中核實,所以未有證據支持 Claude 對 Spud 的幻覺勝負結論 [12][16][23][25][26][29][45]。 OpenAI 的 SimpleQA 例子顯示取捨:gpt 5 thinking mini 被列為 52% 棄答、22% 準確、26% 錯誤;o4 mini 則是 1% 棄答、24% 準確、75% 錯誤 [3]。

接下來在實務上我該做什麼?

真正可用於產品決策的基準,應分開計正確答案、錯誤答案、正確棄答同錯誤棄答,因為棄答本身都有 accuracy、precision、recall 等指標 [68]。

接下來我應該探索哪個相關主題?

繼續“Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6:2026 Benchmark 點睇先唔會睇錯”以獲得另一個角度和額外的引用。

開啟相關頁面

我應該將其與什麼進行比較?

對照「DeepSeek V4 唔止 1M context:MoE、長上下文同 API 遷移重點」交叉檢查此答案。

開啟相關頁面

繼續你的研究

研究對話

研究問題

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI40 來源

引用的答案

This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” [10][5][6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. [10][5][6]

Key findings

  • Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. [10]

  • Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. [10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. [10][4]

  • OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. [15]

  • In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. [15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. [15]

  • The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. [8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. [9][7][3]

  • Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. [4][3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. [4][15]

  • For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. [10][15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. [10][15]

Evidence notes

  • The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” [10][5][6]

  • The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. [15]

  • The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. [8][9][7][3]

Limitations / uncertainty

  • Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” [5][6]

  • Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. [10][15]

  • Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. [15][8][4]

Summary

  • Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. [10][5][6]

  • Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. [15][8][9][10]

  • Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” [10][5][6]

Sources

  • [3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
  • [4] A comprehensive taxonomy of hallucinations in large language models
  • [5] OpenAI API docs: GPT-5 mini Model
  • [6] OpenAI API docs: Prompt guidance for GPT-5.4
  • [7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
  • [8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
  • [9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
  • [10] Anthropic docs: What’s new in Claude Opus 4.7
  • [15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

來源