التقاريرمنشور29 أبريل 2026Last edited 6 مايو 202620 المصادر

Claude Opus 4.7 وGPT-5.5 Spud: ماذا تقول الأدلة عن الهلوسة؟

Claude Opus 4.7 موثّق رسميًا لدى Anthropic، بينما لا توجد في المصادر الرسمية المقدمة من OpenAI صفحة نموذج أو معرف API باسم GPT 5.5 Spud [12][16][23][25][26][29][45]. مثال SimpleQA من OpenAI يوضح لماذا لا تكفي الدقة وحدها: gpt 5 thinking mini سُجل له 52% امتناعًا، و22% دقة، و26% خطأ، مقابل o4 mini مع 1% امتناع، و24%...

ابحث وتحقق من الحقائق مع Studio Global AI تصفّح المزيد من الاكتشاف

18K0

AI-generated editorial illustration of Claude Opus 4.7 and an unverified GPT-5.5 Spud comparison with hallucination evidence — Claude Opus 4.7 vsAI-generated editorial illustration for a fact-check on Claude Opus 4.7, GPT-5.5 Spud rumors, and hallucination benchmarks.
موجّه الذكاء الاصطناعي
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs. GPT-5.5 Spud: Hallucination Evidence, Fact-Checked. Article summary: Claude Opus 4.7 is official, but GPT 5.5 Spud is not verified in the cited official OpenAI sources, so there is no defensible head to head hallucination benchmark here; compare Claude against documented OpenAI models.... Topic tags: ai, ai safety, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "# GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use) | by Pranit naik | No Time | Apr, 2026 | Medium. ## Gpt-5.5 vs Opus 4.7 | Real-world AI model performance | Gen AI" source context "GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use)" Reference image 2: visual subject "# GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks. I compared GPT-5.5 against
openai.com

السؤال عن الفائز في الهلوسة بين Claude Opus 4.7 وGPT-5.5 Spud يبدو في ظاهره سؤال ترتيب نماذج. لكن قراءة المصادر تقود إلى نتيجة أكثر تحفظًا: لدينا نموذج موثّق رسميًا هو Claude Opus 4.7، ولدينا اسم Spud يظهر في مسارات مجتمعية وتسريبات مزعومة، لا في وثائق إصدار أو صفحة نموذج رسمية من OpenAI ضمن المصادر المقدمة.

بعبارة أبسط: لا يصح بناء حكم مثل «Claude أقل هلوسة من Spud» أو العكس قبل التأكد من أن الطرفين موجودان كمنتجين رسميين قابلين للاختبار بالطريقة نفسها.

الخلاصة المدعومة بالأدلة

السؤال	الجواب المدعوم بالمصادر
هل Claude Opus 4.7 موثّق؟	نعم. Anthropic توثّق Claude Opus 4.7 وتذكر أن المطورين يمكنهم استخدام معرف API: `claude-opus-4-7` ^[12]^[16].
هل GPT-5.5 Spud موثّق كنموذج رسمي من OpenAI؟	ليس في مصادر OpenAI الرسمية المقدمة هنا. هذه المصادر توثّق GPT-5 وGPT-5 mini وGPT-5.2-Codex وإرشادات GPT-5.4، لا نموذجًا عامًا باسم GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45].
أين يظهر اسم Spud في هذه المجموعة من المصادر؟	يظهر في منشورات Reddit وخيط طلب ميزة في مجتمع مطوري OpenAI، وليس في ملاحظات إصدار أو وثائق API رسمية ^[7]^[8]^[10]^[28].
هل يوجد معيار هلوسة مباشر بين Claude Opus 4.7 وGPT-5.5 Spud؟	لا. لا يقدم أي مصدر هنا اختبارًا مشتركًا بالمهام نفسها ونظام التقييم نفسه، وأي اختبار عادل يجب أن يفصل بين الأخطاء الواقعية وسلوك الامتناع عن الإجابة ^[68].

هذا لا يعني أن اسم Spud لا يمكن أن يظهر مستقبلًا، أو أن نموذجًا داخليًا بهذا الاسم غير موجود قطعًا. المعنى الأدق هو أن الأدلة الحالية لا تكفي للتعامل معه كنموذج رسمي من OpenAI أو لإعلان فائز في الهلوسة.

ما الذي نعرفه عن Claude Opus 4.7؟

أقوى دليل على Claude Opus 4.7 يأتي من Anthropic نفسها. الشركة تقول إن المطورين يمكنهم استخدام claude-opus-4-7 عبر Claude API ^[16]، كما تشير وثائقها إلى أن Claude Opus 4.7 يقدم خاصية task budgets، أي ميزانيات أو حدود مخصصة للمهام ^[12].

هذه الخاصية مهمة لمن يبني منتجًا فوق النموذج، لأنها ترتبط بالتحكم في طريقة تنفيذ المهمة. لكنها ليست وحدها معيارًا للهلوسة. وجود إعدادات للتحكم في المهمة لا يخبرنا تلقائيًا متى سيقول النموذج: لا أعرف، أو متى سيتجنب اختلاق معلومة غير مؤكدة.

هناك إشارة مرتبطة بالصدق، لكنها لا تحسم المقارنة مع Spud. نقلت Mashable، استنادًا إلى بطاقة نظام Opus 4.7 من Anthropic، أن Claude Opus 4.7 حقق معدل صدق MASK بلغ 91.7%، وأنه أقل ميلًا للهلوسة أو المجاملة المفرطة من نماذج Anthropic السابقة وبعض نماذج الذكاء الاصطناعي المتقدمة الأخرى ^[14]. هذه معلومة مفيدة عن Claude، لكنها لا تكوّن اختبارًا مباشرًا ضد نموذج OpenAI موثّق باسم GPT-5.5 Spud.

ماذا تقول مصادر OpenAI بدلًا من ذلك؟

المصادر الرسمية المقدمة من OpenAI تثبت أسماء أخرى داخل عائلة GPT-5: GPT-5، وGPT-5 mini، وGPT-5.2-Codex، وإرشادات مطالبة مرتبطة بـGPT-5.4 ^[23]^[25]^[26]^[29]^[45]. أما Spud فيظهر في منشورات Reddit وخيط في OpenAI Developer Community ^[7]^[8]^[10]^[28].

الفرق هنا مهم. منشور في مجتمع مطورين أو Reddit قد يكون مؤشرًا على نقاش أو توقعات أو شائعات، لكنه ليس بطاقة نموذج، ولا معرف API، ولا إعلان إصدار رسمي. لذلك لا ينبغي استخدامه كطرف في معيار أداء وكأنه منتج موثّق.

الأهم من اسم Spud أن OpenAI نفسها نشرت تفسيرًا مباشرًا لمشكلة الهلوسة. تقول الشركة إن أساليب التدريب والتقييم الشائعة تكافئ التخمين بدل الاعتراف بعدم اليقين، وترى أن الأفضل للنموذج أن يوضح عدم يقينه أو يطلب توضيحًا بدل تقديم معلومة خاطئة بثقة ^[3].

مثال OpenAI في SimpleQA يوضح الفكرة جيدًا: نموذج gpt-5-thinking-mini يظهر مع 52% امتناعًا عن الإجابة، و22% دقة، و26% خطأ، بينما o4-mini يظهر مع 1% امتناع، و24% دقة، و75% خطأ ^[3]. على الورق، الدقة متقاربة. لكن الفرق في معدل الخطأ كبير، لأن النموذج الأول يختار الصمت أو التحفظ أكثر عندما لا يملك ثقة كافية ^[3].

لماذا الامتناع عن الإجابة ليس ضعفًا دائمًا؟

في الاستخدام اليومي، نميل إلى تفضيل نموذج يجيب بسرعة وبثقة. لكن في مجالات مثل البحث، والطب، والقانون، والتحليل المالي، والدعم الفني عالي المخاطر، الإجابة الواثقة الخاطئة قد تكون أسوأ من جواب يقول: لا أملك دليلًا كافيًا.

هذا هو جوهر مفهوم عدم اليقين المُعاير. النموذج الجيد لا يرفض كل شيء، ولا يجيب عن كل شيء. بل يجيب عندما تكون الأدلة كافية، ويسأل أسئلة توضيحية عندما يكون الطلب غامضًا، ويمتنع عندما لا يمكن دعم الجواب.

الأبحاث تدعم هذا الاتجاه مع بعض التحفظات. دراسة من 2024 تشير إلى أن الامتناع المبني على عدم اليقين يحسن الصحة العامة للإجابات ويقلل الهلوسة ويزيد السلامة في إعدادات السؤال والجواب ^[1]^[4]. كما يركز عمل I-CALM على الامتناع المعرفي في الأسئلة الواقعية ذات الإجابات القابلة للتحقق، ويلاحظ أن نماذج اللغة الكبيرة الحالية قد تفشل أحيانًا في الامتناع عندما ينبغي لها ذلك ^[54]. ويدرس بحث عن التعلم المعزز المُعاير سلوكيًا كيفية تشجيع النماذج على الاعتراف بعدم اليقين عبر الامتناع ^[61].

تتعامل مراجعات أوسع مع قياس عدم اليقين كأداة لاكتشاف الهلوسة، وتصف عدم اليقين المُعاير بأنه مفيد لتحديد متى نثق في إجابة النموذج، ومتى نؤجل القرار أو نتحقق خارجيًا ^[53]^[55]. لكن الشرط الأساسي أن يكون الامتناع مُعايرًا: نموذج يقول «لا أعرف» طوال الوقت قد يكون آمنًا لكنه غير مفيد، ونموذج لا يقولها أبدًا قد يكون مفيدًا ظاهريًا لكنه خطير.

كيف تبدو مقارنة عادلة فعلًا؟

إذا أراد فريق تقني أو مشتري خدمة ذكاء اصطناعي اختبار Claude مقابل OpenAI في الهلوسة، فالطريق العادل ليس استخدام اسم غير موثّق. الأفضل هو بناء اختبار واضح على النحو الآتي:

استخدم معرفات نماذج رسمية. في حالة Claude، يمكن اختبار claude-opus-4-7. وفي حالة OpenAI، يجب اختيار نموذج موثّق مثل GPT-5 أو GPT-5 mini بدل تسمية Spud غير المثبتة في المصادر الرسمية المقدمة ^[16]^[23]^[25]^[29].
ابنِ مجموعة اختبار مختلطة. يجب أن تتضمن أسئلة قابلة للإجابة، وطلبات ناقصة التفاصيل، وأسئلة لا يمكن جوابها من المعلومات المتاحة. أبحاث الامتناع تدرس تحديدًا قيمة الرفض أو التوقف عندما يكون عدم اليقين عاليًا أو عندما لا يمكن تقديم جواب آمن ^[1]^[4].
قيّم الامتناع وحده، لا كخطأ تلقائي. احسب الإجابات الصحيحة، والإجابات الخاطئة، والامتناع الصحيح، والامتناع الخاطئ. مسح أبحاث الامتناع يعرّف مقاييس مثل دقة الامتناع، ودقة قرارات الامتناع، واسترجاع حالات الامتناع الصحيحة ^[68].
افصل بين عدم اليقين الواقعي والرفض لأسباب السلامة. رفض تقديم تعليمات ضارة ليس السلوك نفسه كقول النموذج إنه لا يملك دليلًا كافيًا على واقعة معينة. I-CALM يركز تحديدًا على الامتناع المعرفي في الأسئلة الواقعية ذات الإجابات القابلة للتحقق ^[54].
اعرض الدقة، ومعدل الخطأ، ومعدل الامتناع معًا. مثال SimpleQA من OpenAI يبين أن نموذجًا يمتنع أكثر قد يحقق دقة قريبة لكنه يخطئ أقل بكثير ^[3].
ثبّت بيئة الاختبار. الوصول إلى الويب، وأدوات البحث، وحجم السياق، وتعليمات النظام، وطريقة الاسترجاع كلها قد تغير النتيجة. إذا أعطيت نموذجًا مصادر أفضل من الآخر فأنت تختبر الإعداد، لا النموذج وحده.

أسئلة سريعة

هل GPT-5.5 Spud حقيقي؟

ليس كنموذج رسمي من OpenAI ضمن الأدلة المقدمة هنا. المصادر الرسمية المذكورة توثّق GPT-5 وGPT-5 mini وGPT-5.2-Codex وإرشادات GPT-5.4، بينما يظهر Spud في Reddit وخيط طلب ميزة في مجتمع المطورين ^[7]^[8]^[10]^[23]^[25]^[26]^[28]^[29]^[45].

هل Claude Opus 4.7 يهلوس أقل من GPT-5.5 Spud؟

لا يمكن الجزم بذلك من هذه المصادر. Claude Opus 4.7 موثّق رسميًا ^[12]^[16]، وهناك تقرير ثانوي عن معدل صدق MASK بلغ 91.7% ^[14]. لكن لا يوجد هدف موثّق باسم GPT-5.5 Spud ولا معيار مشترك بين الاسمين ^[7]^[8]^[10]^[28]^[68].

ما المقارنة الأفضل للمطورين والمشترين؟

قارن Claude Opus 4.7 بنماذج OpenAI موثّقة، وبالمهام نفسها، والأدوات نفسها، وتعليمات النظام نفسها، وقواعد التقييم نفسها. لا تكتفِ بالدقة؛ اجمع بينها وبين معدل الخطأ وسلوك الامتناع عن الإجابة ^[3]^[68].

الزبدة

لا توجد في الأدلة المقدمة نتيجة موثوقة تقول إن Claude فاز أو إن Spud فاز في التحكم بالهلوسة. النتيجة المدعومة هي أضيق وأهم: Claude Opus 4.7 موثّق رسميًا؛ GPT-5.5 Spud غير موثّق في مصادر OpenAI الرسمية المقدمة؛ وأفضل اختبار للهلوسة يجب أن يكافئ عدم اليقين المُعاير، بما في ذلك الامتناع الصحيح عندما لا يمكن دعم الادعاء بالأدلة ^[3]^[12]^[16]^[23]^[25]^[29]^[45]^[68].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ابحث وتحقق من الحقائق مع Studio Global AI

الوجبات السريعة الرئيسية

Claude Opus 4.7 موثّق رسميًا لدى Anthropic، بينما لا توجد في المصادر الرسمية المقدمة من OpenAI صفحة نموذج أو معرف API باسم GPT 5.5 Spud [12][16][23][25][26][29][45].
مثال SimpleQA من OpenAI يوضح لماذا لا تكفي الدقة وحدها: gpt 5 thinking mini سُجل له 52% امتناعًا، و22% دقة، و26% خطأ، مقابل o4 mini مع 1% امتناع، و24% دقة، و75% خطأ [3].
المقارنة الجادة في الهلوسة يجب أن تقيس الإجابات الصحيحة، والأخطاء، والامتناع الصحيح، والامتناع الخاطئ؛ لأن للامتناع مقاييسه الخاصة مثل الدقة والاسترجاع [68].

يسأل الناس أيضا

ما هي الإجابة المختصرة على "Claude Opus 4.7 وGPT-5.5 Spud: ماذا تقول الأدلة عن الهلوسة؟"؟

Claude Opus 4.7 موثّق رسميًا لدى Anthropic، بينما لا توجد في المصادر الرسمية المقدمة من OpenAI صفحة نموذج أو معرف API باسم GPT 5.5 Spud [12][16][23][25][26][29][45].

ما هي النقاط الأساسية التي يجب التحقق منها أولاً؟

ماذا يجب أن أفعل بعد ذلك في الممارسة العملية؟

المقارنة الجادة في الهلوسة يجب أن تقيس الإجابات الصحيحة، والأخطاء، والامتناع الصحيح، والامتناع الخاطئ؛ لأن للامتناع مقاييسه الخاصة مثل الدقة والاسترجاع [68].

ما هو الموضوع ذو الصلة الذي يجب أن أستكشفه بعد ذلك؟

تابع مع "Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟" لزاوية أخرى واستشهادات إضافية.

فتح الصفحة ذات الصلة

ما الذي يجب أن أقارنه بهذا؟

تحقق من هذه الإجابة مقابل "DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI".

فتح الصفحة ذات الصلة

مواصلة البحث الخاص بك

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟

Claude Opus 4.7 مقابل GPT-5.5 وDeepSeek V4 وKimi K2.6: قراءة دقيقة لبنشماركات 2026

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI

DeepSeek V4: قراءة هندسية في سياق 1M وMoE وخطوات ترحيل API

أخشاب Northwest وSoutheast: لماذا تكون الإجابة larger; larger؟

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

هل يهزم DeepSeek عمالقة الذكاء الاصطناعي؟ الكلفة والثقة أهم من جدول الترتيب

هل يستطيع DeepSeek هزيمة OpenAI وClaude وGemini وGrok؟

محادثة بحثية

أنت21 أبريل 2026

سؤال بحثي

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI21 أبريل 202640 المصادر

إجابة مستشهد بها

This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” ^[10]^[5]^[6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. ^[10]^[5]^[6]

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

المصادر

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

الأكثر رواجًا في الاكتشاف

التقاريرمنشور29 أبريل 2026Last edited 6 مايو 202620 المصادر

Claude Opus 4.7 وGPT-5.5 Spud: ماذا تقول الأدلة عن الهلوسة؟

ابحث وتحقق من الحقائق مع Studio Global AI تصفّح المزيد من الاكتشاف

18K0

الخلاصة المدعومة بالأدلة

السؤال	الجواب المدعوم بالمصادر
هل Claude Opus 4.7 موثّق؟	نعم. Anthropic توثّق Claude Opus 4.7 وتذكر أن المطورين يمكنهم استخدام معرف API: `claude-opus-4-7` ^[12]^[16].
هل GPT-5.5 Spud موثّق كنموذج رسمي من OpenAI؟	ليس في مصادر OpenAI الرسمية المقدمة هنا. هذه المصادر توثّق GPT-5 وGPT-5 mini وGPT-5.2-Codex وإرشادات GPT-5.4، لا نموذجًا عامًا باسم GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45].
أين يظهر اسم Spud في هذه المجموعة من المصادر؟	يظهر في منشورات Reddit وخيط طلب ميزة في مجتمع مطوري OpenAI، وليس في ملاحظات إصدار أو وثائق API رسمية ^[7]^[8]^[10]^[28].
هل يوجد معيار هلوسة مباشر بين Claude Opus 4.7 وGPT-5.5 Spud؟	لا. لا يقدم أي مصدر هنا اختبارًا مشتركًا بالمهام نفسها ونظام التقييم نفسه، وأي اختبار عادل يجب أن يفصل بين الأخطاء الواقعية وسلوك الامتناع عن الإجابة ^[68].

ما الذي نعرفه عن Claude Opus 4.7؟

ماذا تقول مصادر OpenAI بدلًا من ذلك؟

لماذا الامتناع عن الإجابة ليس ضعفًا دائمًا؟

كيف تبدو مقارنة عادلة فعلًا؟

استخدم معرفات نماذج رسمية. في حالة Claude، يمكن اختبار claude-opus-4-7. وفي حالة OpenAI، يجب اختيار نموذج موثّق مثل GPT-5 أو GPT-5 mini بدل تسمية Spud غير المثبتة في المصادر الرسمية المقدمة ^[16]^[23]^[25]^[29].
ابنِ مجموعة اختبار مختلطة. يجب أن تتضمن أسئلة قابلة للإجابة، وطلبات ناقصة التفاصيل، وأسئلة لا يمكن جوابها من المعلومات المتاحة. أبحاث الامتناع تدرس تحديدًا قيمة الرفض أو التوقف عندما يكون عدم اليقين عاليًا أو عندما لا يمكن تقديم جواب آمن ^[1]^[4].
قيّم الامتناع وحده، لا كخطأ تلقائي. احسب الإجابات الصحيحة، والإجابات الخاطئة، والامتناع الصحيح، والامتناع الخاطئ. مسح أبحاث الامتناع يعرّف مقاييس مثل دقة الامتناع، ودقة قرارات الامتناع، واسترجاع حالات الامتناع الصحيحة ^[68].
افصل بين عدم اليقين الواقعي والرفض لأسباب السلامة. رفض تقديم تعليمات ضارة ليس السلوك نفسه كقول النموذج إنه لا يملك دليلًا كافيًا على واقعة معينة. I-CALM يركز تحديدًا على الامتناع المعرفي في الأسئلة الواقعية ذات الإجابات القابلة للتحقق ^[54].
اعرض الدقة، ومعدل الخطأ، ومعدل الامتناع معًا. مثال SimpleQA من OpenAI يبين أن نموذجًا يمتنع أكثر قد يحقق دقة قريبة لكنه يخطئ أقل بكثير ^[3].
ثبّت بيئة الاختبار. الوصول إلى الويب، وأدوات البحث، وحجم السياق، وتعليمات النظام، وطريقة الاسترجاع كلها قد تغير النتيجة. إذا أعطيت نموذجًا مصادر أفضل من الآخر فأنت تختبر الإعداد، لا النموذج وحده.

أسئلة سريعة

هل GPT-5.5 Spud حقيقي؟

هل Claude Opus 4.7 يهلوس أقل من GPT-5.5 Spud؟

ما المقارنة الأفضل للمطورين والمشترين؟

الزبدة

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ابحث وتحقق من الحقائق مع Studio Global AI

الوجبات السريعة الرئيسية

Claude Opus 4.7 موثّق رسميًا لدى Anthropic، بينما لا توجد في المصادر الرسمية المقدمة من OpenAI صفحة نموذج أو معرف API باسم GPT 5.5 Spud [12][16][23][25][26][29][45].
مثال SimpleQA من OpenAI يوضح لماذا لا تكفي الدقة وحدها: gpt 5 thinking mini سُجل له 52% امتناعًا، و22% دقة، و26% خطأ، مقابل o4 mini مع 1% امتناع، و24% دقة، و75% خطأ [3].
المقارنة الجادة في الهلوسة يجب أن تقيس الإجابات الصحيحة، والأخطاء، والامتناع الصحيح، والامتناع الخاطئ؛ لأن للامتناع مقاييسه الخاصة مثل الدقة والاسترجاع [68].

يسأل الناس أيضا

ما هي الإجابة المختصرة على "Claude Opus 4.7 وGPT-5.5 Spud: ماذا تقول الأدلة عن الهلوسة؟"؟

ما هي النقاط الأساسية التي يجب التحقق منها أولاً؟

ماذا يجب أن أفعل بعد ذلك في الممارسة العملية؟

ما هو الموضوع ذو الصلة الذي يجب أن أستكشفه بعد ذلك؟

تابع مع "Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟" لزاوية أخرى واستشهادات إضافية.

فتح الصفحة ذات الصلة

ما الذي يجب أن أقارنه بهذا؟

تحقق من هذه الإجابة مقابل "DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI".

فتح الصفحة ذات الصلة

مواصلة البحث الخاص بك

Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟

Claude Opus 4.7 مقابل GPT-5.5 وDeepSeek V4 وKimi K2.6: قراءة دقيقة لبنشماركات 2026

DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI

DeepSeek V4: قراءة هندسية في سياق 1M وMoE وخطوات ترحيل API

أخشاب Northwest وSoutheast: لماذا تكون الإجابة larger; larger؟

هل يهزم DeepSeek عمالقة الذكاء الاصطناعي؟ الكلفة والثقة أهم من جدول الترتيب

هل يستطيع DeepSeek هزيمة OpenAI وClaude وGemini وGrok؟

محادثة بحثية

أنت21 أبريل 2026

سؤال بحثي

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI21 أبريل 202640 المصادر

إجابة مستشهد بها

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

المصادر

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

الأكثر رواجًا في الاكتشاف

التقاريرمنشور29 أبريل 2026Last edited 6 مايو 202620 المصادر

Claude Opus 4.7 وGPT-5.5 Spud: ماذا تقول الأدلة عن الهلوسة؟

ابحث وتحقق من الحقائق مع Studio Global AI تصفّح المزيد من الاكتشاف

18K0

الخلاصة المدعومة بالأدلة

السؤال	الجواب المدعوم بالمصادر
هل Claude Opus 4.7 موثّق؟	نعم. Anthropic توثّق Claude Opus 4.7 وتذكر أن المطورين يمكنهم استخدام معرف API: `claude-opus-4-7` ^[12]^[16].
هل GPT-5.5 Spud موثّق كنموذج رسمي من OpenAI؟	ليس في مصادر OpenAI الرسمية المقدمة هنا. هذه المصادر توثّق GPT-5 وGPT-5 mini وGPT-5.2-Codex وإرشادات GPT-5.4، لا نموذجًا عامًا باسم GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45].
أين يظهر اسم Spud في هذه المجموعة من المصادر؟	يظهر في منشورات Reddit وخيط طلب ميزة في مجتمع مطوري OpenAI، وليس في ملاحظات إصدار أو وثائق API رسمية ^[7]^[8]^[10]^[28].
هل يوجد معيار هلوسة مباشر بين Claude Opus 4.7 وGPT-5.5 Spud؟	لا. لا يقدم أي مصدر هنا اختبارًا مشتركًا بالمهام نفسها ونظام التقييم نفسه، وأي اختبار عادل يجب أن يفصل بين الأخطاء الواقعية وسلوك الامتناع عن الإجابة ^[68].

ما الذي نعرفه عن Claude Opus 4.7؟

ماذا تقول مصادر OpenAI بدلًا من ذلك؟

لماذا الامتناع عن الإجابة ليس ضعفًا دائمًا؟

كيف تبدو مقارنة عادلة فعلًا؟

استخدم معرفات نماذج رسمية. في حالة Claude، يمكن اختبار claude-opus-4-7. وفي حالة OpenAI، يجب اختيار نموذج موثّق مثل GPT-5 أو GPT-5 mini بدل تسمية Spud غير المثبتة في المصادر الرسمية المقدمة ^[16]^[23]^[25]^[29].
ابنِ مجموعة اختبار مختلطة. يجب أن تتضمن أسئلة قابلة للإجابة، وطلبات ناقصة التفاصيل، وأسئلة لا يمكن جوابها من المعلومات المتاحة. أبحاث الامتناع تدرس تحديدًا قيمة الرفض أو التوقف عندما يكون عدم اليقين عاليًا أو عندما لا يمكن تقديم جواب آمن ^[1]^[4].
قيّم الامتناع وحده، لا كخطأ تلقائي. احسب الإجابات الصحيحة، والإجابات الخاطئة، والامتناع الصحيح، والامتناع الخاطئ. مسح أبحاث الامتناع يعرّف مقاييس مثل دقة الامتناع، ودقة قرارات الامتناع، واسترجاع حالات الامتناع الصحيحة ^[68].
افصل بين عدم اليقين الواقعي والرفض لأسباب السلامة. رفض تقديم تعليمات ضارة ليس السلوك نفسه كقول النموذج إنه لا يملك دليلًا كافيًا على واقعة معينة. I-CALM يركز تحديدًا على الامتناع المعرفي في الأسئلة الواقعية ذات الإجابات القابلة للتحقق ^[54].
اعرض الدقة، ومعدل الخطأ، ومعدل الامتناع معًا. مثال SimpleQA من OpenAI يبين أن نموذجًا يمتنع أكثر قد يحقق دقة قريبة لكنه يخطئ أقل بكثير ^[3].
ثبّت بيئة الاختبار. الوصول إلى الويب، وأدوات البحث، وحجم السياق، وتعليمات النظام، وطريقة الاسترجاع كلها قد تغير النتيجة. إذا أعطيت نموذجًا مصادر أفضل من الآخر فأنت تختبر الإعداد، لا النموذج وحده.

أسئلة سريعة

هل GPT-5.5 Spud حقيقي؟

هل Claude Opus 4.7 يهلوس أقل من GPT-5.5 Spud؟

ما المقارنة الأفضل للمطورين والمشترين؟

الزبدة

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ابحث وتحقق من الحقائق مع Studio Global AI

الوجبات السريعة الرئيسية

Claude Opus 4.7 موثّق رسميًا لدى Anthropic، بينما لا توجد في المصادر الرسمية المقدمة من OpenAI صفحة نموذج أو معرف API باسم GPT 5.5 Spud [12][16][23][25][26][29][45].
مثال SimpleQA من OpenAI يوضح لماذا لا تكفي الدقة وحدها: gpt 5 thinking mini سُجل له 52% امتناعًا، و22% دقة، و26% خطأ، مقابل o4 mini مع 1% امتناع، و24% دقة، و75% خطأ [3].
المقارنة الجادة في الهلوسة يجب أن تقيس الإجابات الصحيحة، والأخطاء، والامتناع الصحيح، والامتناع الخاطئ؛ لأن للامتناع مقاييسه الخاصة مثل الدقة والاسترجاع [68].

يسأل الناس أيضا

ما هي الإجابة المختصرة على "Claude Opus 4.7 وGPT-5.5 Spud: ماذا تقول الأدلة عن الهلوسة؟"؟

ما هي النقاط الأساسية التي يجب التحقق منها أولاً؟

ماذا يجب أن أفعل بعد ذلك في الممارسة العملية؟

ما هو الموضوع ذو الصلة الذي يجب أن أستكشفه بعد ذلك؟

تابع مع "Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟" لزاوية أخرى واستشهادات إضافية.

فتح الصفحة ذات الصلة

ما الذي يجب أن أقارنه بهذا؟

تحقق من هذه الإجابة مقابل "DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI".

فتح الصفحة ذات الصلة

مواصلة البحث الخاص بك

Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟

Claude Opus 4.7 مقابل GPT-5.5 وDeepSeek V4 وKimi K2.6: قراءة دقيقة لبنشماركات 2026

DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI

DeepSeek V4: قراءة هندسية في سياق 1M وMoE وخطوات ترحيل API

أخشاب Northwest وSoutheast: لماذا تكون الإجابة larger; larger؟

هل يهزم DeepSeek عمالقة الذكاء الاصطناعي؟ الكلفة والثقة أهم من جدول الترتيب

هل يستطيع DeepSeek هزيمة OpenAI وClaude وGemini وGrok؟

محادثة بحثية

أنت21 أبريل 2026

سؤال بحثي

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI21 أبريل 202640 المصادر

إجابة مستشهد بها

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

المصادر

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...