LaporanDipublikasikan29 Apr 2026Last edited 6 Mei 202620 sumber

Claude Opus 4.7 vs GPT-5.5 Spud: yang sebenarnya ditunjukkan bukti halusinasi

Claude Opus 4.7 memiliki dokumentasi dan ID API claude opus 4 7; GPT 5.5 Spud belum muncul sebagai model OpenAI resmi dalam sumber yang dikutip [12][16][23][25][26][29][45]. Contoh SimpleQA OpenAI menunjukkan pentingnya memisahkan akurasi, error, dan abstensi: gpt 5 thinking mini tercatat 52% abstensi, 22% akurasi,...

Cari dan periksa fakta dengan Studio Global AI Jelajahi lebih banyak dari Discover

18K0

AI-generated editorial illustration of Claude Opus 4.7 and an unverified GPT-5.5 Spud comparison with hallucination evidence — Claude Opus 4.7 vsAI-generated editorial illustration for a fact-check on Claude Opus 4.7, GPT-5.5 Spud rumors, and hallucination benchmarks.
AI Perintah
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs. GPT-5.5 Spud: Hallucination Evidence, Fact-Checked. Article summary: Claude Opus 4.7 is official, but GPT 5.5 Spud is not verified in the cited official OpenAI sources, so there is no defensible head to head hallucination benchmark here; compare Claude against documented OpenAI models.... Topic tags: ai, ai safety, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "# GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use) | by Pranit naik | No Time | Apr, 2026 | Medium. ## Gpt-5.5 vs Opus 4.7 | Real-world AI model performance | Gen AI" source context "GPT-5.5 vs Claude Opus 4.7 (Which One Should You Actually Use)" Reference image 2: visual subject "# GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks. I compared GPT-5.5 against
openai.com

Pertanyaannya terdengar sederhana: antara Claude Opus 4.7 dan GPT-5.5 Spud, model mana yang lebih tahan halusinasi? Namun sebelum masuk ke papan skor, ada masalah yang lebih mendasar: salah satu nama belum terverifikasi.

Anthropic mendokumentasikan Claude Opus 4.7 dan ID API claude-opus-4-7 ^[12]^[16]. Sebaliknya, sumber resmi OpenAI yang dikutip di sini mendokumentasikan GPT-5, GPT-5 mini, GPT-5.2-Codex, dan panduan prompt GPT-5.4—bukan model publik bernama GPT-5.5 Spud ^[23]^[25]^[26]^[29]^[45]. Dalam kumpulan sumber ini, jejak “Spud” muncul di unggahan Reddit dan thread permintaan fitur di OpenAI Developer Community, bukan di dokumentasi model atau catatan rilis resmi ^[7]^[8]^[10]^[28].

Karena itu, kesimpulan yang bertanggung jawab harus lebih sempit: Claude Opus 4.7 bisa dievaluasi sebagai model yang terdokumentasi; GPT-5.5 Spud belum layak dijadikan target benchmark kecuali dikaitkan dengan rilis, model card, atau ID API resmi.

Putusan singkat berdasarkan bukti

Pertanyaan	Jawaban yang didukung bukti
Apakah Claude Opus 4.7 terverifikasi?	Ya. Anthropic mendokumentasikan Claude Opus 4.7, dan pengumumannya menyebut developer dapat memakai `claude-opus-4-7` lewat Claude API ^[12]^[16].
Apakah GPT-5.5 Spud terverifikasi sebagai model resmi OpenAI?	Tidak dalam sumber resmi OpenAI yang dikutip. Materi tersebut justru mendokumentasikan GPT-5, GPT-5 mini, GPT-5.2-Codex, dan panduan prompt GPT-5.4 ^[23]^[25]^[26]^[29]^[45].
Di mana “Spud” muncul dalam kumpulan sumber ini?	Di unggahan Reddit dan thread permintaan fitur OpenAI Developer Community, bukan di halaman model, model card, dokumentasi API, atau pengumuman rilis resmi ^[7]^[8]^[10]^[28].
Apakah ada benchmark halusinasi Claude Opus 4.7 vs GPT-5.5 Spud?	Tidak ada sumber yang menyediakan uji head-to-head dengan tugas dan skema penilaian yang sama; uji yang adil juga perlu menilai perilaku abstensi secara terpisah dari kesalahan faktual ^[68].

Ini tidak membuktikan bahwa model “Spud” privat atau masa depan mustahil ada. Artinya hanya satu: bukti yang tersedia saat ini tidak cukup untuk memperlakukan GPT-5.5 Spud sebagai model resmi OpenAI atau untuk menyatakan pemenang soal halusinasi.

Bukti tentang Claude Opus 4.7: resmi, tetapi bukan leaderboard lintas-vendor

Sumber terkuat untuk Claude Opus 4.7 adalah dokumentasi produk Anthropic, bukan benchmark publik yang membandingkan semua vendor. Anthropic menyatakan bahwa developer dapat memakai claude-opus-4-7 melalui Claude API ^[16], dan dokumentasinya menyebut Claude Opus 4.7 memperkenalkan fitur task budgets ^[12].

Task budgets penting untuk kendali produk, tetapi itu bukan hal yang sama dengan benchmark ketidakpastian terkalibrasi. Fitur tersebut tidak otomatis menunjukkan kapan model akan berhenti, meminta klarifikasi, atau mengatakan bahwa sebuah klaim tidak cukup didukung bukti.

Ada satu sinyal yang relevan dengan kejujuran model. Mashable melaporkan, dengan mengutip system card Opus 4.7 dari Anthropic, bahwa Claude Opus 4.7 memiliki tingkat kejujuran MASK 91,7% dan lebih kecil kemungkinannya untuk berhalusinasi atau bersikap sycophantic dibanding model Anthropic sebelumnya serta model frontier AI lain ^[14]. Itu penting, tetapi tetap tidak menjawab duel Claude-versus-Spud karena laporan tersebut bukan benchmark yang dipasangkan langsung dengan model GPT-5.5 Spud yang terverifikasi.

Yang sebenarnya dikatakan sumber OpenAI

Materi OpenAI yang dikutip memverifikasi beberapa rujukan keluarga GPT-5: GPT-5, GPT-5 mini, GPT-5.2-Codex, dan panduan prompt GPT-5.4 ^[23]^[25]^[26]^[29]^[45]. Sementara itu, “Spud” dalam kumpulan sumber ini berasal dari unggahan Reddit dan thread permintaan fitur di OpenAI Developer Community ^[7]^[8]^[10]^[28]. Sinyal komunitas bisa menarik untuk dipantau, tetapi tidak setara dengan halaman model resmi, model card, ID API, atau pengumuman rilis.

Penjelasan OpenAI tentang halusinasi justru lebih berguna untuk desain evaluasi. OpenAI menyatakan bahwa prosedur pelatihan dan evaluasi yang umum dapat memberi insentif pada model untuk menebak, bukan mengakui ketidakpastian; OpenAI juga menyebut bahwa model sebaiknya menunjukkan ketidakpastian atau meminta klarifikasi ketimbang memberikan informasi yang percaya diri tetapi salah ^[3].

Contoh SimpleQA dari OpenAI memperlihatkan mengapa satu angka akurasi saja bisa menyesatkan. Dalam contoh itu, gpt-5-thinking-mini tercatat memiliki 52% abstensi, 22% akurasi, dan 26% error, sedangkan o4-mini memiliki 1% abstensi, 24% akurasi, dan 75% error ^[3]. Model pertama menjawab lebih jarang, tetapi jauh lebih jarang salah dalam contoh tersebut ^[3]. Untuk penggunaan produk yang berisiko tinggi, perbedaan seperti ini bisa lebih penting daripada model yang terdengar yakin di setiap prompt.

Mengapa ketidakpastian terkalibrasi adalah inti benchmark

Kontrol halusinasi bukan sekadar membuat model sering menolak. Model yang berguna seharusnya menjawab saat bukti kuat, bertanya saat permintaan kurang jelas, dan menahan jawaban saat klaim tidak bisa didukung. Itulah gagasan praktis dari ketidakpastian yang terkalibrasi.

Riset mendukung kerangka ini, dengan sejumlah catatan. Sebuah studi 2024 melaporkan bahwa abstensi berbasis ketidakpastian dapat meningkatkan correctness, mengurangi halusinasi, dan memperbaiki aspek keselamatan dalam skenario tanya-jawab ^[1]^[4]. I-CALM membingkai epistemic abstention sebagai abstensi pada pertanyaan faktual dengan jawaban yang dapat diverifikasi, dan mencatat bahwa LLM saat ini masih bisa gagal menahan diri ketika semestinya abstain ^[54]. Riset tentang behaviorally calibrated reinforcement learning juga mempelajari cara memberi insentif pada model agar mengakui ketidakpastian dengan menahan jawaban ^[61].

Tinjauan yang lebih luas menempatkan uncertainty quantification sebagai alat untuk mendeteksi halusinasi, dan menggambarkan ketidakpastian terkalibrasi sebagai cara membantu pengguna memutuskan kapan harus percaya, menunda, atau memverifikasi jawaban model ^[53]^[55]. Namun kalibrasi tetap kuncinya: model yang terlalu sering berkata tidak tahu bisa aman tetapi kurang berguna; model yang tidak pernah abstain bisa terasa membantu tetapi berisiko.

Cara membuat uji Claude vs OpenAI yang lebih adil

Pakai ID model resmi. Untuk Claude, uji claude-opus-4-7; untuk OpenAI, gunakan model terdokumentasi seperti GPT-5 atau GPT-5 mini, bukan label Spud yang belum terverifikasi ^[16]^[23]^[25]^[29].
Bangun kumpulan soal campuran. Sertakan pertanyaan yang bisa dijawab, permintaan yang kurang spesifik, dan pertanyaan yang memang tidak dapat dijawab; riset abstensi menilai manfaat menolak menjawab saat ketidakpastian tinggi atau pertanyaan tidak bisa dijawab dengan aman ^[1]^[4].
Nilai abstensi secara terpisah. Catat jawaban benar, jawaban salah, abstensi benar, dan abstensi salah. Survei abstensi mendefinisikan metrik seperti abstention accuracy, abstention precision, dan abstention recall ^[68].
Pisahkan ketidakpastian faktual dari penolakan keamanan. Menolak membantu membuat konten berbahaya bukan perilaku yang sama dengan berkata bahwa bukti faktual tidak cukup; I-CALM berfokus pada epistemic abstention untuk pertanyaan faktual dengan jawaban yang dapat diverifikasi ^[54].
Laporkan akurasi, error rate, dan abstention rate bersama-sama. Contoh SimpleQA OpenAI menunjukkan bahwa model dengan abstensi jauh lebih tinggi bisa memiliki akurasi yang mirip tetapi error yang jauh lebih rendah ^[3].
Samakan lingkungan pengujian. Retrieval, browsing, akses tool, panjang konteks, dan instruksi sistem dapat mengubah hasil. Jika satu model diberi bukti tambahan sementara yang lain tidak, yang diuji bukan hanya modelnya, tetapi juga setup-nya.

FAQ

Apakah GPT-5.5 Spud nyata?

Belum terverifikasi sebagai model resmi OpenAI dalam bukti yang dikutip. Sumber resmi OpenAI di sini mendokumentasikan GPT-5, GPT-5 mini, GPT-5.2-Codex, dan panduan prompt GPT-5.4, sementara “Spud” muncul di unggahan Reddit dan thread komunitas ^[7]^[8]^[10]^[23]^[25]^[26]^[28]^[29]^[45].

Apakah Claude Opus 4.7 lebih jarang berhalusinasi daripada GPT-5.5 Spud?

Pertanyaan itu belum bisa dijawab secara ketat dari sumber ini. Claude Opus 4.7 terdokumentasi ^[12]^[16], dan ada laporan sekunder tentang tingkat kejujuran MASK 91,7% ^[14]. Namun tidak ada target GPT-5.5 Spud yang terverifikasi dan tidak ada benchmark bersama untuk dua nama tersebut ^[7]^[8]^[10]^[28]^[68].

Apa yang sebaiknya dibandingkan oleh pembeli atau developer?

Bandingkan Claude Opus 4.7 dengan model OpenAI yang terdokumentasi, di bawah tugas, tool, prompt, dan aturan penilaian yang sama. Metrik utamanya sebaiknya menggabungkan akurasi, tingkat error, dan perilaku abstensi, bukan akurasi saja ^[3]^[68].

Kesimpulan

Jangan menarik kesimpulan “Claude menang” atau “Spud menang” dari bukti ini. Kesimpulan yang bisa dipertanggungjawabkan adalah: Claude Opus 4.7 terdokumentasi secara resmi; GPT-5.5 Spud belum terverifikasi dalam materi resmi OpenAI yang dikutip; dan cara terbaik menilai kontrol halusinasi adalah memberi nilai pada ketidakpastian yang terkalibrasi, termasuk abstensi yang benar saat sebuah klaim tidak dapat didukung ^[3]^[12]^[16]^[23]^[25]^[29]^[45]^[68].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Cari dan periksa fakta dengan Studio Global AI

Poin-poin penting

Claude Opus 4.7 memiliki dokumentasi dan ID API claude opus 4 7; GPT 5.5 Spud belum muncul sebagai model OpenAI resmi dalam sumber yang dikutip [12][16][23][25][26][29][45].
Contoh SimpleQA OpenAI menunjukkan pentingnya memisahkan akurasi, error, dan abstensi: gpt 5 thinking mini tercatat 52% abstensi, 22% akurasi, dan 26% error; o4 mini 1% abstensi, 24% akurasi, dan 75% error [3].
Benchmark produksi sebaiknya menghitung jawaban benar, jawaban salah, abstensi benar, dan abstensi salah, karena abstensi punya metrik sendiri seperti accuracy, precision, dan recall [68].

Orang-orang juga bertanya

Apa jawaban singkat untuk "Claude Opus 4.7 vs GPT-5.5 Spud: yang sebenarnya ditunjukkan bukti halusinasi"?

Claude Opus 4.7 memiliki dokumentasi dan ID API claude opus 4 7; GPT 5.5 Spud belum muncul sebagai model OpenAI resmi dalam sumber yang dikutip [12][16][23][25][26][29][45].

Apa poin penting yang harus divalidasi terlebih dahulu?

Apa yang harus saya lakukan selanjutnya dalam latihan?

Benchmark produksi sebaiknya menghitung jawaban benar, jawaban salah, abstensi benar, dan abstensi salah, karena abstensi punya metrik sendiri seperti accuracy, precision, dan recall [68].

Topik terkait manakah yang harus saya jelajahi selanjutnya?

Lanjutkan dengan "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: siapa unggul di benchmark 2026?" untuk sudut pandang lain dan kutipan tambahan.

Buka halaman terkait

Dengan apa saya harus membandingkannya?

Periksa ulang jawaban ini dengan "Northwest vs. Southeast Timber: Mengapa Jawabannya larger; larger?".

Buka halaman terkait

Lanjutkan penelitian Anda

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: siapa unggul di benchmark 2026?

Benchmark AI 2026: Claude Opus 4.7, GPT-5.5, DeepSeek V4, dan Kimi K2.6 dibandingkan

Northwest vs. Southeast Timber: Mengapa Jawabannya larger; larger?

Northwest vs. Southeast Timber: Mengapa Substitusi Lebih Tinggi Berarti larger; larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

Bisakah DeepSeek Mengalahkan OpenAI, Claude, Gemini, dan Grok? Masalahnya Bukan Cuma Benchmark

Bisakah DeepSeek Mengalahkan OpenAI, Claude, Gemini, dan Grok? Kuncinya Biaya dan Kepercayaan

Illustration of an AI coding agent connected to cloud database and backup systems

Insiden PocketOS: Pelajaran di Balik Klaim AI Menghapus Database

Klaim AI menghapus database PocketOS: apa yang diketahui dari insiden Claude/Cursor

Percakapan penelitian

kamu21 Apr 2026

Pertanyaan penelitian

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty

Studio Global AI21 Apr 202640 sumber

Jawaban yang dikutip

This is not a clean apples-to-apples fact check because only one of the two named models is clearly verifiable in official sources. Claude Opus 4.7 is documented by Anthropic as its most capable generally available model, while the official OpenAI sources I retrieved document GPT-5 mini and a docs track labeled “Latest: GPT-5.4,” not a model named “GPT-5.5 Spud.” ^[10]^[5]^[6] So a strict “Claude Opus 4.7 vs GPT-5.5 Spud” comparison is not fully verifiable from official evidence. ^[10]^[5]^[6]

Key findings

Anthropic’s official Opus 4.7 docs describe it as “our most capable generally available model to date,” with 1M-token context, 128k max output, adaptive thinking, and new task-budget controls. ^[10]
Anthropic also says task budgets are advisory and that if a budget is too restrictive, Opus 4.7 may complete a task less thoroughly or refuse to do it entirely. ^[10] That is a control mechanism, but it is not the same thing as well-calibrated epistemic abstention about uncertain facts. ^[10]^[4]
OpenAI’s public explanation of hallucinations is more explicit on abstention and uncertainty. OpenAI says standard training and evaluation procedures reward guessing over acknowledging uncertainty, and says it is better for a model to express uncertainty or ask for clarification than to provide confident but incorrect information. ^[15]
In OpenAI’s own SimpleQA example, gpt-5-thinking-mini shows a 52% abstention rate, 22% accuracy, and 26% error rate, while o4-mini shows 1% abstention, 24% accuracy, and 75% error rate. ^[15] OpenAI uses that comparison to argue that lower hallucination can come from abstaining more when uncertain, even if top-line accuracy falls slightly. ^[15]
The academic literature broadly supports that framing. A 2024 study reports that uncertainty-based abstention in LLMs improves safety and reduces hallucinations. ^[8] More recent work argues that current LLMs still often fail to abstain when they should, so confidence-aware abstention and recalibration remain active research problems. ^[9]^[7]^[3]
Reviews and taxonomies in this area consistently treat calibrated uncertainty as important because it helps users decide when to trust, defer, or verify a model’s answer. ^[4]^[3] That means “hallucination control” is not just about fewer wrong answers, but also about whether the model knows when not to guess. ^[4]^[15]
For Claude Opus 4.7 specifically, I found official Anthropic product documentation, but not a comparably explicit official public write-up in the retrieved sources that quantifies calibrated uncertainty or abstention behavior the way OpenAI’s public materials do. ^[10]^[15] On that narrow question, the evidence base is thinner for Opus 4.7 than for OpenAI’s GPT-5-family materials. ^[10]^[15]

Evidence notes

The strongest fact-check result is about existence and naming: Claude Opus 4.7 is officially documented by Anthropic, while the official OpenAI sources I retrieved document GPT-5 mini and a GPT-5.4 docs track rather than “GPT-5.5 Spud.” ^[10]^[5]^[6]
The strongest source on abstention behavior is OpenAI’s “Why language models hallucinate,” because it directly discusses abstentions, error rates, and why accuracy-only evaluations incentivize guessing. ^[15]
The strongest cross-model conclusion I can support is general, not model-specific: abstention and calibrated uncertainty are credible ways to reduce hallucinations, but they require explicit incentives and calibration work. ^[8]^[9]^[7]^[3]

Limitations / uncertainty

Insufficient evidence for a verified comparison against an official OpenAI model called “GPT-5.5 Spud.” ^[5]^[6]
Insufficient evidence for a standardized, same-benchmark head-to-head on calibrated uncertainty between Claude Opus 4.7 and an official OpenAI counterpart in the retrieved sources. ^[10]^[15]
Vendor claims are not directly comparable unless the models are tested on the same tasks, with the same prompting, tool access, web access, and scoring rules for abstentions versus errors. ^[15]^[8]^[4]

Summary

Fact-check result: Claude Opus 4.7 is real and officially documented; “GPT-5.5 Spud” is not verified in the official OpenAI sources I retrieved. ^[10]^[5]^[6]
Research result: abstention and calibrated uncertainty are real levers for reducing hallucinations, and OpenAI’s public materials currently articulate that logic more directly than the retrieved official Anthropic Opus 4.7 docs do. ^[15]^[8]^[9]^[10]
Practical conclusion: if you want a rigorous comparison on hallucination control, the defensible comparison is Claude Opus 4.7 versus an official OpenAI model such as GPT-5 or GPT-5 mini, not versus the unverified label “GPT-5.5 Spud.” ^[10]^[5]^[6]

Sources

^[3] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions
^[4] A comprehensive taxonomy of hallucinations in large language models
^[5] OpenAI API docs: GPT-5 mini Model
^[6] OpenAI API docs: Prompt guidance for GPT-5.4
^[7] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
^[8] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
^[9] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
^[10] Anthropic docs: What’s new in Claude Opus 4.7
^[15] OpenAI: Why Language Models Hallucinate

If you want, I can turn this into a tighter scorecard comparing Claude Opus 4.7 against official OpenAI models on three axes only: factuality, abstention behavior, and uncertainty calibration.

Sumber

[1] [2404.10960] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinationsarxiv.org
This study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering.
[3] Why language models hallucinate | OpenAIopenai.com
Why language models hallucinate OpenAI. Why language models hallucinate. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertaint...
[4] Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations | OpenReviewopenreview.net
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. Keywords: LLMs, uncertainty, abstention, correctness, hallucinations, safety. TL;DR: Abstention based on the right form of uncertainty improves correctness, hallucinations and...
[7] The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI_Indiareddit.com
Skip to main contentGPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AI : r/AI India. Open menu Open navigation[]( to Reddit Home. Get App Get the Reddit app Log InLog in to Reddit. [ Go to AI India](
[8] GPT-5.5: The “Spud” Leaks & The New Frontier of Omnimodal AIreddit.com
They've had ideas like that in the past (GPT5) and the user reaction was not good. The release was a flop, even though the model was quite good.
[10] GPT 5.5 Spud incoming : r/OpenAI - Redditreddit.com
You get stability, but you also get a model that is progressively falling behind the current release on safety and capability dimensions.
[12] What's new in Claude Opus 4.7platform.claude.com
Claude Opus 4.7 introduces task budgets. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to 35% more, varying by content), and /v1/messages/count tokens will return a different number of tok...
[14] Anthropic: Claude Opus 4.7 has a 92% honesty rate, fewer hallucinations | Mashablemashable.com
Anthropic released a new hybrid reasoning model on Thursday: Claude Opus 4.7. Anthropic has a reputation as a safety-first AI company"), and the Opus 4.7 system card reports that the model is less likely to hallucinate or engage in sycophancy than both prio...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[23] GPT-5 mini Model | OpenAI APIdevelopers.openai.com
Search the API docs. Realtime API. Model optimization. Specialized models. Legacy APIs. Using Codex. + Building frontend UIs with Codex and Figma. API. How Perplexity Brought Voice Search to Millions Using the Realtime API. Building frontend UIs with Codex...
[25] Introducing GPT-5 - OpenAIopenai.com
A smarter, more widely useful model. How to use GPT‑5. See here for full details on what GPT‑5 unlocks for developers. At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half...
[26] Introducing GPT-5.2-Codex - OpenAIopenai.com
Pushing the frontier on real-world software engineering. Advancing the cyber frontier. Real-world cyber capabilities. Empowering cyberdefense through trusted access. [Conclusion](
[28] Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Communitycommunity.openai.com
Please Add an Optional Expression Mode with the SPUD Release - ChatGPT / Feature requests - OpenAI Developer Community. Skip to main content. Topics. Announcements. [API]( "Questions, feedback, and best practices around building with OpenAI’s API. [Promptin...
[29] GPT-5 is here - OpenAIopenai.com
Try it in ChatGPT(opens in a new window)Read the research. Start building(opens in a new window)Read the API Platform blog. [ GPT-5 Text & vision 400K context length 128K max output tokens Input $1.25 Output $10.00 per 1M tokens Learn more(opens in a new wi...
[45] Prompt guidance for GPT-5.4 | OpenAI APIdevelopers.openai.com
Latest: GPT-5.4. Using tools. Skills. Shell. Computer use. + File search. + Overview. + Reasoning models. + Using realtime models. Use original for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy...
[53] Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directionsarxiv.org
… UQ as a tool for LLM hallucination detection on openended question-… We then review UQ for classification tasks, followed by … Don’t hallucinate, abstain: Identifying LLM knowledge gaps … 2025
[54] [2604.03904] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigationarxiv.org
Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being
[55] A comprehensive taxonomy of hallucinations in large language modelsarxiv.org
… This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a … interface communicates epistemic uncertainty clearly. Calibrated uncertainty helps users decide … 2508
[61] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning - ADSui.adsabs.harvard.edu
This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when
[68] Know Your Limits: A Survey of Abstention in Large Language Models Bingbing Wen1aclanthology.org
• Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall per-formance when incorporating abstention: ACC = N1 + N5 N1 + N2 + N3 + N4 + N5 • Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions t...

Temukan yang Sedang Tren

LaporanDipublikasikan29 Apr 2026Last edited 6 Mei 202620 sumber