उत्तरप्रकाशित28 अप्रैल 2026Last edited 6 मई 202613 स्रोत

GPT-5.5 vs Claude Opus 4.7：Benchmark 同選型指南

Benchmark 唔係話邊個模型必勝，而係話邊類工作啱邊個：GPT 5.5 喺 Terminal Bench 2.0、FrontierMath 同 BrowseComp style research 較強；Claude Opus 4.7 喺 SWE Bench Pro 同 MCP/tool orchestration 較突出。 Coding 方面，SWE Bench Verified 幾乎打和；但更難嘅 SWE Bench Pro 顯示 Claude Opus 4.7 有 5.7 個百分點優勢，對 production coding agents 更有參考價值。

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

17K0

GPT-5.5 और Claude Opus 4.7 की benchmark तुलना दिखाता editorial AI visual — GPT-5.5 बनाम Claude Opus 4.7: Benchmarks में कौन आगे हैAI-generated editorial illustration for the GPT-5.5 vs Claude Opus 4.7 benchmark comparison.
AI संकेत
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 बनाम Claude Opus 4.7: Benchmarks में कौन आगे है?. Article summary: कोई universal winner नहीं है: GPT 5.5 Terminal Bench 2.0 पर 82.7% और FrontierMath Tier 4 पर 35.4% दिखता है, जबकि Claude Opus 4.7 SWE Bench Pro पर 64.3% और MCP Atlas में 77.3–79.1% से आगे है; निर्णय workload पर निर्भर.... Topic tags: ai, llm, openai, anthropic, claude. Reference image context from search candidates: Reference image 1: visual subject "# OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better? OpenAI released its latest model, GPT-5.5, on April 23, just a week after Anthropic introduced Claude Opus 4.7. **Spoiler al" source context "OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? - Yahoo Tech" Reference image 2: visual subject "Compare their benchmark scores, pricing, and real-world performance before you commit. If you’re cho
openai.com

如果你正考慮喺團隊入面揀 GPT-5.5 定 Claude Opus 4.7，最重要唔係搵一個「總冠軍」，而係問：你要佢做咩？LLM Stats 對兩者嘅比較都用同一個角度：benchmark 數字唔係選出 universal winner，而係反映唔同 workload 嘅訊號 ^[2]。現有資料顯示，GPT-5.5 喺 terminal-style execution、FrontierMath 同 BrowseComp-style research 較強；Claude Opus 4.7 則喺更難嘅 software-engineering 任務，以及 MCP/tool orchestration 方面較有優勢 ^[21]^[27]^[28]^[32]。

Benchmark 一眼睇

Benchmark / 範疇	GPT-5.5	Claude Opus 4.7	點樣解讀
SWE-Bench Verified	88.7%	87.6%	幾乎打和；GPT-5.5 高 1.1 個百分點，未算決定性差距 ^[1]^[18]。
SWE-Bench Pro	58.6%	64.3%	更難嘅 software-engineering tasks 入面，Claude 領先較明顯 ^[32]。
Terminal-Bench 2.0	82.7%	69.4% reported	Terminal-oriented execution 入面 GPT-5.5 較突出，但 Opus 嘅公開分數喺來源之間唔完全一致 ^[1]^[18]^[27]。
MCP Atlas	75.3%	77.3–79.1%	Tool-calling 同 orchestration 方面，Claude 較佔優 ^[21]^[27]^[32]。
FrontierMath Tier 1–3	51.7%	43.8%	數學推理類任務，GPT-5.5 優勢清楚 ^[28]。
FrontierMath Tier 4	35.4%	22.9%	更難嘅 math tier，GPT-5.5 仍然領先 ^[28]。
GPQA Diamond	93.6%	94.2%	幾乎平手；Claude 輕微領先 ^[28]。
Humanity's Last Exam，no tools	41.4%	46.9%	廣泛 exam-style reasoning 入面，Claude 較高 ^[28]。
Humanity's Last Exam，with tools	52.2%	54.7%	加入 tools 後，Claude 仍有小幅優勢 ^[28]。
BrowseComp	84.4%	79.3%	BrowseComp-style research 入面，GPT-5.5 reported 分數較高 ^[5]^[27]。

有兩行要特別小心讀。Terminal-Bench 2.0 方面，LLM Stats 同部分 summary 報 Claude Opus 4.7 為 69.4%，但亦有比較只列出 GPT-5.5 嘅 82.7%，未提供 Opus 公開數字 ^[1]^[18]^[27]。MCP Atlas 方面，BenchLM 公開 snapshot 顯示 Claude Opus 4.7 為 77.3%、GPT-5.5 為 75.3%；其他報告就引用 Claude 79.1% 對 GPT-5.5 75.3% ^[21]^[27]^[32]。方向性結論仍然穩定：terminal-style execution 較偏向 GPT-5.5；MCP/tool orchestration 較偏向 Claude Opus 4.7。

Coding：唔好只睇 headline tie，要睇 SWE-Bench Pro

SWE-Bench 測試模型解決真實 GitHub issues 嘅能力，而 Pro variant 被描述為更難、問題更複雜 ^[17]。喺 SWE-Bench Verified，GPT-5.5 係 88.7%，Claude Opus 4.7 係 87.6%，實際上可以當係接近打和 ^[1]^[18]。

更值得睇嘅 coding 訊號係 SWE-Bench Pro。呢個 benchmark 入面，Claude Opus 4.7 reported 64.3%，GPT-5.5 reported 58.6%，Claude 領先 5.7 個百分點 ^[32]。SWE-Bench Pro 本身亦更貼近複雜工程：一個 overview 指出，Verified set 有 500 個 tasks、12 個 Python repositories；Pro set 則有 1,865 個 tasks、41 個 repositories，涵蓋 Python、Go、TypeScript 同 JavaScript，而且平均改動檔案數由約 1 個升到 4.1 個 ^[22]。

實務上，如果你做嘅係 multi-file bug fixing、pull-request repair、refactoring，或者想建立 production coding agents，Claude Opus 4.7 應該優先試。MindStudio 嘅 coding comparison 亦指出，Opus 4.7 喺大型 codebase 入面需要 broad architectural reasoning 嘅任務表現較強 ^[3]。

Agents 同 tools：terminal GPT-5.5 較醒，orchestration Claude 較穩

如果工作流好依賴 shell、CLI、file navigation、step-by-step computer work，GPT-5.5 嘅 case 較強。Terminal-Bench 2.0 上，GPT-5.5 reported 82.7%，Claude Opus 4.7 reported 69.4% ^[18]^[27]。不過，由於部分公開比較未提供 Opus 對應數字，呢個結果較適合視為方向性訊號，而唔係絕對 leaderboard 真理 ^[1]。

Tool orchestration 就係另一回事。MCP Atlas 係測試模型透過 Model Context Protocol integrations 同外部工具進行 tool-calling 嘅 benchmark；簡單講，即係睇模型可唔可以可靠咁串起多個工具、API 或服務 ^[21]。BenchLM 公開 snapshot 顯示 Claude Opus 4.7 係 77.3%，GPT-5.5 係 75.3% ^[21]；其他 reporting 就寫成 79.1% 對 75.3% ^[27]^[32]。如果你嘅 agent 要連續 call 多個 APIs、services 同 tools，Claude Opus 4.7 會係較好嘅 first test。

Reasoning 同 research：數學係一類，廣泛考試又係另一類

將 reasoning 當成單一能力會好容易誤判。OpenAI 嘅 GPT-5.5 table 顯示，FrontierMath Tier 1–3 入面 GPT-5.5 係 51.7%，Claude Opus 4.7 係 43.8%；FrontierMath Tier 4 入面 GPT-5.5 係 35.4%，Claude 係 22.9% ^[28]。即係話，math-heavy reasoning 方面 GPT-5.5 優勢幾清楚。

但 GPQA Diamond 同 Humanity's Last Exam 俾出嘅訊號唔同。GPQA Diamond 入面兩者幾乎打和：GPT-5.5 93.6%，Claude Opus 4.7 94.2% ^[28]。Humanity's Last Exam 則由 Claude 領先：no-tools setting 係 46.9% 對 GPT-5.5 嘅 41.4%；with-tools setting 係 54.7% 對 GPT-5.5 嘅 52.2% ^[28]。

至於 browsing-heavy research，GPT-5.5 喺 BrowseComp-style research 較強：reported score 係 84.4%，Claude Opus 4.7 係 79.3% ^[5]^[27]。所以，如果你要做大量 web research automation 或 browsing-based analysis，GPT-5.5 值得先試。

應該揀邊個 model？

揀 GPT-5.5，如果你要：

做 terminal execution、shell automation、CLI-based agents，或者 step-by-step computer work；Terminal-Bench 2.0 comparisons 入面 GPT-5.5 較高 ^[18]^[27]。
處理 math-heavy reasoning；FrontierMath Tier 1–3 同 Tier 4 都係 GPT-5.5 領先 ^[28]。
做 BrowseComp-style web research 或 browsing-heavy analysis；GPT-5.5 reported 84.4%，Claude Opus 4.7 reported 79.3% ^[5]^[27]。

揀 Claude Opus 4.7，如果你要：

處理 complex codebase changes、multi-file bug fixing，或者 SWE-Bench Pro 類型嘅 hard engineering tasks；Claude 喺呢個 benchmark 以 64.3% 對 GPT-5.5 58.6% 領先 ^[32]。
建立 MCP/API/tool orchestration 型 agents；MCP Atlas snapshots 入面 Claude Opus 4.7 高過 GPT-5.5 ^[21]^[27]^[32]。
依賴大型 codebase 入面嘅 architectural reasoning；MindStudio comparison 指 Opus 4.7 喺 large codebases 嘅 broad architectural reasoning 較強 ^[3]。

Benchmark 要點讀？唔好當上線保證書

公開 benchmark 數字唔應該直接當成 production truth。Anthropic 喺 Claude Opus 4.7 release notes 入面提到 harness changes、internal implementations 同 methodology updates，亦指出部分 scores 未必可以同 public leaderboard scores 直接比較 ^[19]。另一方面，關於 GPT-5.5 嘅 builder-focused summary 亦提示，部分 benchmark scores 屬 OpenAI-reported，而且缺乏第三方 replication ^[31]。

最穩陣做法係跑一個細型 internal eval：用你哋最近嘅 tickets、repositories、tool chains、prompts 同 pass/fail criteria，同時測 GPT-5.5 同 Claude Opus 4.7。Leaderboard 係路牌，唔係保證書；真正決定因素係你嘅 workload、latency tolerance、tooling 同 failure cost。

Verdict

如果你要一個偏 general automation、terminal execution、math-heavy reasoning 同 BrowseComp-style research 嘅 starting point，GPT-5.5 目前較合理 ^[27]^[28]。如果你嘅核心結果係 hard coding、production coding agents 或 multi-tool orchestration，Claude Opus 4.7 會係更強候選 ^[21]^[32]。

一句講晒：GPT-5.5 強在 broad execution 同數學；Claude Opus 4.7 強在困難 software-engineering 同 tool-agent workflows。真正答案唔係「邊個模型最好」，而係「邊個模型最啱你手上嗰件事」。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

Benchmark 唔係話邊個模型必勝，而係話邊類工作啱邊個：GPT 5.5 喺 Terminal Bench 2.0、FrontierMath 同 BrowseComp style research 較強；Claude Opus 4.7 喺 SWE Bench Pro 同 MCP/tool orchestration 較突出。
Coding 方面，SWE Bench Verified 幾乎打和；但更難嘅 SWE Bench Pro 顯示 Claude Opus 4.7 有 5.7 個百分點優勢，對 production coding agents 更有參考價值。
Benchmark 數字唔應該當成上線保證；部分成績受 harness、官方報告方式或第三方複驗不足影響，最好用自己嘅 repo、tools 同 prompts 做 internal eval。

लोग पूछते भी हैं

"GPT-5.5 vs Claude Opus 4.7：Benchmark 同選型指南" का संक्षिप्त उत्तर क्या है?

Benchmark 唔係話邊個模型必勝，而係話邊類工作啱邊個：GPT 5.5 喺 Terminal Bench 2.0、FrontierMath 同 BrowseComp style research 較強；Claude Opus 4.7 喺 SWE Bench Pro 同 MCP/tool orchestration 較突出。

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे अभ्यास में आगे क्या करना चाहिए?

Benchmark 數字唔應該當成上線保證；部分成績受 harness、官方報告方式或第三方複驗不足影響，最好用自己嘅 repo、tools 同 prompts 做 internal eval。

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

अन्य कोण और अतिरिक्त उद्धरणों के लिए "Claude Security पब्लिक बीटा: Anthropic का AI कोड-सुरक्षा स्कैनर क्या करता है" के साथ जारी रखें।

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

इस उत्तर को "Grok 4.3 API: 1M context, सस्ती token pricing और xAI की वॉइस रणनीति" के सामने क्रॉस-चेक करें।

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

IT Consulting Orange County CA, IT Services, IT Outsourcing - Technijian. # Anthropic Claude Security for Enterprise: What You Need to Know. Anthropic has officially opened Claude

Claude Security पब्लिक बीटा: Anthropic का AI कोड-सुरक्षा स्कैनर क्या करता है

Claude Security पब्लिक बीटा: AI से कंपनी के कोड में कमजोरियां कैसे खोजेगा Anthropic

抽象 AI 介面顯示 Grok 4.3 API、1M context、token 價格與語音波形

Grok 4.3 API: 1M context, सस्ती token pricing और xAI की वॉइस रणनीति

Xiaomi Q5 शायद MIX 5 नहीं: नया संकेत Xiaomi 18 सीरीज़ की ओर

Metas KI-Altersprüfung: Neue Regeln für Jugendliche in EU, USA und Brasilien

Meta की AI उम्र जांच: Instagram और Facebook पर किशोरों के लिए क्या बदलेगा

Meta की नई AI उम्र जांच: Instagram और Facebook पर किशोरों के नियम बदलेंगे

अनुसंधान वार्तालाप

आप26 अप्रैल 2026

शोध प्रश्न

बेंचमार्क रिपोर्ट के आधार पर GPT-5.5 और Claude Opus 4.7 की तुलना करें।

Studio Global AI26 अप्रैल 202636 स्रोत

उद्धृत उत्तर

GPT-5.5 कुल मिलाकर reasoning, terminal/agent execution और कई academic benchmarks में आगे दिखता है, जबकि Claude Opus 4.7 कठिन software-engineering और multi-tool orchestration में मजबूत है। इसलिए “बेहतर” मॉडल use case पर निर्भर है: general reasoning/terminal tasks के लिए GPT-5.5, और complex coding/agentic tool workflows के लिए Opus 4.7 बेहतर विकल्प दिखता है।

Coding: मिश्रित नतीजा
- SWE-Bench Verified में GPT-5.5 को मामूली बढ़त दी गई है: 88.7% बनाम Claude Opus 4.7 का 87.6% ^[7].
- SWE-Bench Pro में Claude Opus 4.7 स्पष्ट रूप से आगे है: 64.3% बनाम GPT-5.5 का 58.6% ^[7]^[6].
- Opus 4.7 के लिए रिपोर्टों में SWE-bench Pro पर Opus 4.6 से 53.4% से 64.3% तक सुधार बताया गया है, यानी कठिन real-world coding tasks में बड़ा लाभ ^[3]^[6].
Terminal / agent execution
- Terminal-Bench 2.0 में GPT-5.5 को 82.7% और Opus 4.7 को 69.4% बताया गया है, इसलिए shell/terminal-style execution tasks में GPT-5.5 मजबूत दिखता है ^[12]^[7].
- कुछ रिपोर्टों में GPT-5.5 के लिए Terminal-Bench 2.0 score 82.7% है, लेकिन Opus के public number को लेकर स्रोतों में असंगति है; इसलिए इस benchmark को थोड़ी सावधानी से पढ़ना चाहिए ^[7]^[12].
Tool use और orchestration
- MCP Atlas जैसे tool-integration benchmark में Claude Opus 4.7 को 79.1% बनाम GPT-5.5 का 75.3% बताया गया है, इसलिए multi-tool/API orchestration में Opus 4.7 आगे दिखता है ^[10]^[2].
- Vellum की रिपोर्ट भी Opus 4.7 को MCP-Atlas पर 77.3% के साथ production-agent workflows के लिए मजबूत बताती है ^[3].
Academic / reasoning
- OpenAI की GPT-5.5 रिपोर्ट में FrontierMath Tier 1–3 पर GPT-5.5 को 51.7% और Claude Opus 4.7 को 43.8% बताया गया है ^[15]^[7].
- उसी रिपोर्ट में FrontierMath Tier 4 पर GPT-5.5 को 35.4% और Claude Opus 4.7 को 22.9% बताया गया है ^[15]^[7].
- GPQA Diamond में दोनों बहुत करीब हैं: GPT-5.5 93.6% और Claude Opus 4.7 94.2% ^[15]^[7].
निष्कर्ष
- GPT-5.5 चुनें अगर आपको broad reasoning, math-heavy tasks, terminal automation, या general-purpose high-performance assistant चाहिए।
- Claude Opus 4.7 चुनें अगर आपका मुख्य काम complex codebase changes, SWE-Bench Pro जैसे कठिन coding tasks, या multi-tool enterprise agents हैं।
- स्वतंत्र, समान harness पर चले benchmarks कम उपलब्ध हैं; official lab numbers और third-party summaries को सीधे “अंतिम सत्य” न मानकर workload-specific testing से validate करना बेहतर होगा।

सूत्र

[1] GPT-5.5 vs Claude Opus 4.7: 2026 Frontier Showdown (Benchmarks)tokenmix.ai
Head-to-Head: The Numbers That Matter Benchmark GPT-5.5 Claude Opus 4.7 Winner --- --- SWE-Bench Verified 88.7% 87.6% GPT-5.5 by 1.1 SWE-Bench Pro 58.6% 64.3% Opus 4.7 by 5.7 MMLU 92.4% 91% GPT-5.5 Terminal-Bench 2.0 82.7% — GPT-5.5 (no public Opus number)...
[2] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Within seven days, I had two new frontier models to compare against the workloads I run for LLM Stats:Claude Opus 4.7shipped on April 16, 2026, andGPT-5.5 on April 23. Both land at the same input price. Both ship 1M-token context. Both pitch significantly b...
[3] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance ...mindstudio.ai
SWE-Bench and Coding Tasks On SWE-Bench Verified — the standard benchmark for evaluating real GitHub issue resolution — both models score competitively at the top of the 2026 leaderboard. GPT-5.5 holds a slight edge on problems requiring precise tool use an...
[5] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[17] Claude Opus 4.7: Anthropic's New Best (Available) Modeldatacamp.com
SWE-bench tests a model's ability to resolve real GitHub issues in open-source Python repositories. Pro is a harder variant with more complex issues. The 10.9-point gain over Opus 4.6 on SWE-bench Pro is the largest improvement in this release (percentage p...
[18] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
Benchmarks Agentic coding Benchmark Opus 4.7 Opus 4.6 Delta --- --- SWE-bench Verified 87.6% 80.8% +6.8 SWE-bench Pro 64.3% 53.4% +10.9 Terminal-Bench 2.0 69.4% 65.4% +4.0 The jump on SWE-bench Pro (+10.9 points) is larger than on SWE-bench Verified, sugges...
[19] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[21] MCP Atlas Benchmark 2026: 13 model averages | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools MCP Atlas A benchmark for tool-calling over Model Context Protocol integrations and external tools. Benchmark score on MCP Atlas — April 23, 2026 BenchLM mirrors the published s...
[22] SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%morphllm.com
Dimension SWE-Bench Verified SWE-Bench Pro --- Tasks 500 1,865 Repositories 12 (all Python) 41 (Python, Go, TS, JS) Avg lines changed 11 (median: 4) 107.4 Avg files changed 1 4.1 Top score (Mar 2026) 80.9% (Claude Opus 4.5) 59% (agent systems) Contamination...
[27] GPT-5.5: The Honest Take on OpenAI's Response to Opus ...alexlavaee.me
Benchmark GPT-5.5 GPT-5.4 Opus 4.7 Gemini 3.1 Pro --- --- Terminal-Bench 2.0 82.7% 75.1% 69.4% 68.5% SWE-Bench Pro (public)\ 58.6% 57.7% 64.3% 54.2% Expert-SWE (OpenAI internal) 73.1% 68.5% — — OSWorld-Verified 78.7% 75.0% 78.0% — MCP Atlas (tool use) 75.3%...
[28] Introducing GPT-5.5 - OpenAIopenai.com
Academic EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaudeOpus 4.7Gemini 3.1 Pro GeneBench 25.0%19.0%33.2%25.6%-- FrontierMath Tier 1–3 51.7%47.6%52.4%50.0%43.8%36.9% FrontierMath Tier 4 35.4%27.1%39.6%38.0%22.9%16.7% BixBench 80.5%74.0%---- GPQA Diamond 93.6%...
[31] What Is GPT-5.5 for Builders in 2026? | WaveSpeedAI Blogwavespeed.ai
Item Status --- Release date: April 23, 2026 Confirmed — OpenAI official Live in ChatGPT (Plus/Pro/Business/Enterprise) Confirmed — OpenAI official Live in Codex (Plus/Pro/Business/Enterprise/Edu/Go) Confirmed — OpenAI official 400K context in Codex Confirm...
[32] Everything You Need to Know About GPT-5.5 - Vellumvellum.ai
SWE-bench Pro: the coding crown stays with Anthropic Claude Opus 4.7 scores 64.3% versus GPT-5.5's 58.6% — a 5.7-point gap on real GitHub issue resolution. OpenAI's system card includes an asterisk noting "evidence of memorization" from other labs on this e...

ट्रेंडिंग डिस्कवर

उत्तरप्रकाशित28 अप्रैल 2026Last edited 6 मई 202613 स्रोत

GPT-5.5 vs Claude Opus 4.7：Benchmark 同選型指南

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

17K0

Benchmark 一眼睇

Benchmark / 範疇	GPT-5.5	Claude Opus 4.7	點樣解讀
SWE-Bench Verified	88.7%	87.6%	幾乎打和；GPT-5.5 高 1.1 個百分點，未算決定性差距 ^[1]^[18]。
SWE-Bench Pro	58.6%	64.3%	更難嘅 software-engineering tasks 入面，Claude 領先較明顯 ^[32]。
Terminal-Bench 2.0	82.7%	69.4% reported	Terminal-oriented execution 入面 GPT-5.5 較突出，但 Opus 嘅公開分數喺來源之間唔完全一致 ^[1]^[18]^[27]。
MCP Atlas	75.3%	77.3–79.1%	Tool-calling 同 orchestration 方面，Claude 較佔優 ^[21]^[27]^[32]。
FrontierMath Tier 1–3	51.7%	43.8%	數學推理類任務，GPT-5.5 優勢清楚 ^[28]。
FrontierMath Tier 4	35.4%	22.9%	更難嘅 math tier，GPT-5.5 仍然領先 ^[28]。
GPQA Diamond	93.6%	94.2%	幾乎平手；Claude 輕微領先 ^[28]。
Humanity's Last Exam，no tools	41.4%	46.9%	廣泛 exam-style reasoning 入面，Claude 較高 ^[28]。
Humanity's Last Exam，with tools	52.2%	54.7%	加入 tools 後，Claude 仍有小幅優勢 ^[28]。
BrowseComp	84.4%	79.3%	BrowseComp-style research 入面，GPT-5.5 reported 分數較高 ^[5]^[27]。

Coding：唔好只睇 headline tie，要睇 SWE-Bench Pro

Agents 同 tools：terminal GPT-5.5 較醒，orchestration Claude 較穩

Reasoning 同 research：數學係一類，廣泛考試又係另一類

應該揀邊個 model？

揀 GPT-5.5，如果你要：

做 terminal execution、shell automation、CLI-based agents，或者 step-by-step computer work；Terminal-Bench 2.0 comparisons 入面 GPT-5.5 較高 ^[18]^[27]。
處理 math-heavy reasoning；FrontierMath Tier 1–3 同 Tier 4 都係 GPT-5.5 領先 ^[28]。
做 BrowseComp-style web research 或 browsing-heavy analysis；GPT-5.5 reported 84.4%，Claude Opus 4.7 reported 79.3% ^[5]^[27]。

揀 Claude Opus 4.7，如果你要：

處理 complex codebase changes、multi-file bug fixing，或者 SWE-Bench Pro 類型嘅 hard engineering tasks；Claude 喺呢個 benchmark 以 64.3% 對 GPT-5.5 58.6% 領先 ^[32]。
建立 MCP/API/tool orchestration 型 agents；MCP Atlas snapshots 入面 Claude Opus 4.7 高過 GPT-5.5 ^[21]^[27]^[32]。
依賴大型 codebase 入面嘅 architectural reasoning；MindStudio comparison 指 Opus 4.7 喺 large codebases 嘅 broad architectural reasoning 較強 ^[3]。

Benchmark 要點讀？唔好當上線保證書

Verdict

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

Benchmark 唔係話邊個模型必勝，而係話邊類工作啱邊個：GPT 5.5 喺 Terminal Bench 2.0、FrontierMath 同 BrowseComp style research 較強；Claude Opus 4.7 喺 SWE Bench Pro 同 MCP/tool orchestration 較突出。
Coding 方面，SWE Bench Verified 幾乎打和；但更難嘅 SWE Bench Pro 顯示 Claude Opus 4.7 有 5.7 個百分點優勢，對 production coding agents 更有參考價值。
Benchmark 數字唔應該當成上線保證；部分成績受 harness、官方報告方式或第三方複驗不足影響，最好用自己嘅 repo、tools 同 prompts 做 internal eval。

लोग पूछते भी हैं

"GPT-5.5 vs Claude Opus 4.7：Benchmark 同選型指南" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे अभ्यास में आगे क्या करना चाहिए?

Benchmark 數字唔應該當成上線保證；部分成績受 harness、官方報告方式或第三方複驗不足影響，最好用自己嘅 repo、tools 同 prompts 做 internal eval。

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

अनुसंधान वार्तालाप

आप26 अप्रैल 2026

शोध प्रश्न

बेंचमार्क रिपोर्ट के आधार पर GPT-5.5 और Claude Opus 4.7 की तुलना करें।

Studio Global AI26 अप्रैल 202636 स्रोत

उद्धृत उत्तर

Coding: मिश्रित नतीजा
- SWE-Bench Verified में GPT-5.5 को मामूली बढ़त दी गई है: 88.7% बनाम Claude Opus 4.7 का 87.6% ^[7].
- SWE-Bench Pro में Claude Opus 4.7 स्पष्ट रूप से आगे है: 64.3% बनाम GPT-5.5 का 58.6% ^[7]^[6].
- Opus 4.7 के लिए रिपोर्टों में SWE-bench Pro पर Opus 4.6 से 53.4% से 64.3% तक सुधार बताया गया है, यानी कठिन real-world coding tasks में बड़ा लाभ ^[3]^[6].
Terminal / agent execution
- Terminal-Bench 2.0 में GPT-5.5 को 82.7% और Opus 4.7 को 69.4% बताया गया है, इसलिए shell/terminal-style execution tasks में GPT-5.5 मजबूत दिखता है ^[12]^[7].
- कुछ रिपोर्टों में GPT-5.5 के लिए Terminal-Bench 2.0 score 82.7% है, लेकिन Opus के public number को लेकर स्रोतों में असंगति है; इसलिए इस benchmark को थोड़ी सावधानी से पढ़ना चाहिए ^[7]^[12].
Tool use और orchestration
- MCP Atlas जैसे tool-integration benchmark में Claude Opus 4.7 को 79.1% बनाम GPT-5.5 का 75.3% बताया गया है, इसलिए multi-tool/API orchestration में Opus 4.7 आगे दिखता है ^[10]^[2].
- Vellum की रिपोर्ट भी Opus 4.7 को MCP-Atlas पर 77.3% के साथ production-agent workflows के लिए मजबूत बताती है ^[3].
Academic / reasoning
- OpenAI की GPT-5.5 रिपोर्ट में FrontierMath Tier 1–3 पर GPT-5.5 को 51.7% और Claude Opus 4.7 को 43.8% बताया गया है ^[15]^[7].
- उसी रिपोर्ट में FrontierMath Tier 4 पर GPT-5.5 को 35.4% और Claude Opus 4.7 को 22.9% बताया गया है ^[15]^[7].
- GPQA Diamond में दोनों बहुत करीब हैं: GPT-5.5 93.6% और Claude Opus 4.7 94.2% ^[15]^[7].
निष्कर्ष
- GPT-5.5 चुनें अगर आपको broad reasoning, math-heavy tasks, terminal automation, या general-purpose high-performance assistant चाहिए।
- Claude Opus 4.7 चुनें अगर आपका मुख्य काम complex codebase changes, SWE-Bench Pro जैसे कठिन coding tasks, या multi-tool enterprise agents हैं।
- स्वतंत्र, समान harness पर चले benchmarks कम उपलब्ध हैं; official lab numbers और third-party summaries को सीधे “अंतिम सत्य” न मानकर workload-specific testing से validate करना बेहतर होगा।

सूत्र

[1] GPT-5.5 vs Claude Opus 4.7: 2026 Frontier Showdown (Benchmarks)tokenmix.ai
Head-to-Head: The Numbers That Matter Benchmark GPT-5.5 Claude Opus 4.7 Winner --- --- SWE-Bench Verified 88.7% 87.6% GPT-5.5 by 1.1 SWE-Bench Pro 58.6% 64.3% Opus 4.7 by 5.7 MMLU 92.4% 91% GPT-5.5 Terminal-Bench 2.0 82.7% — GPT-5.5 (no public Opus number)...
[2] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Within seven days, I had two new frontier models to compare against the workloads I run for LLM Stats:Claude Opus 4.7shipped on April 16, 2026, andGPT-5.5 on April 23. Both land at the same input price. Both ship 1M-token context. Both pitch significantly b...
[3] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance ...mindstudio.ai
SWE-Bench and Coding Tasks On SWE-Bench Verified — the standard benchmark for evaluating real GitHub issue resolution — both models score competitively at the top of the 2026 leaderboard. GPT-5.5 holds a slight edge on problems requiring precise tool use an...
[5] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[17] Claude Opus 4.7: Anthropic's New Best (Available) Modeldatacamp.com
SWE-bench tests a model's ability to resolve real GitHub issues in open-source Python repositories. Pro is a harder variant with more complex issues. The 10.9-point gain over Opus 4.6 on SWE-bench Pro is the largest improvement in this release (percentage p...
[18] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
Benchmarks Agentic coding Benchmark Opus 4.7 Opus 4.6 Delta --- --- SWE-bench Verified 87.6% 80.8% +6.8 SWE-bench Pro 64.3% 53.4% +10.9 Terminal-Bench 2.0 69.4% 65.4% +4.0 The jump on SWE-bench Pro (+10.9 points) is larger than on SWE-bench Verified, sugges...
[19] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[21] MCP Atlas Benchmark 2026: 13 model averages | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools MCP Atlas A benchmark for tool-calling over Model Context Protocol integrations and external tools. Benchmark score on MCP Atlas — April 23, 2026 BenchLM mirrors the published s...
[22] SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%morphllm.com
Dimension SWE-Bench Verified SWE-Bench Pro --- Tasks 500 1,865 Repositories 12 (all Python) 41 (Python, Go, TS, JS) Avg lines changed 11 (median: 4) 107.4 Avg files changed 1 4.1 Top score (Mar 2026) 80.9% (Claude Opus 4.5) 59% (agent systems) Contamination...
[27] GPT-5.5: The Honest Take on OpenAI's Response to Opus ...alexlavaee.me
Benchmark GPT-5.5 GPT-5.4 Opus 4.7 Gemini 3.1 Pro --- --- Terminal-Bench 2.0 82.7% 75.1% 69.4% 68.5% SWE-Bench Pro (public)\ 58.6% 57.7% 64.3% 54.2% Expert-SWE (OpenAI internal) 73.1% 68.5% — — OSWorld-Verified 78.7% 75.0% 78.0% — MCP Atlas (tool use) 75.3%...
[28] Introducing GPT-5.5 - OpenAIopenai.com
Academic EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaudeOpus 4.7Gemini 3.1 Pro GeneBench 25.0%19.0%33.2%25.6%-- FrontierMath Tier 1–3 51.7%47.6%52.4%50.0%43.8%36.9% FrontierMath Tier 4 35.4%27.1%39.6%38.0%22.9%16.7% BixBench 80.5%74.0%---- GPQA Diamond 93.6%...
[31] What Is GPT-5.5 for Builders in 2026? | WaveSpeedAI Blogwavespeed.ai
Item Status --- Release date: April 23, 2026 Confirmed — OpenAI official Live in ChatGPT (Plus/Pro/Business/Enterprise) Confirmed — OpenAI official Live in Codex (Plus/Pro/Business/Enterprise/Edu/Go) Confirmed — OpenAI official 400K context in Codex Confirm...
[32] Everything You Need to Know About GPT-5.5 - Vellumvellum.ai
SWE-bench Pro: the coding crown stays with Anthropic Claude Opus 4.7 scores 64.3% versus GPT-5.5's 58.6% — a 5.7-point gap on real GitHub issue resolution. OpenAI's system card includes an asterisk noting "evidence of memorization" from other labs on this e...

ट्रेंडिंग डिस्कवर

उत्तरप्रकाशित28 अप्रैल 2026Last edited 6 मई 202613 स्रोत

GPT-5.5 vs Claude Opus 4.7：Benchmark 同選型指南

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

17K0

Benchmark 一眼睇

Benchmark / 範疇	GPT-5.5	Claude Opus 4.7	點樣解讀
SWE-Bench Verified	88.7%	87.6%	幾乎打和；GPT-5.5 高 1.1 個百分點，未算決定性差距 ^[1]^[18]。
SWE-Bench Pro	58.6%	64.3%	更難嘅 software-engineering tasks 入面，Claude 領先較明顯 ^[32]。
Terminal-Bench 2.0	82.7%	69.4% reported	Terminal-oriented execution 入面 GPT-5.5 較突出，但 Opus 嘅公開分數喺來源之間唔完全一致 ^[1]^[18]^[27]。
MCP Atlas	75.3%	77.3–79.1%	Tool-calling 同 orchestration 方面，Claude 較佔優 ^[21]^[27]^[32]。
FrontierMath Tier 1–3	51.7%	43.8%	數學推理類任務，GPT-5.5 優勢清楚 ^[28]。
FrontierMath Tier 4	35.4%	22.9%	更難嘅 math tier，GPT-5.5 仍然領先 ^[28]。
GPQA Diamond	93.6%	94.2%	幾乎平手；Claude 輕微領先 ^[28]。
Humanity's Last Exam，no tools	41.4%	46.9%	廣泛 exam-style reasoning 入面，Claude 較高 ^[28]。
Humanity's Last Exam，with tools	52.2%	54.7%	加入 tools 後，Claude 仍有小幅優勢 ^[28]。
BrowseComp	84.4%	79.3%	BrowseComp-style research 入面，GPT-5.5 reported 分數較高 ^[5]^[27]。

Coding：唔好只睇 headline tie，要睇 SWE-Bench Pro

Agents 同 tools：terminal GPT-5.5 較醒，orchestration Claude 較穩

Reasoning 同 research：數學係一類，廣泛考試又係另一類

應該揀邊個 model？

揀 GPT-5.5，如果你要：

做 terminal execution、shell automation、CLI-based agents，或者 step-by-step computer work；Terminal-Bench 2.0 comparisons 入面 GPT-5.5 較高 ^[18]^[27]。
處理 math-heavy reasoning；FrontierMath Tier 1–3 同 Tier 4 都係 GPT-5.5 領先 ^[28]。
做 BrowseComp-style web research 或 browsing-heavy analysis；GPT-5.5 reported 84.4%，Claude Opus 4.7 reported 79.3% ^[5]^[27]。

揀 Claude Opus 4.7，如果你要：

處理 complex codebase changes、multi-file bug fixing，或者 SWE-Bench Pro 類型嘅 hard engineering tasks；Claude 喺呢個 benchmark 以 64.3% 對 GPT-5.5 58.6% 領先 ^[32]。
建立 MCP/API/tool orchestration 型 agents；MCP Atlas snapshots 入面 Claude Opus 4.7 高過 GPT-5.5 ^[21]^[27]^[32]。
依賴大型 codebase 入面嘅 architectural reasoning；MindStudio comparison 指 Opus 4.7 喺 large codebases 嘅 broad architectural reasoning 較強 ^[3]。

Benchmark 要點讀？唔好當上線保證書

Verdict

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

Benchmark 唔係話邊個模型必勝，而係話邊類工作啱邊個：GPT 5.5 喺 Terminal Bench 2.0、FrontierMath 同 BrowseComp style research 較強；Claude Opus 4.7 喺 SWE Bench Pro 同 MCP/tool orchestration 較突出。
Coding 方面，SWE Bench Verified 幾乎打和；但更難嘅 SWE Bench Pro 顯示 Claude Opus 4.7 有 5.7 個百分點優勢，對 production coding agents 更有參考價值。
Benchmark 數字唔應該當成上線保證；部分成績受 harness、官方報告方式或第三方複驗不足影響，最好用自己嘅 repo、tools 同 prompts 做 internal eval。

लोग पूछते भी हैं

"GPT-5.5 vs Claude Opus 4.7：Benchmark 同選型指南" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे अभ्यास में आगे क्या करना चाहिए?

Benchmark 數字唔應該當成上線保證；部分成績受 harness、官方報告方式或第三方複驗不足影響，最好用自己嘅 repo、tools 同 prompts 做 internal eval。

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

अनुसंधान वार्तालाप

आप26 अप्रैल 2026

शोध प्रश्न

बेंचमार्क रिपोर्ट के आधार पर GPT-5.5 और Claude Opus 4.7 की तुलना करें।

Studio Global AI26 अप्रैल 202636 स्रोत

उद्धृत उत्तर

Coding: मिश्रित नतीजा
- SWE-Bench Verified में GPT-5.5 को मामूली बढ़त दी गई है: 88.7% बनाम Claude Opus 4.7 का 87.6% ^[7].
- SWE-Bench Pro में Claude Opus 4.7 स्पष्ट रूप से आगे है: 64.3% बनाम GPT-5.5 का 58.6% ^[7]^[6].
- Opus 4.7 के लिए रिपोर्टों में SWE-bench Pro पर Opus 4.6 से 53.4% से 64.3% तक सुधार बताया गया है, यानी कठिन real-world coding tasks में बड़ा लाभ ^[3]^[6].
Terminal / agent execution
- Terminal-Bench 2.0 में GPT-5.5 को 82.7% और Opus 4.7 को 69.4% बताया गया है, इसलिए shell/terminal-style execution tasks में GPT-5.5 मजबूत दिखता है ^[12]^[7].
- कुछ रिपोर्टों में GPT-5.5 के लिए Terminal-Bench 2.0 score 82.7% है, लेकिन Opus के public number को लेकर स्रोतों में असंगति है; इसलिए इस benchmark को थोड़ी सावधानी से पढ़ना चाहिए ^[7]^[12].
Tool use और orchestration
- MCP Atlas जैसे tool-integration benchmark में Claude Opus 4.7 को 79.1% बनाम GPT-5.5 का 75.3% बताया गया है, इसलिए multi-tool/API orchestration में Opus 4.7 आगे दिखता है ^[10]^[2].
- Vellum की रिपोर्ट भी Opus 4.7 को MCP-Atlas पर 77.3% के साथ production-agent workflows के लिए मजबूत बताती है ^[3].
Academic / reasoning
- OpenAI की GPT-5.5 रिपोर्ट में FrontierMath Tier 1–3 पर GPT-5.5 को 51.7% और Claude Opus 4.7 को 43.8% बताया गया है ^[15]^[7].
- उसी रिपोर्ट में FrontierMath Tier 4 पर GPT-5.5 को 35.4% और Claude Opus 4.7 को 22.9% बताया गया है ^[15]^[7].
- GPQA Diamond में दोनों बहुत करीब हैं: GPT-5.5 93.6% और Claude Opus 4.7 94.2% ^[15]^[7].
निष्कर्ष
- GPT-5.5 चुनें अगर आपको broad reasoning, math-heavy tasks, terminal automation, या general-purpose high-performance assistant चाहिए।
- Claude Opus 4.7 चुनें अगर आपका मुख्य काम complex codebase changes, SWE-Bench Pro जैसे कठिन coding tasks, या multi-tool enterprise agents हैं।
- स्वतंत्र, समान harness पर चले benchmarks कम उपलब्ध हैं; official lab numbers और third-party summaries को सीधे “अंतिम सत्य” न मानकर workload-specific testing से validate करना बेहतर होगा।

सूत्र

[1] GPT-5.5 vs Claude Opus 4.7: 2026 Frontier Showdown (Benchmarks)tokenmix.ai
Head-to-Head: The Numbers That Matter Benchmark GPT-5.5 Claude Opus 4.7 Winner --- --- SWE-Bench Verified 88.7% 87.6% GPT-5.5 by 1.1 SWE-Bench Pro 58.6% 64.3% Opus 4.7 by 5.7 MMLU 92.4% 91% GPT-5.5 Terminal-Bench 2.0 82.7% — GPT-5.5 (no public Opus number)...
[2] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Within seven days, I had two new frontier models to compare against the workloads I run for LLM Stats:Claude Opus 4.7shipped on April 16, 2026, andGPT-5.5 on April 23. Both land at the same input price. Both ship 1M-token context. Both pitch significantly b...
[3] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance ...mindstudio.ai
SWE-Bench and Coding Tasks On SWE-Bench Verified — the standard benchmark for evaluating real GitHub issue resolution — both models score competitively at the top of the 2026 leaderboard. GPT-5.5 holds a slight edge on problems requiring precise tool use an...
[5] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[17] Claude Opus 4.7: Anthropic's New Best (Available) Modeldatacamp.com
SWE-bench tests a model's ability to resolve real GitHub issues in open-source Python repositories. Pro is a harder variant with more complex issues. The 10.9-point gain over Opus 4.6 on SWE-bench Pro is the largest improvement in this release (percentage p...
[18] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
Benchmarks Agentic coding Benchmark Opus 4.7 Opus 4.6 Delta --- --- SWE-bench Verified 87.6% 80.8% +6.8 SWE-bench Pro 64.3% 53.4% +10.9 Terminal-Bench 2.0 69.4% 65.4% +4.0 The jump on SWE-bench Pro (+10.9 points) is larger than on SWE-bench Verified, sugges...
[19] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[21] MCP Atlas Benchmark 2026: 13 model averages | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools MCP Atlas A benchmark for tool-calling over Model Context Protocol integrations and external tools. Benchmark score on MCP Atlas — April 23, 2026 BenchLM mirrors the published s...
[22] SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%morphllm.com
Dimension SWE-Bench Verified SWE-Bench Pro --- Tasks 500 1,865 Repositories 12 (all Python) 41 (Python, Go, TS, JS) Avg lines changed 11 (median: 4) 107.4 Avg files changed 1 4.1 Top score (Mar 2026) 80.9% (Claude Opus 4.5) 59% (agent systems) Contamination...
[27] GPT-5.5: The Honest Take on OpenAI's Response to Opus ...alexlavaee.me
Benchmark GPT-5.5 GPT-5.4 Opus 4.7 Gemini 3.1 Pro --- --- Terminal-Bench 2.0 82.7% 75.1% 69.4% 68.5% SWE-Bench Pro (public)\ 58.6% 57.7% 64.3% 54.2% Expert-SWE (OpenAI internal) 73.1% 68.5% — — OSWorld-Verified 78.7% 75.0% 78.0% — MCP Atlas (tool use) 75.3%...
[28] Introducing GPT-5.5 - OpenAIopenai.com
Academic EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaudeOpus 4.7Gemini 3.1 Pro GeneBench 25.0%19.0%33.2%25.6%-- FrontierMath Tier 1–3 51.7%47.6%52.4%50.0%43.8%36.9% FrontierMath Tier 4 35.4%27.1%39.6%38.0%22.9%16.7% BixBench 80.5%74.0%---- GPQA Diamond 93.6%...
[31] What Is GPT-5.5 for Builders in 2026? | WaveSpeedAI Blogwavespeed.ai
Item Status --- Release date: April 23, 2026 Confirmed — OpenAI official Live in ChatGPT (Plus/Pro/Business/Enterprise) Confirmed — OpenAI official Live in Codex (Plus/Pro/Business/Enterprise/Edu/Go) Confirmed — OpenAI official 400K context in Codex Confirm...
[32] Everything You Need to Know About GPT-5.5 - Vellumvellum.ai
SWE-bench Pro: the coding crown stays with Anthropic Claude Opus 4.7 scores 64.3% versus GPT-5.5's 58.6% — a 5.7-point gap on real GitHub issue resolution. OpenAI's system card includes an asterisk noting "evidence of memorization" from other labs on this e...