Câu trả lờiURL công khai26 thg 4, 202636 nguồn

So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7

GPT 5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT 5.5 mạnh hơn ở tác vụ terminal/agen...

Bài học chính

GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh tế; Claude Opus 4.7
GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh
## Lưu ý về độ tin cậy

Câu trả lời nghiên cứu

Lưu ý về độ tin cậy

Nguồn chính thức của Anthropic có trang Claude Opus 4.7 và nói mô hình này cải thiện rõ ở tool calls, planning và agent decision-making, nhưng snippet công khai không cho đủ toàn bộ bảng số để đối chiếu trực tiếp với GPT-5.5 ^[3].
Nguồn OpenAI tìm được gồm GPT-5.5 System Card và tài liệu benchmark riêng như GeneBench, nhưng không thấy một trang chính thức duy nhất công bố toàn bộ bảng đối đầu GPT-5.5 vs Claude Opus 4.7 theo cùng format ^[1]^[9].
Vì vậy, các con số so sánh trực tiếp bên dưới chủ yếu dựa trên các nguồn tổng hợp/benchmark aggregator và bài phân tích bên thứ ba; cần coi là “tự báo cáo hoặc tổng hợp”, không phải kết quả kiểm định độc lập hoàn toàn ^[11]^[13].

Bảng benchmark chính

Nhóm benchmark	GPT-5.5	Claude Opus 4.7	Mô hình nhỉnh hơn	Nhận xét
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5	GPT-5.5 dẫn khá xa ở tác vụ terminal/agentic trong môi trường dòng lệnh ^[8].
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7	Claude Opus 4.7 nhỉnh hơn ở benchmark sửa lỗi/phát triển phần mềm thực tế dạng khó ^[8].
SWE-Bench Verified	Không đủ số nhất quán từ nguồn đối chiếu	82.4% hoặc 87.6% tùy nguồn	Không kết luận chắc	Có mâu thuẫn giữa nguồn bên thứ ba: một nguồn ghi Opus 4.7 đạt 82.4%, nguồn khác ghi 87.6% ^[4]^[6].
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7, rất sát	Chênh lệch nhỏ; một nguồn nhận định các frontier model gần như đã hội tụ trên GPQA Diamond ^[7]^[14].
GDPval	84.9%	80.3%	GPT-5.5	GPT-5.5 nhỉnh hơn ở đánh giá tác vụ kinh tế/công việc văn phòng theo bảng tổng hợp ^[8].
OSWorld-Verified	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở OSWorld-Verified, nhưng snippet không hiển thị đầy đủ số ^[11].
CyberGym	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở CyberGym, nhưng snippet không hiển thị đầy đủ số ^[11].
FinanceAgent v1.1	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở FinanceAgent v1.1 ^[11].
MCP Atlas	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở MCP Atlas ^[11].
BrowseComp	Cao hơn Opus 4.7 theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở BrowseComp ^[11].
Humanity’s Last Exam	Nguồn mâu thuẫn	Nguồn mâu thuẫn	Không kết luận chắc	LLM Stats nói Opus 4.7 dẫn HLE có và không có tools, trong khi một bài khác ghi GPT-5.5 cao hơn ở HLE và HLE with tools; đây là điểm không nên kết luận nếu chưa có bảng chính thức cùng điều kiện ^[11].

Kết luận theo từng loại tác vụ

Coding thực chiến: Claude Opus 4.7 có lợi thế trên SWE-Bench Pro với 64.3% so với GPT-5.5 là 58.6% ^[8]. Tuy nhiên GPT-5.5 lại vượt rõ ở Terminal-Bench 2.0 với 82.7% so với 69.4%, nên nếu tác vụ là agent thao tác terminal, chạy lệnh, điều hướng môi trường và tự động hóa, GPT-5.5 có vẻ mạnh hơn ^[8].
Reasoning khoa học/tri thức khó: Claude Opus 4.7 nhỉnh hơn trên GPQA Diamond với 94.2% so với GPT-5.5 khoảng 93.6%, nhưng mức chênh này nhỏ và có thể không đáng kể trong thực tế ^[7]^[14].
Agent và workflow công việc: GPT-5.5 dẫn ở GDPval với 84.9% so với Claude Opus 4.7 là 80.3%, đồng thời được tổng hợp là dẫn ở BrowseComp, OSWorld-Verified và CyberGym ^[8]^[11]. Claude Opus 4.7 được tổng hợp là dẫn ở MCP Atlas và FinanceAgent v1.1, tức có lợi thế ở một số workflow agent tài chính/điều phối công cụ ^[11].
Độ chắc chắn của so sánh: Chưa có đủ bằng chứng độc lập, cùng prompt, cùng sampling, cùng tool setup để nói mô hình nào “toàn diện hơn” một cách tuyệt đối. Kết luận thực dụng là GPT-5.5 có vẻ tốt hơn cho terminal agents, browsing/OS/cyber và một số workflow năng suất; Claude Opus 4.7 có vẻ tốt hơn cho SWE-Bench Pro, reasoning sát ngưỡng cao và một số agent tài chính ^[8]^[11].

Chọn mô hình nào?

Chọn GPT-5.5 nếu bạn ưu tiên:
- Agent chạy terminal, automation, shell/tool execution ^[8].
- Tác vụ OS, browsing, cyber hoặc workflow năng suất tổng quát ^[11].
- Benchmark GDPval/công việc tri thức tổng quát ^[8].
Chọn Claude Opus 4.7 nếu bạn ưu tiên:
- Coding kiểu sửa issue phức tạp trên repo thật, nhất là SWE-Bench Pro ^[8].
- Reasoning khoa học khó, dù lợi thế trên GPQA Diamond là nhỏ ^[7]^[14].
- Agent tài chính hoặc orchestration theo các benchmark như FinanceAgent v1.1 và MCP Atlas ^[11].
Nếu dùng cho sản phẩm thật, nên chạy eval nội bộ trên dữ liệu của bạn; benchmark công khai hiện tại chưa đủ đồng nhất để thay thế A/B test thực tế.

Hỗ trợ hình ảnh

claude code buddy terminal pet companion activation guide en image 0 图示

openclaw openai compatible vs claude native config guide en image 0 图示

google gemini api free tier changes april 2026 guide en image 0 图示

claude opus 4 6 vs sonnet 4 6 comparison guide en image 0 图示

everything claude code plugin guide en image 0 图示

claude 4 6 fast mode guide en image 0 图示

NASA is right to focus on 'hardware rich' approach following Artemis II heat shield gamble

Humanoid robot takes over data collection, administrative job at UK’s construction site

'Dr'Dr. of Bubbles' Lucy Rogers on Robot Wars, new book, and why failure fuels innovation

US’ heat-activated knot robots leap hundreds of times their height without electronics

AI WarAI War? White House accuses China of industrial-scale theft of US AI frontier models

Video: China’s SamuRoid humanoid robot offers smarter interactions in a compact form

The RV is becoming an energy platformThe RV is becoming an energy platform. These companies are building it that way

Germany tests shape-shifting wings that morph mid-flight to boost flight safety

No racetrack glory for China’s AI robots against the data challenge

How spinning sails are cutting fuel use in modern shipping

OpenAI logo illustration

The image displays a bar chart comparing the susceptibility of various AI models, including Gemini 3 Pro, GPT-5.1, Haiku 4.5, Sonnet 4.5, anThe image displays a bar chart comparing the susceptibility of various AI models, including Gemini 3 Pro, GPT-5.1, Haiku 4.5, Sonnet 4.5, and Opus 4.5, to prompt-injection style attacks at different query levels (k=100, k=10, k=1), with lower attack success rates indicating better resilience.

Trao đổi nghiên cứu

Bạn26 thg 4, 2026

Câu hỏi nghiên cứu

So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7

Studio Global AI26 thg 4, 202636 nguồn

Câu trả lời được trích dẫn

Lưu ý về độ tin cậy

Nguồn chính thức của Anthropic có trang Claude Opus 4.7 và nói mô hình này cải thiện rõ ở tool calls, planning và agent decision-making, nhưng snippet công khai không cho đủ toàn bộ bảng số để đối chiếu trực tiếp với GPT-5.5 ^[3].
Nguồn OpenAI tìm được gồm GPT-5.5 System Card và tài liệu benchmark riêng như GeneBench, nhưng không thấy một trang chính thức duy nhất công bố toàn bộ bảng đối đầu GPT-5.5 vs Claude Opus 4.7 theo cùng format ^[1]^[9].
Vì vậy, các con số so sánh trực tiếp bên dưới chủ yếu dựa trên các nguồn tổng hợp/benchmark aggregator và bài phân tích bên thứ ba; cần coi là “tự báo cáo hoặc tổng hợp”, không phải kết quả kiểm định độc lập hoàn toàn ^[11]^[13].

Bảng benchmark chính

Nhóm benchmark	GPT-5.5	Claude Opus 4.7	Mô hình nhỉnh hơn	Nhận xét
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5	GPT-5.5 dẫn khá xa ở tác vụ terminal/agentic trong môi trường dòng lệnh ^[8].
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7	Claude Opus 4.7 nhỉnh hơn ở benchmark sửa lỗi/phát triển phần mềm thực tế dạng khó ^[8].
SWE-Bench Verified	Không đủ số nhất quán từ nguồn đối chiếu	82.4% hoặc 87.6% tùy nguồn	Không kết luận chắc	Có mâu thuẫn giữa nguồn bên thứ ba: một nguồn ghi Opus 4.7 đạt 82.4%, nguồn khác ghi 87.6% ^[4]^[6].
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7, rất sát	Chênh lệch nhỏ; một nguồn nhận định các frontier model gần như đã hội tụ trên GPQA Diamond ^[7]^[14].
GDPval	84.9%	80.3%	GPT-5.5	GPT-5.5 nhỉnh hơn ở đánh giá tác vụ kinh tế/công việc văn phòng theo bảng tổng hợp ^[8].
OSWorld-Verified	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở OSWorld-Verified, nhưng snippet không hiển thị đầy đủ số ^[11].
CyberGym	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở CyberGym, nhưng snippet không hiển thị đầy đủ số ^[11].
FinanceAgent v1.1	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở FinanceAgent v1.1 ^[11].
MCP Atlas	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở MCP Atlas ^[11].
BrowseComp	Cao hơn Opus 4.7 theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở BrowseComp ^[11].
Humanity’s Last Exam	Nguồn mâu thuẫn	Nguồn mâu thuẫn	Không kết luận chắc	LLM Stats nói Opus 4.7 dẫn HLE có và không có tools, trong khi một bài khác ghi GPT-5.5 cao hơn ở HLE và HLE with tools; đây là điểm không nên kết luận nếu chưa có bảng chính thức cùng điều kiện ^[11].

Kết luận theo từng loại tác vụ

Coding thực chiến: Claude Opus 4.7 có lợi thế trên SWE-Bench Pro với 64.3% so với GPT-5.5 là 58.6% ^[8]. Tuy nhiên GPT-5.5 lại vượt rõ ở Terminal-Bench 2.0 với 82.7% so với 69.4%, nên nếu tác vụ là agent thao tác terminal, chạy lệnh, điều hướng môi trường và tự động hóa, GPT-5.5 có vẻ mạnh hơn ^[8].
Reasoning khoa học/tri thức khó: Claude Opus 4.7 nhỉnh hơn trên GPQA Diamond với 94.2% so với GPT-5.5 khoảng 93.6%, nhưng mức chênh này nhỏ và có thể không đáng kể trong thực tế ^[7]^[14].
Agent và workflow công việc: GPT-5.5 dẫn ở GDPval với 84.9% so với Claude Opus 4.7 là 80.3%, đồng thời được tổng hợp là dẫn ở BrowseComp, OSWorld-Verified và CyberGym ^[8]^[11]. Claude Opus 4.7 được tổng hợp là dẫn ở MCP Atlas và FinanceAgent v1.1, tức có lợi thế ở một số workflow agent tài chính/điều phối công cụ ^[11].
Độ chắc chắn của so sánh: Chưa có đủ bằng chứng độc lập, cùng prompt, cùng sampling, cùng tool setup để nói mô hình nào “toàn diện hơn” một cách tuyệt đối. Kết luận thực dụng là GPT-5.5 có vẻ tốt hơn cho terminal agents, browsing/OS/cyber và một số workflow năng suất; Claude Opus 4.7 có vẻ tốt hơn cho SWE-Bench Pro, reasoning sát ngưỡng cao và một số agent tài chính ^[8]^[11].

Chọn mô hình nào?

Chọn GPT-5.5 nếu bạn ưu tiên:
- Agent chạy terminal, automation, shell/tool execution ^[8].
- Tác vụ OS, browsing, cyber hoặc workflow năng suất tổng quát ^[11].
- Benchmark GDPval/công việc tri thức tổng quát ^[8].
Chọn Claude Opus 4.7 nếu bạn ưu tiên:
- Coding kiểu sửa issue phức tạp trên repo thật, nhất là SWE-Bench Pro ^[8].
- Reasoning khoa học khó, dù lợi thế trên GPQA Diamond là nhỏ ^[7]^[14].
- Agent tài chính hoặc orchestration theo các benchmark như FinanceAgent v1.1 và MCP Atlas ^[11].
Nếu dùng cho sản phẩm thật, nên chạy eval nội bộ trên dữ liệu của bạn; benchmark công khai hiện tại chưa đủ đồng nhất để thay thế A/B test thực tế.

Người ta cũng hỏi

Câu trả lời ngắn gọn cho "So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7" là gì?

Những điểm chính cần xác nhận đầu tiên là gì?

GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh tế; Claude Opus 4.7 GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh

Tôi nên làm gì tiếp theo trong thực tế?

## Lưu ý về độ tin cậy

Tôi nên khám phá chủ đề liên quan nào tiếp theo?

Tiếp tục với "So sánh một cách toàn diện benchmarks của DeepSeek V4 vs Claude Opus 4.7" để có góc nhìn khác và trích dẫn bổ sung.

Mở trang liên quan

Tôi nên so sánh điều này với cái gì?

Kiểm tra chéo câu trả lời này với "So sánh một cách toàn diện benchmarks của DeepSeek V4 vs GPT-5.5".

Mở trang liên quan

Tiếp tục nghiên cứu của bạn

So sánh một cách toàn diện benchmarks của DeepSeek V4 vs Claude Opus 4.7

So sánh một cách toàn diện benchmarks của DeepSeek V4 vs GPT-5.5

So sánh Claude Code vs OpenAI Codex một cách toàn diện nhất

Tìm kiếm và kiểm chứng thông tin: Làm sao triển khai hoặc tích hợp Kimi K2.6 vào app / production workflow?

Nguồn

[1] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[2] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[3] Claude Opus 4.7 Benchmark: Memory & Effort Levels Testeddatacamp.com
Note: Pricing is $5 per million input tokens and $25 per million output tokens which is identical to Opus 4.6. If you want to explore this model in depth, this article by DataCamp team is a good read. A few numbers worth knowing before we test it: Benchmark...
[4] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[5] Everything You Need to Know About GPT-5.5 - Vellumvellum.ai
Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- Terminal-Bench 2.0 82.7% — 75.1% 69.4% 68.5% SWE-Bench Pro 58.6% — 57.7% 64.3% 54.2% Expert-SWE (Internal) 73.1% — 68.5% — — GDPval 84.9% 82.3% 83.0% 80.3% 67.3% OSWorld-Verifi...
[6] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
01 Which is better, GPT-5.5 or Claude Opus 4.7?On the 10 benchmarks both providers report,Opus 4.7 leads on 6 (GPQA, HLE no tools, HLE with tools, SWE-Bench Pro, MCP Atlas, FinanceAgent v1.1) andGPT-5.5 leads on 4 (Terminal-Bench 2.0, BrowseComp, OSWorld-Ve...
[7] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance ...mindstudio.ai
SWE-Bench and Coding Tasks On SWE-Bench Verified — the standard benchmark for evaluating real GitHub issue resolution — both models score competitively at the top of the 2026 leaderboard. GPT-5.5 holds a slight edge on problems requiring precise tool use an...
[8] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
Show 18 more Self-reported by the model provider. Score may not be independently verified. Similar Models How GPT-5.5 compares to models with the closest performance across key benchmarks. GPT-5.5GPT-5.4Gemini 3.1 ProClaude Opus 4.7GPT-5.2 ProClaude Mythos...
[9] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[10] OpenAI's GPT-5.5 masters agentic coding with 82.7% benchmark ...interestingengineering.com
About UsAdvertise ContactFAQ Follow Us On LinkedInXInstagramFlipboardFacebookYouTubeTikTok All Rights Reserved, IE Media, Inc. AI and Robotics GPT-5.5 crushes Claude Opus 4.7 in agentic coding with 82.7% terminal-bench score GPT-5.5 introduces smarter task...
[11] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[12] Deedy on X: "GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding. https://t.co/AiRKqgRjxS" / Xx.com
Deedy on X: "GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding. / X Don’t mi...
[13] [PDF] GeneBench: Assessing AI Agents for Multi-Stage Inference ... - OpenAIcdn.openai.com
Kimi K2.5 Grok 4.20 (thinking) Qwen 3.6 MiMo V2.5 Pro GLM 5.1 Kimi K2.6 Gemini 3.1 Pro (high) none low med high none low med high xhigh none low med high xhigh none low med high xhigh 5.0 5.2 5.4 5.5 0% 10% 20% 30% 40% Mean pass rate across problems 1.6% 1....
[14] Building more with GPT-5.1-Codex-Max - OpenAIopenai.com
Frontier coding capabilities GPT‑5.1‑Codex‑Max was trained on real-world software engineering tasks, like PR creation, code review, frontend coding, and Q&A and outperforms our previous models on many frontier coding evaluations. The model’s gains on benchm...
[15] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[16] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[17] Introducing GPT-5 - OpenAIopenai.com
Evaluations GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health. It sets a new state of the art across math (94.6% on AIME 2025 witho...
[18] Introducing GPT-5.2-Codex - OpenAIopenai.com
GPT‑5.2‑Codex achieves state-of-the-art performance on SWE-Bench Pro and Terminal-Bench 2.0, benchmarks designed to test agentic performance on a wide variety of tasks in realistic terminal environments. It is also much more effective and reliable at agenti...
[19] Introducing GPT-5.3-Codexopenai.com
Appendix GPT-5.3-Codex (xhigh)GPT-5.2-Codex (xhigh)GPT-5.2 (xhigh) SWE-Bench Pro (Public)56.8%56.4%55.6% Terminal-Bench 2.077.3%64.0%62.2% OSWorld-Verified64.7%38.2%37.9% GDPval (wins or ties)70.9%-70.9% (high) Cybersecurity Capture The Flag Challenges77.6%...
[20] Introducing GPT-5.4 - OpenAIopenai.com
EvalGPT‑5.4GPT‑5.4 ProGPT‑5.3-CodexGPT‑5.2GPT‑5.2 Pro GDPval 83.0%82.0%70.9%70.9%74.1% FinanceAgent v1.1 56.0%61.5%54.0%59.5%— Investment Banking Modeling Tasks (Internal)87.3%83.6%79.3%68.4%71.7% OfficeQA 68.1%—65.1%63.1%— Coding EvalGPT‑5.4GPT‑5.4 ProGPT‑...
[21] Introducing GPT-5.4 mini and nano - OpenAIopenai.com
Coding GPT-5.4 (xhigh) GPT-5.4 mini (xhigh) GPT-5.4 nano (xhigh) GPT-5 mini (high¹) --- --- SWE-bench Pro (Public) 57.7% 54.4% 52.4% 45.7% Terminal-Bench 2.0 75.1% 60.0% 46.3% 38.2% Tool-calling GPT-5.4 (xhigh) GPT-5.4 mini (xhigh) GPT-5.4 nano (xhigh) GPT-...
[22] Introducing GPT‑5 for developers - OpenAIopenai.com
GPT‑5 is state-of-the-art (SOTA) across key coding benchmarks, scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot. We trained GPT‑5 to be a true coding collaborator. It excels at producing high-quality code and handling tasks such as fixing bugs,...
[23] Introducing GPT-5.5 - OpenAIopenai.com
Agentic coding GPT‑5.5 is our strongest agentic coding model to date. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%. On SWE-Bench Pro,...
[24] [PDF] OpenAI o1 System Cardcdn.openai.com
We also ran contextual evaluations not included here, including on GPQA biology, WMDP biology and chemistry splits, an organic chemistry molecular structure dataset, and a synthetic biology translation dataset. [...] o1 (Post-Mitigation) performs similarly...
[25] Claude Opus 4.7 - Anthropicanthropic.com
Image 15: logo In our evals, we saw a double digit jump in accuracy of tool calls and planning in our core orchestrator agents. As users leverage Hebbia to plan and execute on use cases like retrieval, slide creation, or document generation, Claude Opus 4.7...
[26] Introducing Claude 4 - Anthropicanthropic.com
Claude 4 Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps...
[27] Introducing Claude Opus 4.5 - Anthropicanthropic.com
Methodology All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), default sampling settings (temperature, top p), and averaged over 5 independent trials. Exceptions: SWE-bench Verified (no thinki...
[28] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 15: logo In our evals, we saw a double-digit jump in accuracy of tool calls and planning in our core orchestrator agents. As users leverage Hebbia to plan and execute on use cases like retrieval, slide creation, or document generation, Claude Opus 4.7...
[29] Introducing Claude Sonnet 4.5 - Anthropicanthropic.com
[]( Related content Introducing Claude Design by Anthropic Labs Today, we’re launching Claude Design, a new Anthropic Labs product that lets you collaborate with Claude to create polished visual work like designs, prototypes, slides, one-pagers, and more. R...
[30] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
2 SWE-bench results are averaged over 25 trials. 18 2.4 SWE-bench (Verified and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. For the SWE-bench Veri fi ed variant, developed by OpenAI, models...
[31] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
2 SWE-bench results are averaged over 25 trials. 17 2.4 SWE-bench (Verified and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. For the SWE-bench Veri fi ed variant, developed by OpenAI, models...
[32] [PDF] Claude Sonnet 4.6 System Card - Anthropicwww-cdn.anthropic.com
reasoning or knowledge). For details on how we attempt to decontaminate our evaluations, see Section 2.2 of the Claude Opus 4.5 System Card . 14 Evaluation Claude family models Other models Claude Sonnet 4.6 Claude Opus 4....
[33] Newsroom - Anthropicanthropic.com
Skip to main contentSkip to footer []( Research Economic Futures Commitments Learn News Try Claude Newsroom Press inquirespress@anthropic.com Non-media inquiriessupport@anthropic.com Media assetsDownload press kit Image 1: Introducing Claude Opus 4.7 Introd...
[34] System Card: Claude Opus 4 & Claude Sonnet 4www-cdn.anthropic.com
https:/ /arxiv.org/abs/2110.08193 16 Model Disambiguated bias (%) Ambiguous bias (%) Claude Opus 4 -0.60 0.21 Claude Sonnet 4 -1.16 0.61 Claude Sonnet 3.7 -0.98 0.89 Claude Opus 3 0.77 1.21 Table 2.6.A Bias scores on the Bias Benchmark for Question Answerin...
[35] Anthropic's Transparency Hubanthropic.com
Model descriptionClaude Opus 4 and Claude Sonnet 4 are two new hybrid reasoning large language models from Anthropic. They have advanced capabilities in reasoning, visual analysis, computer use, and tool use. They are particularly adept at complex computer...
[36] Claude Opus 4.5 System Cardanthropic.com
5 With a 128k thinking budget; with a 64k thinking budget, the score is 57.8%. 4 Without extended thinking. 19 2.4 SWE-bench (Verified, Pro, and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. F...

Câu trả lờiURL công khai26 thg 4, 202636 nguồn

So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7

Bài học chính

GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh tế; Claude Opus 4.7
GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh
## Lưu ý về độ tin cậy

Câu trả lời nghiên cứu

Lưu ý về độ tin cậy

Nguồn chính thức của Anthropic có trang Claude Opus 4.7 và nói mô hình này cải thiện rõ ở tool calls, planning và agent decision-making, nhưng snippet công khai không cho đủ toàn bộ bảng số để đối chiếu trực tiếp với GPT-5.5 ^[3].
Nguồn OpenAI tìm được gồm GPT-5.5 System Card và tài liệu benchmark riêng như GeneBench, nhưng không thấy một trang chính thức duy nhất công bố toàn bộ bảng đối đầu GPT-5.5 vs Claude Opus 4.7 theo cùng format ^[1]^[9].
Vì vậy, các con số so sánh trực tiếp bên dưới chủ yếu dựa trên các nguồn tổng hợp/benchmark aggregator và bài phân tích bên thứ ba; cần coi là “tự báo cáo hoặc tổng hợp”, không phải kết quả kiểm định độc lập hoàn toàn ^[11]^[13].

Bảng benchmark chính

Nhóm benchmark	GPT-5.5	Claude Opus 4.7	Mô hình nhỉnh hơn	Nhận xét
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5	GPT-5.5 dẫn khá xa ở tác vụ terminal/agentic trong môi trường dòng lệnh ^[8].
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7	Claude Opus 4.7 nhỉnh hơn ở benchmark sửa lỗi/phát triển phần mềm thực tế dạng khó ^[8].
SWE-Bench Verified	Không đủ số nhất quán từ nguồn đối chiếu	82.4% hoặc 87.6% tùy nguồn	Không kết luận chắc	Có mâu thuẫn giữa nguồn bên thứ ba: một nguồn ghi Opus 4.7 đạt 82.4%, nguồn khác ghi 87.6% ^[4]^[6].
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7, rất sát	Chênh lệch nhỏ; một nguồn nhận định các frontier model gần như đã hội tụ trên GPQA Diamond ^[7]^[14].
GDPval	84.9%	80.3%	GPT-5.5	GPT-5.5 nhỉnh hơn ở đánh giá tác vụ kinh tế/công việc văn phòng theo bảng tổng hợp ^[8].
OSWorld-Verified	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở OSWorld-Verified, nhưng snippet không hiển thị đầy đủ số ^[11].
CyberGym	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở CyberGym, nhưng snippet không hiển thị đầy đủ số ^[11].
FinanceAgent v1.1	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở FinanceAgent v1.1 ^[11].
MCP Atlas	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở MCP Atlas ^[11].
BrowseComp	Cao hơn Opus 4.7 theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở BrowseComp ^[11].
Humanity’s Last Exam	Nguồn mâu thuẫn	Nguồn mâu thuẫn	Không kết luận chắc	LLM Stats nói Opus 4.7 dẫn HLE có và không có tools, trong khi một bài khác ghi GPT-5.5 cao hơn ở HLE và HLE with tools; đây là điểm không nên kết luận nếu chưa có bảng chính thức cùng điều kiện ^[11].

Kết luận theo từng loại tác vụ

Coding thực chiến: Claude Opus 4.7 có lợi thế trên SWE-Bench Pro với 64.3% so với GPT-5.5 là 58.6% ^[8]. Tuy nhiên GPT-5.5 lại vượt rõ ở Terminal-Bench 2.0 với 82.7% so với 69.4%, nên nếu tác vụ là agent thao tác terminal, chạy lệnh, điều hướng môi trường và tự động hóa, GPT-5.5 có vẻ mạnh hơn ^[8].
Reasoning khoa học/tri thức khó: Claude Opus 4.7 nhỉnh hơn trên GPQA Diamond với 94.2% so với GPT-5.5 khoảng 93.6%, nhưng mức chênh này nhỏ và có thể không đáng kể trong thực tế ^[7]^[14].
Agent và workflow công việc: GPT-5.5 dẫn ở GDPval với 84.9% so với Claude Opus 4.7 là 80.3%, đồng thời được tổng hợp là dẫn ở BrowseComp, OSWorld-Verified và CyberGym ^[8]^[11]. Claude Opus 4.7 được tổng hợp là dẫn ở MCP Atlas và FinanceAgent v1.1, tức có lợi thế ở một số workflow agent tài chính/điều phối công cụ ^[11].
Độ chắc chắn của so sánh: Chưa có đủ bằng chứng độc lập, cùng prompt, cùng sampling, cùng tool setup để nói mô hình nào “toàn diện hơn” một cách tuyệt đối. Kết luận thực dụng là GPT-5.5 có vẻ tốt hơn cho terminal agents, browsing/OS/cyber và một số workflow năng suất; Claude Opus 4.7 có vẻ tốt hơn cho SWE-Bench Pro, reasoning sát ngưỡng cao và một số agent tài chính ^[8]^[11].

Chọn mô hình nào?

Chọn GPT-5.5 nếu bạn ưu tiên:
- Agent chạy terminal, automation, shell/tool execution ^[8].
- Tác vụ OS, browsing, cyber hoặc workflow năng suất tổng quát ^[11].
- Benchmark GDPval/công việc tri thức tổng quát ^[8].
Chọn Claude Opus 4.7 nếu bạn ưu tiên:
- Coding kiểu sửa issue phức tạp trên repo thật, nhất là SWE-Bench Pro ^[8].
- Reasoning khoa học khó, dù lợi thế trên GPQA Diamond là nhỏ ^[7]^[14].
- Agent tài chính hoặc orchestration theo các benchmark như FinanceAgent v1.1 và MCP Atlas ^[11].
Nếu dùng cho sản phẩm thật, nên chạy eval nội bộ trên dữ liệu của bạn; benchmark công khai hiện tại chưa đủ đồng nhất để thay thế A/B test thực tế.

Hỗ trợ hình ảnh

OpenAI logo illustration

Trao đổi nghiên cứu

Bạn26 thg 4, 2026

Câu hỏi nghiên cứu

So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7

Studio Global AI26 thg 4, 202636 nguồn

Câu trả lời được trích dẫn

Lưu ý về độ tin cậy

Nguồn chính thức của Anthropic có trang Claude Opus 4.7 và nói mô hình này cải thiện rõ ở tool calls, planning và agent decision-making, nhưng snippet công khai không cho đủ toàn bộ bảng số để đối chiếu trực tiếp với GPT-5.5 ^[3].
Nguồn OpenAI tìm được gồm GPT-5.5 System Card và tài liệu benchmark riêng như GeneBench, nhưng không thấy một trang chính thức duy nhất công bố toàn bộ bảng đối đầu GPT-5.5 vs Claude Opus 4.7 theo cùng format ^[1]^[9].
Vì vậy, các con số so sánh trực tiếp bên dưới chủ yếu dựa trên các nguồn tổng hợp/benchmark aggregator và bài phân tích bên thứ ba; cần coi là “tự báo cáo hoặc tổng hợp”, không phải kết quả kiểm định độc lập hoàn toàn ^[11]^[13].

Bảng benchmark chính

Nhóm benchmark	GPT-5.5	Claude Opus 4.7	Mô hình nhỉnh hơn	Nhận xét
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5	GPT-5.5 dẫn khá xa ở tác vụ terminal/agentic trong môi trường dòng lệnh ^[8].
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7	Claude Opus 4.7 nhỉnh hơn ở benchmark sửa lỗi/phát triển phần mềm thực tế dạng khó ^[8].
SWE-Bench Verified	Không đủ số nhất quán từ nguồn đối chiếu	82.4% hoặc 87.6% tùy nguồn	Không kết luận chắc	Có mâu thuẫn giữa nguồn bên thứ ba: một nguồn ghi Opus 4.7 đạt 82.4%, nguồn khác ghi 87.6% ^[4]^[6].
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7, rất sát	Chênh lệch nhỏ; một nguồn nhận định các frontier model gần như đã hội tụ trên GPQA Diamond ^[7]^[14].
GDPval	84.9%	80.3%	GPT-5.5	GPT-5.5 nhỉnh hơn ở đánh giá tác vụ kinh tế/công việc văn phòng theo bảng tổng hợp ^[8].
OSWorld-Verified	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở OSWorld-Verified, nhưng snippet không hiển thị đầy đủ số ^[11].
CyberGym	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở CyberGym, nhưng snippet không hiển thị đầy đủ số ^[11].
FinanceAgent v1.1	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở FinanceAgent v1.1 ^[11].
MCP Atlas	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở MCP Atlas ^[11].
BrowseComp	Cao hơn Opus 4.7 theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở BrowseComp ^[11].
Humanity’s Last Exam	Nguồn mâu thuẫn	Nguồn mâu thuẫn	Không kết luận chắc	LLM Stats nói Opus 4.7 dẫn HLE có và không có tools, trong khi một bài khác ghi GPT-5.5 cao hơn ở HLE và HLE with tools; đây là điểm không nên kết luận nếu chưa có bảng chính thức cùng điều kiện ^[11].

Kết luận theo từng loại tác vụ

Coding thực chiến: Claude Opus 4.7 có lợi thế trên SWE-Bench Pro với 64.3% so với GPT-5.5 là 58.6% ^[8]. Tuy nhiên GPT-5.5 lại vượt rõ ở Terminal-Bench 2.0 với 82.7% so với 69.4%, nên nếu tác vụ là agent thao tác terminal, chạy lệnh, điều hướng môi trường và tự động hóa, GPT-5.5 có vẻ mạnh hơn ^[8].
Reasoning khoa học/tri thức khó: Claude Opus 4.7 nhỉnh hơn trên GPQA Diamond với 94.2% so với GPT-5.5 khoảng 93.6%, nhưng mức chênh này nhỏ và có thể không đáng kể trong thực tế ^[7]^[14].
Agent và workflow công việc: GPT-5.5 dẫn ở GDPval với 84.9% so với Claude Opus 4.7 là 80.3%, đồng thời được tổng hợp là dẫn ở BrowseComp, OSWorld-Verified và CyberGym ^[8]^[11]. Claude Opus 4.7 được tổng hợp là dẫn ở MCP Atlas và FinanceAgent v1.1, tức có lợi thế ở một số workflow agent tài chính/điều phối công cụ ^[11].
Độ chắc chắn của so sánh: Chưa có đủ bằng chứng độc lập, cùng prompt, cùng sampling, cùng tool setup để nói mô hình nào “toàn diện hơn” một cách tuyệt đối. Kết luận thực dụng là GPT-5.5 có vẻ tốt hơn cho terminal agents, browsing/OS/cyber và một số workflow năng suất; Claude Opus 4.7 có vẻ tốt hơn cho SWE-Bench Pro, reasoning sát ngưỡng cao và một số agent tài chính ^[8]^[11].

Chọn mô hình nào?

Chọn GPT-5.5 nếu bạn ưu tiên:
- Agent chạy terminal, automation, shell/tool execution ^[8].
- Tác vụ OS, browsing, cyber hoặc workflow năng suất tổng quát ^[11].
- Benchmark GDPval/công việc tri thức tổng quát ^[8].
Chọn Claude Opus 4.7 nếu bạn ưu tiên:
- Coding kiểu sửa issue phức tạp trên repo thật, nhất là SWE-Bench Pro ^[8].
- Reasoning khoa học khó, dù lợi thế trên GPQA Diamond là nhỏ ^[7]^[14].
- Agent tài chính hoặc orchestration theo các benchmark như FinanceAgent v1.1 và MCP Atlas ^[11].
Nếu dùng cho sản phẩm thật, nên chạy eval nội bộ trên dữ liệu của bạn; benchmark công khai hiện tại chưa đủ đồng nhất để thay thế A/B test thực tế.

Người ta cũng hỏi

Câu trả lời ngắn gọn cho "So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7" là gì?

Những điểm chính cần xác nhận đầu tiên là gì?

Tôi nên làm gì tiếp theo trong thực tế?

## Lưu ý về độ tin cậy

Tôi nên khám phá chủ đề liên quan nào tiếp theo?

Tiếp tục với "So sánh một cách toàn diện benchmarks của DeepSeek V4 vs Claude Opus 4.7" để có góc nhìn khác và trích dẫn bổ sung.

Mở trang liên quan

Tôi nên so sánh điều này với cái gì?

Kiểm tra chéo câu trả lời này với "So sánh một cách toàn diện benchmarks của DeepSeek V4 vs GPT-5.5".

Mở trang liên quan

Tiếp tục nghiên cứu của bạn

So sánh một cách toàn diện benchmarks của DeepSeek V4 vs Claude Opus 4.7

So sánh một cách toàn diện benchmarks của DeepSeek V4 vs GPT-5.5

So sánh Claude Code vs OpenAI Codex một cách toàn diện nhất

Tìm kiếm và kiểm chứng thông tin: Làm sao triển khai hoặc tích hợp Kimi K2.6 vào app / production workflow?

Nguồn

[1] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[2] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[3] Claude Opus 4.7 Benchmark: Memory & Effort Levels Testeddatacamp.com
Note: Pricing is $5 per million input tokens and $25 per million output tokens which is identical to Opus 4.6. If you want to explore this model in depth, this article by DataCamp team is a good read. A few numbers worth knowing before we test it: Benchmark...
[4] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[5] Everything You Need to Know About GPT-5.5 - Vellumvellum.ai
Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- Terminal-Bench 2.0 82.7% — 75.1% 69.4% 68.5% SWE-Bench Pro 58.6% — 57.7% 64.3% 54.2% Expert-SWE (Internal) 73.1% — 68.5% — — GDPval 84.9% 82.3% 83.0% 80.3% 67.3% OSWorld-Verifi...
[6] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
01 Which is better, GPT-5.5 or Claude Opus 4.7?On the 10 benchmarks both providers report,Opus 4.7 leads on 6 (GPQA, HLE no tools, HLE with tools, SWE-Bench Pro, MCP Atlas, FinanceAgent v1.1) andGPT-5.5 leads on 4 (Terminal-Bench 2.0, BrowseComp, OSWorld-Ve...
[7] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance ...mindstudio.ai
SWE-Bench and Coding Tasks On SWE-Bench Verified — the standard benchmark for evaluating real GitHub issue resolution — both models score competitively at the top of the 2026 leaderboard. GPT-5.5 holds a slight edge on problems requiring precise tool use an...
[8] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
Show 18 more Self-reported by the model provider. Score may not be independently verified. Similar Models How GPT-5.5 compares to models with the closest performance across key benchmarks. GPT-5.5GPT-5.4Gemini 3.1 ProClaude Opus 4.7GPT-5.2 ProClaude Mythos...
[9] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[10] OpenAI's GPT-5.5 masters agentic coding with 82.7% benchmark ...interestingengineering.com
About UsAdvertise ContactFAQ Follow Us On LinkedInXInstagramFlipboardFacebookYouTubeTikTok All Rights Reserved, IE Media, Inc. AI and Robotics GPT-5.5 crushes Claude Opus 4.7 in agentic coding with 82.7% terminal-bench score GPT-5.5 introduces smarter task...
[11] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[12] Deedy on X: "GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding. https://t.co/AiRKqgRjxS" / Xx.com
Deedy on X: "GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding. / X Don’t mi...
[13] [PDF] GeneBench: Assessing AI Agents for Multi-Stage Inference ... - OpenAIcdn.openai.com
Kimi K2.5 Grok 4.20 (thinking) Qwen 3.6 MiMo V2.5 Pro GLM 5.1 Kimi K2.6 Gemini 3.1 Pro (high) none low med high none low med high xhigh none low med high xhigh none low med high xhigh 5.0 5.2 5.4 5.5 0% 10% 20% 30% 40% Mean pass rate across problems 1.6% 1....
[14] Building more with GPT-5.1-Codex-Max - OpenAIopenai.com
Frontier coding capabilities GPT‑5.1‑Codex‑Max was trained on real-world software engineering tasks, like PR creation, code review, frontend coding, and Q&A and outperforms our previous models on many frontier coding evaluations. The model’s gains on benchm...
[15] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[16] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[17] Introducing GPT-5 - OpenAIopenai.com
Evaluations GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health. It sets a new state of the art across math (94.6% on AIME 2025 witho...
[18] Introducing GPT-5.2-Codex - OpenAIopenai.com
GPT‑5.2‑Codex achieves state-of-the-art performance on SWE-Bench Pro and Terminal-Bench 2.0, benchmarks designed to test agentic performance on a wide variety of tasks in realistic terminal environments. It is also much more effective and reliable at agenti...
[19] Introducing GPT-5.3-Codexopenai.com
Appendix GPT-5.3-Codex (xhigh)GPT-5.2-Codex (xhigh)GPT-5.2 (xhigh) SWE-Bench Pro (Public)56.8%56.4%55.6% Terminal-Bench 2.077.3%64.0%62.2% OSWorld-Verified64.7%38.2%37.9% GDPval (wins or ties)70.9%-70.9% (high) Cybersecurity Capture The Flag Challenges77.6%...
[20] Introducing GPT-5.4 - OpenAIopenai.com
EvalGPT‑5.4GPT‑5.4 ProGPT‑5.3-CodexGPT‑5.2GPT‑5.2 Pro GDPval 83.0%82.0%70.9%70.9%74.1% FinanceAgent v1.1 56.0%61.5%54.0%59.5%— Investment Banking Modeling Tasks (Internal)87.3%83.6%79.3%68.4%71.7% OfficeQA 68.1%—65.1%63.1%— Coding EvalGPT‑5.4GPT‑5.4 ProGPT‑...
[21] Introducing GPT-5.4 mini and nano - OpenAIopenai.com
Coding GPT-5.4 (xhigh) GPT-5.4 mini (xhigh) GPT-5.4 nano (xhigh) GPT-5 mini (high¹) --- --- SWE-bench Pro (Public) 57.7% 54.4% 52.4% 45.7% Terminal-Bench 2.0 75.1% 60.0% 46.3% 38.2% Tool-calling GPT-5.4 (xhigh) GPT-5.4 mini (xhigh) GPT-5.4 nano (xhigh) GPT-...
[22] Introducing GPT‑5 for developers - OpenAIopenai.com
GPT‑5 is state-of-the-art (SOTA) across key coding benchmarks, scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot. We trained GPT‑5 to be a true coding collaborator. It excels at producing high-quality code and handling tasks such as fixing bugs,...
[23] Introducing GPT-5.5 - OpenAIopenai.com
Agentic coding GPT‑5.5 is our strongest agentic coding model to date. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%. On SWE-Bench Pro,...
[24] [PDF] OpenAI o1 System Cardcdn.openai.com
We also ran contextual evaluations not included here, including on GPQA biology, WMDP biology and chemistry splits, an organic chemistry molecular structure dataset, and a synthetic biology translation dataset. [...] o1 (Post-Mitigation) performs similarly...
[25] Claude Opus 4.7 - Anthropicanthropic.com
Image 15: logo In our evals, we saw a double digit jump in accuracy of tool calls and planning in our core orchestrator agents. As users leverage Hebbia to plan and execute on use cases like retrieval, slide creation, or document generation, Claude Opus 4.7...
[26] Introducing Claude 4 - Anthropicanthropic.com
Claude 4 Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps...
[27] Introducing Claude Opus 4.5 - Anthropicanthropic.com
Methodology All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), default sampling settings (temperature, top p), and averaged over 5 independent trials. Exceptions: SWE-bench Verified (no thinki...
[28] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 15: logo In our evals, we saw a double-digit jump in accuracy of tool calls and planning in our core orchestrator agents. As users leverage Hebbia to plan and execute on use cases like retrieval, slide creation, or document generation, Claude Opus 4.7...
[29] Introducing Claude Sonnet 4.5 - Anthropicanthropic.com
[]( Related content Introducing Claude Design by Anthropic Labs Today, we’re launching Claude Design, a new Anthropic Labs product that lets you collaborate with Claude to create polished visual work like designs, prototypes, slides, one-pagers, and more. R...
[30] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
2 SWE-bench results are averaged over 25 trials. 18 2.4 SWE-bench (Verified and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. For the SWE-bench Veri fi ed variant, developed by OpenAI, models...
[31] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
2 SWE-bench results are averaged over 25 trials. 17 2.4 SWE-bench (Verified and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. For the SWE-bench Veri fi ed variant, developed by OpenAI, models...
[32] [PDF] Claude Sonnet 4.6 System Card - Anthropicwww-cdn.anthropic.com
reasoning or knowledge). For details on how we attempt to decontaminate our evaluations, see Section 2.2 of the Claude Opus 4.5 System Card . 14 Evaluation Claude family models Other models Claude Sonnet 4.6 Claude Opus 4....
[33] Newsroom - Anthropicanthropic.com
Skip to main contentSkip to footer []( Research Economic Futures Commitments Learn News Try Claude Newsroom Press inquirespress@anthropic.com Non-media inquiriessupport@anthropic.com Media assetsDownload press kit Image 1: Introducing Claude Opus 4.7 Introd...
[34] System Card: Claude Opus 4 & Claude Sonnet 4www-cdn.anthropic.com
https:/ /arxiv.org/abs/2110.08193 16 Model Disambiguated bias (%) Ambiguous bias (%) Claude Opus 4 -0.60 0.21 Claude Sonnet 4 -1.16 0.61 Claude Sonnet 3.7 -0.98 0.89 Claude Opus 3 0.77 1.21 Table 2.6.A Bias scores on the Bias Benchmark for Question Answerin...
[35] Anthropic's Transparency Hubanthropic.com
Model descriptionClaude Opus 4 and Claude Sonnet 4 are two new hybrid reasoning large language models from Anthropic. They have advanced capabilities in reasoning, visual analysis, computer use, and tool use. They are particularly adept at complex computer...
[36] Claude Opus 4.5 System Cardanthropic.com
5 With a 128k thinking budget; with a 64k thinking budget, the score is 57.8%. 4 Without extended thinking. 19 2.4 SWE-bench (Verified, Pro, and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. F...

Câu trả lờiURL công khai26 thg 4, 202636 nguồn

So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7

Bài học chính

GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh tế; Claude Opus 4.7
GPT-5.5 và Claude Opus 4.7 không có một bộ benchmark “chuẩn hóa, độc lập, cùng điều kiện” được công bố đầy đủ, nên so sánh dưới đây nên xem là tổng hợp có điều kiện. Nhìn chung: GPT-5.5 mạnh hơn ở tác vụ terminal/agentic thao tác môi trường, duyệt web, OS và một số benchmark kinh
## Lưu ý về độ tin cậy

Câu trả lời nghiên cứu

Lưu ý về độ tin cậy

Nguồn chính thức của Anthropic có trang Claude Opus 4.7 và nói mô hình này cải thiện rõ ở tool calls, planning và agent decision-making, nhưng snippet công khai không cho đủ toàn bộ bảng số để đối chiếu trực tiếp với GPT-5.5 ^[3].
Nguồn OpenAI tìm được gồm GPT-5.5 System Card và tài liệu benchmark riêng như GeneBench, nhưng không thấy một trang chính thức duy nhất công bố toàn bộ bảng đối đầu GPT-5.5 vs Claude Opus 4.7 theo cùng format ^[1]^[9].
Vì vậy, các con số so sánh trực tiếp bên dưới chủ yếu dựa trên các nguồn tổng hợp/benchmark aggregator và bài phân tích bên thứ ba; cần coi là “tự báo cáo hoặc tổng hợp”, không phải kết quả kiểm định độc lập hoàn toàn ^[11]^[13].

Bảng benchmark chính

Nhóm benchmark	GPT-5.5	Claude Opus 4.7	Mô hình nhỉnh hơn	Nhận xét
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5	GPT-5.5 dẫn khá xa ở tác vụ terminal/agentic trong môi trường dòng lệnh ^[8].
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7	Claude Opus 4.7 nhỉnh hơn ở benchmark sửa lỗi/phát triển phần mềm thực tế dạng khó ^[8].
SWE-Bench Verified	Không đủ số nhất quán từ nguồn đối chiếu	82.4% hoặc 87.6% tùy nguồn	Không kết luận chắc	Có mâu thuẫn giữa nguồn bên thứ ba: một nguồn ghi Opus 4.7 đạt 82.4%, nguồn khác ghi 87.6% ^[4]^[6].
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7, rất sát	Chênh lệch nhỏ; một nguồn nhận định các frontier model gần như đã hội tụ trên GPQA Diamond ^[7]^[14].
GDPval	84.9%	80.3%	GPT-5.5	GPT-5.5 nhỉnh hơn ở đánh giá tác vụ kinh tế/công việc văn phòng theo bảng tổng hợp ^[8].
OSWorld-Verified	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở OSWorld-Verified, nhưng snippet không hiển thị đầy đủ số ^[11].
CyberGym	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở CyberGym, nhưng snippet không hiển thị đầy đủ số ^[11].
FinanceAgent v1.1	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở FinanceAgent v1.1 ^[11].
MCP Atlas	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở MCP Atlas ^[11].
BrowseComp	Cao hơn Opus 4.7 theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở BrowseComp ^[11].
Humanity’s Last Exam	Nguồn mâu thuẫn	Nguồn mâu thuẫn	Không kết luận chắc	LLM Stats nói Opus 4.7 dẫn HLE có và không có tools, trong khi một bài khác ghi GPT-5.5 cao hơn ở HLE và HLE with tools; đây là điểm không nên kết luận nếu chưa có bảng chính thức cùng điều kiện ^[11].

Kết luận theo từng loại tác vụ

Coding thực chiến: Claude Opus 4.7 có lợi thế trên SWE-Bench Pro với 64.3% so với GPT-5.5 là 58.6% ^[8]. Tuy nhiên GPT-5.5 lại vượt rõ ở Terminal-Bench 2.0 với 82.7% so với 69.4%, nên nếu tác vụ là agent thao tác terminal, chạy lệnh, điều hướng môi trường và tự động hóa, GPT-5.5 có vẻ mạnh hơn ^[8].
Reasoning khoa học/tri thức khó: Claude Opus 4.7 nhỉnh hơn trên GPQA Diamond với 94.2% so với GPT-5.5 khoảng 93.6%, nhưng mức chênh này nhỏ và có thể không đáng kể trong thực tế ^[7]^[14].
Agent và workflow công việc: GPT-5.5 dẫn ở GDPval với 84.9% so với Claude Opus 4.7 là 80.3%, đồng thời được tổng hợp là dẫn ở BrowseComp, OSWorld-Verified và CyberGym ^[8]^[11]. Claude Opus 4.7 được tổng hợp là dẫn ở MCP Atlas và FinanceAgent v1.1, tức có lợi thế ở một số workflow agent tài chính/điều phối công cụ ^[11].
Độ chắc chắn của so sánh: Chưa có đủ bằng chứng độc lập, cùng prompt, cùng sampling, cùng tool setup để nói mô hình nào “toàn diện hơn” một cách tuyệt đối. Kết luận thực dụng là GPT-5.5 có vẻ tốt hơn cho terminal agents, browsing/OS/cyber và một số workflow năng suất; Claude Opus 4.7 có vẻ tốt hơn cho SWE-Bench Pro, reasoning sát ngưỡng cao và một số agent tài chính ^[8]^[11].

Chọn mô hình nào?

Chọn GPT-5.5 nếu bạn ưu tiên:
- Agent chạy terminal, automation, shell/tool execution ^[8].
- Tác vụ OS, browsing, cyber hoặc workflow năng suất tổng quát ^[11].
- Benchmark GDPval/công việc tri thức tổng quát ^[8].
Chọn Claude Opus 4.7 nếu bạn ưu tiên:
- Coding kiểu sửa issue phức tạp trên repo thật, nhất là SWE-Bench Pro ^[8].
- Reasoning khoa học khó, dù lợi thế trên GPQA Diamond là nhỏ ^[7]^[14].
- Agent tài chính hoặc orchestration theo các benchmark như FinanceAgent v1.1 và MCP Atlas ^[11].
Nếu dùng cho sản phẩm thật, nên chạy eval nội bộ trên dữ liệu của bạn; benchmark công khai hiện tại chưa đủ đồng nhất để thay thế A/B test thực tế.

Hỗ trợ hình ảnh

OpenAI logo illustration

Trao đổi nghiên cứu

Bạn26 thg 4, 2026

Câu hỏi nghiên cứu

So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7

Studio Global AI26 thg 4, 202636 nguồn

Câu trả lời được trích dẫn

Lưu ý về độ tin cậy

Nguồn chính thức của Anthropic có trang Claude Opus 4.7 và nói mô hình này cải thiện rõ ở tool calls, planning và agent decision-making, nhưng snippet công khai không cho đủ toàn bộ bảng số để đối chiếu trực tiếp với GPT-5.5 ^[3].
Nguồn OpenAI tìm được gồm GPT-5.5 System Card và tài liệu benchmark riêng như GeneBench, nhưng không thấy một trang chính thức duy nhất công bố toàn bộ bảng đối đầu GPT-5.5 vs Claude Opus 4.7 theo cùng format ^[1]^[9].
Vì vậy, các con số so sánh trực tiếp bên dưới chủ yếu dựa trên các nguồn tổng hợp/benchmark aggregator và bài phân tích bên thứ ba; cần coi là “tự báo cáo hoặc tổng hợp”, không phải kết quả kiểm định độc lập hoàn toàn ^[11]^[13].

Bảng benchmark chính

Nhóm benchmark	GPT-5.5	Claude Opus 4.7	Mô hình nhỉnh hơn	Nhận xét
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5	GPT-5.5 dẫn khá xa ở tác vụ terminal/agentic trong môi trường dòng lệnh ^[8].
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7	Claude Opus 4.7 nhỉnh hơn ở benchmark sửa lỗi/phát triển phần mềm thực tế dạng khó ^[8].
SWE-Bench Verified	Không đủ số nhất quán từ nguồn đối chiếu	82.4% hoặc 87.6% tùy nguồn	Không kết luận chắc	Có mâu thuẫn giữa nguồn bên thứ ba: một nguồn ghi Opus 4.7 đạt 82.4%, nguồn khác ghi 87.6% ^[4]^[6].
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7, rất sát	Chênh lệch nhỏ; một nguồn nhận định các frontier model gần như đã hội tụ trên GPQA Diamond ^[7]^[14].
GDPval	84.9%	80.3%	GPT-5.5	GPT-5.5 nhỉnh hơn ở đánh giá tác vụ kinh tế/công việc văn phòng theo bảng tổng hợp ^[8].
OSWorld-Verified	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở OSWorld-Verified, nhưng snippet không hiển thị đầy đủ số ^[11].
CyberGym	Có lợi thế theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở CyberGym, nhưng snippet không hiển thị đầy đủ số ^[11].
FinanceAgent v1.1	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở FinanceAgent v1.1 ^[11].
MCP Atlas	Thấp hơn Opus 4.7 theo tổng hợp	Cao hơn GPT-5.5	Claude Opus 4.7	Nguồn tổng hợp nói Opus 4.7 dẫn ở MCP Atlas ^[11].
BrowseComp	Cao hơn Opus 4.7 theo tổng hợp	Thấp hơn GPT-5.5	GPT-5.5	Nguồn tổng hợp nói GPT-5.5 dẫn ở BrowseComp ^[11].
Humanity’s Last Exam	Nguồn mâu thuẫn	Nguồn mâu thuẫn	Không kết luận chắc	LLM Stats nói Opus 4.7 dẫn HLE có và không có tools, trong khi một bài khác ghi GPT-5.5 cao hơn ở HLE và HLE with tools; đây là điểm không nên kết luận nếu chưa có bảng chính thức cùng điều kiện ^[11].

Kết luận theo từng loại tác vụ

Coding thực chiến: Claude Opus 4.7 có lợi thế trên SWE-Bench Pro với 64.3% so với GPT-5.5 là 58.6% ^[8]. Tuy nhiên GPT-5.5 lại vượt rõ ở Terminal-Bench 2.0 với 82.7% so với 69.4%, nên nếu tác vụ là agent thao tác terminal, chạy lệnh, điều hướng môi trường và tự động hóa, GPT-5.5 có vẻ mạnh hơn ^[8].
Reasoning khoa học/tri thức khó: Claude Opus 4.7 nhỉnh hơn trên GPQA Diamond với 94.2% so với GPT-5.5 khoảng 93.6%, nhưng mức chênh này nhỏ và có thể không đáng kể trong thực tế ^[7]^[14].
Agent và workflow công việc: GPT-5.5 dẫn ở GDPval với 84.9% so với Claude Opus 4.7 là 80.3%, đồng thời được tổng hợp là dẫn ở BrowseComp, OSWorld-Verified và CyberGym ^[8]^[11]. Claude Opus 4.7 được tổng hợp là dẫn ở MCP Atlas và FinanceAgent v1.1, tức có lợi thế ở một số workflow agent tài chính/điều phối công cụ ^[11].
Độ chắc chắn của so sánh: Chưa có đủ bằng chứng độc lập, cùng prompt, cùng sampling, cùng tool setup để nói mô hình nào “toàn diện hơn” một cách tuyệt đối. Kết luận thực dụng là GPT-5.5 có vẻ tốt hơn cho terminal agents, browsing/OS/cyber và một số workflow năng suất; Claude Opus 4.7 có vẻ tốt hơn cho SWE-Bench Pro, reasoning sát ngưỡng cao và một số agent tài chính ^[8]^[11].

Chọn mô hình nào?

Chọn GPT-5.5 nếu bạn ưu tiên:
- Agent chạy terminal, automation, shell/tool execution ^[8].
- Tác vụ OS, browsing, cyber hoặc workflow năng suất tổng quát ^[11].
- Benchmark GDPval/công việc tri thức tổng quát ^[8].
Chọn Claude Opus 4.7 nếu bạn ưu tiên:
- Coding kiểu sửa issue phức tạp trên repo thật, nhất là SWE-Bench Pro ^[8].
- Reasoning khoa học khó, dù lợi thế trên GPQA Diamond là nhỏ ^[7]^[14].
- Agent tài chính hoặc orchestration theo các benchmark như FinanceAgent v1.1 và MCP Atlas ^[11].
Nếu dùng cho sản phẩm thật, nên chạy eval nội bộ trên dữ liệu của bạn; benchmark công khai hiện tại chưa đủ đồng nhất để thay thế A/B test thực tế.

Người ta cũng hỏi

Câu trả lời ngắn gọn cho "So sánh một cách toàn diện benchmarks của GPT-5.5 vs Claude Opus 4.7" là gì?

Những điểm chính cần xác nhận đầu tiên là gì?

Tôi nên làm gì tiếp theo trong thực tế?

## Lưu ý về độ tin cậy

Tôi nên khám phá chủ đề liên quan nào tiếp theo?

Tiếp tục với "So sánh một cách toàn diện benchmarks của DeepSeek V4 vs Claude Opus 4.7" để có góc nhìn khác và trích dẫn bổ sung.

Mở trang liên quan

Tôi nên so sánh điều này với cái gì?

Kiểm tra chéo câu trả lời này với "So sánh một cách toàn diện benchmarks của DeepSeek V4 vs GPT-5.5".

Mở trang liên quan

Nguồn

[1] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[2] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[3] Claude Opus 4.7 Benchmark: Memory & Effort Levels Testeddatacamp.com
Note: Pricing is $5 per million input tokens and $25 per million output tokens which is identical to Opus 4.6. If you want to explore this model in depth, this article by DataCamp team is a good read. A few numbers worth knowing before we test it: Benchmark...
[4] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[5] Everything You Need to Know About GPT-5.5 - Vellumvellum.ai
Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- Terminal-Bench 2.0 82.7% — 75.1% 69.4% 68.5% SWE-Bench Pro 58.6% — 57.7% 64.3% 54.2% Expert-SWE (Internal) 73.1% — 68.5% — — GDPval 84.9% 82.3% 83.0% 80.3% 67.3% OSWorld-Verifi...
[6] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
01 Which is better, GPT-5.5 or Claude Opus 4.7?On the 10 benchmarks both providers report,Opus 4.7 leads on 6 (GPQA, HLE no tools, HLE with tools, SWE-Bench Pro, MCP Atlas, FinanceAgent v1.1) andGPT-5.5 leads on 4 (Terminal-Bench 2.0, BrowseComp, OSWorld-Ve...
[7] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance ...mindstudio.ai
SWE-Bench and Coding Tasks On SWE-Bench Verified — the standard benchmark for evaluating real GitHub issue resolution — both models score competitively at the top of the 2026 leaderboard. GPT-5.5 holds a slight edge on problems requiring precise tool use an...
[8] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
Show 18 more Self-reported by the model provider. Score may not be independently verified. Similar Models How GPT-5.5 compares to models with the closest performance across key benchmarks. GPT-5.5GPT-5.4Gemini 3.1 ProClaude Opus 4.7GPT-5.2 ProClaude Mythos...
[9] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[10] OpenAI's GPT-5.5 masters agentic coding with 82.7% benchmark ...interestingengineering.com
About UsAdvertise ContactFAQ Follow Us On LinkedInXInstagramFlipboardFacebookYouTubeTikTok All Rights Reserved, IE Media, Inc. AI and Robotics GPT-5.5 crushes Claude Opus 4.7 in agentic coding with 82.7% terminal-bench score GPT-5.5 introduces smarter task...
[11] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[12] Deedy on X: "GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding. https://t.co/AiRKqgRjxS" / Xx.com
Deedy on X: "GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding. / X Don’t mi...
[13] [PDF] GeneBench: Assessing AI Agents for Multi-Stage Inference ... - OpenAIcdn.openai.com
Kimi K2.5 Grok 4.20 (thinking) Qwen 3.6 MiMo V2.5 Pro GLM 5.1 Kimi K2.6 Gemini 3.1 Pro (high) none low med high none low med high xhigh none low med high xhigh none low med high xhigh 5.0 5.2 5.4 5.5 0% 10% 20% 30% 40% Mean pass rate across problems 1.6% 1....
[14] Building more with GPT-5.1-Codex-Max - OpenAIopenai.com
Frontier coding capabilities GPT‑5.1‑Codex‑Max was trained on real-world software engineering tasks, like PR creation, code review, frontend coding, and Q&A and outperforms our previous models on many frontier coding evaluations. The model’s gains on benchm...
[15] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[16] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[17] Introducing GPT-5 - OpenAIopenai.com
Evaluations GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health. It sets a new state of the art across math (94.6% on AIME 2025 witho...
[18] Introducing GPT-5.2-Codex - OpenAIopenai.com
GPT‑5.2‑Codex achieves state-of-the-art performance on SWE-Bench Pro and Terminal-Bench 2.0, benchmarks designed to test agentic performance on a wide variety of tasks in realistic terminal environments. It is also much more effective and reliable at agenti...
[19] Introducing GPT-5.3-Codexopenai.com
Appendix GPT-5.3-Codex (xhigh)GPT-5.2-Codex (xhigh)GPT-5.2 (xhigh) SWE-Bench Pro (Public)56.8%56.4%55.6% Terminal-Bench 2.077.3%64.0%62.2% OSWorld-Verified64.7%38.2%37.9% GDPval (wins or ties)70.9%-70.9% (high) Cybersecurity Capture The Flag Challenges77.6%...
[20] Introducing GPT-5.4 - OpenAIopenai.com
EvalGPT‑5.4GPT‑5.4 ProGPT‑5.3-CodexGPT‑5.2GPT‑5.2 Pro GDPval 83.0%82.0%70.9%70.9%74.1% FinanceAgent v1.1 56.0%61.5%54.0%59.5%— Investment Banking Modeling Tasks (Internal)87.3%83.6%79.3%68.4%71.7% OfficeQA 68.1%—65.1%63.1%— Coding EvalGPT‑5.4GPT‑5.4 ProGPT‑...
[21] Introducing GPT-5.4 mini and nano - OpenAIopenai.com
Coding GPT-5.4 (xhigh) GPT-5.4 mini (xhigh) GPT-5.4 nano (xhigh) GPT-5 mini (high¹) --- --- SWE-bench Pro (Public) 57.7% 54.4% 52.4% 45.7% Terminal-Bench 2.0 75.1% 60.0% 46.3% 38.2% Tool-calling GPT-5.4 (xhigh) GPT-5.4 mini (xhigh) GPT-5.4 nano (xhigh) GPT-...
[22] Introducing GPT‑5 for developers - OpenAIopenai.com
GPT‑5 is state-of-the-art (SOTA) across key coding benchmarks, scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot. We trained GPT‑5 to be a true coding collaborator. It excels at producing high-quality code and handling tasks such as fixing bugs,...
[23] Introducing GPT-5.5 - OpenAIopenai.com
Agentic coding GPT‑5.5 is our strongest agentic coding model to date. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%. On SWE-Bench Pro,...
[24] [PDF] OpenAI o1 System Cardcdn.openai.com
We also ran contextual evaluations not included here, including on GPQA biology, WMDP biology and chemistry splits, an organic chemistry molecular structure dataset, and a synthetic biology translation dataset. [...] o1 (Post-Mitigation) performs similarly...
[25] Claude Opus 4.7 - Anthropicanthropic.com
Image 15: logo In our evals, we saw a double digit jump in accuracy of tool calls and planning in our core orchestrator agents. As users leverage Hebbia to plan and execute on use cases like retrieval, slide creation, or document generation, Claude Opus 4.7...
[26] Introducing Claude 4 - Anthropicanthropic.com
Claude 4 Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps...
[27] Introducing Claude Opus 4.5 - Anthropicanthropic.com
Methodology All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), default sampling settings (temperature, top p), and averaged over 5 independent trials. Exceptions: SWE-bench Verified (no thinki...
[28] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 15: logo In our evals, we saw a double-digit jump in accuracy of tool calls and planning in our core orchestrator agents. As users leverage Hebbia to plan and execute on use cases like retrieval, slide creation, or document generation, Claude Opus 4.7...
[29] Introducing Claude Sonnet 4.5 - Anthropicanthropic.com
[]( Related content Introducing Claude Design by Anthropic Labs Today, we’re launching Claude Design, a new Anthropic Labs product that lets you collaborate with Claude to create polished visual work like designs, prototypes, slides, one-pagers, and more. R...
[30] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
2 SWE-bench results are averaged over 25 trials. 18 2.4 SWE-bench (Verified and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. For the SWE-bench Veri fi ed variant, developed by OpenAI, models...
[31] [PDF] Claude Opus 4.6 System Card - Anthropicwww-cdn.anthropic.com
2 SWE-bench results are averaged over 25 trials. 17 2.4 SWE-bench (Verified and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. For the SWE-bench Veri fi ed variant, developed by OpenAI, models...
[32] [PDF] Claude Sonnet 4.6 System Card - Anthropicwww-cdn.anthropic.com
reasoning or knowledge). For details on how we attempt to decontaminate our evaluations, see Section 2.2 of the Claude Opus 4.5 System Card . 14 Evaluation Claude family models Other models Claude Sonnet 4.6 Claude Opus 4....
[33] Newsroom - Anthropicanthropic.com
Skip to main contentSkip to footer []( Research Economic Futures Commitments Learn News Try Claude Newsroom Press inquirespress@anthropic.com Non-media inquiriessupport@anthropic.com Media assetsDownload press kit Image 1: Introducing Claude Opus 4.7 Introd...
[34] System Card: Claude Opus 4 & Claude Sonnet 4www-cdn.anthropic.com
https:/ /arxiv.org/abs/2110.08193 16 Model Disambiguated bias (%) Ambiguous bias (%) Claude Opus 4 -0.60 0.21 Claude Sonnet 4 -1.16 0.61 Claude Sonnet 3.7 -0.98 0.89 Claude Opus 3 0.77 1.21 Table 2.6.A Bias scores on the Bias Benchmark for Question Answerin...
[35] Anthropic's Transparency Hubanthropic.com
Model descriptionClaude Opus 4 and Claude Sonnet 4 are two new hybrid reasoning large language models from Anthropic. They have advanced capabilities in reasoning, visual analysis, computer use, and tool use. They are particularly adept at complex computer...
[36] Claude Opus 4.5 System Cardanthropic.com
5 With a 128k thinking budget; with a 64k thinking budget, the score is 57.8%. 4 Without extended thinking. 19 2.4 SWE-bench (Verified, Pro, and Multilingual) SWE-bench (Software Engineering Bench) tests AI models on real-world software engineering tasks. F...