Báo cáoĐã xuất bản5 thg 5 2026Last edited 6 thg 5 202620 nguồn

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: so sánh benchmark 2026

Claude Opus 4.7 là lựa chọn có bằng chứng công khai chắc nhất cho coding và agent: Vals AI ghi 82,00% trên SWE bench, còn Anthropic báo 0,715 ở benchmark research agent nội bộ [17][16]. GPT 5.5 rất mạnh ở reasoning theo các nguồn thứ cấp: O Mega báo 92,4% MMLU, 93,6% GPQA Diamond, 85,0% ARC AGI 2 và 95,0% ARC AGI 1...

Tìm kiếm và kiểm chứng sự thật với Studio Global AI Duyệt thêm từ Khám phá

3.8K0

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6 — Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026Comparativa editorial de cuatro modelos frontier y emergentes según benchmarks públicos disponibles.
Prompt AI
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026. Article summary: La lectura más defendible es que Claude Opus 4.7 tiene la mejor evidencia pública: Vals AI lo sitúa en 82.00% en SWE bench, actualizado el 24/04/2026, y Anthropic reporta 0.715 en su benchmark interno de research agen.... Topic tags: ai, ai benchmarks, llm, claude, openai. Reference image context from search candidates: Reference image 1: visual subject "# DeepSeek V4 vs Claude vs GPT-5.5. Claude Opus 4.6 is no longer Anthropic's flagship — Opus 4.7 shipped on April 16, 2026, at the same $5/$25 price. If you're evaluating "best Ant" source context "DeepSeek V4 vs Claude vs GPT-5.5 - Verdent AI" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90
openai.com

Đặt bốn mô hình này vào một bảng xếp hạng duy nhất nghe có vẻ gọn, nhưng dễ gây hiểu nhầm. Dữ liệu công khai không dày như nhau: Claude Opus 4.7 có cả tín hiệu chính thức từ Anthropic lẫn leaderboard bên ngoài; GPT-5.5 có điểm reasoning rất cao nhưng chủ yếu từ nguồn thứ cấp; DeepSeek V4/V4 Pro có nhiều claim kỹ thuật và đánh giá cộng đồng; còn Kimi K2.6 hiện mới có vài tín hiệu lẻ.

Cách đọc hợp lý hơn là tách hai câu hỏi: mô hình nào có điểm tốt, và điểm đó đáng tin đến đâu.

Kết luận nhanh

Mô hình	Cách đọc thận trọng nhất	Độ tin cậy của bằng chứng
Claude Opus 4.7	Ứng viên có hồ sơ công khai mạnh nhất cho coding, agent và tác vụ nhiều bước. Anthropic báo 0,715 trong benchmark research-agent nội bộ, còn Vals AI xếp Claude Opus 4.7 đứng đầu SWE-bench với 82,00% ^[16]^[17].	Cao - trung bình
GPT-5.5	Rất mạnh ở reasoning tổng quát: O-Mega báo 92,4% MMLU, 93,6% GPQA Diamond, 85,0% ARC-AGI-2 và 95,0% ARC-AGI-1 ^[3].	Trung bình
DeepSeek V4 / V4 Pro	Hứa hẹn cho coding và thử nghiệm kỹ thuật, nhưng nguồn dữ liệu đang lẫn giữa V4, V4 Pro và V4 Pro High ^[25]^[27].	Trung bình - thấp
Kimi K2.6	Có tín hiệu ban đầu — LLM Stats ghi 0,91 GPQA và WhatLLM đưa vào top 10 theo Quality Index — nhưng chưa đủ phủ nhiều benchmark để so ngang hàng ^[7]^[21].	Thấp

Bảng benchmark đối sánh được, nhưng không nên gộp máy móc

Benchmark hoặc chỉ số	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6	Nên hiểu thế nào
SWE-bench	82,00% trên Vals AI, cập nhật 24/4/2026 ^[17]	Chưa có số liệu đối sánh trong bộ nguồn	81% được NxCode nêu cho DeepSeek V4 ^[26]	Chưa có số liệu đối sánh	Tín hiệu công khai sạch nhất đang nghiêng về Claude.
SWE-bench Verified	87,6% theo Vellum; 83,5% ± 1,7 theo LMCouncil ^[20]^[9]	Chưa có số liệu đối sánh	Hugging Face có nhắc SWE-bench Verified trong đánh giá cộng đồng cho DeepSeek-V4-Pro, nhưng phần tóm tắt không hiển thị con số ^[25]	Chưa có số liệu đối sánh	Điểm thay đổi theo nguồn, cấu hình, tập con và biến thể mô hình.
SWE-bench Pro	64,3% theo Vellum ^[20]	Chưa có số liệu đối sánh	Hugging Face có nhắc SWE-bench Pro trong đánh giá cộng đồng, nhưng phần tóm tắt không hiển thị con số ^[25]	Chưa có số liệu đối sánh	Phù hợp hơn nếu đánh giá agent phần mềm làm việc dài hơi.
GPQA Diamond	94,2% theo O-Mega, Vellum và TNW ^[3]^[12]^[15]	93,6% theo O-Mega và Vellum ^[3]^[12]	Có trong các bộ đánh giá cộng đồng, nhưng chưa thấy con số đối sánh trong phần tóm tắt ^[25]	0,91 trên LLM Stats ^[7]	Claude và GPT-5.5 quá sát nhau để kết luận chỉ bằng GPQA.
MMLU	Chưa có số liệu đối sánh trong bộ nguồn	92,4% theo O-Mega ^[3]	MMLU-Pro xuất hiện trong đánh giá cộng đồng, nhưng chưa có số visible trong tóm tắt ^[25]	Chưa có số liệu đối sánh	Nên cho trọng số thấp vì MMLU đã bão hòa ở nhóm mô hình mạnh.
ARC-AGI	Chưa có số liệu đối sánh	ARC-AGI-2: 85,0%; ARC-AGI-1: 95,0% theo O-Mega ^[3]	Chưa có số liệu đối sánh	Chưa có số liệu đối sánh	Củng cố luận điểm GPT-5.5 mạnh về reasoning, nhưng vẫn cần lưu ý nguồn.
Research-agent / tác vụ nhiều bước	0,715 trong benchmark nội bộ của Anthropic ^[16]	Chưa có số liệu đối sánh	BenchLM báo 83,8/100 ở mục Agentic cho DeepSeek V4 Pro High ^[27]	Chưa có số liệu đối sánh	Có ích để định hướng năng lực, nhưng hai thang đo không tương đương.
Long context / Needle-in-a-Haystack	Anthropic nói Opus 4.7 có hiệu năng long-context ổn định nhất trong nhóm mô hình họ thử ^[16]	Chưa có số liệu đối sánh	NxCode nêu 97% ở 1 triệu token, nhưng chính cách diễn đạt cần đọc như claim chờ kiểm chứng độc lập ^[26]	Chưa có số liệu đối sánh	DeepSeek có claim đáng chú ý, chưa phải kết luận đóng.
LiveCodeBench / Codeforces	Chưa có số liệu đối sánh	Chưa có số liệu đối sánh	Redreamality báo LiveCodeBench 93,5 và Codeforces 3206 cho DeepSeek V4 ^[30]	Chưa có số liệu đối sánh	Tốt cho coding thuần, nhưng chưa trả lời hết bài toán agentic dài hơi.

Đừng để một con số dẫn dắt toàn bộ quyết định

Các benchmark này không đo cùng một thứ. SWE-bench tập trung vào khả năng xử lý nhiệm vụ kỹ thuật phần mềm thực tế; Vals AI mô tả đây là benchmark cho việc giải quyết các tác vụ phần mềm trong môi trường production ^[17]. SWE-bench Pro lại cần tách riêng: paper giới thiệu nó như một biến thể khó hơn đáng kể, nhắm vào các tác vụ kỹ thuật phần mềm dài hơi ^[38].

GPQA Diamond hữu ích để nhìn vào reasoning khoa học, nhưng ở nhóm mô hình tuyến đầu, nó không còn tách biệt quá rõ. TNW nhận xét rằng điểm GPQA Diamond của các mô hình như Opus 4.7, GPT-5.4 Pro và Gemini 3.1 Pro đã sát nhau đến mức chênh lệch nằm trong nhiễu đo lường ^[15]. Với MMLU còn phải thận trọng hơn: Nanonets cho rằng đến năm 2026, các mô hình hàng đầu đều vượt 88%, khiến benchmark này quá bão hòa để phân định tinh tế giữa các mô hình mạnh ^[1].

Nguồn của từng con số cũng quan trọng. Một công bố chính thức từ phòng lab, một leaderboard độc lập, một trang tổng hợp và một thảo luận cộng đồng không có trọng lượng như nhau. Ngay cả BenchLM cũng ghi rằng hồ sơ Claude Opus 4.7 đang bị loại khỏi leaderboard công khai vì chưa đủ độ phủ benchmark công khai không do máy sinh ra để xếp hạng an toàn ^[14].

Claude Opus 4.7: trường hợp mạnh nhất cho coding và agent

Claude Opus 4.7 là mô hình có nền tảng bằng chứng công khai tốt nhất trong nhóm này. Anthropic cho biết Opus 4.7 đồng hạng điểm tổng thể cao nhất trong benchmark research-agent nội bộ với 0,715, đồng thời có hiệu năng long-context ổn định nhất trong các mô hình được họ đánh giá ^[16]. Vì đây là benchmark nội bộ, không nên xem nó như kiểm định độc lập; nhưng nó vẫn là tín hiệu chính thức về hướng tối ưu của mô hình: làm việc nhiều bước, xử lý ngữ cảnh dài và tác vụ dạng agent.

Ở phía nguồn ngoài, SWE-bench là tín hiệu rõ nhất. Vals AI xếp Claude Opus 4.7 đứng đầu với 82,00% trên trang cập nhật ngày 24/4/2026 ^[17]. Vellum báo 87,6% trên SWE-bench Verified và 64,3% trên SWE-bench Pro ^[20]. LMCouncil lại ghi 83,5% ± 1,7 cho Claude Opus 4.7 trên SWE-bench Verified ^[9].

Cách đọc đúng không phải là chọn một con số rồi bỏ qua các số còn lại. Hợp lý hơn là nói Claude xuất hiện ở nhóm dẫn đầu, hoặc dẫn đầu, trong nhiều nguồn về software engineering; đồng thời ghi rõ SWE-bench, SWE-bench Verified và SWE-bench Pro không phải cùng một bài kiểm tra và có thể khác nhau vì phương pháp, ngày cập nhật, tập con hoặc cấu hình ^[17]^[20]^[38].

Về reasoning khoa học, Claude Opus 4.7 được O-Mega, Vellum và TNW ghi 94,2% trên GPQA Diamond ^[3]^[12]^[15]. Tuy nhiên, chính TNW cũng cảnh báo GPQA đã rất nén ở nhóm mô hình frontier, nên không nên dùng riêng GPQA để tuyên bố người thắng tuyệt đối ^[15].

GPT-5.5: reasoning rất mạnh, nhưng dấu vết chính thức ít hơn

GPT-5.5 nổi bật trong nhóm số liệu reasoning thu được. O-Mega báo 92,4% trên MMLU, 93,6% trên GPQA Diamond, 85,0% trên ARC-AGI-2 và 95,0% trên ARC-AGI-1 ^[3]. Vellum cũng liệt kê GPT-5.5 ở mức 93,6% GPQA Diamond, thấp hơn Claude Opus 4.7 trong bảng cụ thể đó ^[12]. BenchLM đặt GPT-5.5 vào nhóm cao, với 89/100 trên leaderboard tạm thời và hạng 2/16 trên leaderboard đã xác minh ^[6].

Điểm cần dè chừng là khả năng truy vết. Trong bộ nguồn dùng cho so sánh này, GPT-5.5 xuất hiện qua bài viết, trang tổng hợp và benchmark page; chưa có một benchmark card chính thức của OpenAI với bộ số liệu đối sánh đầy đủ tương tự nguồn Anthropic dành cho Claude Opus 4.7. Appwrite mô tả việc GPT-5.5 ra mắt ngày 24/4/2026, còn Vals liệt kê openai/gpt-5.5 với ngày phát hành 23/4/2026 và Vals Index 67,76% ± 1,79; nhưng các nguồn này không thay thế được một benchmark card chính thức ^[2]^[11].

Vì vậy, trong một báo cáo điều hành, GPT-5.5 nên được đặt là đối thủ hàng đầu về reasoning tổng quát, đặc biệt nhờ GPQA và ARC-AGI. Nhưng nếu tiêu chí là bằng chứng công khai đồng nhất giữa mọi mô hình, chưa nên gọi GPT-5.5 là mô hình thắng toàn cục ^[3]^[6]^[12].

DeepSeek V4 / V4 Pro: đáng chú ý, nhưng phải tách biến thể

DeepSeek là trường hợp dễ gây lẫn nhất. Các nguồn đang dùng luân phiên DeepSeek V4, DeepSeek V4 Pro và DeepSeek V4 Pro High; vì vậy không nên tự động lấy điểm của một biến thể rồi gán cho biến thể khác ^[25]^[26]^[27].

Hugging Face có một thảo luận cộng đồng cho DeepSeek-V4-Pro, bổ sung kết quả đánh giá ở GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified và Terminal-Bench 2.0 ^[25]. BenchLM báo DeepSeek V4 Pro High đạt 83,8/100 ở mục Agentic, 88,8/100 ở mục Coding và 72,1/100 ở mục Knowledge ^[27]. NxCode cho rằng DeepSeek V4 đạt 81% trên SWE-bench và 97% ở Needle-in-a-Haystack tại ngữ cảnh 1 triệu token, nhưng cũng khiến người đọc phải hiểu con số 97% như kết quả cần kiểm chứng độc lập ^[26].

Redreamality đưa thêm tín hiệu tích cực cho coding thuần: LiveCodeBench 93,5 và Codeforces 3206 cho DeepSeek V4 ^[30]. Tuy vậy, cùng nguồn này cũng tóm tắt rằng với công việc agentic dài hơi như SWE-bench Pro và Terminal-Bench 2.0, các mô hình frontier đóng vẫn đang dẫn trước ^[30].

Cách dùng thực tế: DeepSeek V4/V4 Pro xứng đáng được thử nội bộ, nhất là khi đội kỹ thuật muốn tự đo trên workload của mình. Nhưng với bộ nguồn hiện có, DeepSeek chưa đạt độ chắc công khai ngang Claude trong SWE-bench và tín hiệu research-agent từ Anthropic ^[16]^[17]^[25]^[27].

Kimi K2.6: có tín hiệu, chưa đủ để xếp hạng ngang hàng

Kimi K2.6 không nên bị loại khỏi cuộc thảo luận, nhưng cũng không nên được trình bày như thể có độ phủ benchmark tương đương ba mô hình còn lại. LLM Stats liệt kê Kimi K2.6 với 0,91 trên GPQA, còn WhatLLM đưa Kimi K2.6 vào top 10 mô hình theo Quality Index ^[7]^[21]. Các tín hiệu này cho thấy mô hình đã xuất hiện trong hệ sinh thái benchmark, nhưng chưa đủ để so sánh toàn diện với Claude Opus 4.7, GPT-5.5 và DeepSeek V4/V4 Pro.

Cũng cần tránh thay thế âm thầm Kimi K2.6 bằng Kimi K2.5. Simon Willison ghi nhận vào tháng 2/2026 một kết quả Kimi K2.5 trên SWE-bench Verified, nhưng đó là phiên bản khác của mô hình ^[8]. Với một bảng so sánh nghiêm túc, Kimi K2.6 nên được ghi là thiếu bằng chứng hoặc đang chờ xác thực đa benchmark.

Khuyến nghị theo tình huống sử dụng

Nhu cầu	Nên ưu tiên	Độ tin cậy	Lý do
Sửa issue thực tế và coding agentic	Claude Opus 4.7	Cao - trung bình	Dẫn đầu SWE-bench trên Vals AI với 82,00% và xuất hiện mạnh ở SWE-bench Verified, SWE-bench Pro theo Vellum ^[17]^[20].
Tác vụ nhiều bước, research-agent	Claude Opus 4.7	Trung bình	Anthropic báo 0,715 trong benchmark nội bộ và long-context ổn định nhất trong các mô hình họ thử ^[16].
Reasoning khoa học kiểu GPQA	Claude Opus 4.7 hoặc GPT-5.5	Trung bình	Claude ở 94,2%, GPT-5.5 ở 93,6%; chênh lệch nhỏ và GPQA đã nén ở nhóm mô hình frontier ^[3]^[12]^[15].
Reasoning tổng quát rộng	GPT-5.5	Trung bình - thấp	Điểm MMLU, GPQA và ARC-AGI rất mạnh, nhưng chủ yếu đến từ O-Mega, Vellum, BenchLM và các trang tổng hợp ^[3]^[6]^[12].
Thử nghiệm kỹ thuật, tự kiểm chứng trong môi trường riêng	DeepSeek V4 / V4 Pro	Trung bình - thấp	Có tín hiệu từ Hugging Face, BenchLM, NxCode và Redreamality, nhưng còn lẫn biến thể và cần xác thực độc lập ^[25]^[26]^[27]^[30].
Xếp hạng định lượng đầy đủ	Không dùng Kimi K2.6 như mô hình đã đối sánh đầy đủ	Thấp	Có tín hiệu như GPQA 0,91 trên LLM Stats, nhưng thiếu độ phủ benchmark tương đương ^[7]^[21].

Cách trình bày mà không hứa quá đà

Nếu cần đưa vào slide hoặc báo cáo, nên tách phần hiệu năng và phần chất lượng bằng chứng. Một slide có thể là ranking theo tình huống sử dụng; slide thứ hai là bảng điểm; slide thứ ba là ghi chú phương pháp.

Thông điệp chính nên ngắn gọn: Claude Opus 4.7 là mô hình có bằng chứng tốt nhất cho coding và agent; GPT-5.5 là đối thủ rất mạnh ở reasoning tổng quát; DeepSeek V4/V4 Pro là lựa chọn kỹ thuật đáng thử nhưng cần benchmark nội bộ; Kimi K2.6 hiện chưa đủ dữ liệu đối sánh.

Ba cảnh báo phương pháp nên đi kèm mọi bảng xếp hạng. Thứ nhất, không trộn SWE-bench, SWE-bench Verified và SWE-bench Pro như cùng một bài test, vì SWE-bench Pro được thiết kế cho các tác vụ kỹ thuật phần mềm dài hơi và khó hơn ^[38]. Thứ hai, không ra quyết định chỉ dựa vào MMLU, vì nhóm mô hình hàng đầu đã tụ lại trên mốc 88% ^[1]. Thứ ba, mỗi con số nên được gắn nhãn nguồn: chính thức, leaderboard, trang tổng hợp, cộng đồng hay claim từ bài phân tích.

Kết luận

Nếu mục tiêu là chọn mô hình cho một báo cáo có thể bảo vệ trước đội kỹ thuật hoặc ban điều hành, Claude Opus 4.7 nên đứng đầu nhờ kết hợp giữa nguồn chính thức, vị trí dẫn đầu trên Vals SWE-bench và các kết quả mạnh ở biến thể SWE-bench do bên thứ ba báo cáo ^[16]^[17]^[20]. GPT-5.5 nên được trình bày như đối thủ hàng đầu về reasoning, nhưng cần ghi rõ rằng các điểm thu được chủ yếu là từ nguồn thứ cấp ^[3]^[6]^[12]. DeepSeek V4/V4 Pro đáng được thử nội bộ, chưa nên tuyên bố dẫn đầu ^[25]^[26]^[27]^[30]. Kimi K2.6 hiện nên nằm ở nhóm thiếu dữ liệu cho một so sánh đầy đủ ^[7]^[21].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Tìm kiếm và kiểm chứng sự thật với Studio Global AI

Bài học chính

Claude Opus 4.7 là lựa chọn có bằng chứng công khai chắc nhất cho coding và agent: Vals AI ghi 82,00% trên SWE bench, còn Anthropic báo 0,715 ở benchmark research agent nội bộ [17][16].
GPT 5.5 rất mạnh ở reasoning theo các nguồn thứ cấp: O Mega báo 92,4% MMLU, 93,6% GPQA Diamond, 85,0% ARC AGI 2 và 95,0% ARC AGI 1 [3].
DeepSeek V4/V4 Pro có tín hiệu tốt cho coding nhưng số liệu còn trộn biến thể; Kimi K2.6 mới có tín hiệu rời rạc như 0,91 GPQA trên LLM Stats và xuất hiện trong top 10 Quality Index của WhatLLM [25][27][7][21].

Người ta cũng hỏi

Câu trả lời ngắn gọn cho "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: so sánh benchmark 2026" là gì?

Những điểm chính cần xác nhận đầu tiên là gì?

Tôi nên làm gì tiếp theo trong thực tế?

DeepSeek V4/V4 Pro có tín hiệu tốt cho coding nhưng số liệu còn trộn biến thể; Kimi K2.6 mới có tín hiệu rời rạc như 0,91 GPQA trên LLM Stats và xuất hiện trong top 10 Quality Index của WhatLLM [25][27][7][21].

Tôi nên khám phá chủ đề liên quan nào tiếp theo?

Tiếp tục với "DeepSeek V4: không chỉ là 1M token, mà là bài toán MoE và API" để có góc nhìn khác và trích dẫn bổ sung.

Mở trang liên quan

Tôi nên so sánh điều này với cái gì?

Kiểm tra chéo câu trả lời này với "Northwest vs. Southeast Timber: vì sao đáp án là larger; larger?".

Mở trang liên quan

Tiếp tục nghiên cứu của bạn

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4: không chỉ là 1M token, mà là bài toán MoE và API

DeepSeek V4: 1M token, MoE và những việc developer cần làm khi chuyển API

Northwest vs. Southeast Timber: vì sao đáp án là larger; larger?

Bài elasticity Northwest vs. Southeast Timber: vì sao là larger; larger?

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek có thể đánh bại OpenAI, Claude, Gemini, Grok? Câu trả lời nằm ở chi phí và niềm tin

DeepSeek có thể đánh bại OpenAI, Claude, Gemini, Grok?

Illustration of an AI coding agent connected to cloud database and backup systems

Sự cố PocketOS: không chỉ là chuyện “AI xóa database”

Trao đổi nghiên cứu

Bạn26 thg 4 2026

Câu hỏi nghiên cứu

Investiga los benchmarks de Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6, y compáralos de la forma más completa posible en formato de informe de investigación para su presentación.

Studio Global AI26 thg 4 202640 nguồn

Câu trả lời được trích dẫn

La comparación más honesta es parcial: Claude Opus 4.7 es el modelo con mejor soporte público para benchmarks de coding y agente, GPT-5.5 aparece fuerte en razonamiento según fuentes secundarias, DeepSeek V4/V4 Pro muestra señales competitivas pero con evidencia más comunitaria/provisional, y para Kimi K2.6 no encontré evidencia citable suficiente. Para una presentación ejecutiva, recomiendo mostrar a Claude como el líder mejor respaldado en SWE-bench, a GPT-5.5 como competidor fuerte pero menos verificable públicamente, a DeepSeek como alternativa abierta/prometedora pendiente de validación, y a Kimi K2.6 como “Insufficient evidence”.

Key findings

Claude Opus 4.7 tiene la evidencia más sólida del conjunto porque aparece en una fuente oficial de Anthropic con un benchmark interno de research-agent y en fuentes externas de coding como SWE-bench/Vals AI ^[1]^[2].
En SWE-bench, Vals AI reporta que Claude Opus 4.7 lidera con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].
Hay conflicto entre fuentes sobre el rendimiento de Claude en SWE-bench: Vals AI reporta 82.00%, mientras que Vellum reporta 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[2]^[15].
En GPQA Diamond, TNW reporta 94.2% para Claude Opus 4.7 y señala que las diferencias entre modelos frontier en ese benchmark están dentro del ruido de medición ^[3].
GPT-5.5 tiene métricas fuertes en fuentes secundarias: O-Mega reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].
No encontré una fuente oficial de OpenAI en el corpus recuperado para validar los benchmarks de GPT-5.5, por lo que sus resultados deben presentarse como “provisionales / secundarios”.
DeepSeek aparece principalmente como DeepSeek V4 Pro o V4 Pro High en las fuentes recuperadas, no siempre como “DeepSeek V4” base ^[6]^[8].
Para DeepSeek V4/V4 Pro, la evidencia combina discusión comunitaria de Hugging Face, agregadores y blogs técnicos, por lo que su comparabilidad con Claude y GPT-5.5 es menor ^[6]^[7]^[8]^[9].
Kimi K2.6 no cuenta con evidencia suficiente en las fuentes recuperadas; no recomiendo incluirlo en una tabla de ranking como si tuviera benchmarks verificados.

Metodología de lectura

Prioricé fuentes oficiales, leaderboards especializados, discusiones técnicas con trazabilidad y fuentes académicas sobre benchmarks ^[1]^[2]^[6]^[10]^[11].
Clasifiqué la evidencia en cuatro niveles: oficial, benchmark independiente, agregador/comunidad y blog o análisis secundario ^[1]^[2]^[4]^[6]^[8].
No traté como equivalentes los resultados de SWE-bench, SWE-bench Verified y SWE-bench Pro, porque SWE-bench Pro se define como una variante más desafiante y orientada a tareas de ingeniería de software de largo horizonte ^[10].
Consideré MMLU como métrica de bajo poder discriminativo para modelos frontier, ya que una fuente de explicación de benchmarks indica que en 2026 los modelos top superan el 88% y el benchmark está muy saturado ^[12].

Matriz comparativa ejecutiva

Modelo	Estado de evidencia	Benchmarks más relevantes recuperados	Lectura ejecutiva
Claude Opus 4.7	Alta-media	Research-agent interno 0.715 y fuerte rendimiento de long-context según Anthropic; SWE-bench 82.00% según Vals AI; GPQA Diamond 94.2% según TNW ^[1]^[2]^[3]	Mejor candidato para presentarlo como líder respaldado en coding/agente, con cautela por diferencias entre fuentes ^[2]^[15]
GPT-5.5	Media-baja	MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% según O-Mega ^[4]	Muy fuerte en razonamiento según fuentes secundarias, pero falta validación oficial en el corpus recuperado ^[4]^[5]
DeepSeek V4 / V4 Pro	Media-baja	BenchLM reporta DeepSeek V4 Pro High con Agentic 83.8/100 y Coding 88.8/100; NxCode habla de 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens como resultado reclamado ^[7]^[8]	Alternativa competitiva, especialmente si se valora ecosistema abierto/local, pero requiere validación independiente antes de una decisión ejecutiva ^[6]^[8]^[9]
Kimi K2.6	Insufficient evidence	No hay benchmark citable suficiente en las fuentes recuperadas	No incluir como comparable verificado; pedir fuente oficial o leaderboard antes de presentarlo

Benchmarks numéricos recuperados

Benchmark / métrica	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6
SWE-bench	82.00% según Vals AI ^[2]	No recuperado en fuente suficientemente comparable	81% reclamado en una fuente secundaria sobre DeepSeek V4 ^[7]	Insufficient evidence
SWE-bench Verified	87.6% según Vellum ^[15]	No recuperado	Incluido como benchmark evaluado en discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
SWE-bench Pro	64.3% según Vellum ^[15]	No recuperado	Incluido en la discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
GPQA Diamond	94.2% según TNW y O-Mega ^[3]^[4]	93.6% según O-Mega ^[4]	Mencionado dentro de suites comunitarias, sin cifra visible en el resumen recuperado ^[6]^[9]	Insufficient evidence
MMLU	No recuperado con cifra comparable	92.4% según O-Mega ^[4]	MMLU-Pro aparece como evaluación comunitaria, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
ARC-AGI-2	No recuperado	85.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
ARC-AGI-1	No recuperado	95.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
Research-agent / tareas multi-step	0.715 en benchmark interno de Anthropic ^[1]	No recuperado	BenchLM reporta categoría Agentic 83.8/100 para DeepSeek V4 Pro High ^[8]	Insufficient evidence
Long-context / Needle-in-a-Haystack	Anthropic afirma rendimiento long-context muy consistente ^[1]	No recuperado	NxCode reporta 97% a 1M tokens como resultado reclamado, condicionado a validación independiente ^[7]	Insufficient evidence
LiveCodeBench / Codeforces	No recuperado	No recuperado	Redreamality reporta LiveCodeBench 93.5 y Codeforces 3206 para DeepSeek V4 ^[9]	Insufficient evidence

Análisis por modelo

Claude Opus 4.7

Claude Opus 4.7 es el modelo mejor respaldado del conjunto porque tiene una página oficial de Anthropic y resultados externos de SWE-bench ^[1]^[2].

Anthropic afirma que Opus 4.7 empató el mejor resultado global en su benchmark interno de research-agent con 0.715 y que mostró el rendimiento long-context más consistente entre los modelos evaluados ^[1].

Vals AI reporta que Claude Opus 4.7 lidera SWE-bench con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].

Vellum reporta cifras más altas para Claude, con 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[15].

La diferencia entre 82.00% y 87.6% debe tratarse como una discrepancia de metodología, subconjunto o configuración, no como una mejora confirmada única ^[2]^[15].

En razonamiento científico, TNW reporta 94.2% en GPQA Diamond para Claude Opus 4.7 y contextualiza que los modelos frontier están muy cerca entre sí en ese benchmark ^[3].

GPT-5.5

GPT-5.5 aparece muy fuerte en razonamiento general según O-Mega, que reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].

Appwrite publicó un artículo titulado “GPT-5.5 is here” con foco en benchmarks, pricing y cambios para desarrolladores el 24 de abril de 2026, pero se trata de una fuente secundaria y no de una ficha oficial de OpenAI ^[5].

La evidencia recuperada no permite confirmar con una fuente oficial de OpenAI los benchmarks de GPT-5.5, por lo que sus puntuaciones deben etiquetarse como “terceros / no verificadas oficialmente”.

Para una presentación, GPT-5.5 puede colocarse como competidor muy fuerte en razonamiento, pero no como ganador global si se exige trazabilidad oficial comparable a la de Claude ^[1]^[4]^[5].

DeepSeek V4 / V4 Pro

La evidencia recuperada para DeepSeek se concentra en variantes como DeepSeek V4 Pro y DeepSeek V4 Pro High, lo que impide asumir automáticamente que las cifras representan al modelo base DeepSeek V4 ^[6]^[8].

Hugging Face muestra una discusión comunitaria para DeepSeek-V4-Pro con evaluaciones en GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified y Terminal-Bench 2.0 ^[6].

BenchLM reporta para DeepSeek V4 Pro High una categoría Agentic de 83.8/100 y una categoría Coding de 88.8/100 ^[8].

NxCode presenta DeepSeek V4 como un modelo de 1T parámetros con 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens, pero el propio resumen indica que la cifra de 97% debe sostenerse bajo pruebas independientes para ser concluyente ^[7].

Redreamality reporta que DeepSeek V4 alcanza LiveCodeBench 93.5 y Codeforces 3206, y al mismo tiempo afirma que los modelos cerrados siguen liderando en tareas agentic de largo horizonte ^[9].

Para una presentación, DeepSeek debe mostrarse como “competidor técnico prometedor” y no como líder validado, salvo que el equipo disponga de benchmarks internos reproducibles ^[6]^[7]^[8]^[9].

Kimi K2.6

No encontré benchmarks citables suficientes para Kimi K2.6 en las fuentes recuperadas.

No recomiendo sustituir Kimi K2.6 por Kimi K2.5 u otra variante, porque eso mezclaría modelos diferentes y degradaría la validez de la comparación.

La forma más segura de presentarlo es como “pendiente de validación” o “Insufficient evidence”.

Ranking por escenario de uso

Escenario	Modelo recomendado	Justificación
Coding y resolución de issues reales	Claude Opus 4.7	Vals AI lo reporta como líder en SWE-bench con 82.00%, y Vellum también lo ubica fuerte en SWE-bench Verified y SWE-bench Pro ^[2]^[15]
Razonamiento general / ciencia	Claude Opus 4.7 o GPT-5.5	Claude aparece con 94.2% en GPQA Diamond, mientras GPT-5.5 aparece con 93.6% en O-Mega; la diferencia es pequeña y debe leerse con cautela ^[3]^[4]
Agentes y tareas multi-step	Claude Opus 4.7	Anthropic reporta 0.715 en su benchmark interno de research-agent y fuerte consistencia long-context ^[1]
Alternativa abierta / ecosistema técnico	DeepSeek V4 Pro	BenchLM y Hugging Face muestran suites de evaluación amplias para DeepSeek V4 Pro, aunque la evidencia es menos oficial que la de Claude ^[6]^[8]
Presentación con evidencia verificable	Claude Opus 4.7	Es el único del conjunto con fuente oficial clara y señales externas de benchmark en el corpus recuperado ^[1]^[2]
Inclusión de Kimi K2.6	No recomendado	Insufficient evidence

Evidence notes

La fuente más fuerte para Claude es oficial, pero su benchmark de research-agent es interno y no necesariamente reproducible por terceros ^[1].
Vals AI es la fuente más limpia para una comparación de SWE-bench porque reporta directamente un leaderboard de SWE-bench actualizado el 24 de abril de 2026 ^[2].
Vellum aporta más detalle para SWE-bench Verified y SWE-bench Pro, pero sus cifras no coinciden con Vals AI y deben presentarse como una fuente alternativa ^[15].
O-Mega aporta varias cifras útiles para GPT-5.5, pero debe tratarse como fuente secundaria hasta que exista una ficha oficial o un leaderboard independiente reproducible ^[4].
La discusión de Hugging Face para DeepSeek-V4-Pro demuestra actividad comunitaria de evaluación, pero no equivale por sí sola a una publicación oficial o a un benchmark auditado ^[6].
SWE-bench Pro es especialmente relevante para comparar agentes de software porque evalúa tareas más difíciles y de mayor horizonte que benchmarks de coding más simples ^[10].
MMLU debe tener menor peso en la decisión porque fuentes de explicación de benchmarks indican saturación entre modelos top en 2026 ^[12].

Limitations / uncertainty

Insufficient evidence para Kimi K2.6.
No recuperé una fuente oficial de OpenAI con benchmark card de GPT-5.5, por lo que cualquier cifra de GPT-5.5 en este informe debe etiquetarse como secundaria.
No recuperé una fuente oficial de DeepSeek que consolide los r

Nguồn

[1] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[2] GPT-5.5 is here: benchmarks, pricing, and what changes ... - Appwriteappwrite.io
Star on GitHub 55.8KGo to Console Start building for free Sign upGo to Console Start building for free Products Docs Pricing Customers Blog Changelog Star on GitHub 55.8K Blog/GPT-5.5 is here: benchmarks, pricing, and what changes for developers Apr 24, 202...
[3] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[6] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[7] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[8] SWE-bench February 2026 leaderboard updatesimonwillison.net
Here's how the top ten models performed: Image 1: Bar chart showing "% Resolved" by "Model". Bars in descending order: Claude 4.5 Opus (high reasoning) 76.8%, Gemini 3 Flash (high reasoning) 75.8%, MiniMax M2.5 (high reasoning) 75.8%, Claude Opus 4.6 75.6%,...
[9] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[11] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[12] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[14] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performance | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 BenchLM is tracking Claude Opus 4.7, but this profile is currently excluded from the public leaderboard because it still lacks enough non-generated benchmark cov...
[15] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[17] SWE-bench - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Coding SWE-bench SWE-bench Updated: 4/24/2026 Solving production software engineering tasks Key Takeaways Claude Opus 4.7 leads with a...
[20] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both G...
[21] WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed — Live Rankingswhatllm.org
whatllm? whatllm.org WhatLLM.org - LLM Comparison Tool The ultimate LLM comparison tool Compare price, performance, and speed across the entire AI ecosystem. Updated daily with the latest benchmarks. Top 10 Models Ranked by Quality Index across all benchmar...
[25] Add community evaluation results for GPQA, GSM8K, HLE, MMLU ...huggingface.co
deepseek-ai/DeepSeek-V4-Pro · Add community evaluation results for GPQA, GSM8K, HLE, MMLU-PRO, SWE-BENCH PRO, SWE-BENCH VERIFIED, TERMINAL-BENCH-2.0 Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In S...
[26] DeepSeek V4 (2026): 1T Parameters, 81% SWE-bench ... - NxCodenxcode.io
The claimed results: Metric Standard Attention Engram (DeepSeek V4) --- Needle-in-a-Haystack (1M tokens) 84.2% accuracy 97% accuracy Context Length Supported Varies (128K typical) 1M tokens If the 97% figure holds up under independent testing, this represen...
[27] DeepSeek V4 Pro (High) Benchmarks 2026 - BenchLM.aibenchlm.ai
Category Performance PNG Embed Share Scores across all benchmark categories (0-100 scale) Category Breakdown Agentic 83.8/ 100 Weight: 22%5 benchmark s Terminal-Bench 2.0 BrowseComp OSWorld-Verified GAIA TAU-bench WebArena Coding 7 88.8/ 100 Weight: 20%6 be...
[30] Mapping the DeepSeek V4 Evaluation Suite: A Field Guide to 2026 ...redreamality.com
The Takeaway The V4 scorecard confirms a pattern: for pure coding, open weights have caught up (LiveCodeBench 93.5, Codeforces 3206). For long-horizon agentic work (SWE-bench Pro, Terminal-Bench 2.0), closed frontier still leads. For frontier reasoning (HLE...
[38] Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arxiv.org
… PRO, a substantially more challenging benchmark that builds … In our evaluation of widely used coding models, under a unified … Towards this end, this paper is motivated to (1) mitigate … 2025

Khám phá xu hướng

Báo cáoĐã xuất bản5 thg 5 2026Last edited 6 thg 5 202620 nguồn

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: so sánh benchmark 2026

Tìm kiếm và kiểm chứng sự thật với Studio Global AI Duyệt thêm từ Khám phá

3.8K0

Cách đọc hợp lý hơn là tách hai câu hỏi: mô hình nào có điểm tốt, và điểm đó đáng tin đến đâu.

Kết luận nhanh

Mô hình	Cách đọc thận trọng nhất	Độ tin cậy của bằng chứng
Claude Opus 4.7	Ứng viên có hồ sơ công khai mạnh nhất cho coding, agent và tác vụ nhiều bước. Anthropic báo 0,715 trong benchmark research-agent nội bộ, còn Vals AI xếp Claude Opus 4.7 đứng đầu SWE-bench với 82,00% ^[16]^[17].	Cao - trung bình
GPT-5.5	Rất mạnh ở reasoning tổng quát: O-Mega báo 92,4% MMLU, 93,6% GPQA Diamond, 85,0% ARC-AGI-2 và 95,0% ARC-AGI-1 ^[3].	Trung bình
DeepSeek V4 / V4 Pro	Hứa hẹn cho coding và thử nghiệm kỹ thuật, nhưng nguồn dữ liệu đang lẫn giữa V4, V4 Pro và V4 Pro High ^[25]^[27].	Trung bình - thấp
Kimi K2.6	Có tín hiệu ban đầu — LLM Stats ghi 0,91 GPQA và WhatLLM đưa vào top 10 theo Quality Index — nhưng chưa đủ phủ nhiều benchmark để so ngang hàng ^[7]^[21].	Thấp

Bảng benchmark đối sánh được, nhưng không nên gộp máy móc

Benchmark hoặc chỉ số	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6	Nên hiểu thế nào
SWE-bench	82,00% trên Vals AI, cập nhật 24/4/2026 ^[17]	Chưa có số liệu đối sánh trong bộ nguồn	81% được NxCode nêu cho DeepSeek V4 ^[26]	Chưa có số liệu đối sánh	Tín hiệu công khai sạch nhất đang nghiêng về Claude.
SWE-bench Verified	87,6% theo Vellum; 83,5% ± 1,7 theo LMCouncil ^[20]^[9]	Chưa có số liệu đối sánh	Hugging Face có nhắc SWE-bench Verified trong đánh giá cộng đồng cho DeepSeek-V4-Pro, nhưng phần tóm tắt không hiển thị con số ^[25]	Chưa có số liệu đối sánh	Điểm thay đổi theo nguồn, cấu hình, tập con và biến thể mô hình.
SWE-bench Pro	64,3% theo Vellum ^[20]	Chưa có số liệu đối sánh	Hugging Face có nhắc SWE-bench Pro trong đánh giá cộng đồng, nhưng phần tóm tắt không hiển thị con số ^[25]	Chưa có số liệu đối sánh	Phù hợp hơn nếu đánh giá agent phần mềm làm việc dài hơi.
GPQA Diamond	94,2% theo O-Mega, Vellum và TNW ^[3]^[12]^[15]	93,6% theo O-Mega và Vellum ^[3]^[12]	Có trong các bộ đánh giá cộng đồng, nhưng chưa thấy con số đối sánh trong phần tóm tắt ^[25]	0,91 trên LLM Stats ^[7]	Claude và GPT-5.5 quá sát nhau để kết luận chỉ bằng GPQA.
MMLU	Chưa có số liệu đối sánh trong bộ nguồn	92,4% theo O-Mega ^[3]	MMLU-Pro xuất hiện trong đánh giá cộng đồng, nhưng chưa có số visible trong tóm tắt ^[25]	Chưa có số liệu đối sánh	Nên cho trọng số thấp vì MMLU đã bão hòa ở nhóm mô hình mạnh.
ARC-AGI	Chưa có số liệu đối sánh	ARC-AGI-2: 85,0%; ARC-AGI-1: 95,0% theo O-Mega ^[3]	Chưa có số liệu đối sánh	Chưa có số liệu đối sánh	Củng cố luận điểm GPT-5.5 mạnh về reasoning, nhưng vẫn cần lưu ý nguồn.
Research-agent / tác vụ nhiều bước	0,715 trong benchmark nội bộ của Anthropic ^[16]	Chưa có số liệu đối sánh	BenchLM báo 83,8/100 ở mục Agentic cho DeepSeek V4 Pro High ^[27]	Chưa có số liệu đối sánh	Có ích để định hướng năng lực, nhưng hai thang đo không tương đương.
Long context / Needle-in-a-Haystack	Anthropic nói Opus 4.7 có hiệu năng long-context ổn định nhất trong nhóm mô hình họ thử ^[16]	Chưa có số liệu đối sánh	NxCode nêu 97% ở 1 triệu token, nhưng chính cách diễn đạt cần đọc như claim chờ kiểm chứng độc lập ^[26]	Chưa có số liệu đối sánh	DeepSeek có claim đáng chú ý, chưa phải kết luận đóng.
LiveCodeBench / Codeforces	Chưa có số liệu đối sánh	Chưa có số liệu đối sánh	Redreamality báo LiveCodeBench 93,5 và Codeforces 3206 cho DeepSeek V4 ^[30]	Chưa có số liệu đối sánh	Tốt cho coding thuần, nhưng chưa trả lời hết bài toán agentic dài hơi.

Đừng để một con số dẫn dắt toàn bộ quyết định

Claude Opus 4.7: trường hợp mạnh nhất cho coding và agent

GPT-5.5: reasoning rất mạnh, nhưng dấu vết chính thức ít hơn

DeepSeek V4 / V4 Pro: đáng chú ý, nhưng phải tách biến thể

Kimi K2.6: có tín hiệu, chưa đủ để xếp hạng ngang hàng

Khuyến nghị theo tình huống sử dụng

Nhu cầu	Nên ưu tiên	Độ tin cậy	Lý do
Sửa issue thực tế và coding agentic	Claude Opus 4.7	Cao - trung bình	Dẫn đầu SWE-bench trên Vals AI với 82,00% và xuất hiện mạnh ở SWE-bench Verified, SWE-bench Pro theo Vellum ^[17]^[20].
Tác vụ nhiều bước, research-agent	Claude Opus 4.7	Trung bình	Anthropic báo 0,715 trong benchmark nội bộ và long-context ổn định nhất trong các mô hình họ thử ^[16].
Reasoning khoa học kiểu GPQA	Claude Opus 4.7 hoặc GPT-5.5	Trung bình	Claude ở 94,2%, GPT-5.5 ở 93,6%; chênh lệch nhỏ và GPQA đã nén ở nhóm mô hình frontier ^[3]^[12]^[15].
Reasoning tổng quát rộng	GPT-5.5	Trung bình - thấp	Điểm MMLU, GPQA và ARC-AGI rất mạnh, nhưng chủ yếu đến từ O-Mega, Vellum, BenchLM và các trang tổng hợp ^[3]^[6]^[12].
Thử nghiệm kỹ thuật, tự kiểm chứng trong môi trường riêng	DeepSeek V4 / V4 Pro	Trung bình - thấp	Có tín hiệu từ Hugging Face, BenchLM, NxCode và Redreamality, nhưng còn lẫn biến thể và cần xác thực độc lập ^[25]^[26]^[27]^[30].
Xếp hạng định lượng đầy đủ	Không dùng Kimi K2.6 như mô hình đã đối sánh đầy đủ	Thấp	Có tín hiệu như GPQA 0,91 trên LLM Stats, nhưng thiếu độ phủ benchmark tương đương ^[7]^[21].

Cách trình bày mà không hứa quá đà

Kết luận

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Tìm kiếm và kiểm chứng sự thật với Studio Global AI

Bài học chính

Claude Opus 4.7 là lựa chọn có bằng chứng công khai chắc nhất cho coding và agent: Vals AI ghi 82,00% trên SWE bench, còn Anthropic báo 0,715 ở benchmark research agent nội bộ [17][16].
GPT 5.5 rất mạnh ở reasoning theo các nguồn thứ cấp: O Mega báo 92,4% MMLU, 93,6% GPQA Diamond, 85,0% ARC AGI 2 và 95,0% ARC AGI 1 [3].
DeepSeek V4/V4 Pro có tín hiệu tốt cho coding nhưng số liệu còn trộn biến thể; Kimi K2.6 mới có tín hiệu rời rạc như 0,91 GPQA trên LLM Stats và xuất hiện trong top 10 Quality Index của WhatLLM [25][27][7][21].

Người ta cũng hỏi

Câu trả lời ngắn gọn cho "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: so sánh benchmark 2026" là gì?

Những điểm chính cần xác nhận đầu tiên là gì?

Tôi nên làm gì tiếp theo trong thực tế?

Tôi nên khám phá chủ đề liên quan nào tiếp theo?

Tiếp tục với "DeepSeek V4: không chỉ là 1M token, mà là bài toán MoE và API" để có góc nhìn khác và trích dẫn bổ sung.

Mở trang liên quan

Tôi nên so sánh điều này với cái gì?

Kiểm tra chéo câu trả lời này với "Northwest vs. Southeast Timber: vì sao đáp án là larger; larger?".

Mở trang liên quan

Tiếp tục nghiên cứu của bạn

DeepSeek V4: không chỉ là 1M token, mà là bài toán MoE và API

DeepSeek V4: 1M token, MoE và những việc developer cần làm khi chuyển API

Northwest vs. Southeast Timber: vì sao đáp án là larger; larger?

Bài elasticity Northwest vs. Southeast Timber: vì sao là larger; larger?

DeepSeek có thể đánh bại OpenAI, Claude, Gemini, Grok? Câu trả lời nằm ở chi phí và niềm tin

DeepSeek có thể đánh bại OpenAI, Claude, Gemini, Grok?

Sự cố PocketOS: không chỉ là chuyện “AI xóa database”

Trao đổi nghiên cứu

Bạn26 thg 4 2026

Câu hỏi nghiên cứu

Investiga los benchmarks de Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6, y compáralos de la forma más completa posible en formato de informe de investigación para su presentación.

Studio Global AI26 thg 4 202640 nguồn

Câu trả lời được trích dẫn

Key findings

Claude Opus 4.7 tiene la evidencia más sólida del conjunto porque aparece en una fuente oficial de Anthropic con un benchmark interno de research-agent y en fuentes externas de coding como SWE-bench/Vals AI ^[1]^[2].
En SWE-bench, Vals AI reporta que Claude Opus 4.7 lidera con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].
Hay conflicto entre fuentes sobre el rendimiento de Claude en SWE-bench: Vals AI reporta 82.00%, mientras que Vellum reporta 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[2]^[15].
En GPQA Diamond, TNW reporta 94.2% para Claude Opus 4.7 y señala que las diferencias entre modelos frontier en ese benchmark están dentro del ruido de medición ^[3].
GPT-5.5 tiene métricas fuertes en fuentes secundarias: O-Mega reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].
No encontré una fuente oficial de OpenAI en el corpus recuperado para validar los benchmarks de GPT-5.5, por lo que sus resultados deben presentarse como “provisionales / secundarios”.
DeepSeek aparece principalmente como DeepSeek V4 Pro o V4 Pro High en las fuentes recuperadas, no siempre como “DeepSeek V4” base ^[6]^[8].
Para DeepSeek V4/V4 Pro, la evidencia combina discusión comunitaria de Hugging Face, agregadores y blogs técnicos, por lo que su comparabilidad con Claude y GPT-5.5 es menor ^[6]^[7]^[8]^[9].
Kimi K2.6 no cuenta con evidencia suficiente en las fuentes recuperadas; no recomiendo incluirlo en una tabla de ranking como si tuviera benchmarks verificados.

Metodología de lectura

Prioricé fuentes oficiales, leaderboards especializados, discusiones técnicas con trazabilidad y fuentes académicas sobre benchmarks ^[1]^[2]^[6]^[10]^[11].
Clasifiqué la evidencia en cuatro niveles: oficial, benchmark independiente, agregador/comunidad y blog o análisis secundario ^[1]^[2]^[4]^[6]^[8].
No traté como equivalentes los resultados de SWE-bench, SWE-bench Verified y SWE-bench Pro, porque SWE-bench Pro se define como una variante más desafiante y orientada a tareas de ingeniería de software de largo horizonte ^[10].
Consideré MMLU como métrica de bajo poder discriminativo para modelos frontier, ya que una fuente de explicación de benchmarks indica que en 2026 los modelos top superan el 88% y el benchmark está muy saturado ^[12].

Matriz comparativa ejecutiva

Modelo	Estado de evidencia	Benchmarks más relevantes recuperados	Lectura ejecutiva
Claude Opus 4.7	Alta-media	Research-agent interno 0.715 y fuerte rendimiento de long-context según Anthropic; SWE-bench 82.00% según Vals AI; GPQA Diamond 94.2% según TNW ^[1]^[2]^[3]	Mejor candidato para presentarlo como líder respaldado en coding/agente, con cautela por diferencias entre fuentes ^[2]^[15]
GPT-5.5	Media-baja	MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% según O-Mega ^[4]	Muy fuerte en razonamiento según fuentes secundarias, pero falta validación oficial en el corpus recuperado ^[4]^[5]
DeepSeek V4 / V4 Pro	Media-baja	BenchLM reporta DeepSeek V4 Pro High con Agentic 83.8/100 y Coding 88.8/100; NxCode habla de 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens como resultado reclamado ^[7]^[8]	Alternativa competitiva, especialmente si se valora ecosistema abierto/local, pero requiere validación independiente antes de una decisión ejecutiva ^[6]^[8]^[9]
Kimi K2.6	Insufficient evidence	No hay benchmark citable suficiente en las fuentes recuperadas	No incluir como comparable verificado; pedir fuente oficial o leaderboard antes de presentarlo

Benchmarks numéricos recuperados

Benchmark / métrica	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6
SWE-bench	82.00% según Vals AI ^[2]	No recuperado en fuente suficientemente comparable	81% reclamado en una fuente secundaria sobre DeepSeek V4 ^[7]	Insufficient evidence
SWE-bench Verified	87.6% según Vellum ^[15]	No recuperado	Incluido como benchmark evaluado en discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
SWE-bench Pro	64.3% según Vellum ^[15]	No recuperado	Incluido en la discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
GPQA Diamond	94.2% según TNW y O-Mega ^[3]^[4]	93.6% según O-Mega ^[4]	Mencionado dentro de suites comunitarias, sin cifra visible en el resumen recuperado ^[6]^[9]	Insufficient evidence
MMLU	No recuperado con cifra comparable	92.4% según O-Mega ^[4]	MMLU-Pro aparece como evaluación comunitaria, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
ARC-AGI-2	No recuperado	85.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
ARC-AGI-1	No recuperado	95.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
Research-agent / tareas multi-step	0.715 en benchmark interno de Anthropic ^[1]	No recuperado	BenchLM reporta categoría Agentic 83.8/100 para DeepSeek V4 Pro High ^[8]	Insufficient evidence
Long-context / Needle-in-a-Haystack	Anthropic afirma rendimiento long-context muy consistente ^[1]	No recuperado	NxCode reporta 97% a 1M tokens como resultado reclamado, condicionado a validación independiente ^[7]	Insufficient evidence
LiveCodeBench / Codeforces	No recuperado	No recuperado	Redreamality reporta LiveCodeBench 93.5 y Codeforces 3206 para DeepSeek V4 ^[9]	Insufficient evidence

Análisis por modelo

Claude Opus 4.7

Claude Opus 4.7 es el modelo mejor respaldado del conjunto porque tiene una página oficial de Anthropic y resultados externos de SWE-bench ^[1]^[2].

Vals AI reporta que Claude Opus 4.7 lidera SWE-bench con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].

Vellum reporta cifras más altas para Claude, con 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[15].

La diferencia entre 82.00% y 87.6% debe tratarse como una discrepancia de metodología, subconjunto o configuración, no como una mejora confirmada única ^[2]^[15].

En razonamiento científico, TNW reporta 94.2% en GPQA Diamond para Claude Opus 4.7 y contextualiza que los modelos frontier están muy cerca entre sí en ese benchmark ^[3].

GPT-5.5

GPT-5.5 aparece muy fuerte en razonamiento general según O-Mega, que reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].

Para una presentación, GPT-5.5 puede colocarse como competidor muy fuerte en razonamiento, pero no como ganador global si se exige trazabilidad oficial comparable a la de Claude ^[1]^[4]^[5].

DeepSeek V4 / V4 Pro

Hugging Face muestra una discusión comunitaria para DeepSeek-V4-Pro con evaluaciones en GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified y Terminal-Bench 2.0 ^[6].

BenchLM reporta para DeepSeek V4 Pro High una categoría Agentic de 83.8/100 y una categoría Coding de 88.8/100 ^[8].

Redreamality reporta que DeepSeek V4 alcanza LiveCodeBench 93.5 y Codeforces 3206, y al mismo tiempo afirma que los modelos cerrados siguen liderando en tareas agentic de largo horizonte ^[9].

Kimi K2.6

No encontré benchmarks citables suficientes para Kimi K2.6 en las fuentes recuperadas.

No recomiendo sustituir Kimi K2.6 por Kimi K2.5 u otra variante, porque eso mezclaría modelos diferentes y degradaría la validez de la comparación.

La forma más segura de presentarlo es como “pendiente de validación” o “Insufficient evidence”.

Ranking por escenario de uso

Escenario	Modelo recomendado	Justificación
Coding y resolución de issues reales	Claude Opus 4.7	Vals AI lo reporta como líder en SWE-bench con 82.00%, y Vellum también lo ubica fuerte en SWE-bench Verified y SWE-bench Pro ^[2]^[15]
Razonamiento general / ciencia	Claude Opus 4.7 o GPT-5.5	Claude aparece con 94.2% en GPQA Diamond, mientras GPT-5.5 aparece con 93.6% en O-Mega; la diferencia es pequeña y debe leerse con cautela ^[3]^[4]
Agentes y tareas multi-step	Claude Opus 4.7	Anthropic reporta 0.715 en su benchmark interno de research-agent y fuerte consistencia long-context ^[1]
Alternativa abierta / ecosistema técnico	DeepSeek V4 Pro	BenchLM y Hugging Face muestran suites de evaluación amplias para DeepSeek V4 Pro, aunque la evidencia es menos oficial que la de Claude ^[6]^[8]
Presentación con evidencia verificable	Claude Opus 4.7	Es el único del conjunto con fuente oficial clara y señales externas de benchmark en el corpus recuperado ^[1]^[2]
Inclusión de Kimi K2.6	No recomendado	Insufficient evidence

Evidence notes

La fuente más fuerte para Claude es oficial, pero su benchmark de research-agent es interno y no necesariamente reproducible por terceros ^[1].
Vals AI es la fuente más limpia para una comparación de SWE-bench porque reporta directamente un leaderboard de SWE-bench actualizado el 24 de abril de 2026 ^[2].
Vellum aporta más detalle para SWE-bench Verified y SWE-bench Pro, pero sus cifras no coinciden con Vals AI y deben presentarse como una fuente alternativa ^[15].
O-Mega aporta varias cifras útiles para GPT-5.5, pero debe tratarse como fuente secundaria hasta que exista una ficha oficial o un leaderboard independiente reproducible ^[4].
La discusión de Hugging Face para DeepSeek-V4-Pro demuestra actividad comunitaria de evaluación, pero no equivale por sí sola a una publicación oficial o a un benchmark auditado ^[6].
SWE-bench Pro es especialmente relevante para comparar agentes de software porque evalúa tareas más difíciles y de mayor horizonte que benchmarks de coding más simples ^[10].
MMLU debe tener menor peso en la decisión porque fuentes de explicación de benchmarks indican saturación entre modelos top en 2026 ^[12].

Limitations / uncertainty

Insufficient evidence para Kimi K2.6.
No recuperé una fuente oficial de OpenAI con benchmark card de GPT-5.5, por lo que cualquier cifra de GPT-5.5 en este informe debe etiquetarse como secundaria.
No recuperé una fuente oficial de DeepSeek que consolide los r

Nguồn

[1] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[2] GPT-5.5 is here: benchmarks, pricing, and what changes ... - Appwriteappwrite.io
Star on GitHub 55.8KGo to Console Start building for free Sign upGo to Console Start building for free Products Docs Pricing Customers Blog Changelog Star on GitHub 55.8K Blog/GPT-5.5 is here: benchmarks, pricing, and what changes for developers Apr 24, 202...
[3] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[6] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[7] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[8] SWE-bench February 2026 leaderboard updatesimonwillison.net
Here's how the top ten models performed: Image 1: Bar chart showing "% Resolved" by "Model". Bars in descending order: Claude 4.5 Opus (high reasoning) 76.8%, Gemini 3 Flash (high reasoning) 75.8%, MiniMax M2.5 (high reasoning) 75.8%, Claude Opus 4.6 75.6%,...
[9] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[11] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[12] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[14] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performance | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 BenchLM is tracking Claude Opus 4.7, but this profile is currently excluded from the public leaderboard because it still lacks enough non-generated benchmark cov...
[15] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[17] SWE-bench - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Coding SWE-bench SWE-bench Updated: 4/24/2026 Solving production software engineering tasks Key Takeaways Claude Opus 4.7 leads with a...
[20] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both G...
[21] WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed — Live Rankingswhatllm.org
whatllm? whatllm.org WhatLLM.org - LLM Comparison Tool The ultimate LLM comparison tool Compare price, performance, and speed across the entire AI ecosystem. Updated daily with the latest benchmarks. Top 10 Models Ranked by Quality Index across all benchmar...
[25] Add community evaluation results for GPQA, GSM8K, HLE, MMLU ...huggingface.co
deepseek-ai/DeepSeek-V4-Pro · Add community evaluation results for GPQA, GSM8K, HLE, MMLU-PRO, SWE-BENCH PRO, SWE-BENCH VERIFIED, TERMINAL-BENCH-2.0 Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In S...
[26] DeepSeek V4 (2026): 1T Parameters, 81% SWE-bench ... - NxCodenxcode.io
The claimed results: Metric Standard Attention Engram (DeepSeek V4) --- Needle-in-a-Haystack (1M tokens) 84.2% accuracy 97% accuracy Context Length Supported Varies (128K typical) 1M tokens If the 97% figure holds up under independent testing, this represen...
[27] DeepSeek V4 Pro (High) Benchmarks 2026 - BenchLM.aibenchlm.ai
Category Performance PNG Embed Share Scores across all benchmark categories (0-100 scale) Category Breakdown Agentic 83.8/ 100 Weight: 22%5 benchmark s Terminal-Bench 2.0 BrowseComp OSWorld-Verified GAIA TAU-bench WebArena Coding 7 88.8/ 100 Weight: 20%6 be...
[30] Mapping the DeepSeek V4 Evaluation Suite: A Field Guide to 2026 ...redreamality.com
The Takeaway The V4 scorecard confirms a pattern: for pure coding, open weights have caught up (LiveCodeBench 93.5, Codeforces 3206). For long-horizon agentic work (SWE-bench Pro, Terminal-Bench 2.0), closed frontier still leads. For frontier reasoning (HLE...
[38] Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arxiv.org
… PRO, a substantially more challenging benchmark that builds … In our evaluation of widely used coding models, under a unified … Towards this end, this paper is motivated to (1) mitigate … 2025

Khám phá xu hướng

Báo cáoĐã xuất bản5 thg 5 2026Last edited 6 thg 5 202620 nguồn

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: so sánh benchmark 2026

Tìm kiếm và kiểm chứng sự thật với Studio Global AI Duyệt thêm từ Khám phá

3.8K0

Cách đọc hợp lý hơn là tách hai câu hỏi: mô hình nào có điểm tốt, và điểm đó đáng tin đến đâu.

Kết luận nhanh

Mô hình	Cách đọc thận trọng nhất	Độ tin cậy của bằng chứng
Claude Opus 4.7	Ứng viên có hồ sơ công khai mạnh nhất cho coding, agent và tác vụ nhiều bước. Anthropic báo 0,715 trong benchmark research-agent nội bộ, còn Vals AI xếp Claude Opus 4.7 đứng đầu SWE-bench với 82,00% ^[16]^[17].	Cao - trung bình
GPT-5.5	Rất mạnh ở reasoning tổng quát: O-Mega báo 92,4% MMLU, 93,6% GPQA Diamond, 85,0% ARC-AGI-2 và 95,0% ARC-AGI-1 ^[3].	Trung bình
DeepSeek V4 / V4 Pro	Hứa hẹn cho coding và thử nghiệm kỹ thuật, nhưng nguồn dữ liệu đang lẫn giữa V4, V4 Pro và V4 Pro High ^[25]^[27].	Trung bình - thấp
Kimi K2.6	Có tín hiệu ban đầu — LLM Stats ghi 0,91 GPQA và WhatLLM đưa vào top 10 theo Quality Index — nhưng chưa đủ phủ nhiều benchmark để so ngang hàng ^[7]^[21].	Thấp

Bảng benchmark đối sánh được, nhưng không nên gộp máy móc

Benchmark hoặc chỉ số	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6	Nên hiểu thế nào
SWE-bench	82,00% trên Vals AI, cập nhật 24/4/2026 ^[17]	Chưa có số liệu đối sánh trong bộ nguồn	81% được NxCode nêu cho DeepSeek V4 ^[26]	Chưa có số liệu đối sánh	Tín hiệu công khai sạch nhất đang nghiêng về Claude.
SWE-bench Verified	87,6% theo Vellum; 83,5% ± 1,7 theo LMCouncil ^[20]^[9]	Chưa có số liệu đối sánh	Hugging Face có nhắc SWE-bench Verified trong đánh giá cộng đồng cho DeepSeek-V4-Pro, nhưng phần tóm tắt không hiển thị con số ^[25]	Chưa có số liệu đối sánh	Điểm thay đổi theo nguồn, cấu hình, tập con và biến thể mô hình.
SWE-bench Pro	64,3% theo Vellum ^[20]	Chưa có số liệu đối sánh	Hugging Face có nhắc SWE-bench Pro trong đánh giá cộng đồng, nhưng phần tóm tắt không hiển thị con số ^[25]	Chưa có số liệu đối sánh	Phù hợp hơn nếu đánh giá agent phần mềm làm việc dài hơi.
GPQA Diamond	94,2% theo O-Mega, Vellum và TNW ^[3]^[12]^[15]	93,6% theo O-Mega và Vellum ^[3]^[12]	Có trong các bộ đánh giá cộng đồng, nhưng chưa thấy con số đối sánh trong phần tóm tắt ^[25]	0,91 trên LLM Stats ^[7]	Claude và GPT-5.5 quá sát nhau để kết luận chỉ bằng GPQA.
MMLU	Chưa có số liệu đối sánh trong bộ nguồn	92,4% theo O-Mega ^[3]	MMLU-Pro xuất hiện trong đánh giá cộng đồng, nhưng chưa có số visible trong tóm tắt ^[25]	Chưa có số liệu đối sánh	Nên cho trọng số thấp vì MMLU đã bão hòa ở nhóm mô hình mạnh.
ARC-AGI	Chưa có số liệu đối sánh	ARC-AGI-2: 85,0%; ARC-AGI-1: 95,0% theo O-Mega ^[3]	Chưa có số liệu đối sánh	Chưa có số liệu đối sánh	Củng cố luận điểm GPT-5.5 mạnh về reasoning, nhưng vẫn cần lưu ý nguồn.
Research-agent / tác vụ nhiều bước	0,715 trong benchmark nội bộ của Anthropic ^[16]	Chưa có số liệu đối sánh	BenchLM báo 83,8/100 ở mục Agentic cho DeepSeek V4 Pro High ^[27]	Chưa có số liệu đối sánh	Có ích để định hướng năng lực, nhưng hai thang đo không tương đương.
Long context / Needle-in-a-Haystack	Anthropic nói Opus 4.7 có hiệu năng long-context ổn định nhất trong nhóm mô hình họ thử ^[16]	Chưa có số liệu đối sánh	NxCode nêu 97% ở 1 triệu token, nhưng chính cách diễn đạt cần đọc như claim chờ kiểm chứng độc lập ^[26]	Chưa có số liệu đối sánh	DeepSeek có claim đáng chú ý, chưa phải kết luận đóng.
LiveCodeBench / Codeforces	Chưa có số liệu đối sánh	Chưa có số liệu đối sánh	Redreamality báo LiveCodeBench 93,5 và Codeforces 3206 cho DeepSeek V4 ^[30]	Chưa có số liệu đối sánh	Tốt cho coding thuần, nhưng chưa trả lời hết bài toán agentic dài hơi.

Đừng để một con số dẫn dắt toàn bộ quyết định

Claude Opus 4.7: trường hợp mạnh nhất cho coding và agent

GPT-5.5: reasoning rất mạnh, nhưng dấu vết chính thức ít hơn

DeepSeek V4 / V4 Pro: đáng chú ý, nhưng phải tách biến thể

Kimi K2.6: có tín hiệu, chưa đủ để xếp hạng ngang hàng

Khuyến nghị theo tình huống sử dụng

Nhu cầu	Nên ưu tiên	Độ tin cậy	Lý do
Sửa issue thực tế và coding agentic	Claude Opus 4.7	Cao - trung bình	Dẫn đầu SWE-bench trên Vals AI với 82,00% và xuất hiện mạnh ở SWE-bench Verified, SWE-bench Pro theo Vellum ^[17]^[20].
Tác vụ nhiều bước, research-agent	Claude Opus 4.7	Trung bình	Anthropic báo 0,715 trong benchmark nội bộ và long-context ổn định nhất trong các mô hình họ thử ^[16].
Reasoning khoa học kiểu GPQA	Claude Opus 4.7 hoặc GPT-5.5	Trung bình	Claude ở 94,2%, GPT-5.5 ở 93,6%; chênh lệch nhỏ và GPQA đã nén ở nhóm mô hình frontier ^[3]^[12]^[15].
Reasoning tổng quát rộng	GPT-5.5	Trung bình - thấp	Điểm MMLU, GPQA và ARC-AGI rất mạnh, nhưng chủ yếu đến từ O-Mega, Vellum, BenchLM và các trang tổng hợp ^[3]^[6]^[12].
Thử nghiệm kỹ thuật, tự kiểm chứng trong môi trường riêng	DeepSeek V4 / V4 Pro	Trung bình - thấp	Có tín hiệu từ Hugging Face, BenchLM, NxCode và Redreamality, nhưng còn lẫn biến thể và cần xác thực độc lập ^[25]^[26]^[27]^[30].
Xếp hạng định lượng đầy đủ	Không dùng Kimi K2.6 như mô hình đã đối sánh đầy đủ	Thấp	Có tín hiệu như GPQA 0,91 trên LLM Stats, nhưng thiếu độ phủ benchmark tương đương ^[7]^[21].

Cách trình bày mà không hứa quá đà

Kết luận

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Tìm kiếm và kiểm chứng sự thật với Studio Global AI

Bài học chính

Claude Opus 4.7 là lựa chọn có bằng chứng công khai chắc nhất cho coding và agent: Vals AI ghi 82,00% trên SWE bench, còn Anthropic báo 0,715 ở benchmark research agent nội bộ [17][16].
GPT 5.5 rất mạnh ở reasoning theo các nguồn thứ cấp: O Mega báo 92,4% MMLU, 93,6% GPQA Diamond, 85,0% ARC AGI 2 và 95,0% ARC AGI 1 [3].
DeepSeek V4/V4 Pro có tín hiệu tốt cho coding nhưng số liệu còn trộn biến thể; Kimi K2.6 mới có tín hiệu rời rạc như 0,91 GPQA trên LLM Stats và xuất hiện trong top 10 Quality Index của WhatLLM [25][27][7][21].

Người ta cũng hỏi

Câu trả lời ngắn gọn cho "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: so sánh benchmark 2026" là gì?

Những điểm chính cần xác nhận đầu tiên là gì?

Tôi nên làm gì tiếp theo trong thực tế?

Tôi nên khám phá chủ đề liên quan nào tiếp theo?

Tiếp tục với "DeepSeek V4: không chỉ là 1M token, mà là bài toán MoE và API" để có góc nhìn khác và trích dẫn bổ sung.

Mở trang liên quan

Tôi nên so sánh điều này với cái gì?

Kiểm tra chéo câu trả lời này với "Northwest vs. Southeast Timber: vì sao đáp án là larger; larger?".

Mở trang liên quan

Tiếp tục nghiên cứu của bạn

DeepSeek V4: không chỉ là 1M token, mà là bài toán MoE và API

DeepSeek V4: 1M token, MoE và những việc developer cần làm khi chuyển API

Northwest vs. Southeast Timber: vì sao đáp án là larger; larger?

Bài elasticity Northwest vs. Southeast Timber: vì sao là larger; larger?

DeepSeek có thể đánh bại OpenAI, Claude, Gemini, Grok? Câu trả lời nằm ở chi phí và niềm tin

DeepSeek có thể đánh bại OpenAI, Claude, Gemini, Grok?

Sự cố PocketOS: không chỉ là chuyện “AI xóa database”

Trao đổi nghiên cứu

Bạn26 thg 4 2026

Câu hỏi nghiên cứu

Investiga los benchmarks de Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6, y compáralos de la forma más completa posible en formato de informe de investigación para su presentación.

Studio Global AI26 thg 4 202640 nguồn

Câu trả lời được trích dẫn

Key findings

Claude Opus 4.7 tiene la evidencia más sólida del conjunto porque aparece en una fuente oficial de Anthropic con un benchmark interno de research-agent y en fuentes externas de coding como SWE-bench/Vals AI ^[1]^[2].
En SWE-bench, Vals AI reporta que Claude Opus 4.7 lidera con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].
Hay conflicto entre fuentes sobre el rendimiento de Claude en SWE-bench: Vals AI reporta 82.00%, mientras que Vellum reporta 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[2]^[15].
En GPQA Diamond, TNW reporta 94.2% para Claude Opus 4.7 y señala que las diferencias entre modelos frontier en ese benchmark están dentro del ruido de medición ^[3].
GPT-5.5 tiene métricas fuertes en fuentes secundarias: O-Mega reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].
No encontré una fuente oficial de OpenAI en el corpus recuperado para validar los benchmarks de GPT-5.5, por lo que sus resultados deben presentarse como “provisionales / secundarios”.
DeepSeek aparece principalmente como DeepSeek V4 Pro o V4 Pro High en las fuentes recuperadas, no siempre como “DeepSeek V4” base ^[6]^[8].
Para DeepSeek V4/V4 Pro, la evidencia combina discusión comunitaria de Hugging Face, agregadores y blogs técnicos, por lo que su comparabilidad con Claude y GPT-5.5 es menor ^[6]^[7]^[8]^[9].
Kimi K2.6 no cuenta con evidencia suficiente en las fuentes recuperadas; no recomiendo incluirlo en una tabla de ranking como si tuviera benchmarks verificados.

Metodología de lectura

Prioricé fuentes oficiales, leaderboards especializados, discusiones técnicas con trazabilidad y fuentes académicas sobre benchmarks ^[1]^[2]^[6]^[10]^[11].
Clasifiqué la evidencia en cuatro niveles: oficial, benchmark independiente, agregador/comunidad y blog o análisis secundario ^[1]^[2]^[4]^[6]^[8].
No traté como equivalentes los resultados de SWE-bench, SWE-bench Verified y SWE-bench Pro, porque SWE-bench Pro se define como una variante más desafiante y orientada a tareas de ingeniería de software de largo horizonte ^[10].
Consideré MMLU como métrica de bajo poder discriminativo para modelos frontier, ya que una fuente de explicación de benchmarks indica que en 2026 los modelos top superan el 88% y el benchmark está muy saturado ^[12].

Matriz comparativa ejecutiva

Modelo	Estado de evidencia	Benchmarks más relevantes recuperados	Lectura ejecutiva
Claude Opus 4.7	Alta-media	Research-agent interno 0.715 y fuerte rendimiento de long-context según Anthropic; SWE-bench 82.00% según Vals AI; GPQA Diamond 94.2% según TNW ^[1]^[2]^[3]	Mejor candidato para presentarlo como líder respaldado en coding/agente, con cautela por diferencias entre fuentes ^[2]^[15]
GPT-5.5	Media-baja	MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% según O-Mega ^[4]	Muy fuerte en razonamiento según fuentes secundarias, pero falta validación oficial en el corpus recuperado ^[4]^[5]
DeepSeek V4 / V4 Pro	Media-baja	BenchLM reporta DeepSeek V4 Pro High con Agentic 83.8/100 y Coding 88.8/100; NxCode habla de 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens como resultado reclamado ^[7]^[8]	Alternativa competitiva, especialmente si se valora ecosistema abierto/local, pero requiere validación independiente antes de una decisión ejecutiva ^[6]^[8]^[9]
Kimi K2.6	Insufficient evidence	No hay benchmark citable suficiente en las fuentes recuperadas	No incluir como comparable verificado; pedir fuente oficial o leaderboard antes de presentarlo

Benchmarks numéricos recuperados

Benchmark / métrica	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6
SWE-bench	82.00% según Vals AI ^[2]	No recuperado en fuente suficientemente comparable	81% reclamado en una fuente secundaria sobre DeepSeek V4 ^[7]	Insufficient evidence
SWE-bench Verified	87.6% según Vellum ^[15]	No recuperado	Incluido como benchmark evaluado en discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
SWE-bench Pro	64.3% según Vellum ^[15]	No recuperado	Incluido en la discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
GPQA Diamond	94.2% según TNW y O-Mega ^[3]^[4]	93.6% según O-Mega ^[4]	Mencionado dentro de suites comunitarias, sin cifra visible en el resumen recuperado ^[6]^[9]	Insufficient evidence
MMLU	No recuperado con cifra comparable	92.4% según O-Mega ^[4]	MMLU-Pro aparece como evaluación comunitaria, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
ARC-AGI-2	No recuperado	85.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
ARC-AGI-1	No recuperado	95.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
Research-agent / tareas multi-step	0.715 en benchmark interno de Anthropic ^[1]	No recuperado	BenchLM reporta categoría Agentic 83.8/100 para DeepSeek V4 Pro High ^[8]	Insufficient evidence
Long-context / Needle-in-a-Haystack	Anthropic afirma rendimiento long-context muy consistente ^[1]	No recuperado	NxCode reporta 97% a 1M tokens como resultado reclamado, condicionado a validación independiente ^[7]	Insufficient evidence
LiveCodeBench / Codeforces	No recuperado	No recuperado	Redreamality reporta LiveCodeBench 93.5 y Codeforces 3206 para DeepSeek V4 ^[9]	Insufficient evidence

Análisis por modelo

Claude Opus 4.7

Claude Opus 4.7 es el modelo mejor respaldado del conjunto porque tiene una página oficial de Anthropic y resultados externos de SWE-bench ^[1]^[2].

Vals AI reporta que Claude Opus 4.7 lidera SWE-bench con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].

Vellum reporta cifras más altas para Claude, con 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[15].

La diferencia entre 82.00% y 87.6% debe tratarse como una discrepancia de metodología, subconjunto o configuración, no como una mejora confirmada única ^[2]^[15].

En razonamiento científico, TNW reporta 94.2% en GPQA Diamond para Claude Opus 4.7 y contextualiza que los modelos frontier están muy cerca entre sí en ese benchmark ^[3].

GPT-5.5

GPT-5.5 aparece muy fuerte en razonamiento general según O-Mega, que reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].

Para una presentación, GPT-5.5 puede colocarse como competidor muy fuerte en razonamiento, pero no como ganador global si se exige trazabilidad oficial comparable a la de Claude ^[1]^[4]^[5].

DeepSeek V4 / V4 Pro

Hugging Face muestra una discusión comunitaria para DeepSeek-V4-Pro con evaluaciones en GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified y Terminal-Bench 2.0 ^[6].

BenchLM reporta para DeepSeek V4 Pro High una categoría Agentic de 83.8/100 y una categoría Coding de 88.8/100 ^[8].

Redreamality reporta que DeepSeek V4 alcanza LiveCodeBench 93.5 y Codeforces 3206, y al mismo tiempo afirma que los modelos cerrados siguen liderando en tareas agentic de largo horizonte ^[9].

Kimi K2.6

No encontré benchmarks citables suficientes para Kimi K2.6 en las fuentes recuperadas.

No recomiendo sustituir Kimi K2.6 por Kimi K2.5 u otra variante, porque eso mezclaría modelos diferentes y degradaría la validez de la comparación.

La forma más segura de presentarlo es como “pendiente de validación” o “Insufficient evidence”.

Ranking por escenario de uso

Escenario	Modelo recomendado	Justificación
Coding y resolución de issues reales	Claude Opus 4.7	Vals AI lo reporta como líder en SWE-bench con 82.00%, y Vellum también lo ubica fuerte en SWE-bench Verified y SWE-bench Pro ^[2]^[15]
Razonamiento general / ciencia	Claude Opus 4.7 o GPT-5.5	Claude aparece con 94.2% en GPQA Diamond, mientras GPT-5.5 aparece con 93.6% en O-Mega; la diferencia es pequeña y debe leerse con cautela ^[3]^[4]
Agentes y tareas multi-step	Claude Opus 4.7	Anthropic reporta 0.715 en su benchmark interno de research-agent y fuerte consistencia long-context ^[1]
Alternativa abierta / ecosistema técnico	DeepSeek V4 Pro	BenchLM y Hugging Face muestran suites de evaluación amplias para DeepSeek V4 Pro, aunque la evidencia es menos oficial que la de Claude ^[6]^[8]
Presentación con evidencia verificable	Claude Opus 4.7	Es el único del conjunto con fuente oficial clara y señales externas de benchmark en el corpus recuperado ^[1]^[2]
Inclusión de Kimi K2.6	No recomendado	Insufficient evidence

Evidence notes

La fuente más fuerte para Claude es oficial, pero su benchmark de research-agent es interno y no necesariamente reproducible por terceros ^[1].
Vals AI es la fuente más limpia para una comparación de SWE-bench porque reporta directamente un leaderboard de SWE-bench actualizado el 24 de abril de 2026 ^[2].
Vellum aporta más detalle para SWE-bench Verified y SWE-bench Pro, pero sus cifras no coinciden con Vals AI y deben presentarse como una fuente alternativa ^[15].
O-Mega aporta varias cifras útiles para GPT-5.5, pero debe tratarse como fuente secundaria hasta que exista una ficha oficial o un leaderboard independiente reproducible ^[4].
La discusión de Hugging Face para DeepSeek-V4-Pro demuestra actividad comunitaria de evaluación, pero no equivale por sí sola a una publicación oficial o a un benchmark auditado ^[6].
SWE-bench Pro es especialmente relevante para comparar agentes de software porque evalúa tareas más difíciles y de mayor horizonte que benchmarks de coding más simples ^[10].
MMLU debe tener menor peso en la decisión porque fuentes de explicación de benchmarks indican saturación entre modelos top en 2026 ^[12].

Limitations / uncertainty

Insufficient evidence para Kimi K2.6.
No recuperé una fuente oficial de OpenAI con benchmark card de GPT-5.5, por lo que cualquier cifra de GPT-5.5 en este informe debe etiquetarse como secundaria.
No recuperé una fuente oficial de DeepSeek que consolide los r

Nguồn

[1] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[2] GPT-5.5 is here: benchmarks, pricing, and what changes ... - Appwriteappwrite.io
Star on GitHub 55.8KGo to Console Start building for free Sign upGo to Console Start building for free Products Docs Pricing Customers Blog Changelog Star on GitHub 55.8K Blog/GPT-5.5 is here: benchmarks, pricing, and what changes for developers Apr 24, 202...
[3] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[6] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[7] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[8] SWE-bench February 2026 leaderboard updatesimonwillison.net
Here's how the top ten models performed: Image 1: Bar chart showing "% Resolved" by "Model". Bars in descending order: Claude 4.5 Opus (high reasoning) 76.8%, Gemini 3 Flash (high reasoning) 75.8%, MiniMax M2.5 (high reasoning) 75.8%, Claude Opus 4.6 75.6%,...
[9] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[11] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[12] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[14] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performance | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 BenchLM is tracking Claude Opus 4.7, but this profile is currently excluded from the public leaderboard because it still lacks enough non-generated benchmark cov...
[15] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[17] SWE-bench - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Coding SWE-bench SWE-bench Updated: 4/24/2026 Solving production software engineering tasks Key Takeaways Claude Opus 4.7 leads with a...
[20] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both G...
[21] WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed — Live Rankingswhatllm.org
whatllm? whatllm.org WhatLLM.org - LLM Comparison Tool The ultimate LLM comparison tool Compare price, performance, and speed across the entire AI ecosystem. Updated daily with the latest benchmarks. Top 10 Models Ranked by Quality Index across all benchmar...
[25] Add community evaluation results for GPQA, GSM8K, HLE, MMLU ...huggingface.co
deepseek-ai/DeepSeek-V4-Pro · Add community evaluation results for GPQA, GSM8K, HLE, MMLU-PRO, SWE-BENCH PRO, SWE-BENCH VERIFIED, TERMINAL-BENCH-2.0 Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In S...
[26] DeepSeek V4 (2026): 1T Parameters, 81% SWE-bench ... - NxCodenxcode.io
The claimed results: Metric Standard Attention Engram (DeepSeek V4) --- Needle-in-a-Haystack (1M tokens) 84.2% accuracy 97% accuracy Context Length Supported Varies (128K typical) 1M tokens If the 97% figure holds up under independent testing, this represen...
[27] DeepSeek V4 Pro (High) Benchmarks 2026 - BenchLM.aibenchlm.ai
Category Performance PNG Embed Share Scores across all benchmark categories (0-100 scale) Category Breakdown Agentic 83.8/ 100 Weight: 22%5 benchmark s Terminal-Bench 2.0 BrowseComp OSWorld-Verified GAIA TAU-bench WebArena Coding 7 88.8/ 100 Weight: 20%6 be...
[30] Mapping the DeepSeek V4 Evaluation Suite: A Field Guide to 2026 ...redreamality.com
The Takeaway The V4 scorecard confirms a pattern: for pure coding, open weights have caught up (LiveCodeBench 93.5, Codeforces 3206). For long-horizon agentic work (SWE-bench Pro, Terminal-Bench 2.0), closed frontier still leads. For frontier reasoning (HLE...
[38] Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arxiv.org
… PRO, a substantially more challenging benchmark that builds … In our evaluation of widely used coding models, under a unified … Towards this end, this paper is motivated to (1) mitigate … 2025