รายงานเผยแพร่แล้ว5 พ.ค. 2026Last edited 6 พ.ค. 202620 แหล่งที่มา

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน

Claude Opus 4.7 มีเคสที่แน่นที่สุดในงาน coding และ agentic: Vals AI ให้ 82.00% บน SWE bench ส่วน Anthropic รายงาน 0.715 ใน benchmark ภายในด้าน research agent [17][16] GPT 5.5 ดูแข็งมากใน reasoning โดย O Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC AGI 2 85.0% และ ARC AGI 1 95.0% แต่หลักฐานที่พบส่วนใหญ่ยังเป็นแหล่...

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

3.8K0

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6 — Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026Comparativa editorial de cuatro modelos frontier y emergentes según benchmarks públicos disponibles.
AI พรอมต์
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026. Article summary: La lectura más defendible es que Claude Opus 4.7 tiene la mejor evidencia pública: Vals AI lo sitúa en 82.00% en SWE bench, actualizado el 24/04/2026, y Anthropic reporta 0.715 en su benchmark interno de research agen.... Topic tags: ai, ai benchmarks, llm, claude, openai. Reference image context from search candidates: Reference image 1: visual subject "# DeepSeek V4 vs Claude vs GPT-5.5. Claude Opus 4.6 is no longer Anthropic's flagship — Opus 4.7 shipped on April 16, 2026, at the same $5/$25 price. If you're evaluating "best Ant" source context "DeepSeek V4 vs Claude vs GPT-5.5 - Verdent AI" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90
openai.com

การจับ Claude Opus 4.7, GPT-5.5, DeepSeek V4 และ Kimi K2.6 มาเรียงเป็นอันดับเดียวเหมือนแข่งสนามเดียวกันทั้งหมดอาจทำให้เข้าใจผิดได้ง่าย เพราะข้อมูลสาธารณะที่มีอยู่ไม่ได้หนาแน่นเท่ากันทุกโมเดล Claude Opus 4.7 มีทั้งสัญญาณจากผู้พัฒนาและ leaderboard ภายนอกที่ค่อนข้างแข็งแรง GPT-5.5 โดดเด่นมากในตัวเลข reasoning แต่ส่วนใหญ่เป็นข้อมูลจากแหล่งรอง DeepSeek V4/V4 Pro มีสัญญาณดีใน coding และระบบเปิด/เชิงเทคนิค แต่แหล่งข้อมูลปนหลายเวอร์ชัน ส่วน Kimi K2.6 ยังมีข้อมูลไม่พอสำหรับการเทียบเต็มรูปแบบ

คำตอบสั้นสำหรับผู้บริหาร

โมเดล	อ่านผลอย่างไรจึงจะปลอดภัยที่สุด	ความมั่นใจของหลักฐาน
Claude Opus 4.7	เคสสาธารณะที่แข็งที่สุดใน coding, agentic และงานหลายขั้นตอน Anthropic รายงาน 0.715 ใน benchmark ภายในแบบ research-agent และ Vals AI จัดให้เป็นอันดับหนึ่งใน SWE-bench ที่ 82.00% ^[16]^[17]	สูง-กลาง
GPT-5.5	แข็งมากใน reasoning ทั่วไป O-Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% และ ARC-AGI-1 95.0% ^[3]	กลาง
DeepSeek V4 / V4 Pro	น่าจับตาใน coding และการทดลองเชิงเทคนิค แต่หลักฐานปนระหว่าง V4, V4 Pro และ V4 Pro High จึงไม่ควรยกคะแนนของเวอร์ชันหนึ่งไปแทนอีกเวอร์ชันโดยตรง ^[25]^[27]	กลาง-ต่ำ
Kimi K2.6	มีสัญญาณบางส่วน เช่น LLM Stats ให้ 0.91 ใน GPQA และ WhatLLM นำไปไว้ใน top 10 ของ Quality Index แต่ยังไม่พอสำหรับการเทียบหลาย benchmark ^[7]^[21]	ต่ำ

ตาราง benchmark ที่พอเทียบกันได้

Benchmark หรือเมตริก	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6	ควรตีความอย่างไร
SWE-bench	82.00% ใน Vals AI อัปเดต 24 เมษายน 2026 ^[17]	ไม่พบตัวเลขเทียบตรงที่น่าใช้ในชุดข้อมูลนี้	NxCode อ้าง 81% สำหรับ DeepSeek V4 ^[26]	ไม่พบตัวเลขเทียบตรง	สัญญาณที่สะอาดที่สุดเอนเข้าหา Claude
SWE-bench Verified	Vellum รายงาน 87.6%; LMCouncil รายงาน 83.5% ± 1.7 ^[20]^[9]	ไม่พบตัวเลขเทียบตรง	Hugging Face ระบุว่าอยู่ในชุดประเมินชุมชนของ DeepSeek-V4-Pro แต่สรุปที่พบไม่โชว์ตัวเลข ^[25]	ไม่พบตัวเลขเทียบตรง	ตัวเลขแกว่งตามแหล่งข้อมูล วิธีรัน และเวอร์ชันโมเดล
SWE-bench Pro	Vellum รายงาน 64.3% ^[20]	ไม่พบตัวเลขเทียบตรง	Hugging Face ระบุว่าอยู่ในชุดประเมินชุมชน แต่สรุปที่พบไม่โชว์ตัวเลข ^[25]	ไม่พบตัวเลขเทียบตรง	สำคัญมากสำหรับงาน software agent ระยะยาว
GPQA Diamond	94.2% ตาม O-Mega, Vellum และ TNW ^[3]^[12]^[15]	93.6% ตาม O-Mega และ Vellum ^[3]^[12]	มีอยู่ในชุดประเมินชุมชน แต่ไม่พบตัวเลขเทียบตรงในสรุป ^[25]	0.91 ใน LLM Stats ^[7]	Claude กับ GPT-5.5 ใกล้กันมากเกินกว่าจะตัดสินผู้ชนะจาก GPQA อย่างเดียว
MMLU	ไม่พบตัวเลขเทียบตรง	92.4% ตาม O-Mega ^[3]	MMLU-Pro อยู่ในชุดประเมินชุมชน แต่ไม่พบตัวเลขในสรุป ^[25]	ไม่พบตัวเลขเทียบตรง	ควรให้น้ำหนักน้อย เพราะ MMLU อิ่มตัวในกลุ่มโมเดลบนสุดแล้ว
ARC-AGI	ไม่พบตัวเลขเทียบตรง	ARC-AGI-2 85.0%; ARC-AGI-1 95.0% ตาม O-Mega ^[3]	ไม่พบตัวเลขเทียบตรง	ไม่พบตัวเลขเทียบตรง	เสริมภาพว่า GPT-5.5 แข็งด้าน reasoning แต่ต้องระวังแหล่งข้อมูล
Research-agent / งานหลายขั้นตอน	0.715 ใน benchmark ภายในของ Anthropic ^[16]	ไม่พบตัวเลขเทียบตรง	BenchLM รายงานหมวด Agentic 83.8/100 สำหรับ DeepSeek V4 Pro High ^[27]	ไม่พบตัวเลขเทียบตรง	ใช้ดูทิศทางความสามารถได้ แต่ไม่ใช่เมตริกเดียวกัน
Long context / Needle-in-a-Haystack	Anthropic ระบุว่า Opus 4.7 มี long-context ที่สม่ำเสมอที่สุดในกลุ่มโมเดลที่ทดสอบ ^[16]	ไม่พบตัวเลขเทียบตรง	NxCode รายงาน 97% ที่ 1 ล้านโทเคน โดยแหล่งข้อมูลเองยังผูกกับการรอ validation อิสระ ^[26]	ไม่พบตัวเลขเทียบตรง	DeepSeek มี claim ที่แรง แต่ยังไม่ใช่ข้อสรุปปิดเกม
LiveCodeBench / Codeforces	ไม่พบตัวเลขเทียบตรง	ไม่พบตัวเลขเทียบตรง	Redreamality รายงาน LiveCodeBench 93.5 และ Codeforces 3206 สำหรับ DeepSeek V4 ^[30]	ไม่พบตัวเลขเทียบตรง	เป็นสัญญาณดีด้าน coding ล้วน แต่ยังไม่ตอบเรื่อง agentic coding ระยะยาว

ทำไมไม่ควรดูแค่คะแนนรวม

เบนช์มาร์กแต่ละตัววัดคนละเรื่อง SWE-bench วัดการแก้ปัญหาวิศวกรรมซอฟต์แวร์จากงานจริง และ Vals AI อธิบายว่าเป็น benchmark สำหรับแก้ production software engineering tasks ^[17] ส่วน SWE-bench Pro ต้องแยกออกมาอ่านต่างหาก เพราะ paper ระบุว่าเป็น benchmark ที่ยากกว่าอย่างมีนัยสำคัญและเน้นงานซอฟต์แวร์ระยะยาว ^[38]

GPQA Diamond มีประโยชน์กับการวัด reasoning เชิงวิทยาศาสตร์ แต่ในกลุ่มโมเดล frontier คะแนนเริ่มเบียดกันมาก TNW ระบุว่าใน GPQA Diamond โมเดลอย่าง Opus 4.7, GPT-5.4 Pro และ Gemini 3.1 Pro อยู่ใกล้กันจนความต่างเข้าข่าย noise ของการวัด ^[15] ส่วน MMLU ต้องระวังยิ่งกว่าเดิม เพราะ Nanonets ระบุว่าในปี 2026 โมเดลระดับบนทำคะแนนเกิน 88% กันแล้ว จึงแยกผู้นำออกจากกันได้ไม่ละเอียดนัก ^[1]

ที่สำคัญคือที่มาของตัวเลขไม่เท่ากัน แหล่งทางการจากผู้พัฒนา, leaderboard อิสระ, aggregator และ discussion ของชุมชนมีน้ำหนักไม่เท่ากัน BenchLM ยังระบุเองว่าโปรไฟล์ Claude Opus 4.7 ถูกกันออกจาก public leaderboard ของตน เพราะยังขาด coverage สาธารณะที่ไม่ใช่ generated เพียงพอสำหรับจัดอันดับอย่างปลอดภัย ^[14] นี่เป็นตัวอย่างที่ดีว่า leaderboard แต่ละแห่งมีเกณฑ์และจุดแข็งไม่เหมือนกัน

Claude Opus 4.7: เคสที่แน่นที่สุดใน coding และงานแบบเอเจนต์

Claude Opus 4.7 เป็นโมเดลที่มีฐานหลักฐานสาธารณะแข็งที่สุดในชุดนี้ แหล่งสำคัญมาจาก Anthropic ซึ่งระบุว่า Opus 4.7 ทำคะแนนรวมเท่ากับอันดับสูงสุดใน benchmark ภายในแบบ research-agent ที่ 0.715 และมี long-context performance สม่ำเสมอที่สุดในกลุ่มโมเดลที่บริษัททดสอบ ^[16] เนื่องจากเป็นการทดสอบภายใน จึงไม่ควรอ่านเท่ากับ benchmark อิสระ แต่ถือเป็นสัญญาณทางการว่าโมเดลนี้ถูกดันไปทางงานหลายขั้นตอน

สัญญาณภายนอกที่ชัดที่สุดอยู่ในสาย software engineering Vals AI จัด Claude Opus 4.7 เป็นอันดับหนึ่งใน SWE-bench ด้วยคะแนน 82.00% ในหน้าที่อัปเดตวันที่ 24 เมษายน 2026 ^[17] Vellum รายงาน 87.6% บน SWE-bench Verified และ 64.3% บน SWE-bench Pro ^[20] ขณะที่ LMCouncil ให้ 83.5% ± 1.7 ใน SWE-bench Verified ^[9]

ดังนั้นข้อสรุปที่รอบคอบไม่ใช่การเลือกตัวเลขเดียวแล้วตัดตัวเลขอื่นทิ้ง แต่ควรพูดว่า Claude อยู่ในกลุ่มนำหรือเป็นผู้นำในหลายแหล่งข้อมูลด้าน software engineering โดยต้องจำไว้ว่า SWE-bench, SWE-bench Verified และ SWE-bench Pro ไม่ใช่ชุดทดสอบเดียวกัน และอาจต่างกันตามวิธีรัน วันที่ ชุดย่อย หรือ configuration ^[17]^[20]^[38]

ในด้าน reasoning วิทยาศาสตร์ Claude Opus 4.7 ได้ 94.2% ใน GPQA Diamond ตาม O-Mega, Vellum และ TNW ^[3]^[12]^[15] แต่ TNW เตือนว่าคะแนน GPQA ของโมเดล frontier อยู่ใกล้กันมาก จึงไม่ควรใช้ GPQA เพียงตัวเดียวเพื่อตัดสินผู้ชนะโดยรวม ^[15]

GPT-5.5: reasoning แข็งมาก แต่หลักฐานทางการที่พบยังน้อยกว่า

GPT-5.5 โดดเด่นในชุดข้อมูล reasoning ที่มีอยู่ O-Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% และ ARC-AGI-1 95.0% ^[3] Vellum ก็ระบุ GPT-5.5 ที่ 93.6% ใน GPQA Diamond ซึ่งต่ำกว่า Claude Opus 4.7 ในตารางเดียวกัน ^[12] BenchLM จัด GPT-5.5 เป็นโมเดลระดับสูง โดยให้ 89/100 ใน provisional leaderboard และอันดับ 2 จาก 16 ใน verified leaderboard ^[6]

ข้อควรระวังคือ traceability หรือการไล่กลับไปยังแหล่งทางการ ในชุดข้อมูลที่ใช้สำหรับบทความนี้ GPT-5.5 ปรากฏในบทความ, aggregator และหน้า benchmark หลายแห่ง แต่ไม่พบ benchmark card ทางการของ OpenAI ที่ให้ชุดตัวเลขครบและเทียบตรงกับวัสดุทางการของ Anthropic สำหรับ Claude Opus 4.7 Appwrite ระบุว่า GPT-5.5 เปิดตัววันที่ 24 เมษายน 2026 ส่วน Vals ระบุ openai/gpt-5.5 มี release date วันที่ 23 เมษายน 2026 และมี Vals Index 67.76% ± 1.79 แต่ทั้งสองแหล่งยังไม่ใช่ benchmark card ทางการจาก OpenAI ^[2]^[11]

ถ้าต้องสรุปในระดับผู้บริหาร GPT-5.5 ควรถูกวางเป็นคู่แข่งชั้นนำด้าน reasoning โดยเฉพาะจาก GPQA และ ARC-AGI แต่ไม่ควรประกาศเป็นผู้ชนะรวม หากเกณฑ์คือหลักฐานสาธารณะที่มีความหนาแน่นและเทียบตรงกันทุกโมเดล ^[3]^[6]^[12]

DeepSeek V4 / V4 Pro: น่าลองจริง แต่ต้องแยกเวอร์ชันให้ชัด

DeepSeek เป็นกรณีที่ต้องระวังชื่อเวอร์ชันมากที่สุด แหล่งข้อมูลที่พบสลับไปมาระหว่าง DeepSeek V4, DeepSeek V4 Pro และ DeepSeek V4 Pro High ดังนั้นไม่ควรเอาคะแนนของเวอร์ชันหนึ่งไปอ้างแทนอีกเวอร์ชันแบบอัตโนมัติ ^[25]^[26]^[27]

Hugging Face มี discussion ชุมชนของ DeepSeek-V4-Pro ที่เพิ่มผลประเมินใน GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified และ Terminal-Bench 2.0 ^[25] BenchLM รายงาน DeepSeek V4 Pro High ที่ 83.8/100 ในหมวด Agentic, 88.8/100 ในหมวด Coding และ 72.1/100 ในหมวด Knowledge ^[27] ส่วน NxCode ระบุว่า DeepSeek V4 ทำได้ 81% ใน SWE-bench และ 97% ใน Needle-in-a-Haystack ที่ 1 ล้านโทเคน แต่แหล่งข้อมูลเดียวกันก็วางเงื่อนไขว่า 97% นี้ควรผ่านการทดสอบอิสระก่อนจึงจะอ่านเป็นข้อสรุปแรงได้ ^[26]

Redreamality ให้สัญญาณบวกอีกด้านสำหรับ coding ล้วน โดยรายงาน LiveCodeBench 93.5 และ Codeforces 3206 สำหรับ DeepSeek V4 ^[30] แต่แหล่งเดียวกันก็สรุปว่าในงาน agentic ระยะยาว เช่น SWE-bench Pro และ Terminal-Bench 2.0 กลุ่ม closed frontier models ยังเป็นผู้นำ ^[30]

อ่านในเชิงปฏิบัติ DeepSeek V4/V4 Pro สมควรถูกนำไปทดลองภายใน โดยเฉพาะถ้าทีมให้ความสำคัญกับการควบคุมเชิงเทคนิค ต้นทุน ecosystem แบบเปิด หรือการทดสอบ deployment ที่ควบคุมเอง แต่จากหลักฐานชุดนี้ ยังไม่แข็งเท่ากรณี Claude ใน SWE-bench และ benchmark ภายในที่ Anthropic เปิดเผย ^[16]^[17]^[25]^[27]

Kimi K2.6: มีสัญญาณ แต่ยังไม่พอสำหรับการเทียบเต็มรูปแบบ

Kimi K2.6 ไม่ควรถูกตัดออกจากการสนทนา แต่ก็ไม่ควรถูกนำไปวางเหมือนมีหลักฐานครบเท่าอีกสามโมเดล LLM Stats ระบุ Kimi K2.6 ที่ 0.91 ใน GPQA และ WhatLLM นำ Kimi K2.6 เข้า top 10 ของโมเดลตาม Quality Index ^[7]^[21] สัญญาณเหล่านี้บอกว่ามีการ benchmark อยู่บ้าง แต่ยังไม่พอสำหรับการเปรียบเทียบหลายมิติอย่าง SWE-bench, GPQA, ARC-AGI, long context และ agentic work

อีกจุดที่ต้องหลีกเลี่ยงคือการแทนที่ Kimi K2.6 ด้วย Kimi K2.5 แบบเงียบ ๆ Simon Willison บันทึกผลของ Kimi K2.5 ใน SWE-bench Verified เมื่อเดือนกุมภาพันธ์ 2026 แต่ข้อมูลนั้นเป็นของโมเดลอีกเวอร์ชันหนึ่ง ^[8] ถ้าต้องนำเสนออย่างเข้มงวด Kimi K2.6 ควรถูกติดป้ายว่า evidence insufficient หรือรอการยืนยันจาก benchmark หลายชุด

จัดอันดับตามกรณีใช้งาน

กรณีใช้งาน	แนะนำ	ความมั่นใจ	เหตุผล
แก้ issue จริงและ coding แบบ agentic	Claude Opus 4.7	สูง-กลาง	Vals AI ให้ Claude นำ SWE-bench ที่ 82.00% และ Vellum รายงานว่าแข็งทั้ง SWE-bench Verified และ SWE-bench Pro ^[17]^[20]
งาน research-agent และงานหลายขั้นตอน	Claude Opus 4.7	กลาง	Anthropic รายงาน 0.715 ใน benchmark ภายใน และ long-context consistency ดีที่สุดในกลุ่มที่บริษัททดสอบ ^[16]
reasoning วิทยาศาสตร์แบบ GPQA	Claude Opus 4.7 หรือ GPT-5.5	กลาง	Claude อยู่ที่ 94.2% ส่วน GPT-5.5 อยู่ที่ 93.6%; ความต่างเล็ก และ GPQA เริ่มเบียดกันมากในกลุ่ม frontier ^[3]^[12]^[15]
reasoning ทั่วไปหลายโจทย์	GPT-5.5	กลาง-ต่ำ	ตัวเลข MMLU, GPQA และ ARC-AGI แข็งมาก แต่แหล่งที่พบหลัก ๆ คือ O-Mega, Vellum, BenchLM และ aggregator อื่น ^[3]^[6]^[12]
ทดลองเชิงเทคนิค ควบคุมเอง หรือสำรวจ ecosystem เปิด	DeepSeek V4 / V4 Pro	กลาง-ต่ำ	มีสัญญาณจาก Hugging Face, BenchLM, NxCode และ Redreamality แต่ยังปนเวอร์ชันและต้อง validation เอง ^[25]^[26]^[27]^[30]
ranking เชิงตัวเลขครบทุกมิติ	ยังไม่ควรใช้ Kimi K2.6 เป็น comparable ที่ยืนยันแล้ว	ต่ำ	มีเพียงสัญญาณบางส่วน เช่น GPQA 0.91 ใน LLM Stats แต่ coverage ยังไม่เทียบเท่า ^[7]^[21]

ถ้าจะทำสไลด์เสนอ ควรเล่าแบบไหน

วิธีนำเสนอที่ปลอดภัยคือแยก performance ออกจากคุณภาพของหลักฐาน อย่าใส่คะแนนทั้งหมดลงในกราฟเดียวแล้วประกาศผู้ชนะรวม เพราะจะทำให้ benchmark ที่ต่างกันและแหล่งข้อมูลที่ต่างน้ำหนักถูกบีบให้เหมือนกันเกินจริง

ชุดสไลด์ที่ดีควรมีสามหน้า หน้าแรกเป็น ranking ตามกรณีใช้งาน เช่น coding, reasoning, agentic และ long context หน้าที่สองเป็นตารางตัวเลขพร้อม citation หน้าที่สามเป็นข้อจำกัดของวิธีวัด โดยข้อความหลักควรชัดเจนว่า Claude Opus 4.7 คือผู้นำที่มีหลักฐานแน่นที่สุดใน coding และ agentic work, GPT-5.5 คือคู่แข่งที่แข็งมากใน reasoning, DeepSeek V4/V4 Pro คือตัวเลือกเทคนิคที่น่าทดลองแต่ต้อง validation เอง และ Kimi K2.6 ยังต้องรอข้อมูลเพิ่ม

ข้อควรเตือนในสไลด์ควรมีอย่างน้อยสามข้อ หนึ่ง อย่าเอา SWE-bench, SWE-bench Verified และ SWE-bench Pro มาปนเป็นการทดสอบเดียว เพราะ SWE-bench Pro ถูกออกแบบให้ยากกว่าและเน้นงาน software engineering ระยะยาว ^[38] สอง อย่าตัดสินด้วย MMLU เป็นหลัก เพราะโมเดลบนสุดในปี 2026 ทำคะแนนเกาะกลุ่มสูงเกิน 88% แล้ว ^[1] สาม ทุกตัวเลขควรติดป้ายแหล่งที่มา เช่น ทางการ, leaderboard, aggregator, community หรือ claim

บทสรุป

ถ้าเป้าหมายคือเลือกโมเดลสำหรับรายงานหรือ presentation ที่ต้องป้องกันคำถามได้ Claude Opus 4.7 ควรวางไว้เป็นอันดับแรกในสาย coding และ agentic เพราะมีทั้งแหล่งทางการจาก Anthropic, ตำแหน่งนำใน Vals SWE-bench และผลแข็งใน SWE-bench หลายรูปแบบจากบุคคลที่สาม ^[16]^[17]^[20]

GPT-5.5 ควรถูกวางเป็นคู่แข่งระดับบนสุดใน reasoning โดยมีตัวเลข GPQA, MMLU และ ARC-AGI ที่แข็งมาก แต่ต้องระบุให้ชัดว่าหลักฐานที่พบส่วนใหญ่ยังเป็นแหล่งรองหรือ aggregator ^[3]^[6]^[12] DeepSeek V4/V4 Pro ควรถูกนำไปทดสอบภายใน ไม่ใช่ประกาศเป็นผู้นำจากข้อมูลชุดนี้ ^[25]^[26]^[27]^[30] ส่วน Kimi K2.6 ณ ตอนนี้ควรระบุว่า evidence ยังไม่พอสำหรับการเทียบแบบครบถ้วน ^[7]^[21]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

Claude Opus 4.7 มีเคสที่แน่นที่สุดในงาน coding และ agentic: Vals AI ให้ 82.00% บน SWE bench ส่วน Anthropic รายงาน 0.715 ใน benchmark ภายในด้าน research agent [17][16]
GPT 5.5 ดูแข็งมากใน reasoning โดย O Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC AGI 2 85.0% และ ARC AGI 1 95.0% แต่หลักฐานที่พบส่วนใหญ่ยังเป็นแหล่งรองหรือ aggregator [3]
DeepSeek V4/V4 Pro เหมาะกับการทดลองเชิงเทคนิค แต่ต้องระวังการปนกันของ V4, V4 Pro และ V4 Pro High ส่วน Kimi K2.6 ยังมีเพียงสัญญาณบางส่วน เช่น GPQA 0.91 ใน LLM Stats [25][27][7]

คนยังถาม

คำตอบสั้น ๆ สำหรับ "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน" คืออะไร

Claude Opus 4.7 มีเคสที่แน่นที่สุดในงาน coding และ agentic: Vals AI ให้ 82.00% บน SWE bench ส่วน Anthropic รายงาน 0.715 ใน benchmark ภายในด้าน research agent [17][16]

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

DeepSeek V4/V4 Pro เหมาะกับการทดลองเชิงเทคนิค แต่ต้องระวังการปนกันของ V4, V4 Pro และ V4 Pro High ส่วน Kimi K2.6 ยังมีเพียงสัญญาณบางส่วน เช่น GPQA 0.91 ใน LLM Stats [25][27][7]

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

ดำเนินการต่อด้วย "DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา" เพื่อดูอีกมุมหนึ่งและการอ้างอิงเพิ่มเติม

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา

DeepSeek V4 ไม่ใช่แค่โมเดล 1M context

โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger

Northwest กับ Southeast Timber: ทำไมแทนกันได้มากขึ้นแล้วต้องตอบ larger; larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek จะโค่น OpenAI, Claude, Gemini และ Grok ได้ไหม? เกมจริงอยู่ที่ต้นทุนและความไว้วางใจ

Illustration of an AI coding agent connected to cloud database and backup systems

กรณี PocketOS: ไม่ใช่แค่ “AI ลบฐานข้อมูล” แต่คือบทเรียนเรื่องสิทธิ์และแบ็กอัพ

กรณี PocketOS: สิ่งที่รายงาน Claude/Cursor บอกเราจริง ๆ

การสนทนาการวิจัย

คุณ26 เม.ย. 2026

คำถามวิจัย

Investiga los benchmarks de Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6, y compáralos de la forma más completa posible en formato de informe de investigación para su presentación.

Studio Global AI26 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

La comparación más honesta es parcial: Claude Opus 4.7 es el modelo con mejor soporte público para benchmarks de coding y agente, GPT-5.5 aparece fuerte en razonamiento según fuentes secundarias, DeepSeek V4/V4 Pro muestra señales competitivas pero con evidencia más comunitaria/provisional, y para Kimi K2.6 no encontré evidencia citable suficiente. Para una presentación ejecutiva, recomiendo mostrar a Claude como el líder mejor respaldado en SWE-bench, a GPT-5.5 como competidor fuerte pero menos verificable públicamente, a DeepSeek como alternativa abierta/prometedora pendiente de validación, y a Kimi K2.6 como “Insufficient evidence”.

Key findings

Claude Opus 4.7 tiene la evidencia más sólida del conjunto porque aparece en una fuente oficial de Anthropic con un benchmark interno de research-agent y en fuentes externas de coding como SWE-bench/Vals AI ^[1]^[2].
En SWE-bench, Vals AI reporta que Claude Opus 4.7 lidera con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].
Hay conflicto entre fuentes sobre el rendimiento de Claude en SWE-bench: Vals AI reporta 82.00%, mientras que Vellum reporta 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[2]^[15].
En GPQA Diamond, TNW reporta 94.2% para Claude Opus 4.7 y señala que las diferencias entre modelos frontier en ese benchmark están dentro del ruido de medición ^[3].
GPT-5.5 tiene métricas fuertes en fuentes secundarias: O-Mega reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].
No encontré una fuente oficial de OpenAI en el corpus recuperado para validar los benchmarks de GPT-5.5, por lo que sus resultados deben presentarse como “provisionales / secundarios”.
DeepSeek aparece principalmente como DeepSeek V4 Pro o V4 Pro High en las fuentes recuperadas, no siempre como “DeepSeek V4” base ^[6]^[8].
Para DeepSeek V4/V4 Pro, la evidencia combina discusión comunitaria de Hugging Face, agregadores y blogs técnicos, por lo que su comparabilidad con Claude y GPT-5.5 es menor ^[6]^[7]^[8]^[9].
Kimi K2.6 no cuenta con evidencia suficiente en las fuentes recuperadas; no recomiendo incluirlo en una tabla de ranking como si tuviera benchmarks verificados.

Metodología de lectura

Prioricé fuentes oficiales, leaderboards especializados, discusiones técnicas con trazabilidad y fuentes académicas sobre benchmarks ^[1]^[2]^[6]^[10]^[11].
Clasifiqué la evidencia en cuatro niveles: oficial, benchmark independiente, agregador/comunidad y blog o análisis secundario ^[1]^[2]^[4]^[6]^[8].
No traté como equivalentes los resultados de SWE-bench, SWE-bench Verified y SWE-bench Pro, porque SWE-bench Pro se define como una variante más desafiante y orientada a tareas de ingeniería de software de largo horizonte ^[10].
Consideré MMLU como métrica de bajo poder discriminativo para modelos frontier, ya que una fuente de explicación de benchmarks indica que en 2026 los modelos top superan el 88% y el benchmark está muy saturado ^[12].

Matriz comparativa ejecutiva

Modelo	Estado de evidencia	Benchmarks más relevantes recuperados	Lectura ejecutiva
Claude Opus 4.7	Alta-media	Research-agent interno 0.715 y fuerte rendimiento de long-context según Anthropic; SWE-bench 82.00% según Vals AI; GPQA Diamond 94.2% según TNW ^[1]^[2]^[3]	Mejor candidato para presentarlo como líder respaldado en coding/agente, con cautela por diferencias entre fuentes ^[2]^[15]
GPT-5.5	Media-baja	MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% según O-Mega ^[4]	Muy fuerte en razonamiento según fuentes secundarias, pero falta validación oficial en el corpus recuperado ^[4]^[5]
DeepSeek V4 / V4 Pro	Media-baja	BenchLM reporta DeepSeek V4 Pro High con Agentic 83.8/100 y Coding 88.8/100; NxCode habla de 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens como resultado reclamado ^[7]^[8]	Alternativa competitiva, especialmente si se valora ecosistema abierto/local, pero requiere validación independiente antes de una decisión ejecutiva ^[6]^[8]^[9]
Kimi K2.6	Insufficient evidence	No hay benchmark citable suficiente en las fuentes recuperadas	No incluir como comparable verificado; pedir fuente oficial o leaderboard antes de presentarlo

Benchmarks numéricos recuperados

Benchmark / métrica	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6
SWE-bench	82.00% según Vals AI ^[2]	No recuperado en fuente suficientemente comparable	81% reclamado en una fuente secundaria sobre DeepSeek V4 ^[7]	Insufficient evidence
SWE-bench Verified	87.6% según Vellum ^[15]	No recuperado	Incluido como benchmark evaluado en discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
SWE-bench Pro	64.3% según Vellum ^[15]	No recuperado	Incluido en la discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
GPQA Diamond	94.2% según TNW y O-Mega ^[3]^[4]	93.6% según O-Mega ^[4]	Mencionado dentro de suites comunitarias, sin cifra visible en el resumen recuperado ^[6]^[9]	Insufficient evidence
MMLU	No recuperado con cifra comparable	92.4% según O-Mega ^[4]	MMLU-Pro aparece como evaluación comunitaria, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
ARC-AGI-2	No recuperado	85.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
ARC-AGI-1	No recuperado	95.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
Research-agent / tareas multi-step	0.715 en benchmark interno de Anthropic ^[1]	No recuperado	BenchLM reporta categoría Agentic 83.8/100 para DeepSeek V4 Pro High ^[8]	Insufficient evidence
Long-context / Needle-in-a-Haystack	Anthropic afirma rendimiento long-context muy consistente ^[1]	No recuperado	NxCode reporta 97% a 1M tokens como resultado reclamado, condicionado a validación independiente ^[7]	Insufficient evidence
LiveCodeBench / Codeforces	No recuperado	No recuperado	Redreamality reporta LiveCodeBench 93.5 y Codeforces 3206 para DeepSeek V4 ^[9]	Insufficient evidence

Análisis por modelo

Claude Opus 4.7

Claude Opus 4.7 es el modelo mejor respaldado del conjunto porque tiene una página oficial de Anthropic y resultados externos de SWE-bench ^[1]^[2].

Anthropic afirma que Opus 4.7 empató el mejor resultado global en su benchmark interno de research-agent con 0.715 y que mostró el rendimiento long-context más consistente entre los modelos evaluados ^[1].

Vals AI reporta que Claude Opus 4.7 lidera SWE-bench con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].

Vellum reporta cifras más altas para Claude, con 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[15].

La diferencia entre 82.00% y 87.6% debe tratarse como una discrepancia de metodología, subconjunto o configuración, no como una mejora confirmada única ^[2]^[15].

En razonamiento científico, TNW reporta 94.2% en GPQA Diamond para Claude Opus 4.7 y contextualiza que los modelos frontier están muy cerca entre sí en ese benchmark ^[3].

GPT-5.5

GPT-5.5 aparece muy fuerte en razonamiento general según O-Mega, que reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].

Appwrite publicó un artículo titulado “GPT-5.5 is here” con foco en benchmarks, pricing y cambios para desarrolladores el 24 de abril de 2026, pero se trata de una fuente secundaria y no de una ficha oficial de OpenAI ^[5].

La evidencia recuperada no permite confirmar con una fuente oficial de OpenAI los benchmarks de GPT-5.5, por lo que sus puntuaciones deben etiquetarse como “terceros / no verificadas oficialmente”.

Para una presentación, GPT-5.5 puede colocarse como competidor muy fuerte en razonamiento, pero no como ganador global si se exige trazabilidad oficial comparable a la de Claude ^[1]^[4]^[5].

DeepSeek V4 / V4 Pro

La evidencia recuperada para DeepSeek se concentra en variantes como DeepSeek V4 Pro y DeepSeek V4 Pro High, lo que impide asumir automáticamente que las cifras representan al modelo base DeepSeek V4 ^[6]^[8].

Hugging Face muestra una discusión comunitaria para DeepSeek-V4-Pro con evaluaciones en GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified y Terminal-Bench 2.0 ^[6].

BenchLM reporta para DeepSeek V4 Pro High una categoría Agentic de 83.8/100 y una categoría Coding de 88.8/100 ^[8].

NxCode presenta DeepSeek V4 como un modelo de 1T parámetros con 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens, pero el propio resumen indica que la cifra de 97% debe sostenerse bajo pruebas independientes para ser concluyente ^[7].

Redreamality reporta que DeepSeek V4 alcanza LiveCodeBench 93.5 y Codeforces 3206, y al mismo tiempo afirma que los modelos cerrados siguen liderando en tareas agentic de largo horizonte ^[9].

Para una presentación, DeepSeek debe mostrarse como “competidor técnico prometedor” y no como líder validado, salvo que el equipo disponga de benchmarks internos reproducibles ^[6]^[7]^[8]^[9].

Kimi K2.6

No encontré benchmarks citables suficientes para Kimi K2.6 en las fuentes recuperadas.

No recomiendo sustituir Kimi K2.6 por Kimi K2.5 u otra variante, porque eso mezclaría modelos diferentes y degradaría la validez de la comparación.

La forma más segura de presentarlo es como “pendiente de validación” o “Insufficient evidence”.

Ranking por escenario de uso

Escenario	Modelo recomendado	Justificación
Coding y resolución de issues reales	Claude Opus 4.7	Vals AI lo reporta como líder en SWE-bench con 82.00%, y Vellum también lo ubica fuerte en SWE-bench Verified y SWE-bench Pro ^[2]^[15]
Razonamiento general / ciencia	Claude Opus 4.7 o GPT-5.5	Claude aparece con 94.2% en GPQA Diamond, mientras GPT-5.5 aparece con 93.6% en O-Mega; la diferencia es pequeña y debe leerse con cautela ^[3]^[4]
Agentes y tareas multi-step	Claude Opus 4.7	Anthropic reporta 0.715 en su benchmark interno de research-agent y fuerte consistencia long-context ^[1]
Alternativa abierta / ecosistema técnico	DeepSeek V4 Pro	BenchLM y Hugging Face muestran suites de evaluación amplias para DeepSeek V4 Pro, aunque la evidencia es menos oficial que la de Claude ^[6]^[8]
Presentación con evidencia verificable	Claude Opus 4.7	Es el único del conjunto con fuente oficial clara y señales externas de benchmark en el corpus recuperado ^[1]^[2]
Inclusión de Kimi K2.6	No recomendado	Insufficient evidence

Evidence notes

La fuente más fuerte para Claude es oficial, pero su benchmark de research-agent es interno y no necesariamente reproducible por terceros ^[1].
Vals AI es la fuente más limpia para una comparación de SWE-bench porque reporta directamente un leaderboard de SWE-bench actualizado el 24 de abril de 2026 ^[2].
Vellum aporta más detalle para SWE-bench Verified y SWE-bench Pro, pero sus cifras no coinciden con Vals AI y deben presentarse como una fuente alternativa ^[15].
O-Mega aporta varias cifras útiles para GPT-5.5, pero debe tratarse como fuente secundaria hasta que exista una ficha oficial o un leaderboard independiente reproducible ^[4].
La discusión de Hugging Face para DeepSeek-V4-Pro demuestra actividad comunitaria de evaluación, pero no equivale por sí sola a una publicación oficial o a un benchmark auditado ^[6].
SWE-bench Pro es especialmente relevante para comparar agentes de software porque evalúa tareas más difíciles y de mayor horizonte que benchmarks de coding más simples ^[10].
MMLU debe tener menor peso en la decisión porque fuentes de explicación de benchmarks indican saturación entre modelos top en 2026 ^[12].

Limitations / uncertainty

Insufficient evidence para Kimi K2.6.
No recuperé una fuente oficial de OpenAI con benchmark card de GPT-5.5, por lo que cualquier cifra de GPT-5.5 en este informe debe etiquetarse como secundaria.
No recuperé una fuente oficial de DeepSeek que consolide los r

แหล่งที่มา

[1] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[2] GPT-5.5 is here: benchmarks, pricing, and what changes ... - Appwriteappwrite.io
Star on GitHub 55.8KGo to Console Start building for free Sign upGo to Console Start building for free Products Docs Pricing Customers Blog Changelog Star on GitHub 55.8K Blog/GPT-5.5 is here: benchmarks, pricing, and what changes for developers Apr 24, 202...
[3] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[6] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[7] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[8] SWE-bench February 2026 leaderboard updatesimonwillison.net
Here's how the top ten models performed: Image 1: Bar chart showing "% Resolved" by "Model". Bars in descending order: Claude 4.5 Opus (high reasoning) 76.8%, Gemini 3 Flash (high reasoning) 75.8%, MiniMax M2.5 (high reasoning) 75.8%, Claude Opus 4.6 75.6%,...
[9] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[11] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[12] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[14] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performance | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 BenchLM is tracking Claude Opus 4.7, but this profile is currently excluded from the public leaderboard because it still lacks enough non-generated benchmark cov...
[15] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[17] SWE-bench - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Coding SWE-bench SWE-bench Updated: 4/24/2026 Solving production software engineering tasks Key Takeaways Claude Opus 4.7 leads with a...
[20] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both G...
[21] WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed — Live Rankingswhatllm.org
whatllm? whatllm.org WhatLLM.org - LLM Comparison Tool The ultimate LLM comparison tool Compare price, performance, and speed across the entire AI ecosystem. Updated daily with the latest benchmarks. Top 10 Models Ranked by Quality Index across all benchmar...
[25] Add community evaluation results for GPQA, GSM8K, HLE, MMLU ...huggingface.co
deepseek-ai/DeepSeek-V4-Pro · Add community evaluation results for GPQA, GSM8K, HLE, MMLU-PRO, SWE-BENCH PRO, SWE-BENCH VERIFIED, TERMINAL-BENCH-2.0 Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In S...
[26] DeepSeek V4 (2026): 1T Parameters, 81% SWE-bench ... - NxCodenxcode.io
The claimed results: Metric Standard Attention Engram (DeepSeek V4) --- Needle-in-a-Haystack (1M tokens) 84.2% accuracy 97% accuracy Context Length Supported Varies (128K typical) 1M tokens If the 97% figure holds up under independent testing, this represen...
[27] DeepSeek V4 Pro (High) Benchmarks 2026 - BenchLM.aibenchlm.ai
Category Performance PNG Embed Share Scores across all benchmark categories (0-100 scale) Category Breakdown Agentic 83.8/ 100 Weight: 22%5 benchmark s Terminal-Bench 2.0 BrowseComp OSWorld-Verified GAIA TAU-bench WebArena Coding 7 88.8/ 100 Weight: 20%6 be...
[30] Mapping the DeepSeek V4 Evaluation Suite: A Field Guide to 2026 ...redreamality.com
The Takeaway The V4 scorecard confirms a pattern: for pure coding, open weights have caught up (LiveCodeBench 93.5, Codeforces 3206). For long-horizon agentic work (SWE-bench Pro, Terminal-Bench 2.0), closed frontier still leads. For frontier reasoning (HLE...
[38] Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arxiv.org
… PRO, a substantially more challenging benchmark that builds … In our evaluation of widely used coding models, under a unified … Towards this end, this paper is motivated to (1) mitigate … 2025

ค้นพบเทรนด์

รายงานเผยแพร่แล้ว5 พ.ค. 2026Last edited 6 พ.ค. 202620 แหล่งที่มา

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

3.8K0

คำตอบสั้นสำหรับผู้บริหาร

โมเดล	อ่านผลอย่างไรจึงจะปลอดภัยที่สุด	ความมั่นใจของหลักฐาน
Claude Opus 4.7	เคสสาธารณะที่แข็งที่สุดใน coding, agentic และงานหลายขั้นตอน Anthropic รายงาน 0.715 ใน benchmark ภายในแบบ research-agent และ Vals AI จัดให้เป็นอันดับหนึ่งใน SWE-bench ที่ 82.00% ^[16]^[17]	สูง-กลาง
GPT-5.5	แข็งมากใน reasoning ทั่วไป O-Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% และ ARC-AGI-1 95.0% ^[3]	กลาง
DeepSeek V4 / V4 Pro	น่าจับตาใน coding และการทดลองเชิงเทคนิค แต่หลักฐานปนระหว่าง V4, V4 Pro และ V4 Pro High จึงไม่ควรยกคะแนนของเวอร์ชันหนึ่งไปแทนอีกเวอร์ชันโดยตรง ^[25]^[27]	กลาง-ต่ำ
Kimi K2.6	มีสัญญาณบางส่วน เช่น LLM Stats ให้ 0.91 ใน GPQA และ WhatLLM นำไปไว้ใน top 10 ของ Quality Index แต่ยังไม่พอสำหรับการเทียบหลาย benchmark ^[7]^[21]	ต่ำ

ตาราง benchmark ที่พอเทียบกันได้

Benchmark หรือเมตริก	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6	ควรตีความอย่างไร
SWE-bench	82.00% ใน Vals AI อัปเดต 24 เมษายน 2026 ^[17]	ไม่พบตัวเลขเทียบตรงที่น่าใช้ในชุดข้อมูลนี้	NxCode อ้าง 81% สำหรับ DeepSeek V4 ^[26]	ไม่พบตัวเลขเทียบตรง	สัญญาณที่สะอาดที่สุดเอนเข้าหา Claude
SWE-bench Verified	Vellum รายงาน 87.6%; LMCouncil รายงาน 83.5% ± 1.7 ^[20]^[9]	ไม่พบตัวเลขเทียบตรง	Hugging Face ระบุว่าอยู่ในชุดประเมินชุมชนของ DeepSeek-V4-Pro แต่สรุปที่พบไม่โชว์ตัวเลข ^[25]	ไม่พบตัวเลขเทียบตรง	ตัวเลขแกว่งตามแหล่งข้อมูล วิธีรัน และเวอร์ชันโมเดล
SWE-bench Pro	Vellum รายงาน 64.3% ^[20]	ไม่พบตัวเลขเทียบตรง	Hugging Face ระบุว่าอยู่ในชุดประเมินชุมชน แต่สรุปที่พบไม่โชว์ตัวเลข ^[25]	ไม่พบตัวเลขเทียบตรง	สำคัญมากสำหรับงาน software agent ระยะยาว
GPQA Diamond	94.2% ตาม O-Mega, Vellum และ TNW ^[3]^[12]^[15]	93.6% ตาม O-Mega และ Vellum ^[3]^[12]	มีอยู่ในชุดประเมินชุมชน แต่ไม่พบตัวเลขเทียบตรงในสรุป ^[25]	0.91 ใน LLM Stats ^[7]	Claude กับ GPT-5.5 ใกล้กันมากเกินกว่าจะตัดสินผู้ชนะจาก GPQA อย่างเดียว
MMLU	ไม่พบตัวเลขเทียบตรง	92.4% ตาม O-Mega ^[3]	MMLU-Pro อยู่ในชุดประเมินชุมชน แต่ไม่พบตัวเลขในสรุป ^[25]	ไม่พบตัวเลขเทียบตรง	ควรให้น้ำหนักน้อย เพราะ MMLU อิ่มตัวในกลุ่มโมเดลบนสุดแล้ว
ARC-AGI	ไม่พบตัวเลขเทียบตรง	ARC-AGI-2 85.0%; ARC-AGI-1 95.0% ตาม O-Mega ^[3]	ไม่พบตัวเลขเทียบตรง	ไม่พบตัวเลขเทียบตรง	เสริมภาพว่า GPT-5.5 แข็งด้าน reasoning แต่ต้องระวังแหล่งข้อมูล
Research-agent / งานหลายขั้นตอน	0.715 ใน benchmark ภายในของ Anthropic ^[16]	ไม่พบตัวเลขเทียบตรง	BenchLM รายงานหมวด Agentic 83.8/100 สำหรับ DeepSeek V4 Pro High ^[27]	ไม่พบตัวเลขเทียบตรง	ใช้ดูทิศทางความสามารถได้ แต่ไม่ใช่เมตริกเดียวกัน
Long context / Needle-in-a-Haystack	Anthropic ระบุว่า Opus 4.7 มี long-context ที่สม่ำเสมอที่สุดในกลุ่มโมเดลที่ทดสอบ ^[16]	ไม่พบตัวเลขเทียบตรง	NxCode รายงาน 97% ที่ 1 ล้านโทเคน โดยแหล่งข้อมูลเองยังผูกกับการรอ validation อิสระ ^[26]	ไม่พบตัวเลขเทียบตรง	DeepSeek มี claim ที่แรง แต่ยังไม่ใช่ข้อสรุปปิดเกม
LiveCodeBench / Codeforces	ไม่พบตัวเลขเทียบตรง	ไม่พบตัวเลขเทียบตรง	Redreamality รายงาน LiveCodeBench 93.5 และ Codeforces 3206 สำหรับ DeepSeek V4 ^[30]	ไม่พบตัวเลขเทียบตรง	เป็นสัญญาณดีด้าน coding ล้วน แต่ยังไม่ตอบเรื่อง agentic coding ระยะยาว

ทำไมไม่ควรดูแค่คะแนนรวม

Claude Opus 4.7: เคสที่แน่นที่สุดใน coding และงานแบบเอเจนต์

GPT-5.5: reasoning แข็งมาก แต่หลักฐานทางการที่พบยังน้อยกว่า

DeepSeek V4 / V4 Pro: น่าลองจริง แต่ต้องแยกเวอร์ชันให้ชัด

Kimi K2.6: มีสัญญาณ แต่ยังไม่พอสำหรับการเทียบเต็มรูปแบบ

จัดอันดับตามกรณีใช้งาน

กรณีใช้งาน	แนะนำ	ความมั่นใจ	เหตุผล
แก้ issue จริงและ coding แบบ agentic	Claude Opus 4.7	สูง-กลาง	Vals AI ให้ Claude นำ SWE-bench ที่ 82.00% และ Vellum รายงานว่าแข็งทั้ง SWE-bench Verified และ SWE-bench Pro ^[17]^[20]
งาน research-agent และงานหลายขั้นตอน	Claude Opus 4.7	กลาง	Anthropic รายงาน 0.715 ใน benchmark ภายใน และ long-context consistency ดีที่สุดในกลุ่มที่บริษัททดสอบ ^[16]
reasoning วิทยาศาสตร์แบบ GPQA	Claude Opus 4.7 หรือ GPT-5.5	กลาง	Claude อยู่ที่ 94.2% ส่วน GPT-5.5 อยู่ที่ 93.6%; ความต่างเล็ก และ GPQA เริ่มเบียดกันมากในกลุ่ม frontier ^[3]^[12]^[15]
reasoning ทั่วไปหลายโจทย์	GPT-5.5	กลาง-ต่ำ	ตัวเลข MMLU, GPQA และ ARC-AGI แข็งมาก แต่แหล่งที่พบหลัก ๆ คือ O-Mega, Vellum, BenchLM และ aggregator อื่น ^[3]^[6]^[12]
ทดลองเชิงเทคนิค ควบคุมเอง หรือสำรวจ ecosystem เปิด	DeepSeek V4 / V4 Pro	กลาง-ต่ำ	มีสัญญาณจาก Hugging Face, BenchLM, NxCode และ Redreamality แต่ยังปนเวอร์ชันและต้อง validation เอง ^[25]^[26]^[27]^[30]
ranking เชิงตัวเลขครบทุกมิติ	ยังไม่ควรใช้ Kimi K2.6 เป็น comparable ที่ยืนยันแล้ว	ต่ำ	มีเพียงสัญญาณบางส่วน เช่น GPQA 0.91 ใน LLM Stats แต่ coverage ยังไม่เทียบเท่า ^[7]^[21]

ถ้าจะทำสไลด์เสนอ ควรเล่าแบบไหน

บทสรุป

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

Claude Opus 4.7 มีเคสที่แน่นที่สุดในงาน coding และ agentic: Vals AI ให้ 82.00% บน SWE bench ส่วน Anthropic รายงาน 0.715 ใน benchmark ภายในด้าน research agent [17][16]
GPT 5.5 ดูแข็งมากใน reasoning โดย O Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC AGI 2 85.0% และ ARC AGI 1 95.0% แต่หลักฐานที่พบส่วนใหญ่ยังเป็นแหล่งรองหรือ aggregator [3]
DeepSeek V4/V4 Pro เหมาะกับการทดลองเชิงเทคนิค แต่ต้องระวังการปนกันของ V4, V4 Pro และ V4 Pro High ส่วน Kimi K2.6 ยังมีเพียงสัญญาณบางส่วน เช่น GPQA 0.91 ใน LLM Stats [25][27][7]

คนยังถาม

คำตอบสั้น ๆ สำหรับ "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา

DeepSeek V4 ไม่ใช่แค่โมเดล 1M context

โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger

Northwest กับ Southeast Timber: ทำไมแทนกันได้มากขึ้นแล้วต้องตอบ larger; larger

DeepSeek จะโค่น OpenAI, Claude, Gemini และ Grok ได้ไหม? เกมจริงอยู่ที่ต้นทุนและความไว้วางใจ

กรณี PocketOS: ไม่ใช่แค่ “AI ลบฐานข้อมูล” แต่คือบทเรียนเรื่องสิทธิ์และแบ็กอัพ

กรณี PocketOS: สิ่งที่รายงาน Claude/Cursor บอกเราจริง ๆ

การสนทนาการวิจัย

คุณ26 เม.ย. 2026

คำถามวิจัย

Investiga los benchmarks de Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6, y compáralos de la forma más completa posible en formato de informe de investigación para su presentación.

Studio Global AI26 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

Key findings

Claude Opus 4.7 tiene la evidencia más sólida del conjunto porque aparece en una fuente oficial de Anthropic con un benchmark interno de research-agent y en fuentes externas de coding como SWE-bench/Vals AI ^[1]^[2].
En SWE-bench, Vals AI reporta que Claude Opus 4.7 lidera con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].
Hay conflicto entre fuentes sobre el rendimiento de Claude en SWE-bench: Vals AI reporta 82.00%, mientras que Vellum reporta 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[2]^[15].
En GPQA Diamond, TNW reporta 94.2% para Claude Opus 4.7 y señala que las diferencias entre modelos frontier en ese benchmark están dentro del ruido de medición ^[3].
GPT-5.5 tiene métricas fuertes en fuentes secundarias: O-Mega reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].
No encontré una fuente oficial de OpenAI en el corpus recuperado para validar los benchmarks de GPT-5.5, por lo que sus resultados deben presentarse como “provisionales / secundarios”.
DeepSeek aparece principalmente como DeepSeek V4 Pro o V4 Pro High en las fuentes recuperadas, no siempre como “DeepSeek V4” base ^[6]^[8].
Para DeepSeek V4/V4 Pro, la evidencia combina discusión comunitaria de Hugging Face, agregadores y blogs técnicos, por lo que su comparabilidad con Claude y GPT-5.5 es menor ^[6]^[7]^[8]^[9].
Kimi K2.6 no cuenta con evidencia suficiente en las fuentes recuperadas; no recomiendo incluirlo en una tabla de ranking como si tuviera benchmarks verificados.

Metodología de lectura

Prioricé fuentes oficiales, leaderboards especializados, discusiones técnicas con trazabilidad y fuentes académicas sobre benchmarks ^[1]^[2]^[6]^[10]^[11].
Clasifiqué la evidencia en cuatro niveles: oficial, benchmark independiente, agregador/comunidad y blog o análisis secundario ^[1]^[2]^[4]^[6]^[8].
No traté como equivalentes los resultados de SWE-bench, SWE-bench Verified y SWE-bench Pro, porque SWE-bench Pro se define como una variante más desafiante y orientada a tareas de ingeniería de software de largo horizonte ^[10].
Consideré MMLU como métrica de bajo poder discriminativo para modelos frontier, ya que una fuente de explicación de benchmarks indica que en 2026 los modelos top superan el 88% y el benchmark está muy saturado ^[12].

Matriz comparativa ejecutiva

Modelo	Estado de evidencia	Benchmarks más relevantes recuperados	Lectura ejecutiva
Claude Opus 4.7	Alta-media	Research-agent interno 0.715 y fuerte rendimiento de long-context según Anthropic; SWE-bench 82.00% según Vals AI; GPQA Diamond 94.2% según TNW ^[1]^[2]^[3]	Mejor candidato para presentarlo como líder respaldado en coding/agente, con cautela por diferencias entre fuentes ^[2]^[15]
GPT-5.5	Media-baja	MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% según O-Mega ^[4]	Muy fuerte en razonamiento según fuentes secundarias, pero falta validación oficial en el corpus recuperado ^[4]^[5]
DeepSeek V4 / V4 Pro	Media-baja	BenchLM reporta DeepSeek V4 Pro High con Agentic 83.8/100 y Coding 88.8/100; NxCode habla de 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens como resultado reclamado ^[7]^[8]	Alternativa competitiva, especialmente si se valora ecosistema abierto/local, pero requiere validación independiente antes de una decisión ejecutiva ^[6]^[8]^[9]
Kimi K2.6	Insufficient evidence	No hay benchmark citable suficiente en las fuentes recuperadas	No incluir como comparable verificado; pedir fuente oficial o leaderboard antes de presentarlo

Benchmarks numéricos recuperados

Benchmark / métrica	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6
SWE-bench	82.00% según Vals AI ^[2]	No recuperado en fuente suficientemente comparable	81% reclamado en una fuente secundaria sobre DeepSeek V4 ^[7]	Insufficient evidence
SWE-bench Verified	87.6% según Vellum ^[15]	No recuperado	Incluido como benchmark evaluado en discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
SWE-bench Pro	64.3% según Vellum ^[15]	No recuperado	Incluido en la discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
GPQA Diamond	94.2% según TNW y O-Mega ^[3]^[4]	93.6% según O-Mega ^[4]	Mencionado dentro de suites comunitarias, sin cifra visible en el resumen recuperado ^[6]^[9]	Insufficient evidence
MMLU	No recuperado con cifra comparable	92.4% según O-Mega ^[4]	MMLU-Pro aparece como evaluación comunitaria, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
ARC-AGI-2	No recuperado	85.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
ARC-AGI-1	No recuperado	95.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
Research-agent / tareas multi-step	0.715 en benchmark interno de Anthropic ^[1]	No recuperado	BenchLM reporta categoría Agentic 83.8/100 para DeepSeek V4 Pro High ^[8]	Insufficient evidence
Long-context / Needle-in-a-Haystack	Anthropic afirma rendimiento long-context muy consistente ^[1]	No recuperado	NxCode reporta 97% a 1M tokens como resultado reclamado, condicionado a validación independiente ^[7]	Insufficient evidence
LiveCodeBench / Codeforces	No recuperado	No recuperado	Redreamality reporta LiveCodeBench 93.5 y Codeforces 3206 para DeepSeek V4 ^[9]	Insufficient evidence

Análisis por modelo

Claude Opus 4.7

Claude Opus 4.7 es el modelo mejor respaldado del conjunto porque tiene una página oficial de Anthropic y resultados externos de SWE-bench ^[1]^[2].

Vals AI reporta que Claude Opus 4.7 lidera SWE-bench con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].

Vellum reporta cifras más altas para Claude, con 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[15].

La diferencia entre 82.00% y 87.6% debe tratarse como una discrepancia de metodología, subconjunto o configuración, no como una mejora confirmada única ^[2]^[15].

En razonamiento científico, TNW reporta 94.2% en GPQA Diamond para Claude Opus 4.7 y contextualiza que los modelos frontier están muy cerca entre sí en ese benchmark ^[3].

GPT-5.5

GPT-5.5 aparece muy fuerte en razonamiento general según O-Mega, que reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].

Para una presentación, GPT-5.5 puede colocarse como competidor muy fuerte en razonamiento, pero no como ganador global si se exige trazabilidad oficial comparable a la de Claude ^[1]^[4]^[5].

DeepSeek V4 / V4 Pro

Hugging Face muestra una discusión comunitaria para DeepSeek-V4-Pro con evaluaciones en GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified y Terminal-Bench 2.0 ^[6].

BenchLM reporta para DeepSeek V4 Pro High una categoría Agentic de 83.8/100 y una categoría Coding de 88.8/100 ^[8].

Redreamality reporta que DeepSeek V4 alcanza LiveCodeBench 93.5 y Codeforces 3206, y al mismo tiempo afirma que los modelos cerrados siguen liderando en tareas agentic de largo horizonte ^[9].

Kimi K2.6

No encontré benchmarks citables suficientes para Kimi K2.6 en las fuentes recuperadas.

No recomiendo sustituir Kimi K2.6 por Kimi K2.5 u otra variante, porque eso mezclaría modelos diferentes y degradaría la validez de la comparación.

La forma más segura de presentarlo es como “pendiente de validación” o “Insufficient evidence”.

Ranking por escenario de uso

Escenario	Modelo recomendado	Justificación
Coding y resolución de issues reales	Claude Opus 4.7	Vals AI lo reporta como líder en SWE-bench con 82.00%, y Vellum también lo ubica fuerte en SWE-bench Verified y SWE-bench Pro ^[2]^[15]
Razonamiento general / ciencia	Claude Opus 4.7 o GPT-5.5	Claude aparece con 94.2% en GPQA Diamond, mientras GPT-5.5 aparece con 93.6% en O-Mega; la diferencia es pequeña y debe leerse con cautela ^[3]^[4]
Agentes y tareas multi-step	Claude Opus 4.7	Anthropic reporta 0.715 en su benchmark interno de research-agent y fuerte consistencia long-context ^[1]
Alternativa abierta / ecosistema técnico	DeepSeek V4 Pro	BenchLM y Hugging Face muestran suites de evaluación amplias para DeepSeek V4 Pro, aunque la evidencia es menos oficial que la de Claude ^[6]^[8]
Presentación con evidencia verificable	Claude Opus 4.7	Es el único del conjunto con fuente oficial clara y señales externas de benchmark en el corpus recuperado ^[1]^[2]
Inclusión de Kimi K2.6	No recomendado	Insufficient evidence

Evidence notes

La fuente más fuerte para Claude es oficial, pero su benchmark de research-agent es interno y no necesariamente reproducible por terceros ^[1].
Vals AI es la fuente más limpia para una comparación de SWE-bench porque reporta directamente un leaderboard de SWE-bench actualizado el 24 de abril de 2026 ^[2].
Vellum aporta más detalle para SWE-bench Verified y SWE-bench Pro, pero sus cifras no coinciden con Vals AI y deben presentarse como una fuente alternativa ^[15].
O-Mega aporta varias cifras útiles para GPT-5.5, pero debe tratarse como fuente secundaria hasta que exista una ficha oficial o un leaderboard independiente reproducible ^[4].
La discusión de Hugging Face para DeepSeek-V4-Pro demuestra actividad comunitaria de evaluación, pero no equivale por sí sola a una publicación oficial o a un benchmark auditado ^[6].
SWE-bench Pro es especialmente relevante para comparar agentes de software porque evalúa tareas más difíciles y de mayor horizonte que benchmarks de coding más simples ^[10].
MMLU debe tener menor peso en la decisión porque fuentes de explicación de benchmarks indican saturación entre modelos top en 2026 ^[12].

Limitations / uncertainty

Insufficient evidence para Kimi K2.6.
No recuperé una fuente oficial de OpenAI con benchmark card de GPT-5.5, por lo que cualquier cifra de GPT-5.5 en este informe debe etiquetarse como secundaria.
No recuperé una fuente oficial de DeepSeek que consolide los r

แหล่งที่มา

[1] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[2] GPT-5.5 is here: benchmarks, pricing, and what changes ... - Appwriteappwrite.io
Star on GitHub 55.8KGo to Console Start building for free Sign upGo to Console Start building for free Products Docs Pricing Customers Blog Changelog Star on GitHub 55.8K Blog/GPT-5.5 is here: benchmarks, pricing, and what changes for developers Apr 24, 202...
[3] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[6] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[7] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[8] SWE-bench February 2026 leaderboard updatesimonwillison.net
Here's how the top ten models performed: Image 1: Bar chart showing "% Resolved" by "Model". Bars in descending order: Claude 4.5 Opus (high reasoning) 76.8%, Gemini 3 Flash (high reasoning) 75.8%, MiniMax M2.5 (high reasoning) 75.8%, Claude Opus 4.6 75.6%,...
[9] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[11] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[12] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[14] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performance | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 BenchLM is tracking Claude Opus 4.7, but this profile is currently excluded from the public leaderboard because it still lacks enough non-generated benchmark cov...
[15] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[17] SWE-bench - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Coding SWE-bench SWE-bench Updated: 4/24/2026 Solving production software engineering tasks Key Takeaways Claude Opus 4.7 leads with a...
[20] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both G...
[21] WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed — Live Rankingswhatllm.org
whatllm? whatllm.org WhatLLM.org - LLM Comparison Tool The ultimate LLM comparison tool Compare price, performance, and speed across the entire AI ecosystem. Updated daily with the latest benchmarks. Top 10 Models Ranked by Quality Index across all benchmar...
[25] Add community evaluation results for GPQA, GSM8K, HLE, MMLU ...huggingface.co
deepseek-ai/DeepSeek-V4-Pro · Add community evaluation results for GPQA, GSM8K, HLE, MMLU-PRO, SWE-BENCH PRO, SWE-BENCH VERIFIED, TERMINAL-BENCH-2.0 Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In S...
[26] DeepSeek V4 (2026): 1T Parameters, 81% SWE-bench ... - NxCodenxcode.io
The claimed results: Metric Standard Attention Engram (DeepSeek V4) --- Needle-in-a-Haystack (1M tokens) 84.2% accuracy 97% accuracy Context Length Supported Varies (128K typical) 1M tokens If the 97% figure holds up under independent testing, this represen...
[27] DeepSeek V4 Pro (High) Benchmarks 2026 - BenchLM.aibenchlm.ai
Category Performance PNG Embed Share Scores across all benchmark categories (0-100 scale) Category Breakdown Agentic 83.8/ 100 Weight: 22%5 benchmark s Terminal-Bench 2.0 BrowseComp OSWorld-Verified GAIA TAU-bench WebArena Coding 7 88.8/ 100 Weight: 20%6 be...
[30] Mapping the DeepSeek V4 Evaluation Suite: A Field Guide to 2026 ...redreamality.com
The Takeaway The V4 scorecard confirms a pattern: for pure coding, open weights have caught up (LiveCodeBench 93.5, Codeforces 3206). For long-horizon agentic work (SWE-bench Pro, Terminal-Bench 2.0), closed frontier still leads. For frontier reasoning (HLE...
[38] Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arxiv.org
… PRO, a substantially more challenging benchmark that builds … In our evaluation of widely used coding models, under a unified … Towards this end, this paper is motivated to (1) mitigate … 2025

ค้นพบเทรนด์

รายงานเผยแพร่แล้ว5 พ.ค. 2026Last edited 6 พ.ค. 202620 แหล่งที่มา

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

3.8K0

คำตอบสั้นสำหรับผู้บริหาร

โมเดล	อ่านผลอย่างไรจึงจะปลอดภัยที่สุด	ความมั่นใจของหลักฐาน
Claude Opus 4.7	เคสสาธารณะที่แข็งที่สุดใน coding, agentic และงานหลายขั้นตอน Anthropic รายงาน 0.715 ใน benchmark ภายในแบบ research-agent และ Vals AI จัดให้เป็นอันดับหนึ่งใน SWE-bench ที่ 82.00% ^[16]^[17]	สูง-กลาง
GPT-5.5	แข็งมากใน reasoning ทั่วไป O-Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% และ ARC-AGI-1 95.0% ^[3]	กลาง
DeepSeek V4 / V4 Pro	น่าจับตาใน coding และการทดลองเชิงเทคนิค แต่หลักฐานปนระหว่าง V4, V4 Pro และ V4 Pro High จึงไม่ควรยกคะแนนของเวอร์ชันหนึ่งไปแทนอีกเวอร์ชันโดยตรง ^[25]^[27]	กลาง-ต่ำ
Kimi K2.6	มีสัญญาณบางส่วน เช่น LLM Stats ให้ 0.91 ใน GPQA และ WhatLLM นำไปไว้ใน top 10 ของ Quality Index แต่ยังไม่พอสำหรับการเทียบหลาย benchmark ^[7]^[21]	ต่ำ

ตาราง benchmark ที่พอเทียบกันได้

Benchmark หรือเมตริก	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6	ควรตีความอย่างไร
SWE-bench	82.00% ใน Vals AI อัปเดต 24 เมษายน 2026 ^[17]	ไม่พบตัวเลขเทียบตรงที่น่าใช้ในชุดข้อมูลนี้	NxCode อ้าง 81% สำหรับ DeepSeek V4 ^[26]	ไม่พบตัวเลขเทียบตรง	สัญญาณที่สะอาดที่สุดเอนเข้าหา Claude
SWE-bench Verified	Vellum รายงาน 87.6%; LMCouncil รายงาน 83.5% ± 1.7 ^[20]^[9]	ไม่พบตัวเลขเทียบตรง	Hugging Face ระบุว่าอยู่ในชุดประเมินชุมชนของ DeepSeek-V4-Pro แต่สรุปที่พบไม่โชว์ตัวเลข ^[25]	ไม่พบตัวเลขเทียบตรง	ตัวเลขแกว่งตามแหล่งข้อมูล วิธีรัน และเวอร์ชันโมเดล
SWE-bench Pro	Vellum รายงาน 64.3% ^[20]	ไม่พบตัวเลขเทียบตรง	Hugging Face ระบุว่าอยู่ในชุดประเมินชุมชน แต่สรุปที่พบไม่โชว์ตัวเลข ^[25]	ไม่พบตัวเลขเทียบตรง	สำคัญมากสำหรับงาน software agent ระยะยาว
GPQA Diamond	94.2% ตาม O-Mega, Vellum และ TNW ^[3]^[12]^[15]	93.6% ตาม O-Mega และ Vellum ^[3]^[12]	มีอยู่ในชุดประเมินชุมชน แต่ไม่พบตัวเลขเทียบตรงในสรุป ^[25]	0.91 ใน LLM Stats ^[7]	Claude กับ GPT-5.5 ใกล้กันมากเกินกว่าจะตัดสินผู้ชนะจาก GPQA อย่างเดียว
MMLU	ไม่พบตัวเลขเทียบตรง	92.4% ตาม O-Mega ^[3]	MMLU-Pro อยู่ในชุดประเมินชุมชน แต่ไม่พบตัวเลขในสรุป ^[25]	ไม่พบตัวเลขเทียบตรง	ควรให้น้ำหนักน้อย เพราะ MMLU อิ่มตัวในกลุ่มโมเดลบนสุดแล้ว
ARC-AGI	ไม่พบตัวเลขเทียบตรง	ARC-AGI-2 85.0%; ARC-AGI-1 95.0% ตาม O-Mega ^[3]	ไม่พบตัวเลขเทียบตรง	ไม่พบตัวเลขเทียบตรง	เสริมภาพว่า GPT-5.5 แข็งด้าน reasoning แต่ต้องระวังแหล่งข้อมูล
Research-agent / งานหลายขั้นตอน	0.715 ใน benchmark ภายในของ Anthropic ^[16]	ไม่พบตัวเลขเทียบตรง	BenchLM รายงานหมวด Agentic 83.8/100 สำหรับ DeepSeek V4 Pro High ^[27]	ไม่พบตัวเลขเทียบตรง	ใช้ดูทิศทางความสามารถได้ แต่ไม่ใช่เมตริกเดียวกัน
Long context / Needle-in-a-Haystack	Anthropic ระบุว่า Opus 4.7 มี long-context ที่สม่ำเสมอที่สุดในกลุ่มโมเดลที่ทดสอบ ^[16]	ไม่พบตัวเลขเทียบตรง	NxCode รายงาน 97% ที่ 1 ล้านโทเคน โดยแหล่งข้อมูลเองยังผูกกับการรอ validation อิสระ ^[26]	ไม่พบตัวเลขเทียบตรง	DeepSeek มี claim ที่แรง แต่ยังไม่ใช่ข้อสรุปปิดเกม
LiveCodeBench / Codeforces	ไม่พบตัวเลขเทียบตรง	ไม่พบตัวเลขเทียบตรง	Redreamality รายงาน LiveCodeBench 93.5 และ Codeforces 3206 สำหรับ DeepSeek V4 ^[30]	ไม่พบตัวเลขเทียบตรง	เป็นสัญญาณดีด้าน coding ล้วน แต่ยังไม่ตอบเรื่อง agentic coding ระยะยาว

ทำไมไม่ควรดูแค่คะแนนรวม

Claude Opus 4.7: เคสที่แน่นที่สุดใน coding และงานแบบเอเจนต์

GPT-5.5: reasoning แข็งมาก แต่หลักฐานทางการที่พบยังน้อยกว่า

DeepSeek V4 / V4 Pro: น่าลองจริง แต่ต้องแยกเวอร์ชันให้ชัด

Kimi K2.6: มีสัญญาณ แต่ยังไม่พอสำหรับการเทียบเต็มรูปแบบ

จัดอันดับตามกรณีใช้งาน

กรณีใช้งาน	แนะนำ	ความมั่นใจ	เหตุผล
แก้ issue จริงและ coding แบบ agentic	Claude Opus 4.7	สูง-กลาง	Vals AI ให้ Claude นำ SWE-bench ที่ 82.00% และ Vellum รายงานว่าแข็งทั้ง SWE-bench Verified และ SWE-bench Pro ^[17]^[20]
งาน research-agent และงานหลายขั้นตอน	Claude Opus 4.7	กลาง	Anthropic รายงาน 0.715 ใน benchmark ภายใน และ long-context consistency ดีที่สุดในกลุ่มที่บริษัททดสอบ ^[16]
reasoning วิทยาศาสตร์แบบ GPQA	Claude Opus 4.7 หรือ GPT-5.5	กลาง	Claude อยู่ที่ 94.2% ส่วน GPT-5.5 อยู่ที่ 93.6%; ความต่างเล็ก และ GPQA เริ่มเบียดกันมากในกลุ่ม frontier ^[3]^[12]^[15]
reasoning ทั่วไปหลายโจทย์	GPT-5.5	กลาง-ต่ำ	ตัวเลข MMLU, GPQA และ ARC-AGI แข็งมาก แต่แหล่งที่พบหลัก ๆ คือ O-Mega, Vellum, BenchLM และ aggregator อื่น ^[3]^[6]^[12]
ทดลองเชิงเทคนิค ควบคุมเอง หรือสำรวจ ecosystem เปิด	DeepSeek V4 / V4 Pro	กลาง-ต่ำ	มีสัญญาณจาก Hugging Face, BenchLM, NxCode และ Redreamality แต่ยังปนเวอร์ชันและต้อง validation เอง ^[25]^[26]^[27]^[30]
ranking เชิงตัวเลขครบทุกมิติ	ยังไม่ควรใช้ Kimi K2.6 เป็น comparable ที่ยืนยันแล้ว	ต่ำ	มีเพียงสัญญาณบางส่วน เช่น GPQA 0.91 ใน LLM Stats แต่ coverage ยังไม่เทียบเท่า ^[7]^[21]

ถ้าจะทำสไลด์เสนอ ควรเล่าแบบไหน

บทสรุป

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

Claude Opus 4.7 มีเคสที่แน่นที่สุดในงาน coding และ agentic: Vals AI ให้ 82.00% บน SWE bench ส่วน Anthropic รายงาน 0.715 ใน benchmark ภายในด้าน research agent [17][16]
GPT 5.5 ดูแข็งมากใน reasoning โดย O Mega รายงาน MMLU 92.4%, GPQA Diamond 93.6%, ARC AGI 2 85.0% และ ARC AGI 1 95.0% แต่หลักฐานที่พบส่วนใหญ่ยังเป็นแหล่งรองหรือ aggregator [3]
DeepSeek V4/V4 Pro เหมาะกับการทดลองเชิงเทคนิค แต่ต้องระวังการปนกันของ V4, V4 Pro และ V4 Pro High ส่วน Kimi K2.6 ยังมีเพียงสัญญาณบางส่วน เช่น GPQA 0.91 ใน LLM Stats [25][27][7]

คนยังถาม

คำตอบสั้น ๆ สำหรับ "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา

DeepSeek V4 ไม่ใช่แค่โมเดล 1M context

โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger

Northwest กับ Southeast Timber: ทำไมแทนกันได้มากขึ้นแล้วต้องตอบ larger; larger

DeepSeek จะโค่น OpenAI, Claude, Gemini และ Grok ได้ไหม? เกมจริงอยู่ที่ต้นทุนและความไว้วางใจ

กรณี PocketOS: ไม่ใช่แค่ “AI ลบฐานข้อมูล” แต่คือบทเรียนเรื่องสิทธิ์และแบ็กอัพ

กรณี PocketOS: สิ่งที่รายงาน Claude/Cursor บอกเราจริง ๆ

การสนทนาการวิจัย

คุณ26 เม.ย. 2026

คำถามวิจัย

Investiga los benchmarks de Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6, y compáralos de la forma más completa posible en formato de informe de investigación para su presentación.

Studio Global AI26 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

Key findings

Claude Opus 4.7 tiene la evidencia más sólida del conjunto porque aparece en una fuente oficial de Anthropic con un benchmark interno de research-agent y en fuentes externas de coding como SWE-bench/Vals AI ^[1]^[2].
En SWE-bench, Vals AI reporta que Claude Opus 4.7 lidera con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].
Hay conflicto entre fuentes sobre el rendimiento de Claude en SWE-bench: Vals AI reporta 82.00%, mientras que Vellum reporta 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[2]^[15].
En GPQA Diamond, TNW reporta 94.2% para Claude Opus 4.7 y señala que las diferencias entre modelos frontier en ese benchmark están dentro del ruido de medición ^[3].
GPT-5.5 tiene métricas fuertes en fuentes secundarias: O-Mega reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].
No encontré una fuente oficial de OpenAI en el corpus recuperado para validar los benchmarks de GPT-5.5, por lo que sus resultados deben presentarse como “provisionales / secundarios”.
DeepSeek aparece principalmente como DeepSeek V4 Pro o V4 Pro High en las fuentes recuperadas, no siempre como “DeepSeek V4” base ^[6]^[8].
Para DeepSeek V4/V4 Pro, la evidencia combina discusión comunitaria de Hugging Face, agregadores y blogs técnicos, por lo que su comparabilidad con Claude y GPT-5.5 es menor ^[6]^[7]^[8]^[9].
Kimi K2.6 no cuenta con evidencia suficiente en las fuentes recuperadas; no recomiendo incluirlo en una tabla de ranking como si tuviera benchmarks verificados.

Metodología de lectura

Prioricé fuentes oficiales, leaderboards especializados, discusiones técnicas con trazabilidad y fuentes académicas sobre benchmarks ^[1]^[2]^[6]^[10]^[11].
Clasifiqué la evidencia en cuatro niveles: oficial, benchmark independiente, agregador/comunidad y blog o análisis secundario ^[1]^[2]^[4]^[6]^[8].
No traté como equivalentes los resultados de SWE-bench, SWE-bench Verified y SWE-bench Pro, porque SWE-bench Pro se define como una variante más desafiante y orientada a tareas de ingeniería de software de largo horizonte ^[10].
Consideré MMLU como métrica de bajo poder discriminativo para modelos frontier, ya que una fuente de explicación de benchmarks indica que en 2026 los modelos top superan el 88% y el benchmark está muy saturado ^[12].

Matriz comparativa ejecutiva

Modelo	Estado de evidencia	Benchmarks más relevantes recuperados	Lectura ejecutiva
Claude Opus 4.7	Alta-media	Research-agent interno 0.715 y fuerte rendimiento de long-context según Anthropic; SWE-bench 82.00% según Vals AI; GPQA Diamond 94.2% según TNW ^[1]^[2]^[3]	Mejor candidato para presentarlo como líder respaldado en coding/agente, con cautela por diferencias entre fuentes ^[2]^[15]
GPT-5.5	Media-baja	MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% según O-Mega ^[4]	Muy fuerte en razonamiento según fuentes secundarias, pero falta validación oficial en el corpus recuperado ^[4]^[5]
DeepSeek V4 / V4 Pro	Media-baja	BenchLM reporta DeepSeek V4 Pro High con Agentic 83.8/100 y Coding 88.8/100; NxCode habla de 81% en SWE-bench y 97% en Needle-in-a-Haystack a 1M tokens como resultado reclamado ^[7]^[8]	Alternativa competitiva, especialmente si se valora ecosistema abierto/local, pero requiere validación independiente antes de una decisión ejecutiva ^[6]^[8]^[9]
Kimi K2.6	Insufficient evidence	No hay benchmark citable suficiente en las fuentes recuperadas	No incluir como comparable verificado; pedir fuente oficial o leaderboard antes de presentarlo

Benchmarks numéricos recuperados

Benchmark / métrica	Claude Opus 4.7	GPT-5.5	DeepSeek V4 / V4 Pro	Kimi K2.6
SWE-bench	82.00% según Vals AI ^[2]	No recuperado en fuente suficientemente comparable	81% reclamado en una fuente secundaria sobre DeepSeek V4 ^[7]	Insufficient evidence
SWE-bench Verified	87.6% según Vellum ^[15]	No recuperado	Incluido como benchmark evaluado en discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
SWE-bench Pro	64.3% según Vellum ^[15]	No recuperado	Incluido en la discusión comunitaria de DeepSeek-V4-Pro, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
GPQA Diamond	94.2% según TNW y O-Mega ^[3]^[4]	93.6% según O-Mega ^[4]	Mencionado dentro de suites comunitarias, sin cifra visible en el resumen recuperado ^[6]^[9]	Insufficient evidence
MMLU	No recuperado con cifra comparable	92.4% según O-Mega ^[4]	MMLU-Pro aparece como evaluación comunitaria, sin cifra visible en el resumen recuperado ^[6]	Insufficient evidence
ARC-AGI-2	No recuperado	85.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
ARC-AGI-1	No recuperado	95.0% según O-Mega ^[4]	No recuperado	Insufficient evidence
Research-agent / tareas multi-step	0.715 en benchmark interno de Anthropic ^[1]	No recuperado	BenchLM reporta categoría Agentic 83.8/100 para DeepSeek V4 Pro High ^[8]	Insufficient evidence
Long-context / Needle-in-a-Haystack	Anthropic afirma rendimiento long-context muy consistente ^[1]	No recuperado	NxCode reporta 97% a 1M tokens como resultado reclamado, condicionado a validación independiente ^[7]	Insufficient evidence
LiveCodeBench / Codeforces	No recuperado	No recuperado	Redreamality reporta LiveCodeBench 93.5 y Codeforces 3206 para DeepSeek V4 ^[9]	Insufficient evidence

Análisis por modelo

Claude Opus 4.7

Claude Opus 4.7 es el modelo mejor respaldado del conjunto porque tiene una página oficial de Anthropic y resultados externos de SWE-bench ^[1]^[2].

Vals AI reporta que Claude Opus 4.7 lidera SWE-bench con 82.00% en una página actualizada el 24 de abril de 2026 ^[2].

Vellum reporta cifras más altas para Claude, con 87.6% en SWE-bench Verified y 64.3% en SWE-bench Pro ^[15].

La diferencia entre 82.00% y 87.6% debe tratarse como una discrepancia de metodología, subconjunto o configuración, no como una mejora confirmada única ^[2]^[15].

En razonamiento científico, TNW reporta 94.2% en GPQA Diamond para Claude Opus 4.7 y contextualiza que los modelos frontier están muy cerca entre sí en ese benchmark ^[3].

GPT-5.5

GPT-5.5 aparece muy fuerte en razonamiento general según O-Mega, que reporta MMLU 92.4%, GPQA Diamond 93.6%, ARC-AGI-2 85.0% y ARC-AGI-1 95.0% ^[4].

Para una presentación, GPT-5.5 puede colocarse como competidor muy fuerte en razonamiento, pero no como ganador global si se exige trazabilidad oficial comparable a la de Claude ^[1]^[4]^[5].

DeepSeek V4 / V4 Pro

Hugging Face muestra una discusión comunitaria para DeepSeek-V4-Pro con evaluaciones en GPQA, GSM8K, HLE, MMLU-Pro, SWE-bench Pro, SWE-bench Verified y Terminal-Bench 2.0 ^[6].

BenchLM reporta para DeepSeek V4 Pro High una categoría Agentic de 83.8/100 y una categoría Coding de 88.8/100 ^[8].

Redreamality reporta que DeepSeek V4 alcanza LiveCodeBench 93.5 y Codeforces 3206, y al mismo tiempo afirma que los modelos cerrados siguen liderando en tareas agentic de largo horizonte ^[9].

Kimi K2.6

No encontré benchmarks citables suficientes para Kimi K2.6 en las fuentes recuperadas.

No recomiendo sustituir Kimi K2.6 por Kimi K2.5 u otra variante, porque eso mezclaría modelos diferentes y degradaría la validez de la comparación.

La forma más segura de presentarlo es como “pendiente de validación” o “Insufficient evidence”.

Ranking por escenario de uso

Escenario	Modelo recomendado	Justificación
Coding y resolución de issues reales	Claude Opus 4.7	Vals AI lo reporta como líder en SWE-bench con 82.00%, y Vellum también lo ubica fuerte en SWE-bench Verified y SWE-bench Pro ^[2]^[15]
Razonamiento general / ciencia	Claude Opus 4.7 o GPT-5.5	Claude aparece con 94.2% en GPQA Diamond, mientras GPT-5.5 aparece con 93.6% en O-Mega; la diferencia es pequeña y debe leerse con cautela ^[3]^[4]
Agentes y tareas multi-step	Claude Opus 4.7	Anthropic reporta 0.715 en su benchmark interno de research-agent y fuerte consistencia long-context ^[1]
Alternativa abierta / ecosistema técnico	DeepSeek V4 Pro	BenchLM y Hugging Face muestran suites de evaluación amplias para DeepSeek V4 Pro, aunque la evidencia es menos oficial que la de Claude ^[6]^[8]
Presentación con evidencia verificable	Claude Opus 4.7	Es el único del conjunto con fuente oficial clara y señales externas de benchmark en el corpus recuperado ^[1]^[2]
Inclusión de Kimi K2.6	No recomendado	Insufficient evidence

Evidence notes

La fuente más fuerte para Claude es oficial, pero su benchmark de research-agent es interno y no necesariamente reproducible por terceros ^[1].
Vals AI es la fuente más limpia para una comparación de SWE-bench porque reporta directamente un leaderboard de SWE-bench actualizado el 24 de abril de 2026 ^[2].
Vellum aporta más detalle para SWE-bench Verified y SWE-bench Pro, pero sus cifras no coinciden con Vals AI y deben presentarse como una fuente alternativa ^[15].
O-Mega aporta varias cifras útiles para GPT-5.5, pero debe tratarse como fuente secundaria hasta que exista una ficha oficial o un leaderboard independiente reproducible ^[4].
La discusión de Hugging Face para DeepSeek-V4-Pro demuestra actividad comunitaria de evaluación, pero no equivale por sí sola a una publicación oficial o a un benchmark auditado ^[6].
SWE-bench Pro es especialmente relevante para comparar agentes de software porque evalúa tareas más difíciles y de mayor horizonte que benchmarks de coding más simples ^[10].
MMLU debe tener menor peso en la decisión porque fuentes de explicación de benchmarks indican saturación entre modelos top en 2026 ^[12].

Limitations / uncertainty

Insufficient evidence para Kimi K2.6.
No recuperé una fuente oficial de OpenAI con benchmark card de GPT-5.5, por lo que cualquier cifra de GPT-5.5 en este informe debe etiquetarse como secundaria.
No recuperé una fuente oficial de DeepSeek que consolide los r

แหล่งที่มา

[1] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[2] GPT-5.5 is here: benchmarks, pricing, and what changes ... - Appwriteappwrite.io
Star on GitHub 55.8KGo to Console Start building for free Sign upGo to Console Start building for free Products Docs Pricing Customers Blog Changelog Star on GitHub 55.8K Blog/GPT-5.5 is here: benchmarks, pricing, and what changes for developers Apr 24, 202...
[3] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Reasoning, Math, and Science Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- MMLU 92.4% - - GPQA Diamond 93.6% 92.8% 94.2% 94.3% ARC-AGI-2 85.0% 73.3% 77.1% ARC-AGI-1 95.0% 93.7% - FrontierMath T1-3 51.7% 52.4% 47.6% 43.8% F...
[6] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[7] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[8] SWE-bench February 2026 leaderboard updatesimonwillison.net
Here's how the top ten models performed: Image 1: Bar chart showing "% Resolved" by "Model". Bars in descending order: Claude 4.5 Opus (high reasoning) 76.8%, Gemini 3 Flash (high reasoning) 75.8%, MiniMax M2.5 (high reasoning) 75.8%, Claude Opus 4.6 75.6%,...
[9] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[11] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[12] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[14] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performance | BenchLM.aibenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 BenchLM is tracking Claude Opus 4.7, but this profile is currently excluded from the public leaderboard because it still lacks enough non-generated benchmark cov...
[15] Claude Opus 4.7 leads on SWE-bench and agentic ... - TNWthenextweb.com
On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark...
[16] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[17] SWE-bench - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Coding SWE-bench SWE-bench Updated: 4/24/2026 Solving production software engineering tasks Key Takeaways Claude Opus 4.7 leads with a...
[20] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both G...
[21] WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed — Live Rankingswhatllm.org
whatllm? whatllm.org WhatLLM.org - LLM Comparison Tool The ultimate LLM comparison tool Compare price, performance, and speed across the entire AI ecosystem. Updated daily with the latest benchmarks. Top 10 Models Ranked by Quality Index across all benchmar...
[25] Add community evaluation results for GPQA, GSM8K, HLE, MMLU ...huggingface.co
deepseek-ai/DeepSeek-V4-Pro · Add community evaluation results for GPQA, GSM8K, HLE, MMLU-PRO, SWE-BENCH PRO, SWE-BENCH VERIFIED, TERMINAL-BENCH-2.0 Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In S...
[26] DeepSeek V4 (2026): 1T Parameters, 81% SWE-bench ... - NxCodenxcode.io
The claimed results: Metric Standard Attention Engram (DeepSeek V4) --- Needle-in-a-Haystack (1M tokens) 84.2% accuracy 97% accuracy Context Length Supported Varies (128K typical) 1M tokens If the 97% figure holds up under independent testing, this represen...
[27] DeepSeek V4 Pro (High) Benchmarks 2026 - BenchLM.aibenchlm.ai
Category Performance PNG Embed Share Scores across all benchmark categories (0-100 scale) Category Breakdown Agentic 83.8/ 100 Weight: 22%5 benchmark s Terminal-Bench 2.0 BrowseComp OSWorld-Verified GAIA TAU-bench WebArena Coding 7 88.8/ 100 Weight: 20%6 be...
[30] Mapping the DeepSeek V4 Evaluation Suite: A Field Guide to 2026 ...redreamality.com
The Takeaway The V4 scorecard confirms a pattern: for pure coding, open weights have caught up (LiveCodeBench 93.5, Codeforces 3206). For long-horizon agentic work (SWE-bench Pro, Terminal-Bench 2.0), closed frontier still leads. For frontier reasoning (HLE...
[38] Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arxiv.org
… PRO, a substantially more challenging benchmark that builds … In our evaluation of widely used coding models, under a unified … Towards this end, this paper is motivated to (1) mitigate … 2025