รายงานเผยแพร่แล้ว29 เม.ย. 2026Last edited 6 พ.ค. 202612 แหล่งที่มา

GPT-5.5, Claude Opus 4.7, Kimi K2.6 และ DeepSeek V4: เทียบ Benchmark แบบเลือกใช้จริง

ถ้างานเป็น coding agent ผ่านเทอร์มินัล ให้เริ่มที่ GPT 5.5; ถ้าเป็นซ่อมซอฟต์แวร์ตาม benchmark ให้ดู Claude Opus 4.7; ถ้าต้องการ open weight ให้ดู Kimi K2.6; ถ้าคุมต้นทุนให้ทดสอบ DeepSeek V4 Pro Max [1][18][24]. อย่าเอา GPT 5.5 Pro ไปรวมกับ GPT 5.5 รุ่นฐาน: ในตารางที่รายงานแยก Pro ทำ BrowseComp ได้ 90.1% และ Humanity...

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

17K0

Abstract benchmark dashboard comparing GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4 — GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks ComparedAI-generated editorial illustration for a benchmark comparison of GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4.
AI พรอมต์
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks Compared. Article summary: There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,.... Topic tags: ai, ai benchmarks, llm, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hp
openai.com

ตาราง Benchmark ทำให้การเทียบโมเดลดูเหมือนการแข่งขันที่ต้องมีผู้ชนะหนึ่งเดียว แต่ในทางใช้งานจริง คำถามที่สำคัญกว่าไม่ใช่รุ่นไหนชนะทั้งหมด หากเป็นรุ่นไหนควรเอาไปทดสอบก่อนกับงานของคุณ ข้อมูลเปรียบเทียบร่วมที่ใกล้เคียงที่สุดในแหล่งอ้างอิงครอบคลุม GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7 และ DeepSeek-V4-Pro-Max ส่วน Kimi K2.6 มาจากแหล่งข้อมูลเฉพาะของ Kimi เช่นข่าวเปิดตัว model card และ leaderboard แยกต่างหาก ^[1]^[6]^[24].

หมายเหตุเรื่องชื่อรุ่น: บทความนี้ใช้ชื่อ DeepSeek-V4-Pro-Max สำหรับ DeepSeek V4 เพราะเป็นตัวแปรที่มีแถว Benchmark และต้นทุนในแหล่งอ้างอิง ^[18]^[24]. และจะแยก GPT-5.5 Pro ออกจาก GPT-5.5 รุ่นฐานทุกครั้งที่แหล่งข้อมูลรายงานคะแนนคนละชุด ^[24].

สรุปเร็วตามประเภทงาน

Coding agent ที่ทำงานหนักบนเทอร์มินัล: GPT-5.5 มีคะแนน Terminal-Bench 2.0 สูงสุดในตารางเปรียบเทียบร่วมที่อ้างถึง อยู่ที่ 82.7% ^[24].
งานซ่อมซอฟต์แวร์ตาม Benchmark: Claude Opus 4.7 นำในแถว SWE-Bench Pro ที่ 64.3% และแถว SWE-Bench Verified ที่ 87.6% ในข้อมูลที่อ้างถึง ^[18]^[24].
เหตุผลเชิงยากโดยไม่ใช้เครื่องมือ: Claude Opus 4.7 นำใน GPQA Diamond และ Humanity’s Last Exam แบบไม่ใช้เครื่องมือในตารางเปรียบเทียบร่วม ^[24].
เหตุผลเชิงยากแบบใช้เครื่องมือและการค้นเว็บ: GPT-5.5 Pro นำ Humanity’s Last Exam แบบใช้เครื่องมือที่ 57.2% และ BrowseComp ที่ 90.1% ในจุดที่มีการรายงานรุ่น Pro แยก ^[24].
การ deploy แบบ open-weight: Kimi K2.6 เป็นตัวเลือก open-weight ที่ชัดที่สุดในแหล่งอ้างอิง โดยถูกอธิบายว่าเป็นโมเดล MoE ขนาด 1T parameters มี 32B active parameters และ context window 256K ^[1].
Hosted inference ที่ต้องคุมต้นทุน: DeepSeek-V4-Pro-Max เป็นตัวเลือกด้านความคุ้มค่าที่ควรนำไปทดสอบ โดย LLM Stats ระบุ context 1M, คะแนน SWE-Bench Verified 80.6% และ cost columns $1.74/$3.48 ^[18].

ตารางเปรียบเทียบ Benchmark

เครื่องหมายขีดหมายถึงไม่พบคะแนนของโมเดลนั้นในแหล่งอ้างอิงที่ใช้ ไม่ได้แปลว่าคะแนนเป็นศูนย์ แถวของ GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7 และ DeepSeek-V4-Pro-Max ส่วนใหญ่มาจากตารางเปรียบเทียบร่วม ส่วนตัวเลขของ Kimi K2.6 มาจากแหล่งข้อมูล Kimi แยกต่างหาก ^[1]^[6]^[24].

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam, ไม่ใช้เครื่องมือ	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam, ใช้เครื่องมือ	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

ควรเริ่มทดสอบจากรุ่นไหน

โจทย์หลัก	เริ่มจาก	เหตุผล
Coding agent ที่ใช้เทอร์มินัลเยอะ	GPT-5.5	ได้คะแนน Terminal-Bench 2.0 สูงสุดในตารางเปรียบเทียบร่วมที่ 82.7% ^[24].
ซ่อมโค้ดและแก้ปัญหาซอฟต์แวร์	Claude Opus 4.7	นำทั้งแถว SWE-Bench Pro และ SWE-Bench Verified ในข้อมูลที่อ้างถึง ^[18]^[24].
เหตุผลเชิงยากโดยไม่ใช้เครื่องมือ	Claude Opus 4.7	นำ GPQA Diamond และ Humanity’s Last Exam แบบไม่ใช้เครื่องมือในตารางเปรียบเทียบร่วม ^[24].
งาน reasoning หรือ browsing แบบใช้เครื่องมือ	GPT-5.5 Pro	นำ Humanity’s Last Exam แบบใช้เครื่องมือและ BrowseComp ในจุดที่มีรายงาน GPT-5.5 Pro แยก ^[24].
ต้องการ open-weight deployment	Kimi K2.6	ถูกอธิบายว่าเป็น open-weight MoE ขนาด 1T parameters และ model card บน Hugging Face รายงานคะแนน coding benchmark แข็งแรง ^[1]^[6].
ใช้ API/hosted inference แบบคุมต้นทุน	DeepSeek-V4-Pro-Max	LLM Stats ระบุ context 1M, SWE-Bench Verified 80.6% และ cost columns ต่ำกว่าแถว Claude Opus 4.7 บน leaderboard เดียวกัน ^[18].
งาน long-context	GPT-5.5, Claude Opus 4.7 หรือ DeepSeek-V4-Pro-Max	แหล่งอ้างอิงระบุ context 1M สำหรับ GPT-5.5, Claude Opus 4.7 และ DeepSeek-V4-Pro-Max ส่วน Kimi K2.6 อยู่ราว 256K–262K ^[1]^[11]^[16]^[18]^[27].

อ่านแต่ละโมเดลแบบใช้งานจริง

GPT-5.5

OpenAI อธิบาย GPT-5.5 ว่าสร้างมาเพื่องานซับซ้อน เช่น coding, research และ data analysis ^[38]. ในตารางเปรียบเทียบร่วมของ VentureBeat, GPT-5.5 ทำคะแนน Terminal-Bench 2.0 ได้ 82.7% สูงกว่า Claude Opus 4.7 ที่ 69.4% และ DeepSeek-V4-Pro-Max ที่ 67.9% ^[24]. ตารางเดียวกันยังรายงาน GPT-5.5 ที่ 93.6% บน GPQA Diamond, 58.6% บน SWE-Bench Pro และ 84.4% บน BrowseComp ^[24].

ข้อควรระวังคือ GPT-5.5 Pro เป็นจุดเปรียบเทียบแยก ไม่ควรนำคะแนนไปรวมกับรุ่นฐานแบบเหมารวม ในตารางเดียวกัน GPT-5.5 Pro ทำ BrowseComp ได้ 90.1% และ Humanity’s Last Exam แบบใช้เครื่องมือได้ 57.2% แต่ตัวเลขเหล่านี้ควรแยกจาก GPT-5.5 รุ่นฐานเมื่อนำไปเทียบต้นทุน latency หรือการตั้งค่าโมเดล ^[24].

ด้านการจัดซื้อหรือวางงบ BenchLM ระบุ GPT-5.5 มี context window 1M token ส่วนรายงานราคาหนึ่งระบุ GPT-5.5 ที่ $5 ต่อ 1 ล้าน input tokens และ $30 ต่อ 1 ล้าน output tokens ^[27]^[36]. ควรมองราคานี้เป็นสัญญาณเบื้องต้น และตรวจสอบราคาปัจจุบันกับผู้ให้บริการก่อนทำงบจริง.

Claude Opus 4.7

Claude Opus 4.7 มีสัญญาณด้าน software repair แข็งที่สุดในกลุ่มนี้ตามแหล่งอ้างอิง LLM Stats ระบุคะแนน 87.6% บน SWE-Bench Verified และตารางเปรียบเทียบร่วมรายงาน 64.3% บน SWE-Bench Pro ^[18]^[24]. รุ่นนี้ยังนำ GPQA Diamond ที่ 94.2%, Humanity’s Last Exam แบบไม่ใช้เครื่องมือที่ 46.9% และ MCP Atlas ที่ 79.1% ในตารางเปรียบเทียบร่วม ^[24].

LLM Stats รายงาน context window 1M token และราคา $5/$25 ต่อ 1 ล้าน token สำหรับ Claude Opus 4.7 ^[16]. แต่เรื่องความเทียบกันได้ของ Benchmark สำคัญมาก: Anthropic ระบุว่าบางผลทดสอบใช้ internal implementation หรือมีการปรับ harness parameters และบางคะแนนจึงไม่สามารถเทียบตรงกับ public leaderboard ได้ ^[17].

Kimi K2.6

Kimi K2.6 เป็นตัวเลือก open-weight ที่เด่นที่สุดในข้อมูลที่อ้างถึง ข่าวเปิดตัวอธิบายว่าเป็น open-weight MoE ขนาด 1T parameters, มี 32B active parameters, 384 experts, native multimodality, INT4 quantization และ context 256K ^[1]. Model card บน Hugging Face รายงาน 80.2% บน SWE-Bench Verified, 58.6% บน SWE-Bench Pro, 66.7% บน Terminal-Bench 2.0 และ 89.6 บน LiveCodeBench v6 ^[6].

แหล่งข่าวเปิดตัวเดียวกันรายงาน Kimi K2.6 ที่ 54.0 บน Humanity’s Last Exam แบบใช้เครื่องมือ และ 83.2 บน BrowseComp ^[1]. ส่วน LLM Stats ระบุ Kimi K2.6 มี context 262K, cost columns $0.95/$4.00 และติดป้าย Open Source ^[11]. ข้อจำกัดคือคะแนนของ Kimi ไม่ได้มาจากตารางเปรียบเทียบร่วมชุดเดียวกับ GPT-5.5, Claude Opus 4.7 และ DeepSeek-V4-Pro-Max ดังนั้นส่วนต่างคะแนนที่ห่างกันเล็กน้อยควรใช้เป็นเหตุผลให้ทดสอบต่อ ไม่ใช่ตัดสินผู้ชนะทันที ^[1]^[6]^[24].

DeepSeek-V4-Pro-Max

DeepSeek-V4-Pro-Max ดูเหมือนตัวเลือกสายคุ้มค่ามากกว่าจะเป็นผู้ชนะทุกสนาม LLM Stats ระบุขนาด 1.6T, context 1M, คะแนน SWE-Bench Verified 80.6% และ cost columns $1.74/$3.48 ^[18]. ในตารางเปรียบเทียบร่วม รุ่นนี้ได้ 90.1% บน GPQA Diamond, 37.7% บน Humanity’s Last Exam แบบไม่ใช้เครื่องมือ, 48.2% บน Humanity’s Last Exam แบบใช้เครื่องมือ, 67.9% บน Terminal-Bench 2.0, 55.4% บน SWE-Bench Pro, 83.4% บน BrowseComp และ 73.6% บน MCP Atlas ^[24].

ตัวเลขเหล่านี้ทำให้ DeepSeek-V4-Pro-Max น่าลองสำหรับงานที่ต้นทุนเป็นเงื่อนไขใหญ่ แต่ตารางเดียวกันยังแสดงให้เห็นว่า GPT-5.5, GPT-5.5 Pro หรือ Claude Opus 4.7 นำในแถว Benchmark ส่วนใหญ่ที่รายงาน ดังนั้นก่อนใช้แทนโมเดลพรีเมียมใน production ควรทดสอบกับงานของตัวเองให้ชัดเจน ^[24].

สัญญาณเรื่อง context และราคา

ราคาและ context window ไม่ได้มาจากแหล่งเดียวกันเสมอไป และอาจเปลี่ยนตามผู้ให้บริการหรือแพ็กเกจ ควรใช้ตารางนี้เป็นสัญญาณเพื่อคัด shortlist ไม่ใช่ใบเสนอราคาสุดท้าย.

โมเดล	สัญญาณ context และราคาในแหล่งอ้างอิง	อ่านเชิงปฏิบัติ
GPT-5.5	BenchLM ระบุ context 1M; รายงานราคาหนึ่งระบุ $5 input และ $30 output ต่อ 1 ล้าน token ^[27]^[36].	ตัวเลือก hosted ระดับพรีเมียม ควรเช็กราคาปัจจุบันอีกครั้ง.
Claude Opus 4.7	LLM Stats รายงาน context 1M และราคา $5/$25 ต่อ 1 ล้าน token ^[16].	ตัวเลือกพรีเมียมสำหรับ coding, reasoning และ long-context.
Kimi K2.6	ข่าวเปิดตัวรายงาน context 256K; LLM Stats ระบุ context 262K และ cost columns $0.95/$4.00 ^[1]^[11].	ตัวเลือก open-weight ที่แข็งแรง ราคาบน hosted provider อาจต่างกัน.
DeepSeek-V4-Pro-Max	LLM Stats ระบุ context 1M, ขนาด 1.6T, SWE-Bench Verified 80.6% และ cost columns $1.74/$3.48 ^[18].	ตัวเลือกคุ้มค่าหากคุณภาพยังผ่านเมื่อทดสอบกับ workload จริง.

ทำไมอันดับถึงไม่ตรงกันทุกตาราง

Benchmark แต่ละตัววัดคนละทักษะ GPQA Diamond และ Humanity’s Last Exam เน้น reasoning ยาก ส่วน Terminal-Bench 2.0 และตระกูล SWE-Bench เน้น coding และงานซอฟต์แวร์แบบ agentic ขณะที่ BrowseComp วัดความสามารถแนวค้นหา/ท่องเว็บในตารางเปรียบเทียบร่วม ^[24]. โมเดลหนึ่งจึงอาจชนะบางแถวและตามหลังในอีกแถวได้ เพราะโจทย์ การให้เครื่องมือ และ harness ที่ใช้วัดต่างกัน.

แม้ Benchmark ชื่อเดียวกันก็อาจต่างกันตามวิธีรัน LLM Stats ระบุ Claude Opus 4.7 ที่ 87.6% บน SWE-Bench Verified ขณะที่ LMCouncil ระบุ Claude Opus 4.7 ที่ 83.5% ± 1.7 ภายใต้ setup ของตนเอง ^[18]^[30]. Anthropic เองยังระบุว่าบางผลใช้ internal implementation หรือปรับ harness parameters ทำให้เทียบตรงกับ public leaderboard บางชุดไม่ได้ ^[17].

ดังนั้นช่องว่าง 1–2 จุดไม่ควรเป็นเหตุผลเดียวในการเปลี่ยนโมเดลใน production ใช้ public benchmark เพื่อคัดรายชื่อให้สั้นลง แล้วให้การทดสอบกับงานจริงเป็นตัวตัดสิน.

วิธีประเมินก่อนตัดสินใจ

ก่อนเลือกโมเดลเดียว ควรทดสอบผู้เข้ารอบ 2–3 รุ่นกับงานที่ใกล้เคียงของจริงที่สุด.

ใช้ prompt, ไฟล์ และ repository จริง เพราะ prompt ใน Benchmark มักไม่สะท้อน codebase เอกสาร นโยบาย หรือพฤติกรรมผู้ใช้ของคุณ.
จัด environment ให้เหมือนงานจริง ผลของ coding agent เปลี่ยนได้มากเมื่อมีหรือไม่มี terminal, browsing, retrieval, context ของ repo หรือ internal API.
วัดต้นทุนและ latency ด้วย setting เดียวกัน โหมด Pro หรือ effort สูงอาจเพิ่มคุณภาพ แต่ก็อาจเพิ่ม token และเวลาตอบ.
ตรวจ failure ด้วยคน งานโค้ดควรดู test, diff, maintainability, security regression และ dependency ที่โมเดลแต่งขึ้น.
ใส่ challenger ที่ถูกกว่าหรือเปิดน้ำหนักได้อย่างน้อยหนึ่งตัว ถ้า open weights หรือต้นทุน inference สำคัญ Kimi K2.6 และ DeepSeek-V4-Pro-Max ควรได้อยู่ในชุดทดสอบ ^[1]^[18].

บทสรุป

ถ้าต้องการ shortlist ระดับบน ให้ทดสอบ GPT-5.5 และ Claude Opus 4.7 คู่กัน: GPT-5.5 มีคะแนน Terminal-Bench 2.0 ที่เด่นที่สุดในข้อมูลที่อ้างถึง ส่วน Claude Opus 4.7 มีคะแนน SWE-Bench Pro และ SWE-Bench Verified ที่แข็งที่สุดในข้อมูลชุดนี้ ^[18]^[24]. ถ้าต้องการ open weights ให้เริ่มจาก Kimi K2.6 ^[1]^[6]. ถ้าข้อจำกัดหลักคือต้นทุน ให้ใส่ DeepSeek-V4-Pro-Max ในการทดสอบด้วย แต่ควรพิสูจน์กับ workload ของตัวเองก่อนถือว่าใช้แทนโมเดลพรีเมียมได้ทันที ^[18]^[24].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

ถ้างานเป็น coding agent ผ่านเทอร์มินัล ให้เริ่มที่ GPT 5.5; ถ้าเป็นซ่อมซอฟต์แวร์ตาม benchmark ให้ดู Claude Opus 4.7; ถ้าต้องการ open weight ให้ดู Kimi K2.6; ถ้าคุมต้นทุนให้ทดสอบ DeepSeek V4 Pro Max [1][18][24].
อย่าเอา GPT 5.5 Pro ไปรวมกับ GPT 5.5 รุ่นฐาน: ในตารางที่รายงานแยก Pro ทำ BrowseComp ได้ 90.1% และ Humanity’s Last Exam แบบมีเครื่องมือได้ 57.2% [24].
Kimi K2.6 ถูกอธิบายว่าเป็น open weight MoE 1T parameters พร้อม 32B active parameters ส่วน LLM Stats ระบุ DeepSeek V4 Pro Max มี context 1M และ cost columns $1.74/$3.48 [1][18].

คนยังถาม

คำตอบสั้น ๆ สำหรับ "GPT-5.5, Claude Opus 4.7, Kimi K2.6 และ DeepSeek V4: เทียบ Benchmark แบบเลือกใช้จริง" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

Kimi K2.6 ถูกอธิบายว่าเป็น open weight MoE 1T parameters พร้อม 32B active parameters ส่วน LLM Stats ระบุ DeepSeek V4 Pro Max มี context 1M และ cost columns $1.74/$3.48 [1][18].

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

ดำเนินการต่อด้วย "Claude Opus 4.7 เทียบ GPT-5.5, DeepSeek V4 และ Kimi K2.6: อ่านเบนช์มาร์กปี 2026 แบบไม่หลงตัวเลข" เพื่อดูอีกมุมหนึ่งและการอ้างอิงเพิ่มเติม

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 เทียบ GPT-5.5, DeepSeek V4 และ Kimi K2.6: อ่านเบนช์มาร์กปี 2026 แบบไม่หลงตัวเลข

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา

DeepSeek V4 ไม่ใช่แค่โมเดล 1M context

โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger

Northwest กับ Southeast Timber: ทำไมแทนกันได้มากขึ้นแล้วต้องตอบ larger; larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek จะโค่น OpenAI, Claude, Gemini และ Grok ได้ไหม? เกมจริงอยู่ที่ต้นทุนและความไว้วางใจ

การสนทนาการวิจัย

คุณ25 เม.ย. 2026

คำถามวิจัย

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

I found public benchmark data for all four, but not one fully independent, apples-to-apples leaderboard covering GPT-5.5, Claude Opus 4.7, Kimi K2.6, and DeepSeek V4 under the same harness. The practical read is: GPT-5.5 looks strongest for broad agentic coding/computer-use tasks, Claude Opus 4.7 is extremely strong on coding and hard reasoning, Kimi K2.6 is the strongest open-weight option in this group, and DeepSeek V4 looks like the best price/performance play if its reported results hold up across independent testing ^[4]^[5]^[6]^[13].

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

If you want the safest high-end general choice, start with GPT-5.5. If your priority is top-tier coding reliability and long-context enterprise work, test Claude Opus 4.7. If you need open weights and strong frontier-adjacent performance, Kimi K2.6 is the most compelling open option. If cost is the constraint and you can tolerate some validation risk, DeepSeek V4 is the one to benchmark first.

แหล่งที่มา

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

ค้นพบเทรนด์

รายงานเผยแพร่แล้ว29 เม.ย. 2026Last edited 6 พ.ค. 202612 แหล่งที่มา

GPT-5.5, Claude Opus 4.7, Kimi K2.6 และ DeepSeek V4: เทียบ Benchmark แบบเลือกใช้จริง

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

17K0

สรุปเร็วตามประเภทงาน

Coding agent ที่ทำงานหนักบนเทอร์มินัล: GPT-5.5 มีคะแนน Terminal-Bench 2.0 สูงสุดในตารางเปรียบเทียบร่วมที่อ้างถึง อยู่ที่ 82.7% ^[24].
งานซ่อมซอฟต์แวร์ตาม Benchmark: Claude Opus 4.7 นำในแถว SWE-Bench Pro ที่ 64.3% และแถว SWE-Bench Verified ที่ 87.6% ในข้อมูลที่อ้างถึง ^[18]^[24].
เหตุผลเชิงยากโดยไม่ใช้เครื่องมือ: Claude Opus 4.7 นำใน GPQA Diamond และ Humanity’s Last Exam แบบไม่ใช้เครื่องมือในตารางเปรียบเทียบร่วม ^[24].
เหตุผลเชิงยากแบบใช้เครื่องมือและการค้นเว็บ: GPT-5.5 Pro นำ Humanity’s Last Exam แบบใช้เครื่องมือที่ 57.2% และ BrowseComp ที่ 90.1% ในจุดที่มีการรายงานรุ่น Pro แยก ^[24].
การ deploy แบบ open-weight: Kimi K2.6 เป็นตัวเลือก open-weight ที่ชัดที่สุดในแหล่งอ้างอิง โดยถูกอธิบายว่าเป็นโมเดล MoE ขนาด 1T parameters มี 32B active parameters และ context window 256K ^[1].
Hosted inference ที่ต้องคุมต้นทุน: DeepSeek-V4-Pro-Max เป็นตัวเลือกด้านความคุ้มค่าที่ควรนำไปทดสอบ โดย LLM Stats ระบุ context 1M, คะแนน SWE-Bench Verified 80.6% และ cost columns $1.74/$3.48 ^[18].

ตารางเปรียบเทียบ Benchmark

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam, ไม่ใช้เครื่องมือ	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam, ใช้เครื่องมือ	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

ควรเริ่มทดสอบจากรุ่นไหน

โจทย์หลัก	เริ่มจาก	เหตุผล
Coding agent ที่ใช้เทอร์มินัลเยอะ	GPT-5.5	ได้คะแนน Terminal-Bench 2.0 สูงสุดในตารางเปรียบเทียบร่วมที่ 82.7% ^[24].
ซ่อมโค้ดและแก้ปัญหาซอฟต์แวร์	Claude Opus 4.7	นำทั้งแถว SWE-Bench Pro และ SWE-Bench Verified ในข้อมูลที่อ้างถึง ^[18]^[24].
เหตุผลเชิงยากโดยไม่ใช้เครื่องมือ	Claude Opus 4.7	นำ GPQA Diamond และ Humanity’s Last Exam แบบไม่ใช้เครื่องมือในตารางเปรียบเทียบร่วม ^[24].
งาน reasoning หรือ browsing แบบใช้เครื่องมือ	GPT-5.5 Pro	นำ Humanity’s Last Exam แบบใช้เครื่องมือและ BrowseComp ในจุดที่มีรายงาน GPT-5.5 Pro แยก ^[24].
ต้องการ open-weight deployment	Kimi K2.6	ถูกอธิบายว่าเป็น open-weight MoE ขนาด 1T parameters และ model card บน Hugging Face รายงานคะแนน coding benchmark แข็งแรง ^[1]^[6].
ใช้ API/hosted inference แบบคุมต้นทุน	DeepSeek-V4-Pro-Max	LLM Stats ระบุ context 1M, SWE-Bench Verified 80.6% และ cost columns ต่ำกว่าแถว Claude Opus 4.7 บน leaderboard เดียวกัน ^[18].
งาน long-context	GPT-5.5, Claude Opus 4.7 หรือ DeepSeek-V4-Pro-Max	แหล่งอ้างอิงระบุ context 1M สำหรับ GPT-5.5, Claude Opus 4.7 และ DeepSeek-V4-Pro-Max ส่วน Kimi K2.6 อยู่ราว 256K–262K ^[1]^[11]^[16]^[18]^[27].

อ่านแต่ละโมเดลแบบใช้งานจริง

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

สัญญาณเรื่อง context และราคา

โมเดล	สัญญาณ context และราคาในแหล่งอ้างอิง	อ่านเชิงปฏิบัติ
GPT-5.5	BenchLM ระบุ context 1M; รายงานราคาหนึ่งระบุ $5 input และ $30 output ต่อ 1 ล้าน token ^[27]^[36].	ตัวเลือก hosted ระดับพรีเมียม ควรเช็กราคาปัจจุบันอีกครั้ง.
Claude Opus 4.7	LLM Stats รายงาน context 1M และราคา $5/$25 ต่อ 1 ล้าน token ^[16].	ตัวเลือกพรีเมียมสำหรับ coding, reasoning และ long-context.
Kimi K2.6	ข่าวเปิดตัวรายงาน context 256K; LLM Stats ระบุ context 262K และ cost columns $0.95/$4.00 ^[1]^[11].	ตัวเลือก open-weight ที่แข็งแรง ราคาบน hosted provider อาจต่างกัน.
DeepSeek-V4-Pro-Max	LLM Stats ระบุ context 1M, ขนาด 1.6T, SWE-Bench Verified 80.6% และ cost columns $1.74/$3.48 ^[18].	ตัวเลือกคุ้มค่าหากคุณภาพยังผ่านเมื่อทดสอบกับ workload จริง.

ทำไมอันดับถึงไม่ตรงกันทุกตาราง

วิธีประเมินก่อนตัดสินใจ

ใช้ prompt, ไฟล์ และ repository จริง เพราะ prompt ใน Benchmark มักไม่สะท้อน codebase เอกสาร นโยบาย หรือพฤติกรรมผู้ใช้ของคุณ.
จัด environment ให้เหมือนงานจริง ผลของ coding agent เปลี่ยนได้มากเมื่อมีหรือไม่มี terminal, browsing, retrieval, context ของ repo หรือ internal API.
วัดต้นทุนและ latency ด้วย setting เดียวกัน โหมด Pro หรือ effort สูงอาจเพิ่มคุณภาพ แต่ก็อาจเพิ่ม token และเวลาตอบ.
ตรวจ failure ด้วยคน งานโค้ดควรดู test, diff, maintainability, security regression และ dependency ที่โมเดลแต่งขึ้น.
ใส่ challenger ที่ถูกกว่าหรือเปิดน้ำหนักได้อย่างน้อยหนึ่งตัว ถ้า open weights หรือต้นทุน inference สำคัญ Kimi K2.6 และ DeepSeek-V4-Pro-Max ควรได้อยู่ในชุดทดสอบ ^[1]^[18].

บทสรุป

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

ถ้างานเป็น coding agent ผ่านเทอร์มินัล ให้เริ่มที่ GPT 5.5; ถ้าเป็นซ่อมซอฟต์แวร์ตาม benchmark ให้ดู Claude Opus 4.7; ถ้าต้องการ open weight ให้ดู Kimi K2.6; ถ้าคุมต้นทุนให้ทดสอบ DeepSeek V4 Pro Max [1][18][24].
อย่าเอา GPT 5.5 Pro ไปรวมกับ GPT 5.5 รุ่นฐาน: ในตารางที่รายงานแยก Pro ทำ BrowseComp ได้ 90.1% และ Humanity’s Last Exam แบบมีเครื่องมือได้ 57.2% [24].
Kimi K2.6 ถูกอธิบายว่าเป็น open weight MoE 1T parameters พร้อม 32B active parameters ส่วน LLM Stats ระบุ DeepSeek V4 Pro Max มี context 1M และ cost columns $1.74/$3.48 [1][18].

คนยังถาม

คำตอบสั้น ๆ สำหรับ "GPT-5.5, Claude Opus 4.7, Kimi K2.6 และ DeepSeek V4: เทียบ Benchmark แบบเลือกใช้จริง" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

Claude Opus 4.7 เทียบ GPT-5.5, DeepSeek V4 และ Kimi K2.6: อ่านเบนช์มาร์กปี 2026 แบบไม่หลงตัวเลข

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน

DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา

DeepSeek V4 ไม่ใช่แค่โมเดล 1M context

โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger

Northwest กับ Southeast Timber: ทำไมแทนกันได้มากขึ้นแล้วต้องตอบ larger; larger

DeepSeek จะโค่น OpenAI, Claude, Gemini และ Grok ได้ไหม? เกมจริงอยู่ที่ต้นทุนและความไว้วางใจ

การสนทนาการวิจัย

คุณ25 เม.ย. 2026

คำถามวิจัย

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

แหล่งที่มา

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

ค้นพบเทรนด์

รายงานเผยแพร่แล้ว29 เม.ย. 2026Last edited 6 พ.ค. 202612 แหล่งที่มา

GPT-5.5, Claude Opus 4.7, Kimi K2.6 และ DeepSeek V4: เทียบ Benchmark แบบเลือกใช้จริง

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

17K0

สรุปเร็วตามประเภทงาน

Coding agent ที่ทำงานหนักบนเทอร์มินัล: GPT-5.5 มีคะแนน Terminal-Bench 2.0 สูงสุดในตารางเปรียบเทียบร่วมที่อ้างถึง อยู่ที่ 82.7% ^[24].
งานซ่อมซอฟต์แวร์ตาม Benchmark: Claude Opus 4.7 นำในแถว SWE-Bench Pro ที่ 64.3% และแถว SWE-Bench Verified ที่ 87.6% ในข้อมูลที่อ้างถึง ^[18]^[24].
เหตุผลเชิงยากโดยไม่ใช้เครื่องมือ: Claude Opus 4.7 นำใน GPQA Diamond และ Humanity’s Last Exam แบบไม่ใช้เครื่องมือในตารางเปรียบเทียบร่วม ^[24].
เหตุผลเชิงยากแบบใช้เครื่องมือและการค้นเว็บ: GPT-5.5 Pro นำ Humanity’s Last Exam แบบใช้เครื่องมือที่ 57.2% และ BrowseComp ที่ 90.1% ในจุดที่มีการรายงานรุ่น Pro แยก ^[24].
การ deploy แบบ open-weight: Kimi K2.6 เป็นตัวเลือก open-weight ที่ชัดที่สุดในแหล่งอ้างอิง โดยถูกอธิบายว่าเป็นโมเดล MoE ขนาด 1T parameters มี 32B active parameters และ context window 256K ^[1].
Hosted inference ที่ต้องคุมต้นทุน: DeepSeek-V4-Pro-Max เป็นตัวเลือกด้านความคุ้มค่าที่ควรนำไปทดสอบ โดย LLM Stats ระบุ context 1M, คะแนน SWE-Bench Verified 80.6% และ cost columns $1.74/$3.48 ^[18].

ตารางเปรียบเทียบ Benchmark

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam, ไม่ใช้เครื่องมือ	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam, ใช้เครื่องมือ	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

ควรเริ่มทดสอบจากรุ่นไหน

โจทย์หลัก	เริ่มจาก	เหตุผล
Coding agent ที่ใช้เทอร์มินัลเยอะ	GPT-5.5	ได้คะแนน Terminal-Bench 2.0 สูงสุดในตารางเปรียบเทียบร่วมที่ 82.7% ^[24].
ซ่อมโค้ดและแก้ปัญหาซอฟต์แวร์	Claude Opus 4.7	นำทั้งแถว SWE-Bench Pro และ SWE-Bench Verified ในข้อมูลที่อ้างถึง ^[18]^[24].
เหตุผลเชิงยากโดยไม่ใช้เครื่องมือ	Claude Opus 4.7	นำ GPQA Diamond และ Humanity’s Last Exam แบบไม่ใช้เครื่องมือในตารางเปรียบเทียบร่วม ^[24].
งาน reasoning หรือ browsing แบบใช้เครื่องมือ	GPT-5.5 Pro	นำ Humanity’s Last Exam แบบใช้เครื่องมือและ BrowseComp ในจุดที่มีรายงาน GPT-5.5 Pro แยก ^[24].
ต้องการ open-weight deployment	Kimi K2.6	ถูกอธิบายว่าเป็น open-weight MoE ขนาด 1T parameters และ model card บน Hugging Face รายงานคะแนน coding benchmark แข็งแรง ^[1]^[6].
ใช้ API/hosted inference แบบคุมต้นทุน	DeepSeek-V4-Pro-Max	LLM Stats ระบุ context 1M, SWE-Bench Verified 80.6% และ cost columns ต่ำกว่าแถว Claude Opus 4.7 บน leaderboard เดียวกัน ^[18].
งาน long-context	GPT-5.5, Claude Opus 4.7 หรือ DeepSeek-V4-Pro-Max	แหล่งอ้างอิงระบุ context 1M สำหรับ GPT-5.5, Claude Opus 4.7 และ DeepSeek-V4-Pro-Max ส่วน Kimi K2.6 อยู่ราว 256K–262K ^[1]^[11]^[16]^[18]^[27].

อ่านแต่ละโมเดลแบบใช้งานจริง

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

สัญญาณเรื่อง context และราคา

โมเดล	สัญญาณ context และราคาในแหล่งอ้างอิง	อ่านเชิงปฏิบัติ
GPT-5.5	BenchLM ระบุ context 1M; รายงานราคาหนึ่งระบุ $5 input และ $30 output ต่อ 1 ล้าน token ^[27]^[36].	ตัวเลือก hosted ระดับพรีเมียม ควรเช็กราคาปัจจุบันอีกครั้ง.
Claude Opus 4.7	LLM Stats รายงาน context 1M และราคา $5/$25 ต่อ 1 ล้าน token ^[16].	ตัวเลือกพรีเมียมสำหรับ coding, reasoning และ long-context.
Kimi K2.6	ข่าวเปิดตัวรายงาน context 256K; LLM Stats ระบุ context 262K และ cost columns $0.95/$4.00 ^[1]^[11].	ตัวเลือก open-weight ที่แข็งแรง ราคาบน hosted provider อาจต่างกัน.
DeepSeek-V4-Pro-Max	LLM Stats ระบุ context 1M, ขนาด 1.6T, SWE-Bench Verified 80.6% และ cost columns $1.74/$3.48 ^[18].	ตัวเลือกคุ้มค่าหากคุณภาพยังผ่านเมื่อทดสอบกับ workload จริง.

ทำไมอันดับถึงไม่ตรงกันทุกตาราง

วิธีประเมินก่อนตัดสินใจ

ใช้ prompt, ไฟล์ และ repository จริง เพราะ prompt ใน Benchmark มักไม่สะท้อน codebase เอกสาร นโยบาย หรือพฤติกรรมผู้ใช้ของคุณ.
จัด environment ให้เหมือนงานจริง ผลของ coding agent เปลี่ยนได้มากเมื่อมีหรือไม่มี terminal, browsing, retrieval, context ของ repo หรือ internal API.
วัดต้นทุนและ latency ด้วย setting เดียวกัน โหมด Pro หรือ effort สูงอาจเพิ่มคุณภาพ แต่ก็อาจเพิ่ม token และเวลาตอบ.
ตรวจ failure ด้วยคน งานโค้ดควรดู test, diff, maintainability, security regression และ dependency ที่โมเดลแต่งขึ้น.
ใส่ challenger ที่ถูกกว่าหรือเปิดน้ำหนักได้อย่างน้อยหนึ่งตัว ถ้า open weights หรือต้นทุน inference สำคัญ Kimi K2.6 และ DeepSeek-V4-Pro-Max ควรได้อยู่ในชุดทดสอบ ^[1]^[18].

บทสรุป

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

ถ้างานเป็น coding agent ผ่านเทอร์มินัล ให้เริ่มที่ GPT 5.5; ถ้าเป็นซ่อมซอฟต์แวร์ตาม benchmark ให้ดู Claude Opus 4.7; ถ้าต้องการ open weight ให้ดู Kimi K2.6; ถ้าคุมต้นทุนให้ทดสอบ DeepSeek V4 Pro Max [1][18][24].
อย่าเอา GPT 5.5 Pro ไปรวมกับ GPT 5.5 รุ่นฐาน: ในตารางที่รายงานแยก Pro ทำ BrowseComp ได้ 90.1% และ Humanity’s Last Exam แบบมีเครื่องมือได้ 57.2% [24].
Kimi K2.6 ถูกอธิบายว่าเป็น open weight MoE 1T parameters พร้อม 32B active parameters ส่วน LLM Stats ระบุ DeepSeek V4 Pro Max มี context 1M และ cost columns $1.74/$3.48 [1][18].

คนยังถาม

คำตอบสั้น ๆ สำหรับ "GPT-5.5, Claude Opus 4.7, Kimi K2.6 และ DeepSeek V4: เทียบ Benchmark แบบเลือกใช้จริง" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

การสนทนาการวิจัย

คุณ25 เม.ย. 2026

คำถามวิจัย

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

แหล่งที่มา

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago