รายงานเผยแพร่แล้ว28 เม.ย. 2026Last edited 6 พ.ค. 202613 แหล่งที่มา

เทียบ benchmark GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: โมเดลไหนเหมาะกับงานอะไร

ไม่มีโมเดลที่ชนะทุกงาน: GPT‑5.5 ให้สัญญาณเด่นใน agentic computer use และ tool workflows, Claude Opus 4.7 เด่นงานซ่อมโค้ดระดับ repo, Kimi K2.6 แข็งแรงใน open weights coding และ DeepSeek V4 ควรอยู่ในลิสต์ทดลอง long cont... ตัวเลขหลัก: GPT‑5.5 ทำ Terminal‑Bench 2.0 ได้ 82.7% และ BrowseComp 84.4%; Claude Opus 4.7 ทำ SWE...

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

17K0

GPT‑5.5, Claude Opus 4.7, Kimi K2.6 और DeepSeek V4 की benchmark comparison दिखाती AI-generated editorial illustration — GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: कौन सा मॉडल किस काम में आगे हैचारों AI models की ताकतें workload के हिसाब से बदलती हैं: agents, coding, open weights और long context में अलग-अलग leaders दिखते हैं।
AI พรอมต์
Create a landscape editorial hero image for this Studio Global article: GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: कौन सा मॉडल किस काम में आगे है?. Article summary: अप्रैल 2026 के data में कोई universal winner नहीं है: GPT‑5.5 Terminal‑Bench 2.0 82.7% और BrowseComp 84.4% के साथ agentic tool/computer use में आगे है, जबकि Claude Opus 4.7 SWE‑Bench Verified 87.6% और SWE‑Bench Pro 64.... Topic tags: ai, ai benchmarks, llm, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "# DeepSeek V4 vs Claude vs GPT-5.5. Claude Opus 4.6 is no longer Anthropic's flagship — Opus 4.7 shipped on April 16, 2026, at the same $5/$25 price. If you're evaluating "best Ant" source context "DeepSeek V4 vs Claude vs GPT-5.5 - Verdent AI" Reference image 2: visual subject "# Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which Should You Test Fi
openai.com

จากข้อมูลสาธารณะที่มีถึงเดือนเมษายน 2026 การเทียบ GPT‑5.5, Claude Opus 4.7, Kimi K2.6 และ DeepSeek V4 ไม่ควรอ่านเป็นตารางคะแนนลีกว่าใครเก่งที่สุดแบบเบ็ดเสร็จ แต่ควรอ่านเป็นแผนที่เลือกโมเดลตามงาน: งานเอเจนต์ที่ต้องใช้เบราว์เซอร์และเทอร์มินัล, งานแก้โค้ดใน repo จริง, งานที่ต้องการ open weights หรือการทดลอง long context มีผู้เล่นที่เด่นต่างกัน

ข้อควรระวังสำคัญคือคะแนนเหล่านี้ไม่ได้มาจากสนามทดสอบเดียวกันทั้งหมด ทั้ง lab, tool access, inference effort และ evaluation harness อาจต่างกัน LM Council ยังเตือนว่า benchmark ที่รันโดยอิสระอาจไม่ตรงกับคะแนน self-reported จากผู้พัฒนาโมเดล ^[12]

คำตอบสั้น ๆ

งาน agentic computer-use, browser workflow และ terminal-heavy agents: GPT‑5.5 ให้สัญญาณสาธารณะที่แรงที่สุดในชุดข้อมูลนี้ โดย OpenAI รายงาน Terminal‑Bench 2.0 ที่ 82.7%, OSWorld‑Verified 78.7%, BrowseComp 84.4% และ Toolathlon 55.6% ^[5]
งานซ่อม codebase ระดับ production และ benchmark สาย SWE‑Bench: Claude Opus 4.7 เป็นตัวเลือก shortlist ที่แข็งแรงที่สุด ด้วย SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3% ที่มีรายงาน ^[17]
งาน coding ที่ต้องการ open weights: Kimi K2.6 แข่งขันได้มาก โดยเอกสารของ Kimi ระบุ Terminal‑Bench 2.0 66.7%, SWE‑Bench Pro 58.6%, SWE‑Bench Verified 80.2% และ LiveCodeBench v6 89.6 ^[29]
งานทดลอง open-source/open-weights แบบ long context: DeepSeek V4 ควรถูกนำมาทดสอบ แต่ต้องดูให้ชัดว่าเป็น variant ใด เพราะ DeepSeek ระบุว่า V4 Preview live และ open-sourced เมื่อ 24 เมษายน 2026 ^[42]
งาน reasoning ด้านวิทยาศาสตร์: Claude Opus 4.7 รายงาน GPQA Diamond 94.2%; Kimi K2.6 รายงาน GPQA-Diamond 90.5% และ AIME 2026 96.4%; DeepSeek V4-Pro/Pro-Max รายงาน GPQA Diamond 90.1 ^[19]^[27]^[29]^[37]

ก่อนอ่าน benchmark: 3 เรื่องที่ต้องจำ

benchmark คนละตระกูลวัดคนละทักษะ Terminal‑Bench, SWE‑Bench, BrowseComp, OSWorld, GPQA และ HLE ไม่ได้ถามคำถามเดียวกัน โมเดลที่เก่งแก้ issue ใน repo อาจไม่ใช่ตัวที่ดีที่สุดสำหรับ web research agent หรือ computer-use automation ^[5]^[17]^[29]
tool access และ inference effort เปลี่ยนผลได้ OpenAI system card ระบุว่า GPT‑5.5 Pro ใช้โมเดลพื้นฐานเดียวกับ GPT‑5.5 แต่เป็น setting ที่ใช้ parallel test-time compute ดังนั้นคะแนน GPT‑5.5 Pro ไม่ควรถูกอ่านเหมือนเป็นคะแนนของ GPT‑5.5 ปกติภายใต้ compute budget เดียวกัน ^[3]
public benchmark เหมาะสำหรับ shortlist ไม่ใช่คำตอบสุดท้ายของ procurement เพราะ independent runs อาจไม่ตรงกับ self-reported scores ทีมที่เลือกใช้จริงควรรัน eval ภายในด้วย prompt, tool budget, timeout และเกณฑ์ให้คะแนนเดียวกัน ^[12]

ภาพรวมแต่ละโมเดล

โมเดล	ภาพจำจากเอกสารสาธารณะ	สัญญาณที่เด่น	ข้อควรระวัง
GPT‑5.5	เอกสารเปิดตัวของ OpenAI เน้น computer-use, tool-use และ agentic workflows ^[5]	Terminal‑Bench 2.0 82.7%, OSWorld‑Verified 78.7%, BrowseComp 84.4%; ส่วน GPT‑5.5 Pro ได้ BrowseComp 90.1 ^[5]	อย่าเทียบคะแนน Pro กับ GPT‑5.5 ปกติโดยตรง เพราะ Pro ใช้ parallel test-time compute setting ^[3]
Claude Opus 4.7	Anthropic วางตำแหน่งเป็น hybrid reasoning model สำหรับ coding และ AI agents พร้อม context window 1M ^[14]	SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3% ^[17]	context window 1M มีประโยชน์ แต่ขนาดหน้าต่างไม่เท่ากับ recall quality เสมอ โดยสรุปของ StationX มี caveat เรื่อง recall ที่ปลายสุดของ 1M tokens ^[17]
Kimi K2.6	โมเดล open-source/open-weights จาก Moonshot/Kimi ที่เน้นงาน coding ^[29]^[34]	Terminal‑Bench 2.0 66.7%, SWE‑Bench Pro 58.6%, SWE‑Bench Verified 80.2%, LiveCodeBench v6 89.6 ^[29]	Artificial Analysis ระบุว่า Kimi K2.6 รองรับ image/video input แบบ native และมี max context length 256k; performance จริงยังขึ้นกับการ deploy ^[32]
DeepSeek V4-Pro / Pro-Max	DeepSeek ระบุว่า V4 Preview live และ open-sourced ส่วน Hugging Face card นำเสนอ V4 series เป็น MoE language models ^[37]^[42]	SWE Verified 80.6, SWE Pro 55.4, Terminal Bench 2.0 67.9 และ GPQA Diamond 90.1 ^[37]	ชื่อ DeepSeek V4 มีหลาย variant จึงไม่ควรรวม Flash, Pro และ Pro-Max เป็นคะแนนเดียวกัน ^[37]^[42]

ตาราง benchmark เทียบหัวต่อหัว

Benchmark	GPT‑5.5	Claude Opus 4.7	Kimi K2.6	DeepSeek V4-Pro / Pro-Max	อ่านอย่างไร
Terminal‑Bench 2.0	82.7% ^[5]	69.4% reported ^[16]	66.7% ^[29]	67.9% ^[37]	งาน command-line และ autonomous coding style เห็น lead ของ GPT‑5.5 ชัดที่สุด
SWE‑Bench Pro	58.6% ^[5]	64.3% ^[17]	58.6% ^[29]	55.4% ^[37]	benchmark software engineering ที่ยากขึ้น Claude Opus 4.7 นำ
SWE‑Bench Verified	ไม่พบค่าเปรียบเทียบที่ชัดในชุดแหล่งข้อมูลนี้	87.6% ^[17]	80.2% ^[29]	80.6% ^[37]	งานแนวแก้ issue ใน repo จริง Claude มีสัญญาณ reported ที่แข็งแรงที่สุด
OSWorld‑Verified	78.7% ^[5]	78.0% ^[17]	73.1% ^[29]	ไม่พบค่าที่เทียบได้ชัด	งาน computer-use GPT‑5.5 และ Claude Opus 4.7 อยู่ใกล้กันมาก
BrowseComp	84.4%; GPT‑5.5 Pro 90.1% ^[5]	79.3% ^[5]	83.2%; Agent Swarm 86.3% ^[34]	ไม่พบค่าที่เทียบได้ชัด	งาน browser-agent และ web-research เห็นสัญญาณแรงจาก GPT‑5.5 Pro และ Kimi Agent Swarm
GPQA Diamond	ไม่พบค่า official ที่เทียบได้ชัดในชุดแหล่งข้อมูลนี้	94.2% ^[19]	90.5% ^[27]	90.1% ^[37]	งาน science reasoning ระดับสูง Claude มีคะแนน reported สูงสุด
HLE / hard reasoning	ไม่พบค่าที่เทียบตรงได้	HLE no-tools 46.9%, with-tools 54.7% ^[16]	HLE-Full 34.7%; with-tools 54.0% ^[29]^[34]	HLE 37.7% ^[37]	เมื่อมี tool ช่วย Claude และ Kimi ใกล้กัน; DeepSeek ต่ำกว่าตามตัวเลขที่ระบุ
Long context	ใน excerpt เอกสารเปิดตัวที่ใช้ ไม่พบ public context spec ที่ชัด	1M context window ^[14]	256k max context length ^[32]	เอกสาร V4 วางตำแหน่งด้าน long-context ^[37]^[42]	Claude และ DeepSeek ถูกวางตำแหน่งด้าน long context ชัดกว่า แต่ต้องทดสอบ recall จริงแยกต่างหาก

เลือกตามงาน: ตัวไหนเหมาะกับอะไร

1. เอเจนต์ที่ต้องใช้เทอร์มินัล เบราว์เซอร์ และเครื่องมือหลายขั้นตอน: GPT‑5.5

ถ้า workload ของคุณคือให้โมเดลเปิดเว็บ เรียก tool ใช้คำสั่งในเทอร์มินัล แก้ไฟล์ และวนลูปหลายขั้นตอนแบบ agentic workflow, GPT‑5.5 เป็นตัวที่โดดเด่นที่สุดในชุดข้อมูลนี้ ตัวเลขที่ OpenAI รายงานมี Terminal‑Bench 2.0 82.7%, OSWorld‑Verified 78.7%, BrowseComp 84.4% และ Toolathlon 55.6% ^[5]

อย่างไรก็ตาม GPT‑5.5 Pro ต้องแยกอ่านจาก GPT‑5.5 ปกติ แม้ BrowseComp ของ GPT‑5.5 Pro จะอยู่ที่ 90.1% แต่ OpenAI system card ระบุว่า Pro เป็น setting ของโมเดลพื้นฐานเดียวกันที่ใช้ parallel test-time compute ^[3]^[5]

เหมาะกับ: coding agents, browser research agents, computer-use automation และ enterprise assistant ที่ต้องเรียกใช้เครื่องมือหลายชนิด

2. ซ่อม codebase ระดับ production: Claude Opus 4.7

ถ้า KPI หลักคือแก้ bug ใน repository จริง เตรียม pull request ทำให้ test ผ่าน และเข้าใจ codebase ขนาดใหญ่ Claude Opus 4.7 เป็น shortlist ที่แข็งแรงที่สุดในข้อมูลนี้ โดยมี SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3% ^[17]

Anthropic ระบุว่า Claude Opus 4.7 เป็น hybrid reasoning model สำหรับ coding และ AI agents พร้อม context window 1M จึงควรถูกนำไปทดสอบใน workflow ที่ต้องอ่าน codebase หรือเอกสารจำนวนมาก ^[14]

เหมาะกับ: repo maintenance, code review, complex refactor, developer copilots และ engineering agents

3. coding stack ที่ต้องการ open weights: Kimi K2.6

ถ้าเงื่อนไขสำคัญคือ self-hosting, data control หรือการใช้โมเดลแบบ open weights, Kimi K2.6 เป็นหนึ่งในตัวเลือกที่น่าจับตาที่สุด ตาราง official ของ Kimi ระบุ Terminal‑Bench 2.0 66.7%, SWE‑Bench Pro 58.6%, SWE‑Bench Verified 80.2%, SciCode 52.2% และ LiveCodeBench v6 89.6 ^[29]

Kimi K2.6 ยังมีสัญญาณดีในงาน agentic/search โดยมี BrowseComp 83.2% และ Agent Swarm BrowseComp 86.3% ^[34] ขณะที่ Artificial Analysis ระบุว่าโมเดลรองรับ image/video input แบบ native และ context length สูงสุด 256k ^[32]

เหมาะกับ: open model deployments, coding agents, research agents และทีมที่ต้องการควบคุม hosting มากกว่าใช้ hosted frontier model เพียงอย่างเดียว

4. long-context open-source/open-weights experimentation: DeepSeek V4

DeepSeek ระบุว่า DeepSeek V4 Preview live และ open-sourced เมื่อ 24 เมษายน 2026 ^[42] ส่วน DeepSeek-V4-Pro model card นำเสนอ V4 series เป็น MoE language models ^[37]

ในชุดคะแนนที่รายงานสำหรับ DeepSeek V4-Pro/Pro-Max มี Terminal Bench 2.0 67.9, SWE Verified 80.6, SWE Pro 55.4 และ GPQA Diamond 90.1 ^[37] ทำให้ DeepSeek V4 เป็นตัวเลือกเชิงกลยุทธ์สำหรับทีมที่ต้องการเปรียบเทียบ hosted frontier models กับโมเดล deployable/open-weights แต่ต้องอ่านคะแนนตาม variant เสมอ ^[37]^[42]

เหมาะกับ: long-context applications, open-source/open-weights experiments และทีมที่ต้องการประเมิน trade-off ระหว่างความสามารถ การ deploy และการควบคุมระบบ

5. วิทยาศาสตร์และคณิตศาสตร์: Claude นำใน GPQA แต่ภาพรวมยังไม่จบใน benchmark เดียว

จากตัวเลขที่มี Claude Opus 4.7 รายงาน GPQA Diamond 94.2% ^[19] ส่วน Kimi K2.6 รายงาน GPQA-Diamond 90.5% และ AIME 2026 96.4% ^[27]^[29] ขณะที่ DeepSeek V4-Pro/Pro-Max รายงาน GPQA Diamond 90.1 ^[37]

ดังนั้น Claude เป็น shortlist ที่แข็งแรงสำหรับ science reasoning แต่การตัดสินงานคณิตศาสตร์หรือวิทยาศาสตร์ไม่ควรอิง benchmark เดียว เพราะผลลัพธ์เปลี่ยนได้ตาม setup, tool access และ effort mode ^[12]

checklist ก่อนเลือกใช้จริง

อย่าตัดสินจาก benchmark เดียว ใช้ public benchmark เพื่อคัด shortlist แล้วรัน eval ภายในด้วย prompt, tool budget, timeout และ scoring rubric เดียวกัน เพราะคะแนนที่รันโดยอิสระอาจไม่ตรงกับ self-reported scores ^[12]
แยก GPT‑5.5 และ GPT‑5.5 Pro เป็นคนละ track Pro ใช้ parallel test-time compute setting จึงไม่ควรถือว่าเทียบได้ภายใต้ compute budget เดียวกับ GPT‑5.5 ปกติ ^[3]
กำหนด requirement เรื่อง open weights ก่อน ถ้า data control, self-hosting หรือการปรับแต่ง deployment เป็นข้อบังคับ ควรแยก Kimi K2.6 และ DeepSeek V4 ไว้ใน evaluation lane ของตนเอง ^[29]^[34]^[37]^[42]
long context ต้องทดสอบมากกว่าแค่ดู window size Claude Opus 4.7 มี positioning 1M context, Kimi K2.6 มี max context 256k และ DeepSeek V4 มี positioning ด้าน long-context แต่ recall, instruction following และ cost ต้องทดสอบกับเอกสารจริงของคุณ ^[14]^[17]^[32]^[37]^[42]
งาน coding agents ต้องรันกับ repo จริงของทีม คะแนนแบบ SWE‑Bench มีประโยชน์ แต่ production repo มักมี dependency setup, flaky tests, coding style และ review constraints เฉพาะตัว ^[17]

ข้อจำกัดของการเทียบครั้งนี้

ยังไม่พบ public comparison ที่นำทั้ง 4 โมเดลมาทดสอบโดย independent lab เดียวกัน ใช้ harness เดียวกัน tool access เดียวกัน และ effort setting เดียวกันทั้งหมด LM Council ก็เตือนเรื่องความคลาดเคลื่อนระหว่าง independent benchmark กับ self-reported benchmark ^[12]
GPT‑5.5 Pro ไม่ควรถูกอ่านเหมือน GPT‑5.5 ปกติ เพราะ OpenAI system card ระบุว่า Pro เป็น setting ของโมเดลพื้นฐานเดียวกันที่ใช้ parallel test-time compute ^[3]
คะแนนของ DeepSeek V4 เป็น variant-specific จึงไม่ควรรวม V4 Preview, V4-Pro และ Pro-Max style naming เป็นคะแนนเดียว ^[37]^[42]
สำหรับ Kimi K2.6 และ DeepSeek V4 ที่อยู่ในกลุ่ม open-weights/deployable performance ในโลกจริงอาจขึ้นกับ serving stack, hardware, quantization และ context settings จึงควรทดสอบ deployment ของตนเองคู่กับ benchmark ที่เผยแพร่ ^[29]^[34]^[37]

สรุปท้ายบท

เลือก GPT‑5.5 ถ้างานหลักคือ agentic computer-use, browsing, tool orchestration และ terminal-heavy coding ^[5]

เลือก Claude Opus 4.7 เป็น priority ถ้า core value ของโปรดักต์คือ repo-level bug fixing, codebase repair และ software engineering แบบ SWE‑Bench ^[14]^[17]

ประเมิน Kimi K2.6 ถ้าต้องการ open-weights coding model ที่มีสัญญาณแข็งแรงทั้ง SWE‑Bench, Terminal‑Bench และ agentic search ^[29]^[34]

ใส่ DeepSeek V4-Pro/Pro-Max ใน shortlist ถ้าโจทย์คือ long-context open-source/open-weights experimentation และ deployability แต่ต้องตรวจสอบ variant กับ benchmark setup เสมอ ^[37]^[42]

ทางเลือกที่ปลอดภัยที่สุดคือใช้ public benchmark เป็นตัวคัดรายชื่อ จากนั้นเลือกโมเดลสุดท้ายด้วยงานจริงของทีม ค่า latency, cost, privacy constraint และ failure-mode tests ของระบบเอง ^[12]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

ไม่มีโมเดลที่ชนะทุกงาน: GPT‑5.5 ให้สัญญาณเด่นใน agentic computer use และ tool workflows, Claude Opus 4.7 เด่นงานซ่อมโค้ดระดับ repo, Kimi K2.6 แข็งแรงใน open weights coding และ DeepSeek V4 ควรอยู่ในลิสต์ทดลอง long cont...
ตัวเลขหลัก: GPT‑5.5 ทำ Terminal‑Bench 2.0 ได้ 82.7% และ BrowseComp 84.4%; Claude Opus 4.7 ทำ SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3%; Kimi K2.6 ทำ SWE‑Bench Verified 80.2%; DeepSeek V4 Pro/Pro Max รายงาน SWE...
อย่าใช้ leaderboard อย่างเดียวตัดสินซื้อหรือ deploy: benchmark คนละ lab, tool access และ effort setting ทำให้เทียบตรง ๆ ไม่ได้ ควรทดสอบกับ workload จริงของทีม [12]

คนยังถาม

คำตอบสั้น ๆ สำหรับ "เทียบ benchmark GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: โมเดลไหนเหมาะกับงานอะไร" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

อย่าใช้ leaderboard อย่างเดียวตัดสินซื้อหรือ deploy: benchmark คนละ lab, tool access และ effort setting ทำให้เทียบตรง ๆ ไม่ได้ ควรทดสอบกับ workload จริงของทีม [12]

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

ดำเนินการต่อด้วย "Claude Opus 4.7 เทียบ GPT-5.5, DeepSeek V4 และ Kimi K2.6: อ่านเบนช์มาร์กปี 2026 แบบไม่หลงตัวเลข" เพื่อดูอีกมุมหนึ่งและการอ้างอิงเพิ่มเติม

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 เทียบ GPT-5.5, DeepSeek V4 และ Kimi K2.6: อ่านเบนช์มาร์กปี 2026 แบบไม่หลงตัวเลข

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา

DeepSeek V4 ไม่ใช่แค่โมเดล 1M context

โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger

Northwest กับ Southeast Timber: ทำไมแทนกันได้มากขึ้นแล้วต้องตอบ larger; larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek จะโค่น OpenAI, Claude, Gemini และ Grok ได้ไหม? เกมจริงอยู่ที่ต้นทุนและความไว้วางใจ

การสนทนาการวิจัย

คุณ26 เม.ย. 2026

คำถามวิจัย

GPT-5.5, Claude Opus 4.7, Kimi K2.6 और DeepSeek V4 के बेंचमार्क पर शोध करें और एक अच्छी शोध रिपोर्ट तैयार करें।

Studio Global AI26 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

26 अप्रैल 2026 तक उपलब्ध आधिकारिक/प्रमुख स्रोतों के आधार पर, इन चारों मॉडलों में कोई एक “सभी बेंचमार्क में विजेता” नहीं है। GPT‑5.5 agentic coding/computer-use में बहुत मजबूत दिखता है, Claude Opus 4.7 SWE-Bench शैली के real-world coding benchmarks में आगे है, Kimi K2.6 open-weights coding/agent मॉडल के रूप में बहुत प्रतिस्पर्धी है, और DeepSeek V4-Pro/Pro-Max 1M-context open-weights विकल्प के रूप में मजबूत reasoning/coding प्रदर्शन देता है। DeepSeek V4 के लिए मैंने Pro/Pro-Max को मुख्य comparison माना है, क्योंकि official materials में V4-Flash को speed/cost variant बताया गया है। ^[1]^[8]^[9]

Key findings

GPT‑5.5 ने OpenAI के launch benchmarks में Terminal-Bench 2.0 पर 82.7%, OSWorld-Verified पर 78.7%, BrowseComp पर 84.4%, FrontierMath Tier 1–3 पर 51.7%, और FrontierMath Tier 4 पर 35.4% स्कोर किया; GPT‑5.5 Pro ने BrowseComp पर 90.1% और FrontierMath Tier 4 पर 39.6% दिखाया। ^[1]
Claude Opus 4.7 coding benchmarks में विशेष रूप से मजबूत है: Vellum के Anthropic-reported benchmark breakdown में SWE-Bench Verified 87.6%, SWE-Bench Pro 64.3%, MCP-Atlas 77.3%, OSWorld-Verified 78.0%, और GPQA Diamond 94.2% दिए गए हैं। ^[5]
Kimi K2.6 सबसे मजबूत open-weights coding contenders में है: उसके official Hugging Face model card में SWE-Bench Pro 58.6%, Terminal-Bench 2.0 66.7%, SWE-Bench Verified 80.2%, BrowseComp 83.2%, BrowseComp Agent Swarm 86.3%, और GPQA-Diamond 90.5% दिए गए हैं। ^[6]
DeepSeek V4-Pro official release में 1.6T total / 49B active parameters और 1M context बताता है; DeepSeek-V4-Flash 284B total / 13B active parameters वाला faster/economical variant है। ^[8]^[9]
DeepSeek-V4-Pro-Max ने Hugging Face model card पर LiveCodeBench 93.5, Codeforces rating 3206, GPQA Diamond 90.1, Terminal Bench 2.0 67.9, SWE Verified 80.6, और SWE Pro 55.4 रिपोर्ट किया। ^[9]
उपलब्ध evidence में cross-model comparisons पूरी तरह apples-to-apples नहीं हैं, क्योंकि कई results vendor-reported हैं, effort settings अलग हैं, tools/harness अलग हो सकते हैं, और कुछ competitor scores re-evaluated या self-reported हैं। ^[5]^[6]^[9]

मॉडल प्रोफाइल

मॉडल	स्थिति / रिलीज	मुख्य स्पेक्स	प्राथमिक ताकत
GPT‑5.5	OpenAI ने 23 अप्रैल 2026 को GPT‑5.5 release किया और 24 अप्रैल 2026 update में API availability जोड़ी। ^[1]	Public page में parameter count disclosed नहीं है; GPT‑5.5 Pro same underlying model का parallel test-time compute setting बताया गया है। ^[2]	Agentic coding, computer use, tool use, long-horizon work। ^[1]
Claude Opus 4.7	Anthropic page पर Claude Opus 4.7 announcement 16 अप्रैल 2026 दिखता है। ^[3]	1M context window, 128k max output tokens, adaptive thinking, high-resolution image support। ^[4]	Real-world coding, tool-calling agents, professional knowledge work। ^[3]^[5]
Kimi K2.6	Moonshot AI का open-source native multimodal agentic model। ^[6]	MoE architecture, 1T total parameters, 32B active parameters, 256K context, Modified MIT license। ^[6]	Open-weights coding, agent swarm, multimodal coding-driven design। ^[6]
DeepSeek V4-Pro / Flash	DeepSeek-V4 Preview 24 अप्रैल 2026 को live और open-sourced बताया गया। ^[8]	V4-Pro: 1.6T total / 49B active; V4-Flash: 284B total / 13B active; दोनों 1M context support करते हैं। ^[8]^[9]	Long-context open-weights reasoning, coding, cost-efficient deployment। ^[8]^[9]

Benchmark तुलना

Benchmark	GPT‑5.5	Claude Opus 4.7	Kimi K2.6	DeepSeek V4-Pro/Pro-Max	पढ़ने का तरीका
Terminal-Bench 2.0	82.7% ^[1]	69.4% ^[1]^[5]	66.7% ^[6]	67.9% ^[9]	GPT‑5.5 इस command-line/agentic coding benchmark में स्पष्ट रूप से आगे दिखता है। ^[1]
SWE-Bench Pro	58.6% ^[1]	64.3% ^[5]	58.6% ^[6]	55.4% ^[9]	Claude Opus 4.7 इस hard software-engineering benchmark पर आगे है। ^[5]
SWE-Bench Verified	उपलब्ध स्रोत में GPT‑5.5 का comparable score नहीं मिला। ^[1]	87.6% ^[5]	80.2% ^[6]	80.6% ^[9]	Claude Opus 4.7 reported results में strongest है। ^[5]
OSWorld-Verified	78.7% ^[1]	78.0% ^[1]^[5]	73.1% ^[6]	Insufficient evidence	GPT‑5.5 और Claude Opus 4.7 computer-use tasks में बहुत करीब हैं। ^[1]^[5]
BrowseComp	84.4%; Pro 90.1% ^[1]	79.3% ^[5]	83.2%; Agent Swarm 86.3% ^[6]	Insufficient evidence	GPT‑5.5 Pro और Kimi Agent Swarm web-research/agentic search में मजबूत दिखते हैं। ^[1]^[6]
GPQA Diamond	उपलब्ध OpenAI launch excerpt में comparable score नहीं मिला। ^[1]	94.2% ^[5]	90.5% ^[6]	90.1% ^[9]	Claude Opus 4.7 science reasoning में reported scores के आधार पर आगे है। ^[5]
HLE / hard reasoning	उपलब्ध OpenAI launch excerpt में comparable HLE score नहीं मिला। ^[1]	HLE no-tools 46.9%, with-tools 54.7% ^[5]	HLE-Full 34.7%, with-tools 54.0% ^[6]	HLE 37.7% ^[9]	Tool-augmented HLE में Claude और Kimi करीब हैं; DeepSeek का listed HLE score lower है। ^[5]^[6]^[9]
Long context	public specs not disclosed in retrieved source	1M context ^[4]	256K context ^[6]	1M context ^[8]^[9]	Long-context deployment में Claude Opus 4.7 और DeepSeek V4 अधिक स्पष्ट रूप से positioned हैं। ^[4]^[8]^[9]

उपयोग-केस के अनुसार निष्कर्ष

अगर आपका workload terminal-heavy autonomous coding, computer-use, tool-driven workflows और general frontier-agent work है, तो GPT‑5.5 सबसे मजबूत candidate दिखता है, खासकर Terminal-Bench 2.0 82.7%, OSWorld-Verified 78.7%, Toolathlon 55.6%, और BrowseComp 84.4% के आधार पर। ^[1]
अगर आपका लक्ष्य GitHub issue resolution, production codebase repair, और SWE-Bench-style software engineering है, तो Claude Opus 4.7 सबसे मजबूत दिखता है, क्योंकि इसका SWE-Bench Verified 87.6% और SWE-Bench Pro 64.3% है। ^[5]
अगर आपको open-weights/self-hostable मॉडल चाहिए और coding + agentic research दोनों महत्वपूर्ण हैं, तो Kimi K2.6 बहुत मजबूत विकल्प है, क्योंकि यह 1T/32B-active MoE model है और SWE-Bench Pro 58.6%, BrowseComp 83.2%, तथा Agent Swarm BrowseComp 86.3% रिपोर्ट करता है। ^[6]
अगर आपको 1M context, open-weights, और cost-efficient deployment चाहिए, तो DeepSeek V4-Pro/Flash रणनीतिक रूप से महत्वपूर्ण है; V4-Pro 1.6T/49B-active है और V4-Flash 284B/13B-active faster/economical variant है। ^[8]^[9]
अगर pure reasoning/math frontier आपका मुख्य लक्ष्य है, तो इस dataset में picture mixed है: Claude Opus 4.7 GPQA Diamond पर 94.2% है, Kimi K2.6 GPQA-Diamond 90.5% और AIME 2026 96.4% देता है, और DeepSeek-V4-Pro-Max GPQA Diamond 90.1%, HMMT 2026 Feb 95.2%, तथा IMOAnswerBench 89.8% दिखाता है। ^[5]^[6]^[9]

Evidence notes

GPT‑5.5 के लिए strongest evidence OpenAI का official launch post और system card है, लेकिन यह vendor-reported data है। ^[1]^[2]
Claude Opus 4.7 के लिए Anthropic official product/docs pages capabilities और specs देते हैं, जबकि benchmark values के लिए Vellum ने Anthropic-reported tables का readable breakdown दिया है। ^[3]^[4]^[5]
Kimi K2.6 के लिए official Hugging Face model card सबसे उपयोगी benchmark source है, क्योंकि उसमें architecture, evaluation settings, और footnotes शामिल हैं। ^[6]
DeepSeek V4 के लिए DeepSeek API Docs release page availability/specs बताता है, और DeepSeek Hugging Face model card detailed evaluation table देता है। ^[8]^[9]
कई benchmarks में “thinking effort,” tools, max generation length, और harness अलग हैं; Kimi K2.6 card खुद बताता है कि कुछ competitor scores publicly available न होने पर re-evaluated और asterisk-marked हैं। ^[6]

Limitations / uncertainty

Insufficient evidence: सभी चार मॉडलों को एक ही स्वतंत्र lab, एक ही harness, एक ही tool budget, और एक ही inference-effort setting में evaluate करने वाला complete public benchmark अभी उपलब्ध नहीं मिला। ^[5]^[6]^[9]
GPT‑5.5 और Claude Opus 4.7 closed models हैं, इसलिए parameter count, training data, और exact inference configuration public comparison में सीमित हैं। ^[1]^[3]
DeepSeek V4 नाम के अंदर Flash, Pro, और Pro-Max/effort-mode जैसे variants हैं, इसलिए “DeepSeek V4” का benchmark score variant-specific है। ^[8]^[9]
कुछ official benchmark charts images में हैं या extracted text में partial हैं, इसलिए मैंने केवल उन numbers को शामिल किया है जो source text में स्पष्ट रूप से उपलब्ध थे। ^[1]^[8]^[9]

Summary

कुल मिलाकर, GPT‑5.5 सबसे मजबूत all-around frontier agent model जैसा दिखता है, खासकर Terminal-Bench 2.0, OSWorld और BrowseComp में। ^[1] Claude Opus 4.7 coding agents के लिए सबसे भरोसेमंद दिखता है, विशेषकर SWE-Bench Verified और SWE-Bench Pro पर। ^[5] Kimi K2.6 open-weights ecosystem में coding और agentic workflows के लिए सबसे मजबूत contenders में है। ^[6] DeepSeek V4-Pro/Pro-Max 1M-context open-weights model के रूप में खास है और coding/reasoning में competitive है, लेकिन direct cross-model comparison अभी भी सीमित evidence पर निर्भर है। ^[8]^[9]

स्रोत मैप

^[1] OpenAI — “Introducing GPT‑5.5”
^[2] OpenAI — “GPT‑5.5 System Card”
^[3] Anthropic — “Claude Opus 4.7”
^[4] Anthropic Docs — “What’s new in Claude Opus 4.7”
^[5] Vellum — “Claude Opus 4.7 Benchmarks Explained”
^[6] Moonshot AI — “Kimi K2.6” Hugging Face model card
^[7] GMI Cloud — “Kimi K2.6: Architecture, Benchmarks, and What It Means for Production AI”
^[8] DeepSeek API Docs — “DeepSeek-V4 Preview Release”
^[9] DeepSeek AI — “DeepSeek-V4-Pro” Hugging Face model card

แหล่งที่มา

[3] GPT-5.5 System Card - OpenAIopenai.com
We generally treat GPT‑5.5’s safety results as strong proxies for GPT‑5.5 Pro, which is the same underlying model using a setting that makes use of parallel test time compute. As noted below, we separately evaluate GPT‑5.5 Pro in certain cases because we ju...
[5] Introducing GPT-5.5 - OpenAIopenai.com
Computer use and vision EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaudeOpus 4.7Gemini 3.1 Pro OSWorld-Verified 78.7%75.0%--78.0%- MMMU Pro (no tools)81.2%81.2%---80.5% MMMU Pro (with tools)83.2%82.1%---- Tool use EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaud...
[12] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
AI Model Benchmarks Apr 2026 18 benchmarks - the world's most-followed benchmarks, curated by AI Explained, author of SimpleBench Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs. Compare Models Human...
[14] Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer []( Research Economic Futures Commitments Learn News Try Claude Claude Opus 4.7 Image 1: Claude Opus 4.7 Image 2: Claude Opus 4.7 Hybrid reasoning model that pushes the frontier for coding and AI agents, featuring a 1M con...
[16] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Apr 16, 2026•16 min•ByNicolas Zeeb Guides CONTENTS Key observations of reported benchmarks Coding capabilities SWE-bench Verified SWE-bench Pro Terminal-Bench 2.0 Agentic capabilities MCP-Atlas (Scaled tool use) Finance Agent v1.1 OSWorld-Verified (Computer...
[17] Claude Opus 4.7 Review: Everything New in 2026app.stationx.net
Sign In MEMBERSHIP 2100 Shares Benchmark Opus 4.6 Opus 4.7 Change --- --- SWE-Bench Pro 53.4% 64.3% +10.9 SWE-Bench Verified 80.8% 87.6% +6.8 Graphwalks (multi-hop reasoning) 38.7% 58.6% +19.9 OSWorld-Verified (computer use) 72.7% 78.0% +5.3 CharXiv (vision...
[19] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[27] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
‍ K2.6 was equipped with search, code-interpreter, and web-browsing tools for HLE with tools, BrowseComp, DeepSearchQA, and WideSearch evaluations. Reasoning and Knowledge K2.6 is competitive with closed-source models on math and science, though GPT-5.4 and...
[29] Kimi K2.6 Tech Blog: Advancing Open-Source Codingkimi.com
APEX-Agents 27.9 33.3 33.0 32.0 11.5 OSWorld-Verified 73.1 75.0 72.7 — 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 — 77.8 76.9 73.0 SWE-Bench Verified 80.2 — 80.8 80...
[32] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Kimi K2.6 has significantly higher token usage than Kimi K2.5. Kimi K2.5 scores 6 on the AA-Omniscience Index, primarily driven...
[34] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[37] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T ... 2 days ago
[42] DeepSeek V4 Preview Releaseapi-docs.deepseek.com
News; DeepSeek-V4 Preview Release 2026/04/24. On this page. DeepSeek V4 Preview Release. DeepSeek-V4 Preview is officially live & open-sourced!

ค้นพบเทรนด์

รายงานเผยแพร่แล้ว28 เม.ย. 2026Last edited 6 พ.ค. 202613 แหล่งที่มา

เทียบ benchmark GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: โมเดลไหนเหมาะกับงานอะไร

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

17K0

คำตอบสั้น ๆ

งาน agentic computer-use, browser workflow และ terminal-heavy agents: GPT‑5.5 ให้สัญญาณสาธารณะที่แรงที่สุดในชุดข้อมูลนี้ โดย OpenAI รายงาน Terminal‑Bench 2.0 ที่ 82.7%, OSWorld‑Verified 78.7%, BrowseComp 84.4% และ Toolathlon 55.6% ^[5]
งานซ่อม codebase ระดับ production และ benchmark สาย SWE‑Bench: Claude Opus 4.7 เป็นตัวเลือก shortlist ที่แข็งแรงที่สุด ด้วย SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3% ที่มีรายงาน ^[17]
งาน coding ที่ต้องการ open weights: Kimi K2.6 แข่งขันได้มาก โดยเอกสารของ Kimi ระบุ Terminal‑Bench 2.0 66.7%, SWE‑Bench Pro 58.6%, SWE‑Bench Verified 80.2% และ LiveCodeBench v6 89.6 ^[29]
งานทดลอง open-source/open-weights แบบ long context: DeepSeek V4 ควรถูกนำมาทดสอบ แต่ต้องดูให้ชัดว่าเป็น variant ใด เพราะ DeepSeek ระบุว่า V4 Preview live และ open-sourced เมื่อ 24 เมษายน 2026 ^[42]
งาน reasoning ด้านวิทยาศาสตร์: Claude Opus 4.7 รายงาน GPQA Diamond 94.2%; Kimi K2.6 รายงาน GPQA-Diamond 90.5% และ AIME 2026 96.4%; DeepSeek V4-Pro/Pro-Max รายงาน GPQA Diamond 90.1 ^[19]^[27]^[29]^[37]

ก่อนอ่าน benchmark: 3 เรื่องที่ต้องจำ

benchmark คนละตระกูลวัดคนละทักษะ Terminal‑Bench, SWE‑Bench, BrowseComp, OSWorld, GPQA และ HLE ไม่ได้ถามคำถามเดียวกัน โมเดลที่เก่งแก้ issue ใน repo อาจไม่ใช่ตัวที่ดีที่สุดสำหรับ web research agent หรือ computer-use automation ^[5]^[17]^[29]
tool access และ inference effort เปลี่ยนผลได้ OpenAI system card ระบุว่า GPT‑5.5 Pro ใช้โมเดลพื้นฐานเดียวกับ GPT‑5.5 แต่เป็น setting ที่ใช้ parallel test-time compute ดังนั้นคะแนน GPT‑5.5 Pro ไม่ควรถูกอ่านเหมือนเป็นคะแนนของ GPT‑5.5 ปกติภายใต้ compute budget เดียวกัน ^[3]
public benchmark เหมาะสำหรับ shortlist ไม่ใช่คำตอบสุดท้ายของ procurement เพราะ independent runs อาจไม่ตรงกับ self-reported scores ทีมที่เลือกใช้จริงควรรัน eval ภายในด้วย prompt, tool budget, timeout และเกณฑ์ให้คะแนนเดียวกัน ^[12]

ภาพรวมแต่ละโมเดล

โมเดล	ภาพจำจากเอกสารสาธารณะ	สัญญาณที่เด่น	ข้อควรระวัง
GPT‑5.5	เอกสารเปิดตัวของ OpenAI เน้น computer-use, tool-use และ agentic workflows ^[5]	Terminal‑Bench 2.0 82.7%, OSWorld‑Verified 78.7%, BrowseComp 84.4%; ส่วน GPT‑5.5 Pro ได้ BrowseComp 90.1 ^[5]	อย่าเทียบคะแนน Pro กับ GPT‑5.5 ปกติโดยตรง เพราะ Pro ใช้ parallel test-time compute setting ^[3]
Claude Opus 4.7	Anthropic วางตำแหน่งเป็น hybrid reasoning model สำหรับ coding และ AI agents พร้อม context window 1M ^[14]	SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3% ^[17]	context window 1M มีประโยชน์ แต่ขนาดหน้าต่างไม่เท่ากับ recall quality เสมอ โดยสรุปของ StationX มี caveat เรื่อง recall ที่ปลายสุดของ 1M tokens ^[17]
Kimi K2.6	โมเดล open-source/open-weights จาก Moonshot/Kimi ที่เน้นงาน coding ^[29]^[34]	Terminal‑Bench 2.0 66.7%, SWE‑Bench Pro 58.6%, SWE‑Bench Verified 80.2%, LiveCodeBench v6 89.6 ^[29]	Artificial Analysis ระบุว่า Kimi K2.6 รองรับ image/video input แบบ native และมี max context length 256k; performance จริงยังขึ้นกับการ deploy ^[32]
DeepSeek V4-Pro / Pro-Max	DeepSeek ระบุว่า V4 Preview live และ open-sourced ส่วน Hugging Face card นำเสนอ V4 series เป็น MoE language models ^[37]^[42]	SWE Verified 80.6, SWE Pro 55.4, Terminal Bench 2.0 67.9 และ GPQA Diamond 90.1 ^[37]	ชื่อ DeepSeek V4 มีหลาย variant จึงไม่ควรรวม Flash, Pro และ Pro-Max เป็นคะแนนเดียวกัน ^[37]^[42]

ตาราง benchmark เทียบหัวต่อหัว

Benchmark	GPT‑5.5	Claude Opus 4.7	Kimi K2.6	DeepSeek V4-Pro / Pro-Max	อ่านอย่างไร
Terminal‑Bench 2.0	82.7% ^[5]	69.4% reported ^[16]	66.7% ^[29]	67.9% ^[37]	งาน command-line และ autonomous coding style เห็น lead ของ GPT‑5.5 ชัดที่สุด
SWE‑Bench Pro	58.6% ^[5]	64.3% ^[17]	58.6% ^[29]	55.4% ^[37]	benchmark software engineering ที่ยากขึ้น Claude Opus 4.7 นำ
SWE‑Bench Verified	ไม่พบค่าเปรียบเทียบที่ชัดในชุดแหล่งข้อมูลนี้	87.6% ^[17]	80.2% ^[29]	80.6% ^[37]	งานแนวแก้ issue ใน repo จริง Claude มีสัญญาณ reported ที่แข็งแรงที่สุด
OSWorld‑Verified	78.7% ^[5]	78.0% ^[17]	73.1% ^[29]	ไม่พบค่าที่เทียบได้ชัด	งาน computer-use GPT‑5.5 และ Claude Opus 4.7 อยู่ใกล้กันมาก
BrowseComp	84.4%; GPT‑5.5 Pro 90.1% ^[5]	79.3% ^[5]	83.2%; Agent Swarm 86.3% ^[34]	ไม่พบค่าที่เทียบได้ชัด	งาน browser-agent และ web-research เห็นสัญญาณแรงจาก GPT‑5.5 Pro และ Kimi Agent Swarm
GPQA Diamond	ไม่พบค่า official ที่เทียบได้ชัดในชุดแหล่งข้อมูลนี้	94.2% ^[19]	90.5% ^[27]	90.1% ^[37]	งาน science reasoning ระดับสูง Claude มีคะแนน reported สูงสุด
HLE / hard reasoning	ไม่พบค่าที่เทียบตรงได้	HLE no-tools 46.9%, with-tools 54.7% ^[16]	HLE-Full 34.7%; with-tools 54.0% ^[29]^[34]	HLE 37.7% ^[37]	เมื่อมี tool ช่วย Claude และ Kimi ใกล้กัน; DeepSeek ต่ำกว่าตามตัวเลขที่ระบุ
Long context	ใน excerpt เอกสารเปิดตัวที่ใช้ ไม่พบ public context spec ที่ชัด	1M context window ^[14]	256k max context length ^[32]	เอกสาร V4 วางตำแหน่งด้าน long-context ^[37]^[42]	Claude และ DeepSeek ถูกวางตำแหน่งด้าน long context ชัดกว่า แต่ต้องทดสอบ recall จริงแยกต่างหาก

เลือกตามงาน: ตัวไหนเหมาะกับอะไร

1. เอเจนต์ที่ต้องใช้เทอร์มินัล เบราว์เซอร์ และเครื่องมือหลายขั้นตอน: GPT‑5.5

2. ซ่อม codebase ระดับ production: Claude Opus 4.7

เหมาะกับ: repo maintenance, code review, complex refactor, developer copilots และ engineering agents

3. coding stack ที่ต้องการ open weights: Kimi K2.6

4. long-context open-source/open-weights experimentation: DeepSeek V4

5. วิทยาศาสตร์และคณิตศาสตร์: Claude นำใน GPQA แต่ภาพรวมยังไม่จบใน benchmark เดียว

checklist ก่อนเลือกใช้จริง

อย่าตัดสินจาก benchmark เดียว ใช้ public benchmark เพื่อคัด shortlist แล้วรัน eval ภายในด้วย prompt, tool budget, timeout และ scoring rubric เดียวกัน เพราะคะแนนที่รันโดยอิสระอาจไม่ตรงกับ self-reported scores ^[12]
แยก GPT‑5.5 และ GPT‑5.5 Pro เป็นคนละ track Pro ใช้ parallel test-time compute setting จึงไม่ควรถือว่าเทียบได้ภายใต้ compute budget เดียวกับ GPT‑5.5 ปกติ ^[3]
กำหนด requirement เรื่อง open weights ก่อน ถ้า data control, self-hosting หรือการปรับแต่ง deployment เป็นข้อบังคับ ควรแยก Kimi K2.6 และ DeepSeek V4 ไว้ใน evaluation lane ของตนเอง ^[29]^[34]^[37]^[42]
long context ต้องทดสอบมากกว่าแค่ดู window size Claude Opus 4.7 มี positioning 1M context, Kimi K2.6 มี max context 256k และ DeepSeek V4 มี positioning ด้าน long-context แต่ recall, instruction following และ cost ต้องทดสอบกับเอกสารจริงของคุณ ^[14]^[17]^[32]^[37]^[42]
งาน coding agents ต้องรันกับ repo จริงของทีม คะแนนแบบ SWE‑Bench มีประโยชน์ แต่ production repo มักมี dependency setup, flaky tests, coding style และ review constraints เฉพาะตัว ^[17]

ข้อจำกัดของการเทียบครั้งนี้

ยังไม่พบ public comparison ที่นำทั้ง 4 โมเดลมาทดสอบโดย independent lab เดียวกัน ใช้ harness เดียวกัน tool access เดียวกัน และ effort setting เดียวกันทั้งหมด LM Council ก็เตือนเรื่องความคลาดเคลื่อนระหว่าง independent benchmark กับ self-reported benchmark ^[12]
GPT‑5.5 Pro ไม่ควรถูกอ่านเหมือน GPT‑5.5 ปกติ เพราะ OpenAI system card ระบุว่า Pro เป็น setting ของโมเดลพื้นฐานเดียวกันที่ใช้ parallel test-time compute ^[3]
คะแนนของ DeepSeek V4 เป็น variant-specific จึงไม่ควรรวม V4 Preview, V4-Pro และ Pro-Max style naming เป็นคะแนนเดียว ^[37]^[42]
สำหรับ Kimi K2.6 และ DeepSeek V4 ที่อยู่ในกลุ่ม open-weights/deployable performance ในโลกจริงอาจขึ้นกับ serving stack, hardware, quantization และ context settings จึงควรทดสอบ deployment ของตนเองคู่กับ benchmark ที่เผยแพร่ ^[29]^[34]^[37]

สรุปท้ายบท

เลือก GPT‑5.5 ถ้างานหลักคือ agentic computer-use, browsing, tool orchestration และ terminal-heavy coding ^[5]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

ไม่มีโมเดลที่ชนะทุกงาน: GPT‑5.5 ให้สัญญาณเด่นใน agentic computer use และ tool workflows, Claude Opus 4.7 เด่นงานซ่อมโค้ดระดับ repo, Kimi K2.6 แข็งแรงใน open weights coding และ DeepSeek V4 ควรอยู่ในลิสต์ทดลอง long cont...
ตัวเลขหลัก: GPT‑5.5 ทำ Terminal‑Bench 2.0 ได้ 82.7% และ BrowseComp 84.4%; Claude Opus 4.7 ทำ SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3%; Kimi K2.6 ทำ SWE‑Bench Verified 80.2%; DeepSeek V4 Pro/Pro Max รายงาน SWE...
อย่าใช้ leaderboard อย่างเดียวตัดสินซื้อหรือ deploy: benchmark คนละ lab, tool access และ effort setting ทำให้เทียบตรง ๆ ไม่ได้ ควรทดสอบกับ workload จริงของทีม [12]

คนยังถาม

คำตอบสั้น ๆ สำหรับ "เทียบ benchmark GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: โมเดลไหนเหมาะกับงานอะไร" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

Claude Opus 4.7 เทียบ GPT-5.5, DeepSeek V4 และ Kimi K2.6: อ่านเบนช์มาร์กปี 2026 แบบไม่หลงตัวเลข

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: สรุปเบนช์มาร์กปี 2026 และคำตัดสิน

DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา

DeepSeek V4 ไม่ใช่แค่โมเดล 1M context

โจทย์ความยืดหยุ่น Northwest vs. Southeast Timber: ทำไมคำตอบคือ larger; larger

Northwest กับ Southeast Timber: ทำไมแทนกันได้มากขึ้นแล้วต้องตอบ larger; larger

DeepSeek จะโค่น OpenAI, Claude, Gemini และ Grok ได้ไหม? เกมจริงอยู่ที่ต้นทุนและความไว้วางใจ

การสนทนาการวิจัย

คุณ26 เม.ย. 2026

คำถามวิจัย

Studio Global AI26 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

Key findings

GPT‑5.5 ने OpenAI के launch benchmarks में Terminal-Bench 2.0 पर 82.7%, OSWorld-Verified पर 78.7%, BrowseComp पर 84.4%, FrontierMath Tier 1–3 पर 51.7%, और FrontierMath Tier 4 पर 35.4% स्कोर किया; GPT‑5.5 Pro ने BrowseComp पर 90.1% और FrontierMath Tier 4 पर 39.6% दिखाया। ^[1]
Claude Opus 4.7 coding benchmarks में विशेष रूप से मजबूत है: Vellum के Anthropic-reported benchmark breakdown में SWE-Bench Verified 87.6%, SWE-Bench Pro 64.3%, MCP-Atlas 77.3%, OSWorld-Verified 78.0%, और GPQA Diamond 94.2% दिए गए हैं। ^[5]
Kimi K2.6 सबसे मजबूत open-weights coding contenders में है: उसके official Hugging Face model card में SWE-Bench Pro 58.6%, Terminal-Bench 2.0 66.7%, SWE-Bench Verified 80.2%, BrowseComp 83.2%, BrowseComp Agent Swarm 86.3%, और GPQA-Diamond 90.5% दिए गए हैं। ^[6]
DeepSeek V4-Pro official release में 1.6T total / 49B active parameters और 1M context बताता है; DeepSeek-V4-Flash 284B total / 13B active parameters वाला faster/economical variant है। ^[8]^[9]
DeepSeek-V4-Pro-Max ने Hugging Face model card पर LiveCodeBench 93.5, Codeforces rating 3206, GPQA Diamond 90.1, Terminal Bench 2.0 67.9, SWE Verified 80.6, और SWE Pro 55.4 रिपोर्ट किया। ^[9]
उपलब्ध evidence में cross-model comparisons पूरी तरह apples-to-apples नहीं हैं, क्योंकि कई results vendor-reported हैं, effort settings अलग हैं, tools/harness अलग हो सकते हैं, और कुछ competitor scores re-evaluated या self-reported हैं। ^[5]^[6]^[9]

मॉडल प्रोफाइल

मॉडल	स्थिति / रिलीज	मुख्य स्पेक्स	प्राथमिक ताकत
GPT‑5.5	OpenAI ने 23 अप्रैल 2026 को GPT‑5.5 release किया और 24 अप्रैल 2026 update में API availability जोड़ी। ^[1]	Public page में parameter count disclosed नहीं है; GPT‑5.5 Pro same underlying model का parallel test-time compute setting बताया गया है। ^[2]	Agentic coding, computer use, tool use, long-horizon work। ^[1]
Claude Opus 4.7	Anthropic page पर Claude Opus 4.7 announcement 16 अप्रैल 2026 दिखता है। ^[3]	1M context window, 128k max output tokens, adaptive thinking, high-resolution image support। ^[4]	Real-world coding, tool-calling agents, professional knowledge work। ^[3]^[5]
Kimi K2.6	Moonshot AI का open-source native multimodal agentic model। ^[6]	MoE architecture, 1T total parameters, 32B active parameters, 256K context, Modified MIT license। ^[6]	Open-weights coding, agent swarm, multimodal coding-driven design। ^[6]
DeepSeek V4-Pro / Flash	DeepSeek-V4 Preview 24 अप्रैल 2026 को live और open-sourced बताया गया। ^[8]	V4-Pro: 1.6T total / 49B active; V4-Flash: 284B total / 13B active; दोनों 1M context support करते हैं। ^[8]^[9]	Long-context open-weights reasoning, coding, cost-efficient deployment। ^[8]^[9]

Benchmark तुलना

Benchmark	GPT‑5.5	Claude Opus 4.7	Kimi K2.6	DeepSeek V4-Pro/Pro-Max	पढ़ने का तरीका
Terminal-Bench 2.0	82.7% ^[1]	69.4% ^[1]^[5]	66.7% ^[6]	67.9% ^[9]	GPT‑5.5 इस command-line/agentic coding benchmark में स्पष्ट रूप से आगे दिखता है। ^[1]
SWE-Bench Pro	58.6% ^[1]	64.3% ^[5]	58.6% ^[6]	55.4% ^[9]	Claude Opus 4.7 इस hard software-engineering benchmark पर आगे है। ^[5]
SWE-Bench Verified	उपलब्ध स्रोत में GPT‑5.5 का comparable score नहीं मिला। ^[1]	87.6% ^[5]	80.2% ^[6]	80.6% ^[9]	Claude Opus 4.7 reported results में strongest है। ^[5]
OSWorld-Verified	78.7% ^[1]	78.0% ^[1]^[5]	73.1% ^[6]	Insufficient evidence	GPT‑5.5 और Claude Opus 4.7 computer-use tasks में बहुत करीब हैं। ^[1]^[5]
BrowseComp	84.4%; Pro 90.1% ^[1]	79.3% ^[5]	83.2%; Agent Swarm 86.3% ^[6]	Insufficient evidence	GPT‑5.5 Pro और Kimi Agent Swarm web-research/agentic search में मजबूत दिखते हैं। ^[1]^[6]
GPQA Diamond	उपलब्ध OpenAI launch excerpt में comparable score नहीं मिला। ^[1]	94.2% ^[5]	90.5% ^[6]	90.1% ^[9]	Claude Opus 4.7 science reasoning में reported scores के आधार पर आगे है। ^[5]
HLE / hard reasoning	उपलब्ध OpenAI launch excerpt में comparable HLE score नहीं मिला। ^[1]	HLE no-tools 46.9%, with-tools 54.7% ^[5]	HLE-Full 34.7%, with-tools 54.0% ^[6]	HLE 37.7% ^[9]	Tool-augmented HLE में Claude और Kimi करीब हैं; DeepSeek का listed HLE score lower है। ^[5]^[6]^[9]
Long context	public specs not disclosed in retrieved source	1M context ^[4]	256K context ^[6]	1M context ^[8]^[9]	Long-context deployment में Claude Opus 4.7 और DeepSeek V4 अधिक स्पष्ट रूप से positioned हैं। ^[4]^[8]^[9]

उपयोग-केस के अनुसार निष्कर्ष

अगर आपका workload terminal-heavy autonomous coding, computer-use, tool-driven workflows और general frontier-agent work है, तो GPT‑5.5 सबसे मजबूत candidate दिखता है, खासकर Terminal-Bench 2.0 82.7%, OSWorld-Verified 78.7%, Toolathlon 55.6%, और BrowseComp 84.4% के आधार पर। ^[1]
अगर आपका लक्ष्य GitHub issue resolution, production codebase repair, और SWE-Bench-style software engineering है, तो Claude Opus 4.7 सबसे मजबूत दिखता है, क्योंकि इसका SWE-Bench Verified 87.6% और SWE-Bench Pro 64.3% है। ^[5]
अगर आपको open-weights/self-hostable मॉडल चाहिए और coding + agentic research दोनों महत्वपूर्ण हैं, तो Kimi K2.6 बहुत मजबूत विकल्प है, क्योंकि यह 1T/32B-active MoE model है और SWE-Bench Pro 58.6%, BrowseComp 83.2%, तथा Agent Swarm BrowseComp 86.3% रिपोर्ट करता है। ^[6]
अगर आपको 1M context, open-weights, और cost-efficient deployment चाहिए, तो DeepSeek V4-Pro/Flash रणनीतिक रूप से महत्वपूर्ण है; V4-Pro 1.6T/49B-active है और V4-Flash 284B/13B-active faster/economical variant है। ^[8]^[9]
अगर pure reasoning/math frontier आपका मुख्य लक्ष्य है, तो इस dataset में picture mixed है: Claude Opus 4.7 GPQA Diamond पर 94.2% है, Kimi K2.6 GPQA-Diamond 90.5% और AIME 2026 96.4% देता है, और DeepSeek-V4-Pro-Max GPQA Diamond 90.1%, HMMT 2026 Feb 95.2%, तथा IMOAnswerBench 89.8% दिखाता है। ^[5]^[6]^[9]

Evidence notes

GPT‑5.5 के लिए strongest evidence OpenAI का official launch post और system card है, लेकिन यह vendor-reported data है। ^[1]^[2]
Claude Opus 4.7 के लिए Anthropic official product/docs pages capabilities और specs देते हैं, जबकि benchmark values के लिए Vellum ने Anthropic-reported tables का readable breakdown दिया है। ^[3]^[4]^[5]
Kimi K2.6 के लिए official Hugging Face model card सबसे उपयोगी benchmark source है, क्योंकि उसमें architecture, evaluation settings, और footnotes शामिल हैं। ^[6]
DeepSeek V4 के लिए DeepSeek API Docs release page availability/specs बताता है, और DeepSeek Hugging Face model card detailed evaluation table देता है। ^[8]^[9]
कई benchmarks में “thinking effort,” tools, max generation length, और harness अलग हैं; Kimi K2.6 card खुद बताता है कि कुछ competitor scores publicly available न होने पर re-evaluated और asterisk-marked हैं। ^[6]

Limitations / uncertainty

Insufficient evidence: सभी चार मॉडलों को एक ही स्वतंत्र lab, एक ही harness, एक ही tool budget, और एक ही inference-effort setting में evaluate करने वाला complete public benchmark अभी उपलब्ध नहीं मिला। ^[5]^[6]^[9]
GPT‑5.5 और Claude Opus 4.7 closed models हैं, इसलिए parameter count, training data, और exact inference configuration public comparison में सीमित हैं। ^[1]^[3]
DeepSeek V4 नाम के अंदर Flash, Pro, और Pro-Max/effort-mode जैसे variants हैं, इसलिए “DeepSeek V4” का benchmark score variant-specific है। ^[8]^[9]
कुछ official benchmark charts images में हैं या extracted text में partial हैं, इसलिए मैंने केवल उन numbers को शामिल किया है जो source text में स्पष्ट रूप से उपलब्ध थे। ^[1]^[8]^[9]

Summary

स्रोत मैप

แหล่งที่มา

[3] GPT-5.5 System Card - OpenAIopenai.com
We generally treat GPT‑5.5’s safety results as strong proxies for GPT‑5.5 Pro, which is the same underlying model using a setting that makes use of parallel test time compute. As noted below, we separately evaluate GPT‑5.5 Pro in certain cases because we ju...
[5] Introducing GPT-5.5 - OpenAIopenai.com
Computer use and vision EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaudeOpus 4.7Gemini 3.1 Pro OSWorld-Verified 78.7%75.0%--78.0%- MMMU Pro (no tools)81.2%81.2%---80.5% MMMU Pro (with tools)83.2%82.1%---- Tool use EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaud...
[12] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
AI Model Benchmarks Apr 2026 18 benchmarks - the world's most-followed benchmarks, curated by AI Explained, author of SimpleBench Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs. Compare Models Human...
[14] Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer []( Research Economic Futures Commitments Learn News Try Claude Claude Opus 4.7 Image 1: Claude Opus 4.7 Image 2: Claude Opus 4.7 Hybrid reasoning model that pushes the frontier for coding and AI agents, featuring a 1M con...
[16] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Apr 16, 2026•16 min•ByNicolas Zeeb Guides CONTENTS Key observations of reported benchmarks Coding capabilities SWE-bench Verified SWE-bench Pro Terminal-Bench 2.0 Agentic capabilities MCP-Atlas (Scaled tool use) Finance Agent v1.1 OSWorld-Verified (Computer...
[17] Claude Opus 4.7 Review: Everything New in 2026app.stationx.net
Sign In MEMBERSHIP 2100 Shares Benchmark Opus 4.6 Opus 4.7 Change --- --- SWE-Bench Pro 53.4% 64.3% +10.9 SWE-Bench Verified 80.8% 87.6% +6.8 Graphwalks (multi-hop reasoning) 38.7% 58.6% +19.9 OSWorld-Verified (computer use) 72.7% 78.0% +5.3 CharXiv (vision...
[19] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[27] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
‍ K2.6 was equipped with search, code-interpreter, and web-browsing tools for HLE with tools, BrowseComp, DeepSearchQA, and WideSearch evaluations. Reasoning and Knowledge K2.6 is competitive with closed-source models on math and science, though GPT-5.4 and...
[29] Kimi K2.6 Tech Blog: Advancing Open-Source Codingkimi.com
APEX-Agents 27.9 33.3 33.0 32.0 11.5 OSWorld-Verified 73.1 75.0 72.7 — 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 — 77.8 76.9 73.0 SWE-Bench Verified 80.2 — 80.8 80...
[32] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Kimi K2.6 has significantly higher token usage than Kimi K2.5. Kimi K2.5 scores 6 on the AA-Omniscience Index, primarily driven...
[34] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[37] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T ... 2 days ago
[42] DeepSeek V4 Preview Releaseapi-docs.deepseek.com
News; DeepSeek-V4 Preview Release 2026/04/24. On this page. DeepSeek V4 Preview Release. DeepSeek-V4 Preview is officially live & open-sourced!

ค้นพบเทรนด์

รายงานเผยแพร่แล้ว28 เม.ย. 2026Last edited 6 พ.ค. 202613 แหล่งที่มา

เทียบ benchmark GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: โมเดลไหนเหมาะกับงานอะไร

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI เรียกดูเพิ่มเติมจาก Discover

17K0

คำตอบสั้น ๆ

งาน agentic computer-use, browser workflow และ terminal-heavy agents: GPT‑5.5 ให้สัญญาณสาธารณะที่แรงที่สุดในชุดข้อมูลนี้ โดย OpenAI รายงาน Terminal‑Bench 2.0 ที่ 82.7%, OSWorld‑Verified 78.7%, BrowseComp 84.4% และ Toolathlon 55.6% ^[5]
งานซ่อม codebase ระดับ production และ benchmark สาย SWE‑Bench: Claude Opus 4.7 เป็นตัวเลือก shortlist ที่แข็งแรงที่สุด ด้วย SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3% ที่มีรายงาน ^[17]
งาน coding ที่ต้องการ open weights: Kimi K2.6 แข่งขันได้มาก โดยเอกสารของ Kimi ระบุ Terminal‑Bench 2.0 66.7%, SWE‑Bench Pro 58.6%, SWE‑Bench Verified 80.2% และ LiveCodeBench v6 89.6 ^[29]
งานทดลอง open-source/open-weights แบบ long context: DeepSeek V4 ควรถูกนำมาทดสอบ แต่ต้องดูให้ชัดว่าเป็น variant ใด เพราะ DeepSeek ระบุว่า V4 Preview live และ open-sourced เมื่อ 24 เมษายน 2026 ^[42]
งาน reasoning ด้านวิทยาศาสตร์: Claude Opus 4.7 รายงาน GPQA Diamond 94.2%; Kimi K2.6 รายงาน GPQA-Diamond 90.5% และ AIME 2026 96.4%; DeepSeek V4-Pro/Pro-Max รายงาน GPQA Diamond 90.1 ^[19]^[27]^[29]^[37]

ก่อนอ่าน benchmark: 3 เรื่องที่ต้องจำ

benchmark คนละตระกูลวัดคนละทักษะ Terminal‑Bench, SWE‑Bench, BrowseComp, OSWorld, GPQA และ HLE ไม่ได้ถามคำถามเดียวกัน โมเดลที่เก่งแก้ issue ใน repo อาจไม่ใช่ตัวที่ดีที่สุดสำหรับ web research agent หรือ computer-use automation ^[5]^[17]^[29]
tool access และ inference effort เปลี่ยนผลได้ OpenAI system card ระบุว่า GPT‑5.5 Pro ใช้โมเดลพื้นฐานเดียวกับ GPT‑5.5 แต่เป็น setting ที่ใช้ parallel test-time compute ดังนั้นคะแนน GPT‑5.5 Pro ไม่ควรถูกอ่านเหมือนเป็นคะแนนของ GPT‑5.5 ปกติภายใต้ compute budget เดียวกัน ^[3]
public benchmark เหมาะสำหรับ shortlist ไม่ใช่คำตอบสุดท้ายของ procurement เพราะ independent runs อาจไม่ตรงกับ self-reported scores ทีมที่เลือกใช้จริงควรรัน eval ภายในด้วย prompt, tool budget, timeout และเกณฑ์ให้คะแนนเดียวกัน ^[12]

ภาพรวมแต่ละโมเดล

โมเดล	ภาพจำจากเอกสารสาธารณะ	สัญญาณที่เด่น	ข้อควรระวัง
GPT‑5.5	เอกสารเปิดตัวของ OpenAI เน้น computer-use, tool-use และ agentic workflows ^[5]	Terminal‑Bench 2.0 82.7%, OSWorld‑Verified 78.7%, BrowseComp 84.4%; ส่วน GPT‑5.5 Pro ได้ BrowseComp 90.1 ^[5]	อย่าเทียบคะแนน Pro กับ GPT‑5.5 ปกติโดยตรง เพราะ Pro ใช้ parallel test-time compute setting ^[3]
Claude Opus 4.7	Anthropic วางตำแหน่งเป็น hybrid reasoning model สำหรับ coding และ AI agents พร้อม context window 1M ^[14]	SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3% ^[17]	context window 1M มีประโยชน์ แต่ขนาดหน้าต่างไม่เท่ากับ recall quality เสมอ โดยสรุปของ StationX มี caveat เรื่อง recall ที่ปลายสุดของ 1M tokens ^[17]
Kimi K2.6	โมเดล open-source/open-weights จาก Moonshot/Kimi ที่เน้นงาน coding ^[29]^[34]	Terminal‑Bench 2.0 66.7%, SWE‑Bench Pro 58.6%, SWE‑Bench Verified 80.2%, LiveCodeBench v6 89.6 ^[29]	Artificial Analysis ระบุว่า Kimi K2.6 รองรับ image/video input แบบ native และมี max context length 256k; performance จริงยังขึ้นกับการ deploy ^[32]
DeepSeek V4-Pro / Pro-Max	DeepSeek ระบุว่า V4 Preview live และ open-sourced ส่วน Hugging Face card นำเสนอ V4 series เป็น MoE language models ^[37]^[42]	SWE Verified 80.6, SWE Pro 55.4, Terminal Bench 2.0 67.9 และ GPQA Diamond 90.1 ^[37]	ชื่อ DeepSeek V4 มีหลาย variant จึงไม่ควรรวม Flash, Pro และ Pro-Max เป็นคะแนนเดียวกัน ^[37]^[42]

ตาราง benchmark เทียบหัวต่อหัว

Benchmark	GPT‑5.5	Claude Opus 4.7	Kimi K2.6	DeepSeek V4-Pro / Pro-Max	อ่านอย่างไร
Terminal‑Bench 2.0	82.7% ^[5]	69.4% reported ^[16]	66.7% ^[29]	67.9% ^[37]	งาน command-line และ autonomous coding style เห็น lead ของ GPT‑5.5 ชัดที่สุด
SWE‑Bench Pro	58.6% ^[5]	64.3% ^[17]	58.6% ^[29]	55.4% ^[37]	benchmark software engineering ที่ยากขึ้น Claude Opus 4.7 นำ
SWE‑Bench Verified	ไม่พบค่าเปรียบเทียบที่ชัดในชุดแหล่งข้อมูลนี้	87.6% ^[17]	80.2% ^[29]	80.6% ^[37]	งานแนวแก้ issue ใน repo จริง Claude มีสัญญาณ reported ที่แข็งแรงที่สุด
OSWorld‑Verified	78.7% ^[5]	78.0% ^[17]	73.1% ^[29]	ไม่พบค่าที่เทียบได้ชัด	งาน computer-use GPT‑5.5 และ Claude Opus 4.7 อยู่ใกล้กันมาก
BrowseComp	84.4%; GPT‑5.5 Pro 90.1% ^[5]	79.3% ^[5]	83.2%; Agent Swarm 86.3% ^[34]	ไม่พบค่าที่เทียบได้ชัด	งาน browser-agent และ web-research เห็นสัญญาณแรงจาก GPT‑5.5 Pro และ Kimi Agent Swarm
GPQA Diamond	ไม่พบค่า official ที่เทียบได้ชัดในชุดแหล่งข้อมูลนี้	94.2% ^[19]	90.5% ^[27]	90.1% ^[37]	งาน science reasoning ระดับสูง Claude มีคะแนน reported สูงสุด
HLE / hard reasoning	ไม่พบค่าที่เทียบตรงได้	HLE no-tools 46.9%, with-tools 54.7% ^[16]	HLE-Full 34.7%; with-tools 54.0% ^[29]^[34]	HLE 37.7% ^[37]	เมื่อมี tool ช่วย Claude และ Kimi ใกล้กัน; DeepSeek ต่ำกว่าตามตัวเลขที่ระบุ
Long context	ใน excerpt เอกสารเปิดตัวที่ใช้ ไม่พบ public context spec ที่ชัด	1M context window ^[14]	256k max context length ^[32]	เอกสาร V4 วางตำแหน่งด้าน long-context ^[37]^[42]	Claude และ DeepSeek ถูกวางตำแหน่งด้าน long context ชัดกว่า แต่ต้องทดสอบ recall จริงแยกต่างหาก

เลือกตามงาน: ตัวไหนเหมาะกับอะไร

1. เอเจนต์ที่ต้องใช้เทอร์มินัล เบราว์เซอร์ และเครื่องมือหลายขั้นตอน: GPT‑5.5

2. ซ่อม codebase ระดับ production: Claude Opus 4.7

เหมาะกับ: repo maintenance, code review, complex refactor, developer copilots และ engineering agents

3. coding stack ที่ต้องการ open weights: Kimi K2.6

4. long-context open-source/open-weights experimentation: DeepSeek V4

5. วิทยาศาสตร์และคณิตศาสตร์: Claude นำใน GPQA แต่ภาพรวมยังไม่จบใน benchmark เดียว

checklist ก่อนเลือกใช้จริง

อย่าตัดสินจาก benchmark เดียว ใช้ public benchmark เพื่อคัด shortlist แล้วรัน eval ภายในด้วย prompt, tool budget, timeout และ scoring rubric เดียวกัน เพราะคะแนนที่รันโดยอิสระอาจไม่ตรงกับ self-reported scores ^[12]
แยก GPT‑5.5 และ GPT‑5.5 Pro เป็นคนละ track Pro ใช้ parallel test-time compute setting จึงไม่ควรถือว่าเทียบได้ภายใต้ compute budget เดียวกับ GPT‑5.5 ปกติ ^[3]
กำหนด requirement เรื่อง open weights ก่อน ถ้า data control, self-hosting หรือการปรับแต่ง deployment เป็นข้อบังคับ ควรแยก Kimi K2.6 และ DeepSeek V4 ไว้ใน evaluation lane ของตนเอง ^[29]^[34]^[37]^[42]
long context ต้องทดสอบมากกว่าแค่ดู window size Claude Opus 4.7 มี positioning 1M context, Kimi K2.6 มี max context 256k และ DeepSeek V4 มี positioning ด้าน long-context แต่ recall, instruction following และ cost ต้องทดสอบกับเอกสารจริงของคุณ ^[14]^[17]^[32]^[37]^[42]
งาน coding agents ต้องรันกับ repo จริงของทีม คะแนนแบบ SWE‑Bench มีประโยชน์ แต่ production repo มักมี dependency setup, flaky tests, coding style และ review constraints เฉพาะตัว ^[17]

ข้อจำกัดของการเทียบครั้งนี้

ยังไม่พบ public comparison ที่นำทั้ง 4 โมเดลมาทดสอบโดย independent lab เดียวกัน ใช้ harness เดียวกัน tool access เดียวกัน และ effort setting เดียวกันทั้งหมด LM Council ก็เตือนเรื่องความคลาดเคลื่อนระหว่าง independent benchmark กับ self-reported benchmark ^[12]
GPT‑5.5 Pro ไม่ควรถูกอ่านเหมือน GPT‑5.5 ปกติ เพราะ OpenAI system card ระบุว่า Pro เป็น setting ของโมเดลพื้นฐานเดียวกันที่ใช้ parallel test-time compute ^[3]
คะแนนของ DeepSeek V4 เป็น variant-specific จึงไม่ควรรวม V4 Preview, V4-Pro และ Pro-Max style naming เป็นคะแนนเดียว ^[37]^[42]
สำหรับ Kimi K2.6 และ DeepSeek V4 ที่อยู่ในกลุ่ม open-weights/deployable performance ในโลกจริงอาจขึ้นกับ serving stack, hardware, quantization และ context settings จึงควรทดสอบ deployment ของตนเองคู่กับ benchmark ที่เผยแพร่ ^[29]^[34]^[37]

สรุปท้ายบท

เลือก GPT‑5.5 ถ้างานหลักคือ agentic computer-use, browsing, tool orchestration และ terminal-heavy coding ^[5]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ค้นหาและตรวจสอบข้อเท็จจริงด้วย Studio Global AI

ประเด็นสำคัญ

ไม่มีโมเดลที่ชนะทุกงาน: GPT‑5.5 ให้สัญญาณเด่นใน agentic computer use และ tool workflows, Claude Opus 4.7 เด่นงานซ่อมโค้ดระดับ repo, Kimi K2.6 แข็งแรงใน open weights coding และ DeepSeek V4 ควรอยู่ในลิสต์ทดลอง long cont...
ตัวเลขหลัก: GPT‑5.5 ทำ Terminal‑Bench 2.0 ได้ 82.7% และ BrowseComp 84.4%; Claude Opus 4.7 ทำ SWE‑Bench Verified 87.6% และ SWE‑Bench Pro 64.3%; Kimi K2.6 ทำ SWE‑Bench Verified 80.2%; DeepSeek V4 Pro/Pro Max รายงาน SWE...
อย่าใช้ leaderboard อย่างเดียวตัดสินซื้อหรือ deploy: benchmark คนละ lab, tool access และ effort setting ทำให้เทียบตรง ๆ ไม่ได้ ควรทดสอบกับ workload จริงของทีม [12]

คนยังถาม

คำตอบสั้น ๆ สำหรับ "เทียบ benchmark GPT‑5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: โมเดลไหนเหมาะกับงานอะไร" คืออะไร

ประเด็นสำคัญที่ต้องตรวจสอบก่อนคืออะไร?

ฉันควรทำอย่างไรต่อไปในทางปฏิบัติ?

ฉันควรสำรวจหัวข้อที่เกี่ยวข้องใดต่อไป

เปิดหน้าที่เกี่ยวข้อง

ฉันควรเปรียบเทียบสิ่งนี้กับอะไร?

ตรวจสอบคำตอบนี้กับ "DeepSeek V4: 1M context, MoE และเช็กลิสต์ย้าย API สำหรับนักพัฒนา"

เปิดหน้าที่เกี่ยวข้อง

ทำการวิจัยต่อ

การสนทนาการวิจัย

คุณ26 เม.ย. 2026

คำถามวิจัย

Studio Global AI26 เม.ย. 202640 แหล่งที่มา

อ้างถึงคำตอบ

Key findings

GPT‑5.5 ने OpenAI के launch benchmarks में Terminal-Bench 2.0 पर 82.7%, OSWorld-Verified पर 78.7%, BrowseComp पर 84.4%, FrontierMath Tier 1–3 पर 51.7%, और FrontierMath Tier 4 पर 35.4% स्कोर किया; GPT‑5.5 Pro ने BrowseComp पर 90.1% और FrontierMath Tier 4 पर 39.6% दिखाया। ^[1]
Claude Opus 4.7 coding benchmarks में विशेष रूप से मजबूत है: Vellum के Anthropic-reported benchmark breakdown में SWE-Bench Verified 87.6%, SWE-Bench Pro 64.3%, MCP-Atlas 77.3%, OSWorld-Verified 78.0%, और GPQA Diamond 94.2% दिए गए हैं। ^[5]
Kimi K2.6 सबसे मजबूत open-weights coding contenders में है: उसके official Hugging Face model card में SWE-Bench Pro 58.6%, Terminal-Bench 2.0 66.7%, SWE-Bench Verified 80.2%, BrowseComp 83.2%, BrowseComp Agent Swarm 86.3%, और GPQA-Diamond 90.5% दिए गए हैं। ^[6]
DeepSeek V4-Pro official release में 1.6T total / 49B active parameters और 1M context बताता है; DeepSeek-V4-Flash 284B total / 13B active parameters वाला faster/economical variant है। ^[8]^[9]
DeepSeek-V4-Pro-Max ने Hugging Face model card पर LiveCodeBench 93.5, Codeforces rating 3206, GPQA Diamond 90.1, Terminal Bench 2.0 67.9, SWE Verified 80.6, और SWE Pro 55.4 रिपोर्ट किया। ^[9]
उपलब्ध evidence में cross-model comparisons पूरी तरह apples-to-apples नहीं हैं, क्योंकि कई results vendor-reported हैं, effort settings अलग हैं, tools/harness अलग हो सकते हैं, और कुछ competitor scores re-evaluated या self-reported हैं। ^[5]^[6]^[9]

मॉडल प्रोफाइल

मॉडल	स्थिति / रिलीज	मुख्य स्पेक्स	प्राथमिक ताकत
GPT‑5.5	OpenAI ने 23 अप्रैल 2026 को GPT‑5.5 release किया और 24 अप्रैल 2026 update में API availability जोड़ी। ^[1]	Public page में parameter count disclosed नहीं है; GPT‑5.5 Pro same underlying model का parallel test-time compute setting बताया गया है। ^[2]	Agentic coding, computer use, tool use, long-horizon work। ^[1]
Claude Opus 4.7	Anthropic page पर Claude Opus 4.7 announcement 16 अप्रैल 2026 दिखता है। ^[3]	1M context window, 128k max output tokens, adaptive thinking, high-resolution image support। ^[4]	Real-world coding, tool-calling agents, professional knowledge work। ^[3]^[5]
Kimi K2.6	Moonshot AI का open-source native multimodal agentic model। ^[6]	MoE architecture, 1T total parameters, 32B active parameters, 256K context, Modified MIT license। ^[6]	Open-weights coding, agent swarm, multimodal coding-driven design। ^[6]
DeepSeek V4-Pro / Flash	DeepSeek-V4 Preview 24 अप्रैल 2026 को live और open-sourced बताया गया। ^[8]	V4-Pro: 1.6T total / 49B active; V4-Flash: 284B total / 13B active; दोनों 1M context support करते हैं। ^[8]^[9]	Long-context open-weights reasoning, coding, cost-efficient deployment। ^[8]^[9]

Benchmark तुलना

Benchmark	GPT‑5.5	Claude Opus 4.7	Kimi K2.6	DeepSeek V4-Pro/Pro-Max	पढ़ने का तरीका
Terminal-Bench 2.0	82.7% ^[1]	69.4% ^[1]^[5]	66.7% ^[6]	67.9% ^[9]	GPT‑5.5 इस command-line/agentic coding benchmark में स्पष्ट रूप से आगे दिखता है। ^[1]
SWE-Bench Pro	58.6% ^[1]	64.3% ^[5]	58.6% ^[6]	55.4% ^[9]	Claude Opus 4.7 इस hard software-engineering benchmark पर आगे है। ^[5]
SWE-Bench Verified	उपलब्ध स्रोत में GPT‑5.5 का comparable score नहीं मिला। ^[1]	87.6% ^[5]	80.2% ^[6]	80.6% ^[9]	Claude Opus 4.7 reported results में strongest है। ^[5]
OSWorld-Verified	78.7% ^[1]	78.0% ^[1]^[5]	73.1% ^[6]	Insufficient evidence	GPT‑5.5 और Claude Opus 4.7 computer-use tasks में बहुत करीब हैं। ^[1]^[5]
BrowseComp	84.4%; Pro 90.1% ^[1]	79.3% ^[5]	83.2%; Agent Swarm 86.3% ^[6]	Insufficient evidence	GPT‑5.5 Pro और Kimi Agent Swarm web-research/agentic search में मजबूत दिखते हैं। ^[1]^[6]
GPQA Diamond	उपलब्ध OpenAI launch excerpt में comparable score नहीं मिला। ^[1]	94.2% ^[5]	90.5% ^[6]	90.1% ^[9]	Claude Opus 4.7 science reasoning में reported scores के आधार पर आगे है। ^[5]
HLE / hard reasoning	उपलब्ध OpenAI launch excerpt में comparable HLE score नहीं मिला। ^[1]	HLE no-tools 46.9%, with-tools 54.7% ^[5]	HLE-Full 34.7%, with-tools 54.0% ^[6]	HLE 37.7% ^[9]	Tool-augmented HLE में Claude और Kimi करीब हैं; DeepSeek का listed HLE score lower है। ^[5]^[6]^[9]
Long context	public specs not disclosed in retrieved source	1M context ^[4]	256K context ^[6]	1M context ^[8]^[9]	Long-context deployment में Claude Opus 4.7 और DeepSeek V4 अधिक स्पष्ट रूप से positioned हैं। ^[4]^[8]^[9]

उपयोग-केस के अनुसार निष्कर्ष

अगर आपका workload terminal-heavy autonomous coding, computer-use, tool-driven workflows और general frontier-agent work है, तो GPT‑5.5 सबसे मजबूत candidate दिखता है, खासकर Terminal-Bench 2.0 82.7%, OSWorld-Verified 78.7%, Toolathlon 55.6%, और BrowseComp 84.4% के आधार पर। ^[1]
अगर आपका लक्ष्य GitHub issue resolution, production codebase repair, और SWE-Bench-style software engineering है, तो Claude Opus 4.7 सबसे मजबूत दिखता है, क्योंकि इसका SWE-Bench Verified 87.6% और SWE-Bench Pro 64.3% है। ^[5]
अगर आपको open-weights/self-hostable मॉडल चाहिए और coding + agentic research दोनों महत्वपूर्ण हैं, तो Kimi K2.6 बहुत मजबूत विकल्प है, क्योंकि यह 1T/32B-active MoE model है और SWE-Bench Pro 58.6%, BrowseComp 83.2%, तथा Agent Swarm BrowseComp 86.3% रिपोर्ट करता है। ^[6]
अगर आपको 1M context, open-weights, और cost-efficient deployment चाहिए, तो DeepSeek V4-Pro/Flash रणनीतिक रूप से महत्वपूर्ण है; V4-Pro 1.6T/49B-active है और V4-Flash 284B/13B-active faster/economical variant है। ^[8]^[9]
अगर pure reasoning/math frontier आपका मुख्य लक्ष्य है, तो इस dataset में picture mixed है: Claude Opus 4.7 GPQA Diamond पर 94.2% है, Kimi K2.6 GPQA-Diamond 90.5% और AIME 2026 96.4% देता है, और DeepSeek-V4-Pro-Max GPQA Diamond 90.1%, HMMT 2026 Feb 95.2%, तथा IMOAnswerBench 89.8% दिखाता है। ^[5]^[6]^[9]

Evidence notes

GPT‑5.5 के लिए strongest evidence OpenAI का official launch post और system card है, लेकिन यह vendor-reported data है। ^[1]^[2]
Claude Opus 4.7 के लिए Anthropic official product/docs pages capabilities और specs देते हैं, जबकि benchmark values के लिए Vellum ने Anthropic-reported tables का readable breakdown दिया है। ^[3]^[4]^[5]
Kimi K2.6 के लिए official Hugging Face model card सबसे उपयोगी benchmark source है, क्योंकि उसमें architecture, evaluation settings, और footnotes शामिल हैं। ^[6]
DeepSeek V4 के लिए DeepSeek API Docs release page availability/specs बताता है, और DeepSeek Hugging Face model card detailed evaluation table देता है। ^[8]^[9]
कई benchmarks में “thinking effort,” tools, max generation length, और harness अलग हैं; Kimi K2.6 card खुद बताता है कि कुछ competitor scores publicly available न होने पर re-evaluated और asterisk-marked हैं। ^[6]

Limitations / uncertainty

Insufficient evidence: सभी चार मॉडलों को एक ही स्वतंत्र lab, एक ही harness, एक ही tool budget, और एक ही inference-effort setting में evaluate करने वाला complete public benchmark अभी उपलब्ध नहीं मिला। ^[5]^[6]^[9]
GPT‑5.5 और Claude Opus 4.7 closed models हैं, इसलिए parameter count, training data, और exact inference configuration public comparison में सीमित हैं। ^[1]^[3]
DeepSeek V4 नाम के अंदर Flash, Pro, और Pro-Max/effort-mode जैसे variants हैं, इसलिए “DeepSeek V4” का benchmark score variant-specific है। ^[8]^[9]
कुछ official benchmark charts images में हैं या extracted text में partial हैं, इसलिए मैंने केवल उन numbers को शामिल किया है जो source text में स्पष्ट रूप से उपलब्ध थे। ^[1]^[8]^[9]

Summary

स्रोत मैप

แหล่งที่มา

[3] GPT-5.5 System Card - OpenAIopenai.com
We generally treat GPT‑5.5’s safety results as strong proxies for GPT‑5.5 Pro, which is the same underlying model using a setting that makes use of parallel test time compute. As noted below, we separately evaluate GPT‑5.5 Pro in certain cases because we ju...
[5] Introducing GPT-5.5 - OpenAIopenai.com
Computer use and vision EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaudeOpus 4.7Gemini 3.1 Pro OSWorld-Verified 78.7%75.0%--78.0%- MMMU Pro (no tools)81.2%81.2%---80.5% MMMU Pro (with tools)83.2%82.1%---- Tool use EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaud...
[12] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude 4.5 ...lmcouncil.ai
AI Model Benchmarks Apr 2026 18 benchmarks - the world's most-followed benchmarks, curated by AI Explained, author of SimpleBench Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs. Compare Models Human...
[14] Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer []( Research Economic Futures Commitments Learn News Try Claude Claude Opus 4.7 Image 1: Claude Opus 4.7 Image 2: Claude Opus 4.7 Hybrid reasoning model that pushes the frontier for coding and AI agents, featuring a 1M con...
[16] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Apr 16, 2026•16 min•ByNicolas Zeeb Guides CONTENTS Key observations of reported benchmarks Coding capabilities SWE-bench Verified SWE-bench Pro Terminal-Bench 2.0 Agentic capabilities MCP-Atlas (Scaled tool use) Finance Agent v1.1 OSWorld-Verified (Computer...
[17] Claude Opus 4.7 Review: Everything New in 2026app.stationx.net
Sign In MEMBERSHIP 2100 Shares Benchmark Opus 4.6 Opus 4.7 Change --- --- SWE-Bench Pro 53.4% 64.3% +10.9 SWE-Bench Verified 80.8% 87.6% +6.8 Graphwalks (multi-hop reasoning) 38.7% 58.6% +19.9 OSWorld-Verified (computer use) 72.7% 78.0% +5.3 CharXiv (vision...
[19] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[27] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
‍ K2.6 was equipped with search, code-interpreter, and web-browsing tools for HLE with tools, BrowseComp, DeepSearchQA, and WideSearch evaluations. Reasoning and Knowledge K2.6 is competitive with closed-source models on math and science, though GPT-5.4 and...
[29] Kimi K2.6 Tech Blog: Advancing Open-Source Codingkimi.com
APEX-Agents 27.9 33.3 33.0 32.0 11.5 OSWorld-Verified 73.1 75.0 72.7 — 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 — 77.8 76.9 73.0 SWE-Bench Verified 80.2 — 80.8 80...
[32] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Kimi K2.6 has significantly higher token usage than Kimi K2.5. Kimi K2.5 scores 6 on the AA-Omniscience Index, primarily driven...
[34] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[37] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T ... 2 days ago
[42] DeepSeek V4 Preview Releaseapi-docs.deepseek.com
News; DeepSeek-V4 Preview Release 2026/04/24. On this page. DeepSeek V4 Preview Release. DeepSeek-V4 Preview is officially live & open-sourced!