التقاريرمنشور29 أبريل 2026Last edited 6 مايو 202612 المصادر

مقارنة معايير GPT-5.5 وClaude Opus 4.7 وKimi K2.6 وDeepSeek V4

للوكلاء الذين يعملون كثيراً داخل الطرفية، ابدأ بـ GPT 5.5؛ لإصلاح البرمجيات، ابدأ بـ Claude Opus 4.7؛ وللنشر مفتوح الأوزان، اختبر Kimi K2.6، بينما DeepSeek V4 Pro Max يستحق التجربة عندما تكون التكلفة حساسة [1][18][24]. لا تخلط GPT 5.5 Pro مع GPT 5.5 الأساسي: في الصفوف التي يظهر فيها منفصلاً، يتصدر BrowseComp بـ90.1%...

ابحث وتحقق من الحقائق مع Studio Global AI تصفّح المزيد من الاكتشاف

18K0

Abstract benchmark dashboard comparing GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4 — GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks ComparedAI-generated editorial illustration for a benchmark comparison of GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4.
موجّه الذكاء الاصطناعي
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks Compared. Article summary: There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,.... Topic tags: ai, ai benchmarks, llm, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hp
openai.com

تبدو جداول المعايير أحياناً كأنها سباق واحد: أعلى رقم يعني أفضل نموذج. في هذه المقارنة تحديداً، هذه قراءة ناقصة. أقرب مقارنة مشتركة في المصادر تغطي GPT-5.5 وGPT-5.5 Pro وClaude Opus 4.7 وDeepSeek-V4-Pro-Max، بينما تأتي أرقام Kimi K2.6 من مصادر منفصلة متعلقة بإطلاق Kimi وبطاقة النموذج ولوحات الصدارة ^[1]^[6]^[24]. لذلك فالسؤال العملي ليس: من الفائز المطلق؟ بل: أي نموذج تضعه أولاً في اختبارك الداخلي؟

ملاحظة تسمية مهمة: عند الحديث عن DeepSeek V4، يستخدم هذا المقال اسم DeepSeek-V4-Pro-Max لأنه المتغير الذي تظهر له صفوف معايير وتكلفة في المصادر المتاحة ^[18]^[24]. كذلك لا يدمج المقال أرقام GPT-5.5 Pro مع GPT-5.5 الأساسي عندما تفصل المصادر بينهما ^[24].

الخلاصة السريعة حسب نوع العمل

وكلاء البرمجة داخل الطرفية وسطر الأوامر: GPT-5.5 لديه أقوى نتيجة مذكورة في Terminal-Bench 2.0 ضمن المقارنة المشتركة، عند 82.7% ^[24].
إصلاح البرمجيات ومهام المستودعات: Claude Opus 4.7 يتصدر صف SWE-Bench Pro المذكور بـ64.3% وصف SWE-Bench Verified المذكور بـ87.6% ^[18]^[24].
الاستدلال الصعب من دون أدوات: Claude Opus 4.7 يتقدم في GPQA Diamond وHumanity’s Last Exam بلا أدوات ضمن المقارنة المشتركة ^[24].
الاستدلال مع الأدوات والتصفح: GPT-5.5 Pro يتصدر Humanity’s Last Exam مع الأدوات بـ57.2% وBrowseComp بـ90.1% في الصفوف التي يظهر فيها هذا المتغير منفصلاً ^[24].
النشر بأوزان مفتوحة: Kimi K2.6 هو المرشح الأوضح في المصادر المتاحة؛ إذ يوصف كنموذج خليط خبراء MoE مفتوح الأوزان بحجم 1T، مع 32B معلمة نشطة ونافذة سياق 256K ^[1].
الاستدلال المستضاف الحساس للتكلفة: DeepSeek-V4-Pro-Max هو خيار القيمة الذي يستحق التحقق؛ إذ تسجل له LLM Stats سياق 1M ونتيجة 80.6% في SWE-Bench Verified وخانات تكلفة $1.74/$3.48 ^[18].

جدول المقارنة بين المعايير

الشرطة — تعني أن الدرجة غير موجودة في المصادر المذكورة لهذا النموذج، لا أن النموذج حصل على صفر. صفوف GPT-5.5 وGPT-5.5 Pro وClaude Opus 4.7 وDeepSeek-V4-Pro-Max تأتي في معظمها من مقارنة مشتركة واحدة، أما أرقام Kimi K2.6 فتأتي من مصادر Kimi منفصلة ^[1]^[6]^[24].

الاختبار	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam، بلا أدوات	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam، مع أدوات	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

أي نموذج تختبر أولاً؟

الأولوية	ابدأ بـ	السبب العملي
وكلاء برمجة يعملون عبر الطرفية	GPT-5.5	يملك أعلى نتيجة Terminal-Bench 2.0 في المقارنة المشتركة، عند 82.7% ^[24].
إصلاح البرمجيات	Claude Opus 4.7	يتصدر صف SWE-Bench Pro وصف SWE-Bench Verified المذكورين بين هذه النماذج ^[18]^[24].
استدلال صعب بلا أدوات	Claude Opus 4.7	يتقدم في GPQA Diamond وHumanity’s Last Exam بلا أدوات ضمن المقارنة المشتركة ^[24].
استدلال مع أدوات أو تصفح	GPT-5.5 Pro	يتصدر Humanity’s Last Exam مع الأدوات وBrowseComp حيث تظهر نسخة Pro منفصلة ^[24].
نشر بأوزان مفتوحة	Kimi K2.6	يوصف كنموذج MoE مفتوح الأوزان بحجم 1T، وتعرض بطاقة Hugging Face الخاصة به صفوفاً قوية في معايير البرمجة ^[1]^[6].
استدلال مستضاف بتكلفة أقل	DeepSeek-V4-Pro-Max	تسجل له LLM Stats سياق 1M و80.6% في SWE-Bench Verified وخانات تكلفة أقل من صف Claude Opus 4.7 في اللوحة نفسها ^[18].
احتياج سياق طويل	GPT-5.5 أو Claude Opus 4.7 أو DeepSeek-V4-Pro-Max	المصادر تذكر سياق 1M لهذه النماذج الثلاثة، بينما يظهر Kimi K2.6 حول 256K إلى 262K ^[1]^[11]^[16]^[18]^[27].

ملاحظات على كل نموذج

GPT-5.5

تصف OpenAI نموذج GPT-5.5 بأنه مبني للمهام المعقدة مثل البرمجة والبحث وتحليل البيانات ^[38]. في المقارنة المشتركة المنشورة عبر VentureBeat، يحقق GPT-5.5 نتيجة 82.7% في Terminal-Bench 2.0، متقدماً على Claude Opus 4.7 عند 69.4% وDeepSeek-V4-Pro-Max عند 67.9% ^[24]. ويسجل أيضاً 93.6% في GPQA Diamond و58.6% في SWE-Bench Pro و84.4% في BrowseComp ضمن الجدول نفسه ^[24].

التحفظ المهم هنا هو أن GPT-5.5 Pro نقطة مقارنة مختلفة. في الجدول نفسه، تصل نسخة Pro إلى 90.1% في BrowseComp و57.2% في Humanity’s Last Exam مع الأدوات، لكن لا ينبغي دمج هذه النتائج مع GPT-5.5 الأساسي عند مقارنة التكلفة أو زمن الاستجابة أو إعدادات النموذج ^[24].

لأغراض الشراء والتخطيط، تذكر BenchLM أن GPT-5.5 يملك نافذة سياق 1M، بينما يورد تقرير تسعير واحد سعراً قدره $5 لكل مليون توكن إدخال و$30 لكل مليون توكن إخراج ^[27]^[36]. تعامل مع هذه الأرقام كإشارة أولية، لا كعرض سعر نهائي.

Claude Opus 4.7

أقوى إشارات Claude Opus 4.7 في هذه المجموعة تأتي من معايير إصلاح البرمجيات. تسجل له LLM Stats نتيجة 87.6% في SWE-Bench Verified، بينما تعرض المقارنة المشتركة 64.3% في SWE-Bench Pro ^[18]^[24]. ويتصدر أيضاً GPQA Diamond بـ94.2% وHumanity’s Last Exam بلا أدوات بـ46.9% وMCP Atlas بـ79.1% ضمن المقارنة المشتركة ^[24].

تورد LLM Stats نافذة سياق 1M وتسعير $5/$25 لكل مليون توكن لـClaude Opus 4.7 ^[16]. لكن قابلية المقارنة تحتاج حذراً: تشير Anthropic إلى أن بعض نتائج المعايير استخدمت تطبيقات داخلية أو إعدادات Harness محدثة، وأن بعض الدرجات ليست قابلة للمقارنة مباشرة مع لوحات الصدارة العامة ^[17].

Kimi K2.6

إذا كان شرطك الأساسي هو الأوزان المفتوحة، فـKimi K2.6 هو أبرز مرشح في المواد المذكورة. تغطية الإطلاق تصفه بأنه نموذج MoE مفتوح الأوزان بحجم 1T، مع 32B معلمة نشطة و384 خبيراً وتعدد وسائط أصلي وتكميم INT4 وسياق 256K ^[1]. وتعرض بطاقة Hugging Face الخاصة به 80.2% في SWE-Bench Verified و58.6% في SWE-Bench Pro و66.7% في Terminal-Bench 2.0 و89.6 في LiveCodeBench v6 ^[6].

تذكر تغطية الإطلاق أيضاً نتيجة 54.0 في Humanity’s Last Exam مع الأدوات و83.2 في BrowseComp لـKimi K2.6 ^[1]. أما LLM Stats فتسجله بسياق 262K وخانات سعر $0.95/$4.00 وتصنيف Open Source ^[11]. لكن يجب عدم الإفراط في تفسير الفروق الصغيرة، لأن أرقام Kimi لا تأتي من الجدول المشترك نفسه الذي يضم GPT-5.5 وClaude Opus 4.7 وDeepSeek-V4-Pro-Max ^[1]^[6]^[24].

DeepSeek-V4-Pro-Max

يظهر DeepSeek-V4-Pro-Max كمرشح قيمة أكثر من كونه الفائز الشامل في المعايير. تضعه LLM Stats بحجم 1.6T وسياق 1M ونتيجة 80.6% في SWE-Bench Verified وخانات تكلفة $1.74/$3.48 ^[18]. وفي المقارنة المشتركة يسجل 90.1% في GPQA Diamond و37.7% في Humanity’s Last Exam بلا أدوات و48.2% مع الأدوات و67.9% في Terminal-Bench 2.0 و55.4% في SWE-Bench Pro و83.4% في BrowseComp و73.6% في MCP Atlas ^[24].

هذه الأرقام تجعله خياراً جديراً بالاختبار عند حساسية الميزانية. لكن الجدول نفسه يضع GPT-5.5 أو GPT-5.5 Pro أو Claude Opus 4.7 في الصدارة في معظم الصفوف المذكورة، لذلك لا تتعامل مع DeepSeek كبديل جاهز للنماذج الأعلى تكلفة قبل اختباره على مهامك الفعلية ^[24].

إشارات السياق والتكلفة

لا تأتي أسعار النماذج ونوافذ السياق دائماً من المصدر نفسه أو من مزود الخدمة نفسه. استخدم الجدول التالي كإشارة للتخطيط، لا كعرض شراء نهائي.

النموذج	إشارة السياق والتكلفة في المصادر	القراءة العملية
GPT-5.5	BenchLM تذكر سياق 1M؛ وتقرير تسعير واحد يورد $5 للإدخال و$30 للإخراج لكل مليون توكن ^[27]^[36].	خيار مستضاف من الفئة العليا؛ تحقق من السعر المباشر قبل الميزانية.
Claude Opus 4.7	LLM Stats تذكر سياق 1M وتسعير $5/$25 لكل مليون توكن ^[16].	خيار قوي للبرمجة والاستدلال والمهام ذات السياق الطويل.
Kimi K2.6	تغطية الإطلاق تذكر سياق 256K؛ وLLM Stats تذكر 262K وخانات سعر $0.95/$4.00 ^[1]^[11].	مرشح قوي للأوزان المفتوحة؛ قارنه بحسب بيئة التشغيل التي ستستخدمها.
DeepSeek-V4-Pro-Max	LLM Stats تذكر سياق 1M وحجم 1.6T و80.6% في SWE-Bench Verified وخانات تكلفة $1.74/$3.48 ^[18].	مرشح قيمة قوي إذا حافظ على الجودة في عبء عملك.

لماذا تختلف التصنيفات؟

لأن كل صف يقيس مهارة مختلفة. GPQA Diamond وHumanity’s Last Exam يركزان على الاستدلال الصعب، بينما تقيس Terminal-Bench 2.0 ونسخ SWE-Bench قدرات البرمجة والعمل البرمجي الوكيل، ويقيس BrowseComp أداءً أقرب إلى البحث والتصفح في المقارنة المشتركة ^[24]. لذلك من الطبيعي أن يتقدم نموذج في صف ويتراجع في آخر.

حتى الاختبار الذي يحمل الاسم نفسه قد يعطي نتائج مختلفة بحسب طريقة التنفيذ. على سبيل المثال، تسجل LLM Stats لـClaude Opus 4.7 نتيجة 87.6% في SWE-Bench Verified، بينما تضعه LMCouncil عند 83.5% ± 1.7 ضمن إعدادها ^[18]^[30]. كما تقول Anthropic إن بعض نتائجها استخدمت تطبيقات داخلية أو إعدادات Harness محدثة، ما يحد من المقارنة المباشرة مع لوحات الصدارة العامة ^[17].

لهذا لا ينبغي أن تحسم فروق نقطة أو نقطتين قرار نشر إنتاجي. المعايير العامة جيدة لتقليص القائمة القصيرة، أما القرار النهائي فيجب أن يأتي من اختبارك أنت.

كيف تختبر القائمة القصيرة؟

قبل الالتزام بنموذج واحد، اختبر أفضل مرشحين أو ثلاثة على مهام تشبه عملك الحقيقي.

استخدم مطالبات وملفات ومستودعات واقعية. مطالبات المعايير لا تمثل دائماً قاعدة كودك أو وثائقك أو سياساتك أو سلوك المستخدمين لديك.
وحّد بيئة الأدوات. نتائج وكلاء البرمجة قد تتغير عندما يتاح للنموذج الطرفية أو التصفح أو الاسترجاع أو سياق المستودع أو واجهات API داخلية.
قس التكلفة وزمن الاستجابة بالإعدادات نفسها. أوضاع Pro ومستويات الجهد الأعلى قد تغير الجودة واستهلاك التوكنات والمدة.
راجع الإخفاقات يدوياً. في مهام البرمجة، افحص الاختبارات والفروق في الكود وقابلية الصيانة والثغرات والاعتماديات المتخيلة.
أدخل منافساً أقل تكلفة في الاختبار. إذا كانت الأوزان المفتوحة أو تكلفة الاستدلال مهمة، يستحق Kimi K2.6 وDeepSeek-V4-Pro-Max مكاناً في القائمة القصيرة ^[1]^[18].

الخلاصة النهائية

إذا أردت قائمة قصيرة من الفئة الأعلى، اختبر GPT-5.5 وClaude Opus 4.7 جنباً إلى جنب: الأول يملك أقوى نتيجة مذكورة في Terminal-Bench 2.0، والثاني يملك أقوى نتائج مذكورة في SWE-Bench Pro وSWE-Bench Verified ^[18]^[24]. إذا كان شرطك الأوزان المفتوحة، ابدأ بـKimi K2.6 ^[1]^[6]. وإذا كانت التكلفة هي القيد الأساسي، فأدخل DeepSeek-V4-Pro-Max في الاختبار، لكن لا تعامله كبديل مباشر للنماذج المميزة قبل قياسه على مهامك أنت ^[18]^[24].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ابحث وتحقق من الحقائق مع Studio Global AI

الوجبات السريعة الرئيسية

للوكلاء الذين يعملون كثيراً داخل الطرفية، ابدأ بـ GPT 5.5؛ لإصلاح البرمجيات، ابدأ بـ Claude Opus 4.7؛ وللنشر مفتوح الأوزان، اختبر Kimi K2.6، بينما DeepSeek V4 Pro Max يستحق التجربة عندما تكون التكلفة حساسة [1][18][24].
لا تخلط GPT 5.5 Pro مع GPT 5.5 الأساسي: في الصفوف التي يظهر فيها منفصلاً، يتصدر BrowseComp بـ90.1% وHumanity’s Last Exam مع الأدوات بـ57.2% [24].
Kimi K2.6 يوصف كنموذج MoE مفتوح الأوزان بحجم 1T ومع 32B معلمة نشطة، بينما تسجل LLM Stats لـDeepSeek V4 Pro Max سياق 1M وتكلفة $1.74/$3.48 [1][18].

يسأل الناس أيضا

ما هي الإجابة المختصرة على "مقارنة معايير GPT-5.5 وClaude Opus 4.7 وKimi K2.6 وDeepSeek V4"؟

ما هي النقاط الأساسية التي يجب التحقق منها أولاً؟

ماذا يجب أن أفعل بعد ذلك في الممارسة العملية؟

Kimi K2.6 يوصف كنموذج MoE مفتوح الأوزان بحجم 1T ومع 32B معلمة نشطة، بينما تسجل LLM Stats لـDeepSeek V4 Pro Max سياق 1M وتكلفة $1.74/$3.48 [1][18].

ما هو الموضوع ذو الصلة الذي يجب أن أستكشفه بعد ذلك؟

تابع مع "Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟" لزاوية أخرى واستشهادات إضافية.

فتح الصفحة ذات الصلة

ما الذي يجب أن أقارنه بهذا؟

تحقق من هذه الإجابة مقابل "DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI".

فتح الصفحة ذات الصلة

مواصلة البحث الخاص بك

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟

Claude Opus 4.7 مقابل GPT-5.5 وDeepSeek V4 وKimi K2.6: قراءة دقيقة لبنشماركات 2026

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI

DeepSeek V4: قراءة هندسية في سياق 1M وMoE وخطوات ترحيل API

أخشاب Northwest وSoutheast: لماذا تكون الإجابة larger; larger؟

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

هل يهزم DeepSeek عمالقة الذكاء الاصطناعي؟ الكلفة والثقة أهم من جدول الترتيب

هل يستطيع DeepSeek هزيمة OpenAI وClaude وGemini وGrok؟

محادثة بحثية

أنت25 أبريل 2026

سؤال بحثي

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 أبريل 202640 المصادر

إجابة مستشهد بها

I found public benchmark data for all four, but not one fully independent, apples-to-apples leaderboard covering GPT-5.5, Claude Opus 4.7, Kimi K2.6, and DeepSeek V4 under the same harness. The practical read is: GPT-5.5 looks strongest for broad agentic coding/computer-use tasks, Claude Opus 4.7 is extremely strong on coding and hard reasoning, Kimi K2.6 is the strongest open-weight option in this group, and DeepSeek V4 looks like the best price/performance play if its reported results hold up across independent testing ^[4]^[5]^[6]^[13].

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

If you want the safest high-end general choice, start with GPT-5.5. If your priority is top-tier coding reliability and long-context enterprise work, test Claude Opus 4.7. If you need open weights and strong frontier-adjacent performance, Kimi K2.6 is the most compelling open option. If cost is the constraint and you can tolerate some validation risk, DeepSeek V4 is the one to benchmark first.

المصادر

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

الأكثر رواجًا في الاكتشاف

التقاريرمنشور29 أبريل 2026Last edited 6 مايو 202612 المصادر

مقارنة معايير GPT-5.5 وClaude Opus 4.7 وKimi K2.6 وDeepSeek V4

ابحث وتحقق من الحقائق مع Studio Global AI تصفّح المزيد من الاكتشاف

18K0

الخلاصة السريعة حسب نوع العمل

وكلاء البرمجة داخل الطرفية وسطر الأوامر: GPT-5.5 لديه أقوى نتيجة مذكورة في Terminal-Bench 2.0 ضمن المقارنة المشتركة، عند 82.7% ^[24].
إصلاح البرمجيات ومهام المستودعات: Claude Opus 4.7 يتصدر صف SWE-Bench Pro المذكور بـ64.3% وصف SWE-Bench Verified المذكور بـ87.6% ^[18]^[24].
الاستدلال الصعب من دون أدوات: Claude Opus 4.7 يتقدم في GPQA Diamond وHumanity’s Last Exam بلا أدوات ضمن المقارنة المشتركة ^[24].
الاستدلال مع الأدوات والتصفح: GPT-5.5 Pro يتصدر Humanity’s Last Exam مع الأدوات بـ57.2% وBrowseComp بـ90.1% في الصفوف التي يظهر فيها هذا المتغير منفصلاً ^[24].
النشر بأوزان مفتوحة: Kimi K2.6 هو المرشح الأوضح في المصادر المتاحة؛ إذ يوصف كنموذج خليط خبراء MoE مفتوح الأوزان بحجم 1T، مع 32B معلمة نشطة ونافذة سياق 256K ^[1].
الاستدلال المستضاف الحساس للتكلفة: DeepSeek-V4-Pro-Max هو خيار القيمة الذي يستحق التحقق؛ إذ تسجل له LLM Stats سياق 1M ونتيجة 80.6% في SWE-Bench Verified وخانات تكلفة $1.74/$3.48 ^[18].

جدول المقارنة بين المعايير

الاختبار	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam، بلا أدوات	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam، مع أدوات	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

أي نموذج تختبر أولاً؟

الأولوية	ابدأ بـ	السبب العملي
وكلاء برمجة يعملون عبر الطرفية	GPT-5.5	يملك أعلى نتيجة Terminal-Bench 2.0 في المقارنة المشتركة، عند 82.7% ^[24].
إصلاح البرمجيات	Claude Opus 4.7	يتصدر صف SWE-Bench Pro وصف SWE-Bench Verified المذكورين بين هذه النماذج ^[18]^[24].
استدلال صعب بلا أدوات	Claude Opus 4.7	يتقدم في GPQA Diamond وHumanity’s Last Exam بلا أدوات ضمن المقارنة المشتركة ^[24].
استدلال مع أدوات أو تصفح	GPT-5.5 Pro	يتصدر Humanity’s Last Exam مع الأدوات وBrowseComp حيث تظهر نسخة Pro منفصلة ^[24].
نشر بأوزان مفتوحة	Kimi K2.6	يوصف كنموذج MoE مفتوح الأوزان بحجم 1T، وتعرض بطاقة Hugging Face الخاصة به صفوفاً قوية في معايير البرمجة ^[1]^[6].
استدلال مستضاف بتكلفة أقل	DeepSeek-V4-Pro-Max	تسجل له LLM Stats سياق 1M و80.6% في SWE-Bench Verified وخانات تكلفة أقل من صف Claude Opus 4.7 في اللوحة نفسها ^[18].
احتياج سياق طويل	GPT-5.5 أو Claude Opus 4.7 أو DeepSeek-V4-Pro-Max	المصادر تذكر سياق 1M لهذه النماذج الثلاثة، بينما يظهر Kimi K2.6 حول 256K إلى 262K ^[1]^[11]^[16]^[18]^[27].

ملاحظات على كل نموذج

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

إشارات السياق والتكلفة

النموذج	إشارة السياق والتكلفة في المصادر	القراءة العملية
GPT-5.5	BenchLM تذكر سياق 1M؛ وتقرير تسعير واحد يورد $5 للإدخال و$30 للإخراج لكل مليون توكن ^[27]^[36].	خيار مستضاف من الفئة العليا؛ تحقق من السعر المباشر قبل الميزانية.
Claude Opus 4.7	LLM Stats تذكر سياق 1M وتسعير $5/$25 لكل مليون توكن ^[16].	خيار قوي للبرمجة والاستدلال والمهام ذات السياق الطويل.
Kimi K2.6	تغطية الإطلاق تذكر سياق 256K؛ وLLM Stats تذكر 262K وخانات سعر $0.95/$4.00 ^[1]^[11].	مرشح قوي للأوزان المفتوحة؛ قارنه بحسب بيئة التشغيل التي ستستخدمها.
DeepSeek-V4-Pro-Max	LLM Stats تذكر سياق 1M وحجم 1.6T و80.6% في SWE-Bench Verified وخانات تكلفة $1.74/$3.48 ^[18].	مرشح قيمة قوي إذا حافظ على الجودة في عبء عملك.

لماذا تختلف التصنيفات؟

كيف تختبر القائمة القصيرة؟

قبل الالتزام بنموذج واحد، اختبر أفضل مرشحين أو ثلاثة على مهام تشبه عملك الحقيقي.

استخدم مطالبات وملفات ومستودعات واقعية. مطالبات المعايير لا تمثل دائماً قاعدة كودك أو وثائقك أو سياساتك أو سلوك المستخدمين لديك.
وحّد بيئة الأدوات. نتائج وكلاء البرمجة قد تتغير عندما يتاح للنموذج الطرفية أو التصفح أو الاسترجاع أو سياق المستودع أو واجهات API داخلية.
قس التكلفة وزمن الاستجابة بالإعدادات نفسها. أوضاع Pro ومستويات الجهد الأعلى قد تغير الجودة واستهلاك التوكنات والمدة.
راجع الإخفاقات يدوياً. في مهام البرمجة، افحص الاختبارات والفروق في الكود وقابلية الصيانة والثغرات والاعتماديات المتخيلة.
أدخل منافساً أقل تكلفة في الاختبار. إذا كانت الأوزان المفتوحة أو تكلفة الاستدلال مهمة، يستحق Kimi K2.6 وDeepSeek-V4-Pro-Max مكاناً في القائمة القصيرة ^[1]^[18].

الخلاصة النهائية

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ابحث وتحقق من الحقائق مع Studio Global AI

الوجبات السريعة الرئيسية

للوكلاء الذين يعملون كثيراً داخل الطرفية، ابدأ بـ GPT 5.5؛ لإصلاح البرمجيات، ابدأ بـ Claude Opus 4.7؛ وللنشر مفتوح الأوزان، اختبر Kimi K2.6، بينما DeepSeek V4 Pro Max يستحق التجربة عندما تكون التكلفة حساسة [1][18][24].
لا تخلط GPT 5.5 Pro مع GPT 5.5 الأساسي: في الصفوف التي يظهر فيها منفصلاً، يتصدر BrowseComp بـ90.1% وHumanity’s Last Exam مع الأدوات بـ57.2% [24].
Kimi K2.6 يوصف كنموذج MoE مفتوح الأوزان بحجم 1T ومع 32B معلمة نشطة، بينما تسجل LLM Stats لـDeepSeek V4 Pro Max سياق 1M وتكلفة $1.74/$3.48 [1][18].

يسأل الناس أيضا

ما هي الإجابة المختصرة على "مقارنة معايير GPT-5.5 وClaude Opus 4.7 وKimi K2.6 وDeepSeek V4"؟

ما هي النقاط الأساسية التي يجب التحقق منها أولاً؟

ماذا يجب أن أفعل بعد ذلك في الممارسة العملية؟

ما هو الموضوع ذو الصلة الذي يجب أن أستكشفه بعد ذلك؟

تابع مع "Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟" لزاوية أخرى واستشهادات إضافية.

فتح الصفحة ذات الصلة

ما الذي يجب أن أقارنه بهذا؟

تحقق من هذه الإجابة مقابل "DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI".

فتح الصفحة ذات الصلة

مواصلة البحث الخاص بك

Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟

Claude Opus 4.7 مقابل GPT-5.5 وDeepSeek V4 وKimi K2.6: قراءة دقيقة لبنشماركات 2026

DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI

DeepSeek V4: قراءة هندسية في سياق 1M وMoE وخطوات ترحيل API

أخشاب Northwest وSoutheast: لماذا تكون الإجابة larger; larger؟

هل يهزم DeepSeek عمالقة الذكاء الاصطناعي؟ الكلفة والثقة أهم من جدول الترتيب

هل يستطيع DeepSeek هزيمة OpenAI وClaude وGemini وGrok؟

محادثة بحثية

أنت25 أبريل 2026

سؤال بحثي

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 أبريل 202640 المصادر

إجابة مستشهد بها

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

المصادر

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

الأكثر رواجًا في الاكتشاف

التقاريرمنشور29 أبريل 2026Last edited 6 مايو 202612 المصادر

مقارنة معايير GPT-5.5 وClaude Opus 4.7 وKimi K2.6 وDeepSeek V4

ابحث وتحقق من الحقائق مع Studio Global AI تصفّح المزيد من الاكتشاف

18K0

الخلاصة السريعة حسب نوع العمل

وكلاء البرمجة داخل الطرفية وسطر الأوامر: GPT-5.5 لديه أقوى نتيجة مذكورة في Terminal-Bench 2.0 ضمن المقارنة المشتركة، عند 82.7% ^[24].
إصلاح البرمجيات ومهام المستودعات: Claude Opus 4.7 يتصدر صف SWE-Bench Pro المذكور بـ64.3% وصف SWE-Bench Verified المذكور بـ87.6% ^[18]^[24].
الاستدلال الصعب من دون أدوات: Claude Opus 4.7 يتقدم في GPQA Diamond وHumanity’s Last Exam بلا أدوات ضمن المقارنة المشتركة ^[24].
الاستدلال مع الأدوات والتصفح: GPT-5.5 Pro يتصدر Humanity’s Last Exam مع الأدوات بـ57.2% وBrowseComp بـ90.1% في الصفوف التي يظهر فيها هذا المتغير منفصلاً ^[24].
النشر بأوزان مفتوحة: Kimi K2.6 هو المرشح الأوضح في المصادر المتاحة؛ إذ يوصف كنموذج خليط خبراء MoE مفتوح الأوزان بحجم 1T، مع 32B معلمة نشطة ونافذة سياق 256K ^[1].
الاستدلال المستضاف الحساس للتكلفة: DeepSeek-V4-Pro-Max هو خيار القيمة الذي يستحق التحقق؛ إذ تسجل له LLM Stats سياق 1M ونتيجة 80.6% في SWE-Bench Verified وخانات تكلفة $1.74/$3.48 ^[18].

جدول المقارنة بين المعايير

الاختبار	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	≈91% ^[28]	90.1% ^[24]
Humanity’s Last Exam، بلا أدوات	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam، مع أدوات	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

أي نموذج تختبر أولاً؟

الأولوية	ابدأ بـ	السبب العملي
وكلاء برمجة يعملون عبر الطرفية	GPT-5.5	يملك أعلى نتيجة Terminal-Bench 2.0 في المقارنة المشتركة، عند 82.7% ^[24].
إصلاح البرمجيات	Claude Opus 4.7	يتصدر صف SWE-Bench Pro وصف SWE-Bench Verified المذكورين بين هذه النماذج ^[18]^[24].
استدلال صعب بلا أدوات	Claude Opus 4.7	يتقدم في GPQA Diamond وHumanity’s Last Exam بلا أدوات ضمن المقارنة المشتركة ^[24].
استدلال مع أدوات أو تصفح	GPT-5.5 Pro	يتصدر Humanity’s Last Exam مع الأدوات وBrowseComp حيث تظهر نسخة Pro منفصلة ^[24].
نشر بأوزان مفتوحة	Kimi K2.6	يوصف كنموذج MoE مفتوح الأوزان بحجم 1T، وتعرض بطاقة Hugging Face الخاصة به صفوفاً قوية في معايير البرمجة ^[1]^[6].
استدلال مستضاف بتكلفة أقل	DeepSeek-V4-Pro-Max	تسجل له LLM Stats سياق 1M و80.6% في SWE-Bench Verified وخانات تكلفة أقل من صف Claude Opus 4.7 في اللوحة نفسها ^[18].
احتياج سياق طويل	GPT-5.5 أو Claude Opus 4.7 أو DeepSeek-V4-Pro-Max	المصادر تذكر سياق 1M لهذه النماذج الثلاثة، بينما يظهر Kimi K2.6 حول 256K إلى 262K ^[1]^[11]^[16]^[18]^[27].

ملاحظات على كل نموذج

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

إشارات السياق والتكلفة

النموذج	إشارة السياق والتكلفة في المصادر	القراءة العملية
GPT-5.5	BenchLM تذكر سياق 1M؛ وتقرير تسعير واحد يورد $5 للإدخال و$30 للإخراج لكل مليون توكن ^[27]^[36].	خيار مستضاف من الفئة العليا؛ تحقق من السعر المباشر قبل الميزانية.
Claude Opus 4.7	LLM Stats تذكر سياق 1M وتسعير $5/$25 لكل مليون توكن ^[16].	خيار قوي للبرمجة والاستدلال والمهام ذات السياق الطويل.
Kimi K2.6	تغطية الإطلاق تذكر سياق 256K؛ وLLM Stats تذكر 262K وخانات سعر $0.95/$4.00 ^[1]^[11].	مرشح قوي للأوزان المفتوحة؛ قارنه بحسب بيئة التشغيل التي ستستخدمها.
DeepSeek-V4-Pro-Max	LLM Stats تذكر سياق 1M وحجم 1.6T و80.6% في SWE-Bench Verified وخانات تكلفة $1.74/$3.48 ^[18].	مرشح قيمة قوي إذا حافظ على الجودة في عبء عملك.

لماذا تختلف التصنيفات؟

كيف تختبر القائمة القصيرة؟

قبل الالتزام بنموذج واحد، اختبر أفضل مرشحين أو ثلاثة على مهام تشبه عملك الحقيقي.

استخدم مطالبات وملفات ومستودعات واقعية. مطالبات المعايير لا تمثل دائماً قاعدة كودك أو وثائقك أو سياساتك أو سلوك المستخدمين لديك.
وحّد بيئة الأدوات. نتائج وكلاء البرمجة قد تتغير عندما يتاح للنموذج الطرفية أو التصفح أو الاسترجاع أو سياق المستودع أو واجهات API داخلية.
قس التكلفة وزمن الاستجابة بالإعدادات نفسها. أوضاع Pro ومستويات الجهد الأعلى قد تغير الجودة واستهلاك التوكنات والمدة.
راجع الإخفاقات يدوياً. في مهام البرمجة، افحص الاختبارات والفروق في الكود وقابلية الصيانة والثغرات والاعتماديات المتخيلة.
أدخل منافساً أقل تكلفة في الاختبار. إذا كانت الأوزان المفتوحة أو تكلفة الاستدلال مهمة، يستحق Kimi K2.6 وDeepSeek-V4-Pro-Max مكاناً في القائمة القصيرة ^[1]^[18].

الخلاصة النهائية

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

ابحث وتحقق من الحقائق مع Studio Global AI

الوجبات السريعة الرئيسية

للوكلاء الذين يعملون كثيراً داخل الطرفية، ابدأ بـ GPT 5.5؛ لإصلاح البرمجيات، ابدأ بـ Claude Opus 4.7؛ وللنشر مفتوح الأوزان، اختبر Kimi K2.6، بينما DeepSeek V4 Pro Max يستحق التجربة عندما تكون التكلفة حساسة [1][18][24].
لا تخلط GPT 5.5 Pro مع GPT 5.5 الأساسي: في الصفوف التي يظهر فيها منفصلاً، يتصدر BrowseComp بـ90.1% وHumanity’s Last Exam مع الأدوات بـ57.2% [24].
Kimi K2.6 يوصف كنموذج MoE مفتوح الأوزان بحجم 1T ومع 32B معلمة نشطة، بينما تسجل LLM Stats لـDeepSeek V4 Pro Max سياق 1M وتكلفة $1.74/$3.48 [1][18].

يسأل الناس أيضا

ما هي الإجابة المختصرة على "مقارنة معايير GPT-5.5 وClaude Opus 4.7 وKimi K2.6 وDeepSeek V4"؟

ما هي النقاط الأساسية التي يجب التحقق منها أولاً؟

ماذا يجب أن أفعل بعد ذلك في الممارسة العملية؟

ما هو الموضوع ذو الصلة الذي يجب أن أستكشفه بعد ذلك؟

تابع مع "Claude Opus 4.7 ضد GPT-5.5 وDeepSeek V4 وKimi K2.6: من يتصدر اختبارات أداء 2026؟" لزاوية أخرى واستشهادات إضافية.

فتح الصفحة ذات الصلة

ما الذي يجب أن أقارنه بهذا؟

تحقق من هذه الإجابة مقابل "DeepSeek V4: ما وراء مليون توكن وMoE وترحيل الـAPI".

فتح الصفحة ذات الصلة

مواصلة البحث الخاص بك

محادثة بحثية

أنت25 أبريل 2026

سؤال بحثي

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 أبريل 202640 المصادر

إجابة مستشهد بها

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

المصادر

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago