उत्तरप्रकाशित28 अप्रैल 2026Last edited 6 मई 20267 स्रोत

Claude Mythos बेंचमार्क: 93.9% SWE-bench और स्कोर की असली सीमा

Claude Mythos Preview का सबसे चर्चित reported score SWE bench Verified पर 93.9% है; लेकिन Anthropic इसे Project Glasswing का invitation only research preview बताता है, self serve model नहीं।[6][13] अन्य reported scores में SWE bench Multilingual पर 87.3%, internal multimodal evaluation में 59.0%, CyberGym पर 0.83 और...

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

17K0

Abstrakte KI-Benchmark-Grafik zu Claude Mythos Preview mit Code- und Score-Elementen — Claude Mythos Benchmarks: 93,9 % SWE-bench und was dahinterstecktSymbolbild: Die Debatte um Claude Mythos dreht sich weniger um einen einzelnen Score als um die Vergleichbarkeit der Evaluationsbedingungen.
AI संकेत
Create a landscape editorial hero image for this Studio Global article: Claude Mythos Benchmarks: 93,9 % SWE-bench und was dahintersteckt. Article summary: Claude Mythos Preview wird vor allem durch 93,9 % auf SWE bench Verified auffällig; weitere berichtete Werte sind 87,3 % auf SWE bench Multilingual und 59,0 % in einer internen multimodalen Evaluation.. Topic tags: ai, anthropic, claude, benchmarks, coding. Reference image context from search candidates: Reference image 1: visual subject "# Claude Mythos Preview Benchmarks – The AI That Scored 93.9% on SWE-bench and Still Won’t Be Released. On April 7, 2026, Anthropic quietly released something extraordinary — and t" source context "Claude Mythos Preview Benchmarks - The AI That Scored 93.9% on ..." Reference image 2: visual subject "A score of 93.9% means Claude Mythos correctly resolved approximately 470 of those 500 real-world GitH
openai.com

Claude Mythos Preview को सिर्फ एक और Claude मॉडल समझना जल्दबाज़ी होगी। Anthropic के Claude API docs में इसे Project Glasswing के तहत defensive cybersecurity workflows के लिए अलग research-preview model बताया गया है; access invitation-only है और self-serve sign-up उपलब्ध नहीं है।^[13] इसलिए इसके benchmark scores ध्यान खींचते हैं, पर उन्हें किसी आम, सबके लिए उपलब्ध product model के public leaderboard score की तरह पढ़ना ठीक नहीं होगा।

नीचे के ठोस अंक मुख्य रूप से third-party reports से दिखते हैं—जैसे W&B, R&D World, llm-stats और Authmind—जो Anthropic data, system-card evaluations या अपनी summaries का हवाला देते हैं।^[6]^[9]^[25]^[27]

प्रमुख benchmark scores

क्षेत्र	Benchmark	Claude Mythos Preview का reported score	संदर्भ
Coding	SWE-bench Verified	93.9%^[6]	W&B ने Claude Opus 4.6 के लिए 80.8% बताया।^[6]
Multilingual coding	SWE-bench Multilingual	87.3%^[6]	W&B ने Opus 4.6 के लिए 77.8% बताया।^[6]
Multimodal	Internal multimodal evaluation	59.0%^[6]	W&B ने Opus 4.6 के लिए 27.1% बताया।^[6]
Cybersecurity	Cybench	pass@1 = 1.00^[27]	Authmind के अनुसार Cybench 40 CTF challenges वाला public benchmark है; Opus 4.6 का score 0.89 बताया गया।^[27]
Cybersecurity	CyberGym	0.83^[27]	Authmind CyberGym को 1,507 real open-source tasks पर targeted vulnerability reproduction की evaluation बताता है; Opus 4.6 का score 0.67 बताया गया।^[27]
Reasoning	GPQA Diamond	94.6%^[25]	llm-stats ने Opus 4.6 के लिए 91.3% बताया।^[25]
Reasoning	Humanity’s Last Exam	बिना tools 56.8% / tools के साथ 64.7%^[25]	llm-stats ने Opus 4.6 के लिए 40.0% और 53.1% बताया।^[25]
Terminal agents	Terminal-Bench configuration	92.1%^[25]	llm-stats इस score को Terminus-2 harness, maximum adaptive thinking, प्रति task 1M-token budget, 4-hour timeouts और Terminal-Bench 2.1 updates से जोड़ता है।^[25]
Multitask knowledge	MMMLU	92.7^[9]	R&D World ने लिखा कि यह Gemini 3.1 Pro की 92.6–93.6 range से overlap करता है और 17-of-18 leadership claim में अकेला outlier था।^[9]

आधिकारिक तौर पर क्या साफ है

सबसे मजबूत आधिकारिक आधार मॉडल की स्थिति है: Claude Mythos Preview, Anthropic के अनुसार, Project Glasswing के तहत defensive cybersecurity workflows के लिए अलग research-preview model है, जिसका access invitation-only है और कोई self-serve sign-up नहीं है।^[13]

Anthropic की Project Glasswing page Mythos Preview को general-purpose frontier model और coding तथा agentic tasks के लिए अपना सबसे सक्षम model बताती है।^[16] उसी page पर Anthropic cybersecurity strength को broader software capability से जोड़ता है: जो model complex software को गहराई से समझ और modify कर सकता है, वह vulnerabilities खोजने और ठीक करने में भी उपयोगी हो सकता है।^[16]

System Card source भी Claude Mythos Preview को Anthropic का नया large language model यानी frontier AI model बताती है, जिसकी capabilities software engineering, reasoning, computer use, knowledge work और research assistance जैसे क्षेत्रों में बताई गई हैं।^[18]

यानी official sources मॉडल की positioning और capability areas को support करते हैं; लेकिन ऊपर दिए गए बहुत-से exact score numbers इस source set में मुख्य रूप से third-party reports के जरिए दिखाई देते हैं।^[6]^[9]^[25]^[27]

93.9% SWE-bench headline क्यों बनता है

सबसे चमकदार single number SWE-bench Verified पर 93.9% है। W&B ने Claude Mythos Preview के लिए यह score report किया और उसी संदर्भ में Claude Opus 4.6 का comparison score 80.8% बताया।^[6] Software teams के लिए यह natural headline है, क्योंकि coding-oriented AI agents को अक्सर ऐसे ही repair और engineering-style tasks पर परखा जाता है।

SWE-bench Multilingual पर भी reported number ऊंचा है: W&B ने Mythos Preview के लिए 87.3% और Opus 4.6 के लिए 77.8% बताया।^[6] इससे संकेत मिलता है कि मजबूती सिर्फ एक English-centric coding setup तक सीमित नहीं दिखती।

फिर भी, 93.9% का मतलब यह नहीं कि मॉडल हर repository, हर toolchain और हर review process में वैसा ही प्रदर्शन करेगा। Claude Mythos के मामले में एक और व्यावहारिक दिक्कत है: Anthropic के अनुसार external teams इसे सामान्य self-serve route से test नहीं कर सकते।^[13]

Cybersecurity numbers मजबूत हैं, पर context खास है

Cybersecurity side पर numbers और भी ध्यान खींचते हैं। Authmind ने Claude Mythos Preview के लिए Cybench पर perfect pass@1 = 1.00 बताया; वहीं Cybench को 40 CTF challenges वाला public benchmark बताया गया है।^[27] CTF यानी Capture the Flag—security competition-style tasks, जिनमें systems या code की कमजोरियों को समझना पड़ता है।

CyberGym पर Authmind ने Mythos Preview का score 0.83 बताया और इसे 1,507 real open-source software tasks पर targeted vulnerability reproduction की evaluation कहा।^[27]

ये scores Anthropic की official positioning से मेल खाते हैं: API docs में Mythos Preview को Project Glasswing के तहत defensive cybersecurity workflows के लिए research preview कहा गया है।^[13] Project Glasswing page भी cybersecurity performance को complex software समझने, बदलने और vulnerabilities find/fix करने की व्यापक क्षमता से जोड़ता है।^[16]

लेकिन benchmark का task-form याद रखना जरूरी है। CTF challenges और vulnerability reproduction बहुत उपयोगी signals हैं, पर वे किसी organization की अपनी security policies, tool restrictions, audit requirements और accountability framework की जगह नहीं ले सकते।

Reasoning, multimodal और terminal-agent scores कैसे पढ़ें

Reasoning side पर भी strong reported scores हैं। llm-stats ने GPQA Diamond पर 94.6%, Humanity’s Last Exam पर tools के बिना 56.8% और tools के साथ 64.7% बताया।^[25] HLE में tools के साथ और बिना tools अलग-अलग score देना अहम है, क्योंकि tool access benchmark comparisons को काफी बदल सकता है।

Terminal-Bench के लिए configuration खुद score जितनी ही महत्वपूर्ण है। llm-stats ने 92.1% report किया, लेकिन साथ में बताया कि setup में Terminus-2 harness, maximum adaptive thinking, प्रति task 1M-token budget, extended 4-hour timeouts और Terminal-Bench 2.1 updates शामिल थे।^[25] Agent benchmarks में time, context window, tools और token budget सिर्फ छोटी technical details नहीं होते—वे result को materially प्रभावित कर सकते हैं।

Multimodal score को भी सावधानी से पढ़ना चाहिए। W&B ने Mythos Preview के लिए internal multimodal evaluation में 59.0% और Opus 4.6 के लिए 27.1% बताया।^[6] llm-stats ने यह भी note किया कि SWE-bench Multimodal internal implementation इस्तेमाल करता है और उसके scores public leaderboard results से सीधे comparable नहीं हैं।^[25]

ये scores normal leaderboard जैसे क्यों नहीं हैं

चार वजहें सबसे अहम हैं:

सीमित access: Claude Mythos Preview, Anthropic के अनुसार, invitation-only research-preview model है और self-serve sign-up उपलब्ध नहीं है।^[13] इससे सामान्य developer teams के लिए independent reproduction मुश्किल हो जाता है।
Sources mixed हैं: official sources इस source set में model status, positioning और capability areas को साफ करते हैं।^[13]^[16]^[18] कई exact score numbers third-party reports से सामने आते हैं।^[6]^[9]^[25]^[27]
Internal या special configurations: multimodal score internal evaluation के रूप में report हुआ है।^[6] Terminal-Bench score specific harness, maximum thinking, बड़े token budget और लंबी timeouts के साथ जुड़ा है।^[25]
Task-specific meaning: Authmind के अनुसार Cybench 40 CTF challenges पर आधारित है, जबकि CyberGym 1,507 real open-source tasks पर vulnerability reproduction evaluate करता है।^[27] ये important हैं, लेकिन सीमित task classes हैं।

नतीजा

Reported benchmarks में Claude Mythos Preview असाधारण रूप से मजबूत दिखता है: SWE-bench Verified पर 93.9%, SWE-bench Multilingual पर 87.3%, internal multimodal evaluation में 59.0%, CyberGym पर 0.83 और Cybench पर pass@1 = 1.00।^[6]^[27]

लेकिन बड़ा takeaway सिर्फ score की ऊंचाई नहीं है। Anthropic इसे Project Glasswing के लिए invitation-only research preview बताता है, कोई सामान्य self-serve product model नहीं।^[13] इसलिए इन numbers को coding, agentic workflows और defensive cybersecurity में मजबूत capability signal की तरह पढ़ना चाहिए—पूरी तरह public, independently reproducible leaderboard ranking की तरह नहीं।

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

Claude Mythos Preview का सबसे चर्चित reported score SWE bench Verified पर 93.9% है; लेकिन Anthropic इसे Project Glasswing का invitation only research preview बताता है, self serve model नहीं।[6][13]
अन्य reported scores में SWE bench Multilingual पर 87.3%, internal multimodal evaluation में 59.0%, CyberGym पर 0.83 और Cybench पर pass@1 = 1.00 शामिल हैं।[6][27]
ये आंकड़े coding, agentic tasks और defensive cybersecurity में मजबूत संकेत देते हैं, पर third party reporting, internal configurations और सीमित access की वजह से इन्हें सामान्य public leaderboard की तरह नहीं पढ़ना चाहि...

लोग पूछते भी हैं

"Claude Mythos बेंचमार्क: 93.9% SWE-bench और स्कोर की असली सीमा" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे अभ्यास में आगे क्या करना चाहिए?

ये आंकड़े coding, agentic tasks और defensive cybersecurity में मजबूत संकेत देते हैं, पर third party reporting, internal configurations और सीमित access की वजह से इन्हें सामान्य public leaderboard की तरह नहीं पढ़ना चाहि...

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

अन्य कोण और अतिरिक्त उद्धरणों के लिए "Claude Security पब्लिक बीटा: Anthropic का AI कोड-सुरक्षा स्कैनर क्या करता है" के साथ जारी रखें।

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

इस उत्तर को "Grok 4.3 API: 1M context, सस्ती token pricing और xAI की वॉइस रणनीति" के सामने क्रॉस-चेक करें।

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

IT Consulting Orange County CA, IT Services, IT Outsourcing - Technijian. # Anthropic Claude Security for Enterprise: What You Need to Know. Anthropic has officially opened Claude

Claude Security पब्लिक बीटा: Anthropic का AI कोड-सुरक्षा स्कैनर क्या करता है

Claude Security पब्लिक बीटा: AI से कंपनी के कोड में कमजोरियां कैसे खोजेगा Anthropic

抽象 AI 介面顯示 Grok 4.3 API、1M context、token 價格與語音波形

Grok 4.3 API: 1M context, सस्ती token pricing और xAI की वॉइस रणनीति

Xiaomi Q5 शायद MIX 5 नहीं: नया संकेत Xiaomi 18 सीरीज़ की ओर

Metas KI-Altersprüfung: Neue Regeln für Jugendliche in EU, USA und Brasilien

Meta की AI उम्र जांच: Instagram और Facebook पर किशोरों के लिए क्या बदलेगा

Meta की नई AI उम्र जांच: Instagram और Facebook पर किशोरों के नियम बदलेंगे

अनुसंधान वार्तालाप

आप25 अप्रैल 2026

शोध प्रश्न

Was sind die Benchmarks von Claude Mythos?

Studio Global AI25 अप्रैल 202630 स्रोत

उद्धृत उत्तर

Claude Mythos hat laut den vorliegenden Belegen vor allem öffentlich diskutierte Benchmark-Werte für Coding-, multimodale und mehrsprachige Aufgaben. Die belastbarste Information ist jedoch eingeschränkt: Anthropic beschreibt Claude Mythos Preview als ein separates Research-Preview-Modell für defensive Cybersecurity-Workflows mit Einladungspflicht, nicht als allgemein verfügbares Standardmodell.^[1]

Verfügbare Benchmark-Angaben

SWE-bench: 93,9% werden in mehreren Drittquellen als Benchmark-Wert für Claude Mythos genannt.^[3]^[5]
Multimodal: 59,0% werden als interner multimodaler Evaluationswert für Mythos Preview genannt, gegenüber 27,1% für Opus 4.6.^[4]
SWE-bench Multilingual: 87,3% werden für Mythos Preview genannt.^[4]
Terminal-Bench 2.0: Eine Drittquelle erwähnt, dass Terminal-Bench 2.0 mit Terminus-2-Harness, maximalem adaptivem Thinking und einem 1M-Token-Budget pro Aufgabe ausgeführt wurde, nennt im vorliegenden Ausschnitt aber keinen konkreten Score.^[7]

Einordnung und Unsicherheit

Anthropic selbst bestätigt in den verfügbaren Belegen die Existenz und den eingeschränkten Zugang von Claude Mythos Preview, aber der bereitgestellte offizielle Ausschnitt enthält keine konkreten Benchmark-Zahlen.^[1]
Die konkreten Scores stammen in den vorliegenden Belegen überwiegend aus Drittquellen oder Snippets, nicht aus vollständig zitierbaren offiziellen Benchmark-Tabellen.^[3]^[4]^[5]
Daher gilt: Insufficient evidence für eine vollständig verifizierte, offizielle Benchmark-Liste von Claude Mythos anhand der bereitgestellten Quellen.

सूत्र

[6] Claude Mythos Benchmark Scores | ml-news – Weights & Biases - Wandbwandb.ai
Image 66 Multimodal and multilingual capabilities The model’s improvements extend to multimodal and multilingual tasks. Internal multimodal evaluations show Mythos Preview scoring 59.0% compared to Opus 4.6’s 27.1%, indicating superior handling of diverse i...
[9] Claude Mythos leads 17 of 18 benchmarks Anthropic ... - R&D Worldrdworldonline.com
Research & Development World Claude Mythos leads 17 of 18 benchmarks Anthropic measured. Muse Spark put Meta back in the frontier club, and OpenAI’s ‘Spud’ model is reportedly near launch By Brian Buntz Anthropic is not planning on publicly releasing it, bu...
[13] Models overview - Claude API Docsdocs.anthropic.com
Models overview - Claude API Docs . Claude Mythos Preview is offered separately as a research preview model for defensive cybersecurity workflows as part of Project Glasswing. Access is invitation-only and there is no self-serve sign-up. Models with the sam...
[16] Project Glasswing - Anthropicanthropic.com
01 /08 Claude Mythos Preview Claude Mythos Preview is a general-purpose frontier model from Anthropic, our most capable yet for coding and agentic tasks. Its strength in cybersecurity is a direct result of that broader capability: a model that can deeply un...
[18] [PDF] Claude Mythos Preview System Card - Anthropicwww-cdn.anthropic.com
Red Teaming benchmark for tool use 232 8.3.2.2 Robustness against adaptive attackers across surfaces 233 8.3.2.2.1 Coding 233 8.3.2.2.2 Computer use 234 8.3.2.2.3 Browser use 235 8.4 Per-question automated welfare interview results 236 8.5 Blocklist u...
[25] Claude Mythos Preview: Benchmarks, Pricing & Project Glasswingllm-stats.com
\SWE-bench Multimodal uses an internal implementation; scores are not directly comparable to public leaderboard results. Terminal-Bench 2.0 was run with the Terminus-2 harness, adaptive thinking at maximum effort, and a 1M token budget per task. With extend...
[27] When a Lab Withholds Its Best Model: What the Claude Mythos System Card Signals for Cybersecurityauthmind.com
On Cybench (a public benchmark drawing from 40 CTF challenges across four major competitions), Claude Mythos Preview achieved a perfect pass@1 score of 1.00. Claude Opus 4.6, the prior generation, scored 0.89. On CyberGym, which evaluates AI agents on targe...

ट्रेंडिंग डिस्कवर

उत्तरप्रकाशित28 अप्रैल 2026Last edited 6 मई 20267 स्रोत

Claude Mythos बेंचमार्क: 93.9% SWE-bench और स्कोर की असली सीमा

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

17K0

प्रमुख benchmark scores

क्षेत्र	Benchmark	Claude Mythos Preview का reported score	संदर्भ
Coding	SWE-bench Verified	93.9%^[6]	W&B ने Claude Opus 4.6 के लिए 80.8% बताया।^[6]
Multilingual coding	SWE-bench Multilingual	87.3%^[6]	W&B ने Opus 4.6 के लिए 77.8% बताया।^[6]
Multimodal	Internal multimodal evaluation	59.0%^[6]	W&B ने Opus 4.6 के लिए 27.1% बताया।^[6]
Cybersecurity	Cybench	pass@1 = 1.00^[27]	Authmind के अनुसार Cybench 40 CTF challenges वाला public benchmark है; Opus 4.6 का score 0.89 बताया गया।^[27]
Cybersecurity	CyberGym	0.83^[27]	Authmind CyberGym को 1,507 real open-source tasks पर targeted vulnerability reproduction की evaluation बताता है; Opus 4.6 का score 0.67 बताया गया।^[27]
Reasoning	GPQA Diamond	94.6%^[25]	llm-stats ने Opus 4.6 के लिए 91.3% बताया।^[25]
Reasoning	Humanity’s Last Exam	बिना tools 56.8% / tools के साथ 64.7%^[25]	llm-stats ने Opus 4.6 के लिए 40.0% और 53.1% बताया।^[25]
Terminal agents	Terminal-Bench configuration	92.1%^[25]	llm-stats इस score को Terminus-2 harness, maximum adaptive thinking, प्रति task 1M-token budget, 4-hour timeouts और Terminal-Bench 2.1 updates से जोड़ता है।^[25]
Multitask knowledge	MMMLU	92.7^[9]	R&D World ने लिखा कि यह Gemini 3.1 Pro की 92.6–93.6 range से overlap करता है और 17-of-18 leadership claim में अकेला outlier था।^[9]

आधिकारिक तौर पर क्या साफ है

93.9% SWE-bench headline क्यों बनता है

Cybersecurity numbers मजबूत हैं, पर context खास है

Reasoning, multimodal और terminal-agent scores कैसे पढ़ें

ये scores normal leaderboard जैसे क्यों नहीं हैं

चार वजहें सबसे अहम हैं:

सीमित access: Claude Mythos Preview, Anthropic के अनुसार, invitation-only research-preview model है और self-serve sign-up उपलब्ध नहीं है।^[13] इससे सामान्य developer teams के लिए independent reproduction मुश्किल हो जाता है।
Sources mixed हैं: official sources इस source set में model status, positioning और capability areas को साफ करते हैं।^[13]^[16]^[18] कई exact score numbers third-party reports से सामने आते हैं।^[6]^[9]^[25]^[27]
Internal या special configurations: multimodal score internal evaluation के रूप में report हुआ है।^[6] Terminal-Bench score specific harness, maximum thinking, बड़े token budget और लंबी timeouts के साथ जुड़ा है।^[25]
Task-specific meaning: Authmind के अनुसार Cybench 40 CTF challenges पर आधारित है, जबकि CyberGym 1,507 real open-source tasks पर vulnerability reproduction evaluate करता है।^[27] ये important हैं, लेकिन सीमित task classes हैं।

नतीजा

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

Claude Mythos Preview का सबसे चर्चित reported score SWE bench Verified पर 93.9% है; लेकिन Anthropic इसे Project Glasswing का invitation only research preview बताता है, self serve model नहीं।[6][13]
अन्य reported scores में SWE bench Multilingual पर 87.3%, internal multimodal evaluation में 59.0%, CyberGym पर 0.83 और Cybench पर pass@1 = 1.00 शामिल हैं।[6][27]
ये आंकड़े coding, agentic tasks और defensive cybersecurity में मजबूत संकेत देते हैं, पर third party reporting, internal configurations और सीमित access की वजह से इन्हें सामान्य public leaderboard की तरह नहीं पढ़ना चाहि...

लोग पूछते भी हैं

"Claude Mythos बेंचमार्क: 93.9% SWE-bench और स्कोर की असली सीमा" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे अभ्यास में आगे क्या करना चाहिए?

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

अनुसंधान वार्तालाप

आप25 अप्रैल 2026

शोध प्रश्न

Was sind die Benchmarks von Claude Mythos?

Studio Global AI25 अप्रैल 202630 स्रोत

उद्धृत उत्तर

Verfügbare Benchmark-Angaben

SWE-bench: 93,9% werden in mehreren Drittquellen als Benchmark-Wert für Claude Mythos genannt.^[3]^[5]
Multimodal: 59,0% werden als interner multimodaler Evaluationswert für Mythos Preview genannt, gegenüber 27,1% für Opus 4.6.^[4]
SWE-bench Multilingual: 87,3% werden für Mythos Preview genannt.^[4]
Terminal-Bench 2.0: Eine Drittquelle erwähnt, dass Terminal-Bench 2.0 mit Terminus-2-Harness, maximalem adaptivem Thinking und einem 1M-Token-Budget pro Aufgabe ausgeführt wurde, nennt im vorliegenden Ausschnitt aber keinen konkreten Score.^[7]

Einordnung und Unsicherheit

Anthropic selbst bestätigt in den verfügbaren Belegen die Existenz und den eingeschränkten Zugang von Claude Mythos Preview, aber der bereitgestellte offizielle Ausschnitt enthält keine konkreten Benchmark-Zahlen.^[1]
Die konkreten Scores stammen in den vorliegenden Belegen überwiegend aus Drittquellen oder Snippets, nicht aus vollständig zitierbaren offiziellen Benchmark-Tabellen.^[3]^[4]^[5]
Daher gilt: Insufficient evidence für eine vollständig verifizierte, offizielle Benchmark-Liste von Claude Mythos anhand der bereitgestellten Quellen.

सूत्र

[6] Claude Mythos Benchmark Scores | ml-news – Weights & Biases - Wandbwandb.ai
Image 66 Multimodal and multilingual capabilities The model’s improvements extend to multimodal and multilingual tasks. Internal multimodal evaluations show Mythos Preview scoring 59.0% compared to Opus 4.6’s 27.1%, indicating superior handling of diverse i...
[9] Claude Mythos leads 17 of 18 benchmarks Anthropic ... - R&D Worldrdworldonline.com
Research & Development World Claude Mythos leads 17 of 18 benchmarks Anthropic measured. Muse Spark put Meta back in the frontier club, and OpenAI’s ‘Spud’ model is reportedly near launch By Brian Buntz Anthropic is not planning on publicly releasing it, bu...
[13] Models overview - Claude API Docsdocs.anthropic.com
Models overview - Claude API Docs . Claude Mythos Preview is offered separately as a research preview model for defensive cybersecurity workflows as part of Project Glasswing. Access is invitation-only and there is no self-serve sign-up. Models with the sam...
[16] Project Glasswing - Anthropicanthropic.com
01 /08 Claude Mythos Preview Claude Mythos Preview is a general-purpose frontier model from Anthropic, our most capable yet for coding and agentic tasks. Its strength in cybersecurity is a direct result of that broader capability: a model that can deeply un...
[18] [PDF] Claude Mythos Preview System Card - Anthropicwww-cdn.anthropic.com
Red Teaming benchmark for tool use 232 8.3.2.2 Robustness against adaptive attackers across surfaces 233 8.3.2.2.1 Coding 233 8.3.2.2.2 Computer use 234 8.3.2.2.3 Browser use 235 8.4 Per-question automated welfare interview results 236 8.5 Blocklist u...
[25] Claude Mythos Preview: Benchmarks, Pricing & Project Glasswingllm-stats.com
\SWE-bench Multimodal uses an internal implementation; scores are not directly comparable to public leaderboard results. Terminal-Bench 2.0 was run with the Terminus-2 harness, adaptive thinking at maximum effort, and a 1M token budget per task. With extend...
[27] When a Lab Withholds Its Best Model: What the Claude Mythos System Card Signals for Cybersecurityauthmind.com
On Cybench (a public benchmark drawing from 40 CTF challenges across four major competitions), Claude Mythos Preview achieved a perfect pass@1 score of 1.00. Claude Opus 4.6, the prior generation, scored 0.89. On CyberGym, which evaluates AI agents on targe...

ट्रेंडिंग डिस्कवर

उत्तरप्रकाशित28 अप्रैल 2026Last edited 6 मई 20267 स्रोत

Claude Mythos बेंचमार्क: 93.9% SWE-bench और स्कोर की असली सीमा

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

17K0

प्रमुख benchmark scores

क्षेत्र	Benchmark	Claude Mythos Preview का reported score	संदर्भ
Coding	SWE-bench Verified	93.9%^[6]	W&B ने Claude Opus 4.6 के लिए 80.8% बताया।^[6]
Multilingual coding	SWE-bench Multilingual	87.3%^[6]	W&B ने Opus 4.6 के लिए 77.8% बताया।^[6]
Multimodal	Internal multimodal evaluation	59.0%^[6]	W&B ने Opus 4.6 के लिए 27.1% बताया।^[6]
Cybersecurity	Cybench	pass@1 = 1.00^[27]	Authmind के अनुसार Cybench 40 CTF challenges वाला public benchmark है; Opus 4.6 का score 0.89 बताया गया।^[27]
Cybersecurity	CyberGym	0.83^[27]	Authmind CyberGym को 1,507 real open-source tasks पर targeted vulnerability reproduction की evaluation बताता है; Opus 4.6 का score 0.67 बताया गया।^[27]
Reasoning	GPQA Diamond	94.6%^[25]	llm-stats ने Opus 4.6 के लिए 91.3% बताया।^[25]
Reasoning	Humanity’s Last Exam	बिना tools 56.8% / tools के साथ 64.7%^[25]	llm-stats ने Opus 4.6 के लिए 40.0% और 53.1% बताया।^[25]
Terminal agents	Terminal-Bench configuration	92.1%^[25]	llm-stats इस score को Terminus-2 harness, maximum adaptive thinking, प्रति task 1M-token budget, 4-hour timeouts और Terminal-Bench 2.1 updates से जोड़ता है।^[25]
Multitask knowledge	MMMLU	92.7^[9]	R&D World ने लिखा कि यह Gemini 3.1 Pro की 92.6–93.6 range से overlap करता है और 17-of-18 leadership claim में अकेला outlier था।^[9]

आधिकारिक तौर पर क्या साफ है

93.9% SWE-bench headline क्यों बनता है

Cybersecurity numbers मजबूत हैं, पर context खास है

Reasoning, multimodal और terminal-agent scores कैसे पढ़ें

ये scores normal leaderboard जैसे क्यों नहीं हैं

चार वजहें सबसे अहम हैं:

सीमित access: Claude Mythos Preview, Anthropic के अनुसार, invitation-only research-preview model है और self-serve sign-up उपलब्ध नहीं है।^[13] इससे सामान्य developer teams के लिए independent reproduction मुश्किल हो जाता है।
Sources mixed हैं: official sources इस source set में model status, positioning और capability areas को साफ करते हैं।^[13]^[16]^[18] कई exact score numbers third-party reports से सामने आते हैं।^[6]^[9]^[25]^[27]
Internal या special configurations: multimodal score internal evaluation के रूप में report हुआ है।^[6] Terminal-Bench score specific harness, maximum thinking, बड़े token budget और लंबी timeouts के साथ जुड़ा है।^[25]
Task-specific meaning: Authmind के अनुसार Cybench 40 CTF challenges पर आधारित है, जबकि CyberGym 1,507 real open-source tasks पर vulnerability reproduction evaluate करता है।^[27] ये important हैं, लेकिन सीमित task classes हैं।

नतीजा

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

Claude Mythos Preview का सबसे चर्चित reported score SWE bench Verified पर 93.9% है; लेकिन Anthropic इसे Project Glasswing का invitation only research preview बताता है, self serve model नहीं।[6][13]
अन्य reported scores में SWE bench Multilingual पर 87.3%, internal multimodal evaluation में 59.0%, CyberGym पर 0.83 और Cybench पर pass@1 = 1.00 शामिल हैं।[6][27]
ये आंकड़े coding, agentic tasks और defensive cybersecurity में मजबूत संकेत देते हैं, पर third party reporting, internal configurations और सीमित access की वजह से इन्हें सामान्य public leaderboard की तरह नहीं पढ़ना चाहि...

लोग पूछते भी हैं

"Claude Mythos बेंचमार्क: 93.9% SWE-bench और स्कोर की असली सीमा" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे अभ्यास में आगे क्या करना चाहिए?

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

अनुसंधान वार्तालाप

आप25 अप्रैल 2026

शोध प्रश्न

Was sind die Benchmarks von Claude Mythos?

Studio Global AI25 अप्रैल 202630 स्रोत

उद्धृत उत्तर

Verfügbare Benchmark-Angaben

SWE-bench: 93,9% werden in mehreren Drittquellen als Benchmark-Wert für Claude Mythos genannt.^[3]^[5]
Multimodal: 59,0% werden als interner multimodaler Evaluationswert für Mythos Preview genannt, gegenüber 27,1% für Opus 4.6.^[4]
SWE-bench Multilingual: 87,3% werden für Mythos Preview genannt.^[4]
Terminal-Bench 2.0: Eine Drittquelle erwähnt, dass Terminal-Bench 2.0 mit Terminus-2-Harness, maximalem adaptivem Thinking und einem 1M-Token-Budget pro Aufgabe ausgeführt wurde, nennt im vorliegenden Ausschnitt aber keinen konkreten Score.^[7]

Einordnung und Unsicherheit

Anthropic selbst bestätigt in den verfügbaren Belegen die Existenz und den eingeschränkten Zugang von Claude Mythos Preview, aber der bereitgestellte offizielle Ausschnitt enthält keine konkreten Benchmark-Zahlen.^[1]
Die konkreten Scores stammen in den vorliegenden Belegen überwiegend aus Drittquellen oder Snippets, nicht aus vollständig zitierbaren offiziellen Benchmark-Tabellen.^[3]^[4]^[5]
Daher gilt: Insufficient evidence für eine vollständig verifizierte, offizielle Benchmark-Liste von Claude Mythos anhand der bereitgestellten Quellen.

सूत्र

[6] Claude Mythos Benchmark Scores | ml-news – Weights & Biases - Wandbwandb.ai
Image 66 Multimodal and multilingual capabilities The model’s improvements extend to multimodal and multilingual tasks. Internal multimodal evaluations show Mythos Preview scoring 59.0% compared to Opus 4.6’s 27.1%, indicating superior handling of diverse i...
[9] Claude Mythos leads 17 of 18 benchmarks Anthropic ... - R&D Worldrdworldonline.com
Research & Development World Claude Mythos leads 17 of 18 benchmarks Anthropic measured. Muse Spark put Meta back in the frontier club, and OpenAI’s ‘Spud’ model is reportedly near launch By Brian Buntz Anthropic is not planning on publicly releasing it, bu...
[13] Models overview - Claude API Docsdocs.anthropic.com
Models overview - Claude API Docs . Claude Mythos Preview is offered separately as a research preview model for defensive cybersecurity workflows as part of Project Glasswing. Access is invitation-only and there is no self-serve sign-up. Models with the sam...
[16] Project Glasswing - Anthropicanthropic.com
01 /08 Claude Mythos Preview Claude Mythos Preview is a general-purpose frontier model from Anthropic, our most capable yet for coding and agentic tasks. Its strength in cybersecurity is a direct result of that broader capability: a model that can deeply un...
[18] [PDF] Claude Mythos Preview System Card - Anthropicwww-cdn.anthropic.com
Red Teaming benchmark for tool use 232 8.3.2.2 Robustness against adaptive attackers across surfaces 233 8.3.2.2.1 Coding 233 8.3.2.2.2 Computer use 234 8.3.2.2.3 Browser use 235 8.4 Per-question automated welfare interview results 236 8.5 Blocklist u...
[25] Claude Mythos Preview: Benchmarks, Pricing & Project Glasswingllm-stats.com
\SWE-bench Multimodal uses an internal implementation; scores are not directly comparable to public leaderboard results. Terminal-Bench 2.0 was run with the Terminus-2 harness, adaptive thinking at maximum effort, and a 1M token budget per task. With extend...
[27] When a Lab Withholds Its Best Model: What the Claude Mythos System Card Signals for Cybersecurityauthmind.com
On Cybench (a public benchmark drawing from 40 CTF challenges across four major competitions), Claude Mythos Preview achieved a perfect pass@1 score of 1.00. Claude Opus 4.6, the prior generation, scored 0.89. On CyberGym, which evaluates AI agents on targe...