答案已發布3 個月前Last edited 2 個月前17 來源

Claude Mythos Preview 嘅 93.9% SWE-bench，到底代表咩？

Claude Mythos Preview 最常見嘅 benchmark 數字係 SWE bench 93.9%，主要反映軟件工程／程式碼任務表現，唔係模型總分 [1][2]。高 SWE bench 分數通常涉及 agent 式設定：讀檔、跑 code、睇測試結果同反覆修改；比較模型時要確保條件相近 [1]。

使用 Studio Global AI 搜尋並查核事實瀏覽更多熱門頁面

Ilustración de un panel de benchmark de IA con código y métricas para Claude Mythos Preview — Benchmark de Claude Mythos Preview: 93,9% en SWE-bench, explicadoImagen editorial generada por IA para representar un benchmark de código.
AI 提示
Create a landscape editorial hero image for this Studio Global article: Benchmark de Claude Mythos Preview: 93,9% en SWE-bench, explicado. Article summary: Claude Mythos Preview se reporta con 93,9% en SWE bench; es la cifra más citada para rendimiento en software, pero SWE bench no es una nota general del modelo [1][2].. Topic tags: ai, anthropic, claude, ai benchmarks, coding. Reference image context from search candidates: Reference image 1: visual subject "# Claude Mythos Preview Benchmarks – The AI That Scored 93.9% on SWE-bench and Still Won’t Be Released. On April 7, 2026, Anthropic quietly released something extraordinary — and t" source context "Claude Mythos Preview Benchmarks - The AI That Scored 93.9% on SWE-bench and Still Won't Be Released - Kingy AI" Reference image 2: visual subject "A Tweet from Ramez Naam, American technologist and science fiction writer, citing Epoch A
openai.com

Claude Mythos Preview 近日最常被拎出嚟講嘅數字，就係 SWE-bench 93.9%。呢個分數的確搶眼；但睇 benchmark 最忌將一科成績當成全科成績表。SWE-bench 主要評估程式碼同軟件工程任務，唔係一次過量度 AI 模型所有能力。

先講答案：93.9% 係軟件任務嘅強訊號

目前圍繞 Claude Mythos Preview 最核心、最清楚嘅數字，是 SWE-bench 93.9% 。如果你關心嘅係改 code、修 bug、處理 repository，或者建立會自己做開發任務嘅 AI agent，呢個分數就有參考價值。

但要睇埋測試設定。SWE-bench 高分通常唔係模型單靠一段回答完成，而係以 agent 形式運作：可以讀檔、執行程式、睇測試結果，再按結果反覆修改。所以 93.9% 反映嘅，唔只係模型本身，亦包括佢所處嘅工具環境同評測流程。

93.9% 唔代表「所有方面都贏」

93.9% 唔應該當成 Claude Mythos Preview 嘅總分。SWE-bench 係軟件任務 benchmark，唔等於一般推理、安全性、成本、可用性，或者同寫 code 無關嘅工作都同樣有呢個水準。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

人們還問