答案已發布2026年4月28日Last edited 2026年5月6日6 個來源

Claude Opus 4.7 基準測試：三個公開分數與可信度

Claude Opus 4.7 目前公開資料中常見的三個數字是 SWE bench Verified 87.6%、GPQA 94.2%、SWE bench Multilingual 80.5%；其中 SWE bench Verified 的來源支撐最穩。 GPQA 與 SWE bench Multilingual 都是有用的補充訊號，但在這批資料裡沒有像 SWE bench Verified 那樣被多個來源廣泛交叉確認。

使用 Studio Global AI 搜尋並查證事實探索更多內容

17K0

Abstrakte Visualisierung von Claude Opus 4.7 Benchmarks mit Diagrammen und Code-Elementen — Claude Opus 4.7 Benchmarks: Die wichtigsten Werte und ihre BelastbarkeitAI-generierte Illustration zu den öffentlichen Benchmark-Werten von Claude Opus 4.7.
AI 提示詞
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 Benchmarks: Die wichtigsten Werte und ihre Belastbarkeit. Article summary: Claude Opus 4.7 wird öffentlich mit 87,6 % auf SWE bench Verified, 94,2 % auf GPQA und 80,5 % auf SWE bench Multilingual genannt; am belastbarsten ist der SWE bench Verified Wert, weil er mehrfach belegt ist.. Topic tags: ai, anthropic, claude, llm, benchmarks. Reference image context from search candidates: Reference image 1: visual subject "# Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. *In short: Anthropic has released Claude Opus 4.7, its most capable generally available" source context "Claude Opus 4.7 leads on SWE-bench and agentic reasoning ..." Reference image 2: visual subject "# Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. *In sh
openai.com

如果只看目前可公開查到的資料，Claude Opus 4.7 的基準測試可以先抓住三個數字：SWE-bench Verified 87.6%、GPQA 94.2%、SWE-bench Multilingual 80.5%。但這三個數字的「可信度」並不完全一樣；最值得先當作參考錨點的是 SWE-bench Verified，因為同一分數在多個來源中都被明確提到。^[4]^[5]

先看結論：三個分數，不同把握度

基準測試	Claude Opus 4.7 公開數字	來源解讀
SWE-bench Verified	87.6%	這批資料中最穩的 coding 評測錨點；同一數字被多個來源提到。^[4]^[5]
GPQA	94.2%	LLM-Stats 明確列出此數字；但在目前可見的 Anthropic 官方摘錄中，沒有看到可直接引用的完整 benchmark 表格。^[5]^[7]
SWE-bench Multilingual	80.5%	另有來源提到此數字，並對照 Opus 4.6 的 77.8%；但來源覆蓋較薄，應保守看待。^[9]

這張表採取的是保守讀法：只放入目前來源中明確出現的數字。若要用於採購、產品遷移或正式模型選型，仍應以自己的程式碼庫、工具鏈與驗收標準做實測。

為什麼 SWE-bench Verified 是最穩的參考點

Claude Opus 4.7 在 SWE-bench Verified 的 87.6%，是這批公開資料中最有支撐的基準測試數字。Rabinarayan Patra 的遷移與 benchmark 文章，以及 LLM-Stats，都列出同樣的 87.6%。^[4]^[5]

LLM-Stats 也把這個分數解讀為相較 Opus 4.6 提升 6.8 個百分點。^[5] ALM Corp 則描述 Opus 4.7 在困難 coding 與 agentic 工作流程上有更強表現。^[6]

對軟體工程團隊來說，SWE-bench Verified 可以作為第一個公開比較點。不過，它不應該是最後答案。真正關鍵的是：模型在你的 repository、CI/CD 流程、內部框架、測試規範與 code review 標準下，能不能穩定交付可接受的結果。

GPQA：分數很亮眼，但需要多一層確認

GPQA 的 94.2% 在 LLM-Stats 中被清楚列出。^[5] Anthropic 官方發布頁本身當然是重要的一手來源；不過，依目前可見摘錄，該頁主要能確認開發者可透過 Claude API 使用 claude-opus-4-7，而不是提供一張可直接引用的完整 benchmark 數字表。^[7]

因此，GPQA 在這裡可以視為有價值的補充訊號，尤其是用來觀察一般推理能力時；但若要把 94.2% 當成採購或遷移的核心依據，最好再對照官方完整資料或自行評測。^[5]^[7]

SWE-bench Multilingual：對多語系程式碼庫有參考價值

對跨國團隊、混合語言 stack，或非全英文開發環境來說，SWE-bench Multilingual 80.5% 這個數字特別值得留意。有來源指出，Claude Opus 4.7 在這項評測上達到 80.5%，高於 Opus 4.6 的 77.8%。^[9]

但限制也要說清楚：這個數字在目前資料中沒有像 SWE-bench Verified 那樣被多個來源反覆確認。它可以當作初步線索，不能取代針對自家多語系程式碼、文件與測試案例的實際驗證。

跑分之外，更會影響落地的幾件事

Claude Opus 4.7 並不是只靠 benchmark 分數被定位。VentureBeat 形容這次發布的是 Anthropic 當時公開釋出中最強大的大型語言模型。^[1] ALM Corp 也把 Opus 4.7 描述為一個面向進階 coding、長時間 agentic 任務、文件密集推理、高解析度視覺理解與專業工作流程的通用 Opus 模型。^[6]

在實際選型時，以下產品特性可能和單一分數一樣重要，甚至更重要：

脈絡視窗： LLM-Stats 提到 Claude Opus 4.7 具備 100 萬 token 脈絡視窗。^[5]
視覺能力： LLM-Stats 提到其視覺處理解析度提高 3.3 倍。^[5]
Effort level： LLM-Stats 與 ALM Corp 都提到新的 xhigh effort level。^[5]^[6]
Tokenizer： ALM Corp 指出 Opus 4.7 使用更新後的 tokenizer，同樣輸入可能產生更高 token 數。^[6]

這些因素會直接影響成本、延遲與輸出品質。特別是 tokenizer 變動，若沒有在遷移前測過，可能會讓原本的 token 預算與成本估算失準。^[6]

團隊該怎麼使用這些數字

如果重點是 coding： 先看 SWE-bench Verified。87.6% 是目前來源中最穩、最容易用來做公開比較的數字。^[4]^[5]

如果重點是 agent 工作流程： 除了 SWE-bench，還要把 Opus 4.7 對困難 coding、agentic 任務的產品定位，以及新的 xhigh effort level 放進測試設計。^[5]^[6]

如果重點是一般推理： GPQA 94.2% 很有參考價值，但在這批資料裡，它的交叉確認程度低於 SWE-bench Verified。^[5]^[7]

如果重點是多語系程式碼庫： SWE-bench Multilingual 80.5% 是有用的提示，但因來源較少，應再用自己的多語系任務驗證。^[9]

如果準備上線或遷移： 不要只測接近 benchmark 的題目。也要測長脈絡、工具使用、視覺輸入、token 消耗、延遲與錯誤恢復。脈絡視窗、視覺處理、xhigh effort 與 tokenizer 變動，都可能改變實際使用體驗。^[5]^[6]

總結

最精簡、也相對穩健的讀法是：Claude Opus 4.7 目前公開可見的核心數字包括 SWE-bench Verified 87.6%、GPQA 94.2%、SWE-bench Multilingual 80.5%。^[4]^[5]^[9] 其中 SWE-bench Verified 是最可靠的錨點，因為它被多個來源一致提到。^[4]^[5]

GPQA 與 SWE-bench Multilingual 則是重要補充，但在目前來源中沒有同等程度的交叉確認。對真正的模型決策來說，公開 benchmark 適合拿來縮小候選名單；最後仍要回到自己的真實工作流程、資料型態與成本限制來驗證。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 目前公開資料中常見的三個數字是 SWE bench Verified 87.6%、GPQA 94.2%、SWE bench Multilingual 80.5%；其中 SWE bench Verified 的來源支撐最穩。
GPQA 與 SWE bench Multilingual 都是有用的補充訊號，但在這批資料裡沒有像 SWE bench Verified 那樣被多個來源廣泛交叉確認。
真正做模型選型或遷移時，還要把 100 萬 token 脈絡視窗、視覺處理、xhigh effort、tokenizer 變動與實際成本／延遲一併測試。

大家也會問

「Claude Opus 4.7 基準測試：三個公開分數與可信度」的簡短答案是什麼？

Claude Opus 4.7 目前公開資料中常見的三個數字是 SWE bench Verified 87.6%、GPQA 94.2%、SWE bench Multilingual 80.5%；其中 SWE bench Verified 的來源支撐最穩。

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

真正做模型選型或遷移時，還要把 100 萬 token 脈絡視窗、視覺處理、xhigh effort、tokenizer 變動與實際成本／延遲一併測試。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？」交叉比對。

開啟相關頁面

繼續深入研究

IT Consulting Orange County CA, IT Services, IT Outsourcing - Technijian. # Anthropic Claude Security for Enterprise: What You Need to Know. Anthropic has officially opened Claude

Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具

Claude Security 公測版詳解：Anthropic 如何用 AI 掃描企業程式碼漏洞

抽象 AI 介面顯示 Grok 4.3 API、1M context、token 價格與語音波形

Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？

Grok 4.3 API：1M context 與 $1.25/$2.50 定價，對模型與語音競爭意味什麼

小米 Q5 不是 MIX 5？最新線索更像指向小米 18 系列

小米 Q5 是 MIX 5 嗎？目前更可能是小米 18 系列線索

Metas KI-Altersprüfung: Neue Regeln für Jugendliche in EU, USA und Brasilien

Meta用AI查年齡：Instagram與Facebook青少年保護新規一次看

研究對話

你2026年4月25日

研究問題

Was sind die Benchmarks von Claude Opus 4.7?

Studio Global AI2026年4月25日18 個來源

附引用的答案

Die öffentlich auffindbaren Benchmark-Werte für Claude Opus 4.7 sind nach den gefundenen Quellen vor allem 87,6 % auf SWE-bench Verified und 94,2 % auf GPQA. Anthropic beschreibt Opus 4.7 außerdem als sein stärkstes allgemein verfügbares Modell; ein weiterer Treffer nennt 80,5 % auf SWE-bench Multilingual. ^[5]^[6]^[1]^[15]

SWE-bench Verified: 87,6 % ^[5]^[6]
GPQA: 94,2 % ^[6]
SWE-bench Multilingual: 80,5 % ^[15]

Wichtig:

Der Wert für SWE-bench Verified wird in mehreren Treffern konsistent genannt. ^[5]^[6]
GPQA 94,2 % habe ich in den Suchergebnissen nur in einem Benchmark-Zusammenfassungsartikel klar gesehen, nicht direkt im Snippet der offiziellen Anthropic-Seite. ^[6]
Für eine vollständige Benchmark-Tabelle direkt aus der offiziellen Release-Seite liegen mir über die Such-Snippets hinaus hier keine vollständig ausgelesenen Primärdaten vor. Wenn du willst, kann ich dir als Nächstes auch eine Einordnung gegen GPT, Gemini und andere Claude-Modelle machen.

來源

[1] Anthropic releases Claude Opus 4.7, narrowly retaking lead for most ...venturebeat.com
Anthropic is publicly releasing its most powerful large language model yet, Claude Opus 4.7, today — as it continues to keep an even more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and...
[4] Claude Opus 4.7: Benchmarks, Breaking Changes, Migration Guide | Rabinarayan Patrarabinarayanpatra.com
Claude Opus 4.7 ships 87.6% on SWE-bench Verified, a new tokenizer, xhigh effort, and four API breaking changes. create( model="claude-opus-4-7", model = "claude-opus-4-7 ", max tokens=64000, max tokens = 64000, output config={"effort": "xhigh"}, output con...
[5] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
Claude Opus 4.7: Benchmarks, Pricing, Context & What's New. Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. Claude Opus 4.7 is a direct upgrade to Opus 4.6 at the sa...
[6] Claude Opus 4.7: Pricing, Benchmarks & Context Window - ALM Corpalmcorp.com
Claude Opus 4.7 is Anthropic’s latest generally available Opus model, and the release matters for a simple reason: it is not just another benchmark update. Opus 4.7 keeps the same list price as Opus 4.6, adds stronger performance on hard coding and agentic...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[9] Anthropic Launches Claude Opus 4.7 With Higher ...binance.com
Anthropic launched Claude Opus 4.7, with SWE-bench Multilingual rising to 80.5% from 77.8% for Opus 4.6. Anthropic said the updated

熱門探索內容

答案已發布2026年4月28日Last edited 2026年5月6日6 個來源

Claude Opus 4.7 基準測試：三個公開分數與可信度

使用 Studio Global AI 搜尋並查證事實探索更多內容

17K0

先看結論：三個分數，不同把握度

基準測試	Claude Opus 4.7 公開數字	來源解讀
SWE-bench Verified	87.6%	這批資料中最穩的 coding 評測錨點；同一數字被多個來源提到。^[4]^[5]
GPQA	94.2%	LLM-Stats 明確列出此數字；但在目前可見的 Anthropic 官方摘錄中，沒有看到可直接引用的完整 benchmark 表格。^[5]^[7]
SWE-bench Multilingual	80.5%	另有來源提到此數字，並對照 Opus 4.6 的 77.8%；但來源覆蓋較薄，應保守看待。^[9]

為什麼 SWE-bench Verified 是最穩的參考點

LLM-Stats 也把這個分數解讀為相較 Opus 4.6 提升 6.8 個百分點。^[5] ALM Corp 則描述 Opus 4.7 在困難 coding 與 agentic 工作流程上有更強表現。^[6]

GPQA：分數很亮眼，但需要多一層確認

SWE-bench Multilingual：對多語系程式碼庫有參考價值

跑分之外，更會影響落地的幾件事

在實際選型時，以下產品特性可能和單一分數一樣重要，甚至更重要：

脈絡視窗： LLM-Stats 提到 Claude Opus 4.7 具備 100 萬 token 脈絡視窗。^[5]
視覺能力： LLM-Stats 提到其視覺處理解析度提高 3.3 倍。^[5]
Effort level： LLM-Stats 與 ALM Corp 都提到新的 xhigh effort level。^[5]^[6]
Tokenizer： ALM Corp 指出 Opus 4.7 使用更新後的 tokenizer，同樣輸入可能產生更高 token 數。^[6]

這些因素會直接影響成本、延遲與輸出品質。特別是 tokenizer 變動，若沒有在遷移前測過，可能會讓原本的 token 預算與成本估算失準。^[6]

團隊該怎麼使用這些數字

如果重點是 coding： 先看 SWE-bench Verified。87.6% 是目前來源中最穩、最容易用來做公開比較的數字。^[4]^[5]

如果重點是 agent 工作流程： 除了 SWE-bench，還要把 Opus 4.7 對困難 coding、agentic 任務的產品定位，以及新的 xhigh effort level 放進測試設計。^[5]^[6]

如果重點是一般推理： GPQA 94.2% 很有參考價值，但在這批資料裡，它的交叉確認程度低於 SWE-bench Verified。^[5]^[7]

如果重點是多語系程式碼庫： SWE-bench Multilingual 80.5% 是有用的提示，但因來源較少，應再用自己的多語系任務驗證。^[9]

總結

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 目前公開資料中常見的三個數字是 SWE bench Verified 87.6%、GPQA 94.2%、SWE bench Multilingual 80.5%；其中 SWE bench Verified 的來源支撐最穩。
GPQA 與 SWE bench Multilingual 都是有用的補充訊號，但在這批資料裡沒有像 SWE bench Verified 那樣被多個來源廣泛交叉確認。
真正做模型選型或遷移時，還要把 100 萬 token 脈絡視窗、視覺處理、xhigh effort、tokenizer 變動與實際成本／延遲一併測試。

大家也會問

「Claude Opus 4.7 基準測試：三個公開分數與可信度」的簡短答案是什麼？

Claude Opus 4.7 目前公開資料中常見的三個數字是 SWE bench Verified 87.6%、GPQA 94.2%、SWE bench Multilingual 80.5%；其中 SWE bench Verified 的來源支撐最穩。

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

真正做模型選型或遷移時，還要把 100 萬 token 脈絡視窗、視覺處理、xhigh effort、tokenizer 變動與實際成本／延遲一併測試。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？」交叉比對。

開啟相關頁面

繼續深入研究

Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具

Claude Security 公測版詳解：Anthropic 如何用 AI 掃描企業程式碼漏洞

Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？

Grok 4.3 API：1M context 與 $1.25/$2.50 定價，對模型與語音競爭意味什麼

小米 Q5 不是 MIX 5？最新線索更像指向小米 18 系列

小米 Q5 是 MIX 5 嗎？目前更可能是小米 18 系列線索

Meta用AI查年齡：Instagram與Facebook青少年保護新規一次看

研究對話

你2026年4月25日

研究問題

Was sind die Benchmarks von Claude Opus 4.7?

Studio Global AI2026年4月25日18 個來源

附引用的答案

SWE-bench Verified: 87,6 % ^[5]^[6]
GPQA: 94,2 % ^[6]
SWE-bench Multilingual: 80,5 % ^[15]

Wichtig:

Der Wert für SWE-bench Verified wird in mehreren Treffern konsistent genannt. ^[5]^[6]
GPQA 94,2 % habe ich in den Suchergebnissen nur in einem Benchmark-Zusammenfassungsartikel klar gesehen, nicht direkt im Snippet der offiziellen Anthropic-Seite. ^[6]
Für eine vollständige Benchmark-Tabelle direkt aus der offiziellen Release-Seite liegen mir über die Such-Snippets hinaus hier keine vollständig ausgelesenen Primärdaten vor. Wenn du willst, kann ich dir als Nächstes auch eine Einordnung gegen GPT, Gemini und andere Claude-Modelle machen.

來源

[1] Anthropic releases Claude Opus 4.7, narrowly retaking lead for most ...venturebeat.com
Anthropic is publicly releasing its most powerful large language model yet, Claude Opus 4.7, today — as it continues to keep an even more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and...
[4] Claude Opus 4.7: Benchmarks, Breaking Changes, Migration Guide | Rabinarayan Patrarabinarayanpatra.com
Claude Opus 4.7 ships 87.6% on SWE-bench Verified, a new tokenizer, xhigh effort, and four API breaking changes. create( model="claude-opus-4-7", model = "claude-opus-4-7 ", max tokens=64000, max tokens = 64000, output config={"effort": "xhigh"}, output con...
[5] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
Claude Opus 4.7: Benchmarks, Pricing, Context & What's New. Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. Claude Opus 4.7 is a direct upgrade to Opus 4.6 at the sa...
[6] Claude Opus 4.7: Pricing, Benchmarks & Context Window - ALM Corpalmcorp.com
Claude Opus 4.7 is Anthropic’s latest generally available Opus model, and the release matters for a simple reason: it is not just another benchmark update. Opus 4.7 keeps the same list price as Opus 4.6, adds stronger performance on hard coding and agentic...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[9] Anthropic Launches Claude Opus 4.7 With Higher ...binance.com
Anthropic launched Claude Opus 4.7, with SWE-bench Multilingual rising to 80.5% from 77.8% for Opus 4.6. Anthropic said the updated

熱門探索內容

答案已發布2026年4月28日Last edited 2026年5月6日6 個來源

Claude Opus 4.7 基準測試：三個公開分數與可信度

使用 Studio Global AI 搜尋並查證事實探索更多內容

17K0

先看結論：三個分數，不同把握度

基準測試	Claude Opus 4.7 公開數字	來源解讀
SWE-bench Verified	87.6%	這批資料中最穩的 coding 評測錨點；同一數字被多個來源提到。^[4]^[5]
GPQA	94.2%	LLM-Stats 明確列出此數字；但在目前可見的 Anthropic 官方摘錄中，沒有看到可直接引用的完整 benchmark 表格。^[5]^[7]
SWE-bench Multilingual	80.5%	另有來源提到此數字，並對照 Opus 4.6 的 77.8%；但來源覆蓋較薄，應保守看待。^[9]

為什麼 SWE-bench Verified 是最穩的參考點

LLM-Stats 也把這個分數解讀為相較 Opus 4.6 提升 6.8 個百分點。^[5] ALM Corp 則描述 Opus 4.7 在困難 coding 與 agentic 工作流程上有更強表現。^[6]

GPQA：分數很亮眼，但需要多一層確認

SWE-bench Multilingual：對多語系程式碼庫有參考價值

跑分之外，更會影響落地的幾件事

在實際選型時，以下產品特性可能和單一分數一樣重要，甚至更重要：

脈絡視窗： LLM-Stats 提到 Claude Opus 4.7 具備 100 萬 token 脈絡視窗。^[5]
視覺能力： LLM-Stats 提到其視覺處理解析度提高 3.3 倍。^[5]
Effort level： LLM-Stats 與 ALM Corp 都提到新的 xhigh effort level。^[5]^[6]
Tokenizer： ALM Corp 指出 Opus 4.7 使用更新後的 tokenizer，同樣輸入可能產生更高 token 數。^[6]

這些因素會直接影響成本、延遲與輸出品質。特別是 tokenizer 變動，若沒有在遷移前測過，可能會讓原本的 token 預算與成本估算失準。^[6]

團隊該怎麼使用這些數字

如果重點是 coding： 先看 SWE-bench Verified。87.6% 是目前來源中最穩、最容易用來做公開比較的數字。^[4]^[5]

如果重點是 agent 工作流程： 除了 SWE-bench，還要把 Opus 4.7 對困難 coding、agentic 任務的產品定位，以及新的 xhigh effort level 放進測試設計。^[5]^[6]

如果重點是一般推理： GPQA 94.2% 很有參考價值，但在這批資料裡，它的交叉確認程度低於 SWE-bench Verified。^[5]^[7]

如果重點是多語系程式碼庫： SWE-bench Multilingual 80.5% 是有用的提示，但因來源較少，應再用自己的多語系任務驗證。^[9]

總結

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 目前公開資料中常見的三個數字是 SWE bench Verified 87.6%、GPQA 94.2%、SWE bench Multilingual 80.5%；其中 SWE bench Verified 的來源支撐最穩。
GPQA 與 SWE bench Multilingual 都是有用的補充訊號，但在這批資料裡沒有像 SWE bench Verified 那樣被多個來源廣泛交叉確認。
真正做模型選型或遷移時，還要把 100 萬 token 脈絡視窗、視覺處理、xhigh effort、tokenizer 變動與實際成本／延遲一併測試。

大家也會問

「Claude Opus 4.7 基準測試：三個公開分數與可信度」的簡短答案是什麼？

Claude Opus 4.7 目前公開資料中常見的三個數字是 SWE bench Verified 87.6%、GPQA 94.2%、SWE bench Multilingual 80.5%；其中 SWE bench Verified 的來源支撐最穩。

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

真正做模型選型或遷移時，還要把 100 萬 token 脈絡視窗、視覺處理、xhigh effort、tokenizer 變動與實際成本／延遲一併測試。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？」交叉比對。

開啟相關頁面

繼續深入研究

研究對話

你2026年4月25日

研究問題

Was sind die Benchmarks von Claude Opus 4.7?

Studio Global AI2026年4月25日18 個來源

附引用的答案

SWE-bench Verified: 87,6 % ^[5]^[6]
GPQA: 94,2 % ^[6]
SWE-bench Multilingual: 80,5 % ^[15]

Wichtig:

Der Wert für SWE-bench Verified wird in mehreren Treffern konsistent genannt. ^[5]^[6]
GPQA 94,2 % habe ich in den Suchergebnissen nur in einem Benchmark-Zusammenfassungsartikel klar gesehen, nicht direkt im Snippet der offiziellen Anthropic-Seite. ^[6]
Für eine vollständige Benchmark-Tabelle direkt aus der offiziellen Release-Seite liegen mir über die Such-Snippets hinaus hier keine vollständig ausgelesenen Primärdaten vor. Wenn du willst, kann ich dir als Nächstes auch eine Einordnung gegen GPT, Gemini und andere Claude-Modelle machen.

來源

[1] Anthropic releases Claude Opus 4.7, narrowly retaking lead for most ...venturebeat.com
Anthropic is publicly releasing its most powerful large language model yet, Claude Opus 4.7, today — as it continues to keep an even more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and...
[4] Claude Opus 4.7: Benchmarks, Breaking Changes, Migration Guide | Rabinarayan Patrarabinarayanpatra.com
Claude Opus 4.7 ships 87.6% on SWE-bench Verified, a new tokenizer, xhigh effort, and four API breaking changes. create( model="claude-opus-4-7", model = "claude-opus-4-7 ", max tokens=64000, max tokens = 64000, output config={"effort": "xhigh"}, output con...
[5] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
Claude Opus 4.7: Benchmarks, Pricing, Context & What's New. Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. Claude Opus 4.7 is a direct upgrade to Opus 4.6 at the sa...
[6] Claude Opus 4.7: Pricing, Benchmarks & Context Window - ALM Corpalmcorp.com
Claude Opus 4.7 is Anthropic’s latest generally available Opus model, and the release matters for a simple reason: it is not just another benchmark update. Opus 4.7 keeps the same list price as Opus 4.6, adds stronger performance on hard coding and agentic...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[9] Anthropic Launches Claude Opus 4.7 With Higher ...binance.com
Anthropic launched Claude Opus 4.7, with SWE-bench Multilingual rising to 80.5% from 77.8% for Opus 4.6. Anthropic said the updated