答案已發布2026年4月29日Last edited 2026年5月6日4 個來源

Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制

Claude Opus 4.7 已在 2026 年 4 月發布並可透過 Claude API 使用；TNW 報導其 SWE bench Pro 為 64.3%、SWE bench Verified 為 87.6%，顯示寫程式與修真實 repo issue 很強，但大型重構仍缺獨立專項 benchmark。[2][3][5] 最有力的公開證據集中在真實 issue 修復與 agentic coding：TNW 報導 CursorBench 從 Opus 4.6 的 58% 升至 Opus 4.7 的 70%，多步驟 agentic reasoning 提升 14%、工具錯誤約降至三分之一。[3] 若要導入 IDE、Claude...

使用 Studio Global AI 搜尋並查證事實探索更多內容

19K0

Claude Opus 4.7 程式碼基準測試與除錯能力的編輯插圖 — Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制AI 生成的編輯視覺，呈現 Claude Opus 4.7、coding benchmark 與軟體工程 workflow。
AI 提示詞
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制. Article summary: Claude Opus 4.7 已於 2026 年 4 月發布並可透過 claude opus 4 7 API 使用；TNW 報導其 SWE bench Pro 為 64.3%、SWE bench Verified 為 87.6%，足以把它列入頂尖 coding 模型候選，但重構能力仍缺獨立專項 benchmark。[2][3][5]. Topic tags: ai, anthropic, claude, coding, software engineering. Reference image context from search candidates: Reference image 1: visual subject "# Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. *In short: Anthropic has released Claude Opus 4.7, its most capable generally available" source context "Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro" Reference image 2: visual subject "# Claude Opus 4.7: What Changed. Claude Opus 4.7: What Changed for Coding Agents (April 2026). Claude Opus 4.7 went gene
openai.com

判斷 Claude Opus 4.7 的 coding 能力，不能只看它能不能生成一段函式。更重要的是：它放進既有 repository 後，能不能讀懂上下文、修真實 issue、正確使用工具，並在多步驟 workflow 中維持低錯誤率。Anthropic 已發布 Claude Opus 4.7，官方頁面列出開發者可透過 Claude API 使用 claude-opus-4-7；CNBC 也報導了這次模型推出。^[5]^[2]

公開資料給出的結論相當明確，但有邊界：Opus 4.7 在寫程式與除錯相關任務上證據很強；大型重構則仍缺少獨立、專門、標準化的公開 benchmark。^[3]^[5]

核心結論：寫程式與除錯強，重構要保守

TNW 報導稱 Claude Opus 4.7 是 Anthropic 最強的一般可用模型，並列出 SWE-bench Pro、SWE-bench Verified、CursorBench 與多步驟 agentic reasoning 的提升。^[3] 這些數字足以支持一個實務判斷：如果你的需求是寫功能、修 bug、讓 coding agent 在多檔案專案裡完成工作，Opus 4.7 值得優先評估。^[3]

但如果問題是「它重構大型專案到底比其他模型強多少」，答案應該更保守。本文可查來源強調 software engineering、SWE-bench、agentic workflow 與長時間任務，卻沒有提供一個清楚拆分大型 refactoring 品質的獨立 benchmark。^[3]^[5]

寫程式、除錯、重構不是同一件事

評估 coding model 時，最好把三種能力分開看。能寫出一段新程式，不代表能正確修既有 bug；能修 bug，也不代表能做出 reviewer 願意接受的大型重構。

能力	你真正想知道的是	目前公開證據
寫程式	能否理解需求、產生可用功能、配合既有 API 與專案結構	證據強：TNW 報導 Opus 4.7 在多個 coding／agentic benchmark 上高於 Opus 4.6。^[3]
除錯	能否讀懂錯誤訊息、logs、trace 與 failing test，找到根因並修真實 issue	證據偏強：SWE-bench Pro 被描述為測試模型解決開源專案真實軟體問題的能力；Anthropic 官方頁也收錄早期使用者對 bug finding 與 fix proposal 的正面回饋。^[3]^[5]
重構	能否在不改變行為的前提下改善結構、命名、抽象邊界與可維護性	證據未定：本文可查來源沒有列出專門衡量 refactoring 品質的獨立公開 benchmark。^[3]^[5]

最硬的公開數字：SWE-bench 與 CursorBench

TNW 報導的 benchmark 數字，是目前判斷 Opus 4.7 coding 能力最具體的公開材料之一。^[3]

指標	Claude Opus 4.7	對照數字	怎麼解讀
SWE-bench Pro	64.3%	Opus 4.6：53.4%；GPT-5.4：57.7%；Gemini 3.1 Pro：54.2%	SWE-bench Pro 被描述為測模型解決開源專案真實軟體問題的能力，因此比單純演算法題更接近日常 issue 修復。^[3]
SWE-bench Verified	87.6%	Opus 4.6：80.8%；Gemini 3.1 Pro：80.6%	在 TNW 報導的 verified software engineering 任務上，Opus 4.7 明顯高於前代與列出的主要對照模型。^[3]
CursorBench	70%	Opus 4.6：58%	對代理式 coding workflow 的提升明顯，不只是單輪補程式碼。^[3]
多步驟 agentic reasoning	較 Opus 4.6 提升 14%	工具錯誤量約為三分之一	對需要工具調用、跨步驟操作與長流程工程任務的場景更有參考價值。^[3]

這些數字的意義是：Opus 4.7 的強項不只在「會寫程式」，而是在更接近真實工程環境的任務中，能處理 issue、工具與多步驟流程。^[3] 但 benchmark 分數不等於你的團隊會得到同等效率提升；資料集、工具權限、測試覆蓋率、專案規模與 reviewer 標準都可能改變實際結果。

除錯能力：證據比重構更扎實

除錯的核心不是把錯誤訊息丟給模型後得到一段看似合理的 patch，而是模型能否定位正確檔案、理解程式路徑、修最小必要範圍，並避免引入 regression。SWE-bench Pro 這類以真實開源專案問題為基礎的任務，因而比一般 coding puzzle 更能反映 bug fix 能力。^[3]

Anthropic 官方發布頁也把 Opus 4.7 放在進階軟體工程與複雜長時間任務的脈絡下介紹，並列出開發者可透過 Claude API 使用該模型。^[5] 官方材料中收錄的早期使用者回饋，包含 Replit 對分析 logs and traces、finding bugs、proposing fixes 更有效率與精準的評語。^[5]

這裡要分清來源性質：早期使用者回饋來自官方發布材料，不等同於獨立第三方盲測。^[5] 所以較穩妥的說法是，Opus 4.7 對「從真實 repo issue 產生修補」的證據偏強；但若你關心的是 live debugging、特定框架疑難雜症、或大型 monorepo 裡的跨服務錯誤，仍應用自己的任務集驗證。^[3]^[5]

重構能力：很值得試，但還不能說已被公開資料證明最強

大型重構比修 bug 更難測。測試通過只能說明行為大致沒有壞，不能保證抽象邊界更好、耦合更低、命名更一致，或 reviewer 更願意接受這個 diff。

就本文可查來源而言，Anthropic 官方發布與 TNW 報導都著重 coding、SWE-bench、agentic workflow 與長時間多步驟任務，但沒有提供清楚、獨立、專門拆分大型重構品質的公開 benchmark。^[3]^[5]

因此，對重構能力最負責任的判斷是：Opus 4.7 很可能值得優先測，因為它在真實 issue 修復、工具使用與多步驟 workflow 上的底層能力有明顯提升；但這仍是間接證據。^[3] 若大型重構是核心需求，應該直接測行為保持、測試通過率、diff 可審查性、命名一致性與後續維護性，而不是只看通用 coding 排行榜。

一般可用的強模型，不等於 Anthropic 所有模型中的絕對最強

TNW 將 Opus 4.7 稱為 Anthropic 最強的一般可用模型，Anthropic 官方頁也列出 claude-opus-4-7 可透過 Claude API 使用。^[3]^[5] 但「一般可用」不等於「Anthropic 任何內部或受限模型中能力最高」。

Alpha Spread 報導指出，Anthropic 稱 Opus 4.7 仍 broadly less capable than Claude Mythos Preview；CNBC 也將 Opus 4.7 與 Mythos 的差異作為報導重點。^[1]^[2] 換句話說，如果你問的是「目前一般可用的 Anthropic coding 模型是否該優先評估 Opus 4.7」，公開證據支持把它排在很前面；如果你問的是「它是不是 Anthropic 全部模型裡絕對最強」，現有來源不支持這種說法。^[1]^[2]^[3]

導入前，建議這樣做 A/B 測試

公開 benchmark 能幫你決定「值不值得試」，但不能替你證明「在你的 codebase 一定最好」。若要把 Opus 4.7 放進 IDE、內部 coding agent 或 Claude API workflow，建議用同一份 repository snapshot 做對照測試。

可以分三類任務測：

功能開發：給同一份需求與同一個專案狀態，評估模型能否產生可合併的 diff。
除錯修復：提供 failing test、錯誤 log 或 issue 描述，評估定位根因、修補範圍與 regression 風險。
重構任務：要求模型在保持行為不變的前提下改善結構，並由工程師評估可讀性、測試通過率、diff 可審查性與維護性。

評分時，至少記錄測試是否通過、是否需要人工回退、是否出現工具調用錯誤、reviewer 是否接受修改，以及模型是否能說明設計取捨。這會比單次 demo 更接近真實導入效果。

最後 verdict

Claude Opus 4.7 在寫程式與修真實 repo 問題上的公開證據很強：TNW 報導的 SWE-bench Pro、SWE-bench Verified、CursorBench 與多步驟 agentic reasoning 數字，都支持它相較 Opus 4.6 有明顯進步，並在報導中的主要對照模型間具競爭力。^[3]

對除錯，可以說證據偏強，因為 SWE-bench 類任務與官方早期使用者回饋都指向更好的 bug 修復與工程 workflow 能力。^[3]^[5] 對重構，則應保持保守：目前可查來源沒有提供獨立、專門、標準化的 refactoring benchmark；大型重構若是你的核心工作，仍應用自家 codebase 做 A/B 測試後再決定是否導入。^[3]^[5]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 已在 2026 年 4 月發布並可透過 Claude API 使用；TNW 報導其 SWE bench Pro 為 64.3%、SWE bench Verified 為 87.6%，顯示寫程式與修真實 repo issue 很強，但大型重構仍缺獨立專項 benchmark。[2][3][5]
最有力的公開證據集中在真實 issue 修復與 agentic coding：TNW 報導 CursorBench 從 Opus 4.6 的 58% 升至 Opus 4.7 的 70%，多步驟 agentic reasoning 提升 14%、工具錯誤約降至三分之一。[3]
若要導入 IDE、Claude API 或內部 agent，不要只看排行榜；用同一份 repository snapshot 測功能開發、除錯和重構，才能知道它是否真的提升你的團隊效率。

大家也會問

「Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制」的簡短答案是什麼？

Claude Opus 4.7 已在 2026 年 4 月發布並可透過 Claude API 使用；TNW 報導其 SWE bench Pro 為 64.3%、SWE bench Verified 為 87.6%，顯示寫程式與修真實 repo issue 很強，但大型重構仍缺獨立專項 benchmark。[2][3][5]

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

若要導入 IDE、Claude API 或內部 agent，不要只看排行榜；用同一份 repository snapshot 測功能開發、除錯和重構，才能知道它是否真的提升你的團隊效率。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？」交叉比對。

開啟相關頁面

繼續深入研究

IT Consulting Orange County CA, IT Services, IT Outsourcing - Technijian. # Anthropic Claude Security for Enterprise: What You Need to Know. Anthropic has officially opened Claude

Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具

Claude Security 公測版詳解：Anthropic 如何用 AI 掃描企業程式碼漏洞

抽象 AI 介面顯示 Grok 4.3 API、1M context、token 價格與語音波形

Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？

Grok 4.3 API：1M context 與 $1.25/$2.50 定價，對模型與語音競爭意味什麼

小米 Q5 不是 MIX 5？最新線索更像指向小米 18 系列

小米 Q5 是 MIX 5 嗎？目前更可能是小米 18 系列線索

Metas KI-Altersprüfung: Neue Regeln für Jugendliche in EU, USA und Brasilien

Meta用AI查年齡：Instagram與Facebook青少年保護新規一次看

研究對話

你2026年4月19日

研究問題

搜尋並查核事實：Claude Opus 4.7 在寫程式、除錯、重構程式碼上到底有多強？

Studio Global AI2026年4月19日12 個來源

附引用的答案

簡單說：Claude Opus 4.7 不是傳聞，確實是 Anthropic 在 2026-04-16 正式發布、且已一般可用的模型。^[5]^[2] 就「軟體工程整體能力」來看，它目前可查到的公開證據很強，尤其在最接近真實除錯/修 bug 的 SWE-bench 類基準上屬於第一梯隊；但如果你要問「重構程式碼到底比別家強多少」，公開且獨立、專門拆分 refactoring 的證據還不夠，我只能說證據偏強，但不到能下絕對定論的程度。^[3]^[5]

Anthropic 官方明確說，Opus 4.7 相比 Opus 4.6 在「進階軟體工程」上有顯著進步，特別是在最困難的任務上更強，並且更能處理複雜、長時間執行的工作。^[5]
如果用公開基準看，最有代表性的證據是 SWE-bench。TNW 引述的數字顯示，Opus 4.7 在 SWE-bench Pro 為 64.3%，高於 Opus 4.6 的 53.4%、GPT-5.4 的 57.7%、Gemini 3.1 Pro 的 54.2%；在 SWE-bench Verified 為 87.6%，也高於 Opus 4.6 的 80.8% 與 Gemini 3.1 Pro 的 80.6%。^[3] TNW 同時說明，SWE-bench Pro 是測模型能否解決開源專案真實軟體問題的基準，這使它比一般程式題更接近真實除錯與修復工作。^[3]
在代理式寫程式場景，TNW 也寫到 CursorBench 從 Opus 4.6 的 58% 提升到 Opus 4.7 的 70%。^[3] 這表示它不只是單輪生成程式碼強，對需要多步驟操作的 coding workflow 也更強。^[3]
Anthropic 官方還聲稱，Opus 4.7 在複雜多步驟工作流上比 Opus 4.6 提升 14%，而且工具使用錯誤降到約三分之一。^[3] 如果這個數字在你的 IDE、CLI agent、CI/CD 流程裡也成立，那對除錯和大型重構會很有感。^[3]
但要注意：關於「找 bug、分析 log、提出修復」這類更貼近實務的描述，目前你能查到的很多證據其實來自 Anthropic 官方挑選的早期測試者回饋，不是完全獨立的第三方對照實驗。^[5] 例如官方頁面收錄了 Replit 對「分析 logs and traces、finding bugs、proposing fixes」更有效率與精準的評語，也有多家合作方說它更適合長時間、多步驟的工程工作。^[5]
關於「重構」本身，我目前沒看到公開、獨立、標準化的 refactoring benchmark 被清楚列出來。^[5]^[3] 所以比較嚴格的說法應該是：它在「修 issue、長流程工程任務、代理式 coding」上的證據很強，因而高度暗示它在重構也很強；但公開證據還不足以把「重構能力」單獨精確量化。^[5]^[3]
另一個重要事實是：Anthropic 與 CNBC 都說 Opus 4.7 雖然是 Anthropic「最強的一般可用模型」，但整體廣泛能力仍不如更高階、受限釋出的 Claude Mythos Preview。^[5]^[2] 所以如果你問的是「Anthropic 目前最強公開可買到的 coding 模型是不是它？」答案偏向是；但如果你問「Anthropic 全部模型裡絕對最強的是不是它？」答案不是。^[5]^[2]

我的查核結論是：如果你的標準是「寫程式 + 修真實 repo 問題 + 長流程 agent coding」，Claude Opus 4.7 可以算目前公開證據非常強、而且接近頂尖的一個選擇。^[3]^[5] 但如果你特別在意「互動式除錯體驗」或「大型專案重構品質」，現有公開證據多半還是官方 benchmark 與合作夥伴案例，獨立驗證仍偏少，所以最準確的判斷是：很強，但對除錯/重構的實務優勢仍需你用自己的 codebase 做 A/B 測試才算定案。^[5]^[3]

如果你要，我下一步可以直接幫你整理一份「Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro」的程式能力對照表，或幫你設計一個你自己團隊可跑的除錯/重構 A/B 測試清單。

來源

[1] Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools - Alpha Spreadalphaspread.com
Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools. Anthropic has announced Claude Opus 4.7, an updated artificial intelligence model that the company says is better at software engineering and difficult coding tasks. The r...
[2] Anthropic rolls out Claude Opus 4.7, an AI model that is less risky than Mythoscnbc.com
Skip Navigation. Markets. Currencies. Cryptocurrency. Bonds. Business. Economy. Finance. Media. Energy. Climate. [Transportation](
[3] Claude Opus 4.7 leads on SWE-bench and agentic reasoning ...thenextweb.com
Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. In short: Anthropic has released Claude Opus 4.7, its most capable generally...
[5] Introducing Claude Opus 4.7anthropic.com
Skip to main contentSkip to footer. Developers can use claude-opus-4-7 via the Claude API. . . ![Image 9: logo](

熱門探索內容

答案已發布2026年4月29日Last edited 2026年5月6日4 個來源

Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制

使用 Studio Global AI 搜尋並查證事實探索更多內容

19K0

公開資料給出的結論相當明確，但有邊界：Opus 4.7 在寫程式與除錯相關任務上證據很強；大型重構則仍缺少獨立、專門、標準化的公開 benchmark。^[3]^[5]

核心結論：寫程式與除錯強，重構要保守

寫程式、除錯、重構不是同一件事

評估 coding model 時，最好把三種能力分開看。能寫出一段新程式，不代表能正確修既有 bug；能修 bug，也不代表能做出 reviewer 願意接受的大型重構。

能力	你真正想知道的是	目前公開證據
寫程式	能否理解需求、產生可用功能、配合既有 API 與專案結構	證據強：TNW 報導 Opus 4.7 在多個 coding／agentic benchmark 上高於 Opus 4.6。^[3]
除錯	能否讀懂錯誤訊息、logs、trace 與 failing test，找到根因並修真實 issue	證據偏強：SWE-bench Pro 被描述為測試模型解決開源專案真實軟體問題的能力；Anthropic 官方頁也收錄早期使用者對 bug finding 與 fix proposal 的正面回饋。^[3]^[5]
重構	能否在不改變行為的前提下改善結構、命名、抽象邊界與可維護性	證據未定：本文可查來源沒有列出專門衡量 refactoring 品質的獨立公開 benchmark。^[3]^[5]

最硬的公開數字：SWE-bench 與 CursorBench

TNW 報導的 benchmark 數字，是目前判斷 Opus 4.7 coding 能力最具體的公開材料之一。^[3]

指標	Claude Opus 4.7	對照數字	怎麼解讀
SWE-bench Pro	64.3%	Opus 4.6：53.4%；GPT-5.4：57.7%；Gemini 3.1 Pro：54.2%	SWE-bench Pro 被描述為測模型解決開源專案真實軟體問題的能力，因此比單純演算法題更接近日常 issue 修復。^[3]
SWE-bench Verified	87.6%	Opus 4.6：80.8%；Gemini 3.1 Pro：80.6%	在 TNW 報導的 verified software engineering 任務上，Opus 4.7 明顯高於前代與列出的主要對照模型。^[3]
CursorBench	70%	Opus 4.6：58%	對代理式 coding workflow 的提升明顯，不只是單輪補程式碼。^[3]
多步驟 agentic reasoning	較 Opus 4.6 提升 14%	工具錯誤量約為三分之一	對需要工具調用、跨步驟操作與長流程工程任務的場景更有參考價值。^[3]

除錯能力：證據比重構更扎實

重構能力：很值得試，但還不能說已被公開資料證明最強

大型重構比修 bug 更難測。測試通過只能說明行為大致沒有壞，不能保證抽象邊界更好、耦合更低、命名更一致，或 reviewer 更願意接受這個 diff。

一般可用的強模型，不等於 Anthropic 所有模型中的絕對最強

導入前，建議這樣做 A/B 測試

可以分三類任務測：

功能開發：給同一份需求與同一個專案狀態，評估模型能否產生可合併的 diff。
除錯修復：提供 failing test、錯誤 log 或 issue 描述，評估定位根因、修補範圍與 regression 風險。
重構任務：要求模型在保持行為不變的前提下改善結構，並由工程師評估可讀性、測試通過率、diff 可審查性與維護性。

最後 verdict

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 已在 2026 年 4 月發布並可透過 Claude API 使用；TNW 報導其 SWE bench Pro 為 64.3%、SWE bench Verified 為 87.6%，顯示寫程式與修真實 repo issue 很強，但大型重構仍缺獨立專項 benchmark。[2][3][5]
最有力的公開證據集中在真實 issue 修復與 agentic coding：TNW 報導 CursorBench 從 Opus 4.6 的 58% 升至 Opus 4.7 的 70%，多步驟 agentic reasoning 提升 14%、工具錯誤約降至三分之一。[3]
若要導入 IDE、Claude API 或內部 agent，不要只看排行榜；用同一份 repository snapshot 測功能開發、除錯和重構，才能知道它是否真的提升你的團隊效率。

大家也會問

「Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制」的簡短答案是什麼？

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

若要導入 IDE、Claude API 或內部 agent，不要只看排行榜；用同一份 repository snapshot 測功能開發、除錯和重構，才能知道它是否真的提升你的團隊效率。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？」交叉比對。

開啟相關頁面

繼續深入研究

研究對話

你2026年4月19日

研究問題

搜尋並查核事實：Claude Opus 4.7 在寫程式、除錯、重構程式碼上到底有多強？

Studio Global AI2026年4月19日12 個來源

附引用的答案

Anthropic 官方明確說，Opus 4.7 相比 Opus 4.6 在「進階軟體工程」上有顯著進步，特別是在最困難的任務上更強，並且更能處理複雜、長時間執行的工作。^[5]
如果用公開基準看，最有代表性的證據是 SWE-bench。TNW 引述的數字顯示，Opus 4.7 在 SWE-bench Pro 為 64.3%，高於 Opus 4.6 的 53.4%、GPT-5.4 的 57.7%、Gemini 3.1 Pro 的 54.2%；在 SWE-bench Verified 為 87.6%，也高於 Opus 4.6 的 80.8% 與 Gemini 3.1 Pro 的 80.6%。^[3] TNW 同時說明，SWE-bench Pro 是測模型能否解決開源專案真實軟體問題的基準，這使它比一般程式題更接近真實除錯與修復工作。^[3]
在代理式寫程式場景，TNW 也寫到 CursorBench 從 Opus 4.6 的 58% 提升到 Opus 4.7 的 70%。^[3] 這表示它不只是單輪生成程式碼強，對需要多步驟操作的 coding workflow 也更強。^[3]
Anthropic 官方還聲稱，Opus 4.7 在複雜多步驟工作流上比 Opus 4.6 提升 14%，而且工具使用錯誤降到約三分之一。^[3] 如果這個數字在你的 IDE、CLI agent、CI/CD 流程裡也成立，那對除錯和大型重構會很有感。^[3]
但要注意：關於「找 bug、分析 log、提出修復」這類更貼近實務的描述，目前你能查到的很多證據其實來自 Anthropic 官方挑選的早期測試者回饋，不是完全獨立的第三方對照實驗。^[5] 例如官方頁面收錄了 Replit 對「分析 logs and traces、finding bugs、proposing fixes」更有效率與精準的評語，也有多家合作方說它更適合長時間、多步驟的工程工作。^[5]
關於「重構」本身，我目前沒看到公開、獨立、標準化的 refactoring benchmark 被清楚列出來。^[5]^[3] 所以比較嚴格的說法應該是：它在「修 issue、長流程工程任務、代理式 coding」上的證據很強，因而高度暗示它在重構也很強；但公開證據還不足以把「重構能力」單獨精確量化。^[5]^[3]
另一個重要事實是：Anthropic 與 CNBC 都說 Opus 4.7 雖然是 Anthropic「最強的一般可用模型」，但整體廣泛能力仍不如更高階、受限釋出的 Claude Mythos Preview。^[5]^[2] 所以如果你問的是「Anthropic 目前最強公開可買到的 coding 模型是不是它？」答案偏向是；但如果你問「Anthropic 全部模型裡絕對最強的是不是它？」答案不是。^[5]^[2]

來源

[1] Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools - Alpha Spreadalphaspread.com
Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools. Anthropic has announced Claude Opus 4.7, an updated artificial intelligence model that the company says is better at software engineering and difficult coding tasks. The r...
[2] Anthropic rolls out Claude Opus 4.7, an AI model that is less risky than Mythoscnbc.com
Skip Navigation. Markets. Currencies. Cryptocurrency. Bonds. Business. Economy. Finance. Media. Energy. Climate. [Transportation](
[3] Claude Opus 4.7 leads on SWE-bench and agentic reasoning ...thenextweb.com
Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. In short: Anthropic has released Claude Opus 4.7, its most capable generally...
[5] Introducing Claude Opus 4.7anthropic.com
Skip to main contentSkip to footer. Developers can use claude-opus-4-7 via the Claude API. . . ![Image 9: logo](

熱門探索內容

答案已發布2026年4月29日Last edited 2026年5月6日4 個來源

Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制

使用 Studio Global AI 搜尋並查證事實探索更多內容

19K0

公開資料給出的結論相當明確，但有邊界：Opus 4.7 在寫程式與除錯相關任務上證據很強；大型重構則仍缺少獨立、專門、標準化的公開 benchmark。^[3]^[5]

核心結論：寫程式與除錯強，重構要保守

寫程式、除錯、重構不是同一件事

評估 coding model 時，最好把三種能力分開看。能寫出一段新程式，不代表能正確修既有 bug；能修 bug，也不代表能做出 reviewer 願意接受的大型重構。

能力	你真正想知道的是	目前公開證據
寫程式	能否理解需求、產生可用功能、配合既有 API 與專案結構	證據強：TNW 報導 Opus 4.7 在多個 coding／agentic benchmark 上高於 Opus 4.6。^[3]
除錯	能否讀懂錯誤訊息、logs、trace 與 failing test，找到根因並修真實 issue	證據偏強：SWE-bench Pro 被描述為測試模型解決開源專案真實軟體問題的能力；Anthropic 官方頁也收錄早期使用者對 bug finding 與 fix proposal 的正面回饋。^[3]^[5]
重構	能否在不改變行為的前提下改善結構、命名、抽象邊界與可維護性	證據未定：本文可查來源沒有列出專門衡量 refactoring 品質的獨立公開 benchmark。^[3]^[5]

最硬的公開數字：SWE-bench 與 CursorBench

TNW 報導的 benchmark 數字，是目前判斷 Opus 4.7 coding 能力最具體的公開材料之一。^[3]

指標	Claude Opus 4.7	對照數字	怎麼解讀
SWE-bench Pro	64.3%	Opus 4.6：53.4%；GPT-5.4：57.7%；Gemini 3.1 Pro：54.2%	SWE-bench Pro 被描述為測模型解決開源專案真實軟體問題的能力，因此比單純演算法題更接近日常 issue 修復。^[3]
SWE-bench Verified	87.6%	Opus 4.6：80.8%；Gemini 3.1 Pro：80.6%	在 TNW 報導的 verified software engineering 任務上，Opus 4.7 明顯高於前代與列出的主要對照模型。^[3]
CursorBench	70%	Opus 4.6：58%	對代理式 coding workflow 的提升明顯，不只是單輪補程式碼。^[3]
多步驟 agentic reasoning	較 Opus 4.6 提升 14%	工具錯誤量約為三分之一	對需要工具調用、跨步驟操作與長流程工程任務的場景更有參考價值。^[3]

除錯能力：證據比重構更扎實

重構能力：很值得試，但還不能說已被公開資料證明最強

大型重構比修 bug 更難測。測試通過只能說明行為大致沒有壞，不能保證抽象邊界更好、耦合更低、命名更一致，或 reviewer 更願意接受這個 diff。

一般可用的強模型，不等於 Anthropic 所有模型中的絕對最強

導入前，建議這樣做 A/B 測試

可以分三類任務測：

功能開發：給同一份需求與同一個專案狀態，評估模型能否產生可合併的 diff。
除錯修復：提供 failing test、錯誤 log 或 issue 描述，評估定位根因、修補範圍與 regression 風險。
重構任務：要求模型在保持行為不變的前提下改善結構，並由工程師評估可讀性、測試通過率、diff 可審查性與維護性。

最後 verdict

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

重點整理

Claude Opus 4.7 已在 2026 年 4 月發布並可透過 Claude API 使用；TNW 報導其 SWE bench Pro 為 64.3%、SWE bench Verified 為 87.6%，顯示寫程式與修真實 repo issue 很強，但大型重構仍缺獨立專項 benchmark。[2][3][5]
最有力的公開證據集中在真實 issue 修復與 agentic coding：TNW 報導 CursorBench 從 Opus 4.6 的 58% 升至 Opus 4.7 的 70%，多步驟 agentic reasoning 提升 14%、工具錯誤約降至三分之一。[3]
若要導入 IDE、Claude API 或內部 agent，不要只看排行榜；用同一份 repository snapshot 測功能開發、除錯和重構，才能知道它是否真的提升你的團隊效率。

大家也會問

「Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制」的簡短答案是什麼？

最值得優先驗證的重點是什麼？

接下來在實務上該怎麼做？

若要導入 IDE、Claude API 或內部 agent，不要只看排行榜；用同一份 repository snapshot 測功能開發、除錯和重構，才能知道它是否真的提升你的團隊效率。

下一步適合探索哪個相關主題？

繼續閱讀「Claude Security 公測版：Anthropic 的企業程式碼漏洞掃描工具」，從另一個角度查看更多引用來源。

開啟相關頁面

我應該拿這個和什麼比較？

將這個答案與「Grok 4.3 API 解讀：1M 上下文、低 token 價格，xAI 想搶下哪個入口？」交叉比對。

開啟相關頁面

繼續深入研究

研究對話

你2026年4月19日

研究問題

搜尋並查核事實：Claude Opus 4.7 在寫程式、除錯、重構程式碼上到底有多強？

Studio Global AI2026年4月19日12 個來源

附引用的答案

Anthropic 官方明確說，Opus 4.7 相比 Opus 4.6 在「進階軟體工程」上有顯著進步，特別是在最困難的任務上更強，並且更能處理複雜、長時間執行的工作。^[5]
如果用公開基準看，最有代表性的證據是 SWE-bench。TNW 引述的數字顯示，Opus 4.7 在 SWE-bench Pro 為 64.3%，高於 Opus 4.6 的 53.4%、GPT-5.4 的 57.7%、Gemini 3.1 Pro 的 54.2%；在 SWE-bench Verified 為 87.6%，也高於 Opus 4.6 的 80.8% 與 Gemini 3.1 Pro 的 80.6%。^[3] TNW 同時說明，SWE-bench Pro 是測模型能否解決開源專案真實軟體問題的基準，這使它比一般程式題更接近真實除錯與修復工作。^[3]
在代理式寫程式場景，TNW 也寫到 CursorBench 從 Opus 4.6 的 58% 提升到 Opus 4.7 的 70%。^[3] 這表示它不只是單輪生成程式碼強，對需要多步驟操作的 coding workflow 也更強。^[3]
Anthropic 官方還聲稱，Opus 4.7 在複雜多步驟工作流上比 Opus 4.6 提升 14%，而且工具使用錯誤降到約三分之一。^[3] 如果這個數字在你的 IDE、CLI agent、CI/CD 流程裡也成立，那對除錯和大型重構會很有感。^[3]
但要注意：關於「找 bug、分析 log、提出修復」這類更貼近實務的描述，目前你能查到的很多證據其實來自 Anthropic 官方挑選的早期測試者回饋，不是完全獨立的第三方對照實驗。^[5] 例如官方頁面收錄了 Replit 對「分析 logs and traces、finding bugs、proposing fixes」更有效率與精準的評語，也有多家合作方說它更適合長時間、多步驟的工程工作。^[5]
關於「重構」本身，我目前沒看到公開、獨立、標準化的 refactoring benchmark 被清楚列出來。^[5]^[3] 所以比較嚴格的說法應該是：它在「修 issue、長流程工程任務、代理式 coding」上的證據很強，因而高度暗示它在重構也很強；但公開證據還不足以把「重構能力」單獨精確量化。^[5]^[3]
另一個重要事實是：Anthropic 與 CNBC 都說 Opus 4.7 雖然是 Anthropic「最強的一般可用模型」，但整體廣泛能力仍不如更高階、受限釋出的 Claude Mythos Preview。^[5]^[2] 所以如果你問的是「Anthropic 目前最強公開可買到的 coding 模型是不是它？」答案偏向是；但如果你問「Anthropic 全部模型裡絕對最強的是不是它？」答案不是。^[5]^[2]

來源

[1] Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools - Alpha Spreadalphaspread.com
Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools. Anthropic has announced Claude Opus 4.7, an updated artificial intelligence model that the company says is better at software engineering and difficult coding tasks. The r...
[2] Anthropic rolls out Claude Opus 4.7, an AI model that is less risky than Mythoscnbc.com
Skip Navigation. Markets. Currencies. Cryptocurrency. Bonds. Business. Economy. Finance. Media. Energy. Climate. [Transportation](
[3] Claude Opus 4.7 leads on SWE-bench and agentic reasoning ...thenextweb.com
Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. In short: Anthropic has released Claude Opus 4.7, its most capable generally...
[5] Introducing Claude Opus 4.7anthropic.com
Skip to main contentSkip to footer. Developers can use claude-opus-4-7 via the Claude API. . . ![Image 9: logo](