答案已發布2026年4月29日Last edited 2026年5月6日4 來源

Claude Opus 4.7 寫 code 有幾勁？SWE-bench、debug 同 refactor 邊界

Claude Opus 4.7 已可經 Claude API 使用；TNW 報導它在 SWE bench Pro 得 64.3%、SWE bench Verified 得 87.6%，coding 與真實 repo issue 修復證據強，但大型 refactor 未有獨立專項 benchmark。[3][5] Agentic coding 方面，TNW 報導 CursorBench 由 Opus 4.6 的 58% 升至 Opus 4.7 的 70%；多步驟 agentic reasoning 提升 14%，工具錯誤約降至三分之一。[3] 引入 IDE、Claude API 或內部 coding agent 前，最好用同一個...

使用 Studio Global AI 搜尋並查核事實從「發現」瀏覽更多內容

19K0

Claude Opus 4.7 程式碼基準測試與除錯能力的編輯插圖 — Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制AI 生成的編輯視覺，呈現 Claude Opus 4.7、coding benchmark 與軟體工程 workflow。
AI 提示
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 寫程式有多強？SWE-bench 數據、除錯能力與重構限制. Article summary: Claude Opus 4.7 已於 2026 年 4 月發布並可透過 claude opus 4 7 API 使用；TNW 報導其 SWE bench Pro 為 64.3%、SWE bench Verified 為 87.6%，足以把它列入頂尖 coding 模型候選，但重構能力仍缺獨立專項 benchmark。[2][3][5]. Topic tags: ai, anthropic, claude, coding, software engineering. Reference image context from search candidates: Reference image 1: visual subject "# Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. *In short: Anthropic has released Claude Opus 4.7, its most capable generally available" source context "Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro" Reference image 2: visual subject "# Claude Opus 4.7: What Changed. Claude Opus 4.7: What Changed for Coding Agents (April 2026). Claude Opus 4.7 went gene
openai.com

判斷 Claude Opus 4.7 寫 code 有幾強，唔應該停喺佢可唔可以生一段 function。真正貼近工程現場嘅問題係：放入既有 repo 後，佢讀唔讀得明上下文、識唔識修真 issue、工具調用會唔會亂、長 workflow 會唔會愈做愈歪。Anthropic 已推出 Claude Opus 4.7，官方頁列明開發者可以透過 Claude API 使用 claude-opus-4-7；CNBC 亦有報道今次模型推出。^[5]^[2]

現有公開資料嘅答案幾清楚，但唔係無邊界：Opus 4.7 喺寫 code 同 debug 任務上證據好強；至於大型 refactor，暫時未見獨立、專門、標準化嘅公開 benchmark 可以單獨證明佢有幾強。^[3]^[5]

先講結論：coding／debug 值得優先試，refactor 要自己驗

TNW 報導稱 Claude Opus 4.7 是 Anthropic 最強一般可用模型，並列出 SWE-bench Pro、SWE-bench Verified、CursorBench 以及多步驟 agentic reasoning 的提升。^[3] 呢批數據足以支持一個實務判斷：如果你要模型幫手寫功能、修 bug，或者畀 coding agent 喺多檔案項目入面跑完整 workflow，Opus 4.7 係值得排前面試嘅選項。^[3]

但如果你問：大型 codebase refactor 到底比其他模型好幾多？講法就要收斂。可查來源主要講 software engineering、SWE-bench、agentic workflow 同長時間任務，無提供一個清楚拆出大型 refactoring 品質嘅獨立 benchmark。^[3]^[5]

寫 code、debug、refactor，唔好撈埋講

一個 coding model 識寫新 function，唔代表一定識修既有 bug；識修 bug，亦唔代表佢做出嚟嘅大型重構會令 reviewer 放心 merge。

能力	你真正要問嘅問題	目前公開證據
寫 code	能否理解需求、跟到既有 API、項目結構同 coding style，產出可 merge 嘅功能	證據強：TNW 報導 Opus 4.7 喺多個 coding／agentic benchmark 上高過 Opus 4.6。^[3]
Debug	能否讀懂錯誤訊息、logs、traces 同 failing tests，搵到根因並修真實 issue	證據偏強：SWE-bench Pro 被描述為測試模型解決開源項目真實軟件問題嘅能力；Anthropic 官方頁亦收錄早期用戶對 bug finding 同 fix proposal 嘅正面評語。^[3]^[5]
Refactor	能否喺唔改變行為嘅前提下，改善結構、命名、抽象邊界同可維護性	證據未定：可查來源未列出專門量度 refactoring 品質嘅獨立公開 benchmark。^[3]^[5]

最硬淨嘅公開數據：SWE-bench 同 CursorBench

TNW 報導嘅 benchmark 數據，係目前判斷 Opus 4.7 coding 能力最具體嘅公開材料之一。^[3]

指標	Claude Opus 4.7	對照數字	點樣解讀
SWE-bench Pro	64.3%	Opus 4.6：53.4%；GPT-5.4：57.7%；Gemini 3.1 Pro：54.2%	SWE-bench Pro 被描述為測模型解決開源項目真實軟件問題嘅能力，比純演算法題更接近日常 issue 修復。^[3]
SWE-bench Verified	87.6%	Opus 4.6：80.8%；Gemini 3.1 Pro：80.6%	喺 TNW 報導嘅 verified software engineering 任務上，Opus 4.7 明顯高過前代同文中列出嘅主要對照模型。^[3]
CursorBench	70%	Opus 4.6：58%	對代理式 coding workflow 嘅提升明顯，唔只係單輪補幾行 code。^[3]
多步驟 agentic reasoning	比 Opus 4.6 提升 14%	工具錯誤約降至 Opus 4.6 嘅三分之一	對需要工具調用、跨步驟操作、長流程工程任務嘅場景更有參考價值。^[3]

呢啲數字嘅重點係：Opus 4.7 嘅強項唔止係識寫 code，而係喺更接近真實工程環境嘅任務入面，處理 issue、工具同多步驟流程都更有競爭力。^[3] 不過，benchmark 分數唔等於你團隊會得到同等效率提升；資料集、工具權限、test coverage、項目規模同 reviewer 標準，都會改變實際結果。

Debug：證據比 refactor 更實在

Debug 嘅核心唔係將 error message 丟畀模型，然後收一段睇落合理嘅 patch。更重要係模型能否定位正確檔案、理解程式路徑、只改最小必要範圍，並避免引入 regression。SWE-bench Pro 呢類以真實開源項目問題為基礎嘅任務，所以比一般 coding puzzle 更能反映 bug fix 能力。^[3]

Anthropic 官方發布頁亦將 Opus 4.7 放喺進階軟件工程同複雜長時間任務嘅脈絡下介紹，並列明開發者可透過 Claude API 使用呢個模型。^[5] 官方材料中收錄嘅早期用戶評語，包括 Replit 指出它喺分析 logs and traces、finding bugs、proposing fixes 方面更有效率同準確。^[5]

呢度要分清來源性質：早期用戶評語來自官方發布材料，唔等同獨立第三方盲測。^[5] 所以較穩陣嘅講法係，Opus 4.7 對從真實 repo issue 產生修補嘅證據偏強；但如果你關心 live debugging、某個特定 framework 嘅疑難雜症，或者大型 monorepo 入面跨服務錯誤，仍然要用自己嘅任務集驗證。^[3]^[5]

Refactor：值得試，但未到可以拍心口講最強

大型重構比修 bug 更難量度。Test pass 只代表行為大致無壞，唔代表抽象邊界更清楚、耦合更低、命名更一致，亦唔代表 reviewer 會覺得個 diff 易睇、易收貨。

就可查來源而言，Anthropic 官方發布同 TNW 報導都集中喺 coding、SWE-bench、agentic workflow 同長時間多步驟任務；但未有提供清楚、獨立、專門拆分大型重構品質嘅公開 benchmark。^[3]^[5]

因此，對 refactor 能力最負責任嘅判斷係：Opus 4.7 好值得優先試，因為佢喺真實 issue 修復、工具使用同多步驟 workflow 上嘅底層能力有明顯提升；但呢個仍然係間接證據。^[3] 如果大型重構係你團隊核心工作，應該直接測行為保持、test pass rate、diff 可 review 性、命名一致性同後續維護性，而唔係只睇通用 coding 排行榜。

一般可用最強，唔等於 Anthropic 所有模型絕對最強

TNW 將 Opus 4.7 稱為 Anthropic 最強一般可用模型；Anthropic 官方頁亦列出 claude-opus-4-7 可透過 Claude API 使用。^[3]^[5] 但一般可用唔等於 Anthropic 任何內部或受限模型之中能力最高。

Alpha Spread 報導指出，Anthropic 稱 Opus 4.7 仍 broadly less capable than Claude Mythos Preview；CNBC 亦以 Opus 4.7 與 Mythos 的取捨作為報道角度。^[1]^[2] 換句話講，如果你問目前一般可用嘅 Anthropic coding 模型是否應該優先評估 Opus 4.7，公開證據支持將佢排得好前；但如果你問佢係咪 Anthropic 全部模型入面絕對最強，現有來源唔支持咁講。^[1]^[2]^[3]

如果要引入，點樣做 A/B test

公開 benchmark 可以幫你決定值唔值得試，但唔能夠替你證明喺你自己 codebase 一定最好。如果要將 Opus 4.7 放入 IDE、內部 coding agent 或 Claude API workflow，最好用同一份 repository snapshot 做對照測試。

建議分三類任務測：

功能開發：畀同一份需求同同一個項目狀態，睇模型能否產出可 merge 嘅 diff。
Debug 修復：提供 failing test、錯誤 log 或 issue 描述，評估佢搵根因、控制修補範圍同避免 regression 嘅能力。
Refactor 任務：要求模型喺保持行為不變嘅前提下改善結構，再由工程師評估可讀性、test pass rate、diff 可 review 性同維護性。

評分時，至少記錄 test 有無過、需唔需要人工回退、有無工具調用錯誤、reviewer 會唔會接受修改，以及模型能否講清楚設計取捨。呢啲會比一次 demo 更接近真實引入效果。

Bottom line

Claude Opus 4.7 喺寫 code 同修真實 repo 問題上，公開證據相當強：TNW 報導嘅 SWE-bench Pro、SWE-bench Verified、CursorBench 同多步驟 agentic reasoning 數據，都支持佢比 Opus 4.6 有明顯進步，並且喺報導中列出嘅主要對照模型之間具競爭力。^[3]

對 debug，可以講證據偏強，因為 SWE-bench 類任務同官方早期用戶評語都指向更好嘅 bug 修復同工程 workflow 能力。^[3]^[5] 對 refactor，就應該保持保守：目前可查來源未有提供獨立、專門、標準化嘅 refactoring benchmark；如果大型重構係你嘅核心工作，仍然要用自家 codebase 做 A/B test 後再決定是否引入。^[3]^[5]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

重點

Claude Opus 4.7 已可經 Claude API 使用；TNW 報導它在 SWE bench Pro 得 64.3%、SWE bench Verified 得 87.6%，coding 與真實 repo issue 修復證據強，但大型 refactor 未有獨立專項 benchmark。[3][5]
Agentic coding 方面，TNW 報導 CursorBench 由 Opus 4.6 的 58% 升至 Opus 4.7 的 70%；多步驟 agentic reasoning 提升 14%，工具錯誤約降至三分之一。[3]
引入 IDE、Claude API 或內部 coding agent 前，最好用同一個 repo snapshot 測功能開發、debug 同 refactor，睇 test、review 同人工返工成本。

人們還問

「Claude Opus 4.7 寫 code 有幾勁？SWE-bench、debug 同 refactor 邊界」的簡短答案是什麼？

Claude Opus 4.7 已可經 Claude API 使用；TNW 報導它在 SWE bench Pro 得 64.3%、SWE bench Verified 得 87.6%，coding 與真實 repo issue 修復證據強，但大型 refactor 未有獨立專項 benchmark。[3][5]

首先要驗證的關鍵點是什麼？

接下來在實務上我該做什麼？

引入 IDE、Claude API 或內部 coding agent 前，最好用同一個 repo snapshot 測功能開發、debug 同 refactor，睇 test、review 同人工返工成本。

接下來我應該探索哪個相關主題？

繼續“Claude Security 公測版：Anthropic 點樣用 AI 幫企業掃 code 漏洞”以獲得另一個角度和額外的引用。

開啟相關頁面

我應該將其與什麼進行比較？

對照「xAI Grok 4.3 API 解讀：1M context、低 token 價與語音平台野心」交叉檢查此答案。

開啟相關頁面

繼續你的研究

IT Consulting Orange County CA, IT Services, IT Outsourcing - Technijian. # Anthropic Claude Security for Enterprise: What You Need to Know. Anthropic has officially opened Claude

Claude Security 公測版：Anthropic 點樣用 AI 幫企業掃 code 漏洞

抽象 AI 介面顯示 Grok 4.3 API、1M context、token 價格與語音波形

xAI Grok 4.3 API 解讀：1M context、低 token 價與語音平台野心

xAI Grok 4.3 API：1M context、$1.25/$2.50 定價如何改變模型與語音競爭

小米 Q5 係咪 MIX 5？最新風向更似小米 18 系列

小米 Q5 更可能唔係 MIX 5：最新線索指向小米 18 系列

Metas KI-Altersprüfung: Neue Regeln für Jugendliche in EU, USA und Brasilien

Meta 用 AI 查年齡：Instagram、Facebook 青少年規則點變？

Meta 新 AI 年齡核實：Instagram 同 Facebook 對青少年有咩新規？

研究對話

你2026年4月19日

研究問題

搜尋並查核事實：Claude Opus 4.7 在寫程式、除錯、重構程式碼上到底有多強？

Studio Global AI2026年4月19日12 來源

引用的答案

簡單說：Claude Opus 4.7 不是傳聞，確實是 Anthropic 在 2026-04-16 正式發布、且已一般可用的模型。^[5]^[2] 就「軟體工程整體能力」來看，它目前可查到的公開證據很強，尤其在最接近真實除錯/修 bug 的 SWE-bench 類基準上屬於第一梯隊；但如果你要問「重構程式碼到底比別家強多少」，公開且獨立、專門拆分 refactoring 的證據還不夠，我只能說證據偏強，但不到能下絕對定論的程度。^[3]^[5]

Anthropic 官方明確說，Opus 4.7 相比 Opus 4.6 在「進階軟體工程」上有顯著進步，特別是在最困難的任務上更強，並且更能處理複雜、長時間執行的工作。^[5]
如果用公開基準看，最有代表性的證據是 SWE-bench。TNW 引述的數字顯示，Opus 4.7 在 SWE-bench Pro 為 64.3%，高於 Opus 4.6 的 53.4%、GPT-5.4 的 57.7%、Gemini 3.1 Pro 的 54.2%；在 SWE-bench Verified 為 87.6%，也高於 Opus 4.6 的 80.8% 與 Gemini 3.1 Pro 的 80.6%。^[3] TNW 同時說明，SWE-bench Pro 是測模型能否解決開源專案真實軟體問題的基準，這使它比一般程式題更接近真實除錯與修復工作。^[3]
在代理式寫程式場景，TNW 也寫到 CursorBench 從 Opus 4.6 的 58% 提升到 Opus 4.7 的 70%。^[3] 這表示它不只是單輪生成程式碼強，對需要多步驟操作的 coding workflow 也更強。^[3]
Anthropic 官方還聲稱，Opus 4.7 在複雜多步驟工作流上比 Opus 4.6 提升 14%，而且工具使用錯誤降到約三分之一。^[3] 如果這個數字在你的 IDE、CLI agent、CI/CD 流程裡也成立，那對除錯和大型重構會很有感。^[3]
但要注意：關於「找 bug、分析 log、提出修復」這類更貼近實務的描述，目前你能查到的很多證據其實來自 Anthropic 官方挑選的早期測試者回饋，不是完全獨立的第三方對照實驗。^[5] 例如官方頁面收錄了 Replit 對「分析 logs and traces、finding bugs、proposing fixes」更有效率與精準的評語，也有多家合作方說它更適合長時間、多步驟的工程工作。^[5]
關於「重構」本身，我目前沒看到公開、獨立、標準化的 refactoring benchmark 被清楚列出來。^[5]^[3] 所以比較嚴格的說法應該是：它在「修 issue、長流程工程任務、代理式 coding」上的證據很強，因而高度暗示它在重構也很強；但公開證據還不足以把「重構能力」單獨精確量化。^[5]^[3]
另一個重要事實是：Anthropic 與 CNBC 都說 Opus 4.7 雖然是 Anthropic「最強的一般可用模型」，但整體廣泛能力仍不如更高階、受限釋出的 Claude Mythos Preview。^[5]^[2] 所以如果你問的是「Anthropic 目前最強公開可買到的 coding 模型是不是它？」答案偏向是；但如果你問「Anthropic 全部模型裡絕對最強的是不是它？」答案不是。^[5]^[2]

我的查核結論是：如果你的標準是「寫程式 + 修真實 repo 問題 + 長流程 agent coding」，Claude Opus 4.7 可以算目前公開證據非常強、而且接近頂尖的一個選擇。^[3]^[5] 但如果你特別在意「互動式除錯體驗」或「大型專案重構品質」，現有公開證據多半還是官方 benchmark 與合作夥伴案例，獨立驗證仍偏少，所以最準確的判斷是：很強，但對除錯/重構的實務優勢仍需你用自己的 codebase 做 A/B 測試才算定案。^[5]^[3]

如果你要，我下一步可以直接幫你整理一份「Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro」的程式能力對照表，或幫你設計一個你自己團隊可跑的除錯/重構 A/B 測試清單。

來源

[1] Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools - Alpha Spreadalphaspread.com
Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools. Anthropic has announced Claude Opus 4.7, an updated artificial intelligence model that the company says is better at software engineering and difficult coding tasks. The r...
[2] Anthropic rolls out Claude Opus 4.7, an AI model that is less risky than Mythoscnbc.com
Skip Navigation. Markets. Currencies. Cryptocurrency. Bonds. Business. Economy. Finance. Media. Energy. Climate. [Transportation](
[3] Claude Opus 4.7 leads on SWE-bench and agentic reasoning ...thenextweb.com
Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. In short: Anthropic has released Claude Opus 4.7, its most capable generally...
[5] Introducing Claude Opus 4.7anthropic.com
Skip to main contentSkip to footer. Developers can use claude-opus-4-7 via the Claude API. . . ![Image 9: logo](

熱門發現

答案已發布2026年4月29日Last edited 2026年5月6日4 來源

Claude Opus 4.7 寫 code 有幾勁？SWE-bench、debug 同 refactor 邊界

使用 Studio Global AI 搜尋並查核事實從「發現」瀏覽更多內容

19K0

先講結論：coding／debug 值得優先試，refactor 要自己驗

寫 code、debug、refactor，唔好撈埋講

一個 coding model 識寫新 function，唔代表一定識修既有 bug；識修 bug，亦唔代表佢做出嚟嘅大型重構會令 reviewer 放心 merge。

能力	你真正要問嘅問題	目前公開證據
寫 code	能否理解需求、跟到既有 API、項目結構同 coding style，產出可 merge 嘅功能	證據強：TNW 報導 Opus 4.7 喺多個 coding／agentic benchmark 上高過 Opus 4.6。^[3]
Debug	能否讀懂錯誤訊息、logs、traces 同 failing tests，搵到根因並修真實 issue	證據偏強：SWE-bench Pro 被描述為測試模型解決開源項目真實軟件問題嘅能力；Anthropic 官方頁亦收錄早期用戶對 bug finding 同 fix proposal 嘅正面評語。^[3]^[5]
Refactor	能否喺唔改變行為嘅前提下，改善結構、命名、抽象邊界同可維護性	證據未定：可查來源未列出專門量度 refactoring 品質嘅獨立公開 benchmark。^[3]^[5]

最硬淨嘅公開數據：SWE-bench 同 CursorBench

TNW 報導嘅 benchmark 數據，係目前判斷 Opus 4.7 coding 能力最具體嘅公開材料之一。^[3]

指標	Claude Opus 4.7	對照數字	點樣解讀
SWE-bench Pro	64.3%	Opus 4.6：53.4%；GPT-5.4：57.7%；Gemini 3.1 Pro：54.2%	SWE-bench Pro 被描述為測模型解決開源項目真實軟件問題嘅能力，比純演算法題更接近日常 issue 修復。^[3]
SWE-bench Verified	87.6%	Opus 4.6：80.8%；Gemini 3.1 Pro：80.6%	喺 TNW 報導嘅 verified software engineering 任務上，Opus 4.7 明顯高過前代同文中列出嘅主要對照模型。^[3]
CursorBench	70%	Opus 4.6：58%	對代理式 coding workflow 嘅提升明顯，唔只係單輪補幾行 code。^[3]
多步驟 agentic reasoning	比 Opus 4.6 提升 14%	工具錯誤約降至 Opus 4.6 嘅三分之一	對需要工具調用、跨步驟操作、長流程工程任務嘅場景更有參考價值。^[3]

Debug：證據比 refactor 更實在

Refactor：值得試，但未到可以拍心口講最強

一般可用最強，唔等於 Anthropic 所有模型絕對最強

如果要引入，點樣做 A/B test

建議分三類任務測：

功能開發：畀同一份需求同同一個項目狀態，睇模型能否產出可 merge 嘅 diff。
Debug 修復：提供 failing test、錯誤 log 或 issue 描述，評估佢搵根因、控制修補範圍同避免 regression 嘅能力。
Refactor 任務：要求模型喺保持行為不變嘅前提下改善結構，再由工程師評估可讀性、test pass rate、diff 可 review 性同維護性。

Bottom line

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

重點

Claude Opus 4.7 已可經 Claude API 使用；TNW 報導它在 SWE bench Pro 得 64.3%、SWE bench Verified 得 87.6%，coding 與真實 repo issue 修復證據強，但大型 refactor 未有獨立專項 benchmark。[3][5]
Agentic coding 方面，TNW 報導 CursorBench 由 Opus 4.6 的 58% 升至 Opus 4.7 的 70%；多步驟 agentic reasoning 提升 14%，工具錯誤約降至三分之一。[3]
引入 IDE、Claude API 或內部 coding agent 前，最好用同一個 repo snapshot 測功能開發、debug 同 refactor，睇 test、review 同人工返工成本。

人們還問

「Claude Opus 4.7 寫 code 有幾勁？SWE-bench、debug 同 refactor 邊界」的簡短答案是什麼？

首先要驗證的關鍵點是什麼？

接下來在實務上我該做什麼？

引入 IDE、Claude API 或內部 coding agent 前，最好用同一個 repo snapshot 測功能開發、debug 同 refactor，睇 test、review 同人工返工成本。

接下來我應該探索哪個相關主題？

繼續“Claude Security 公測版：Anthropic 點樣用 AI 幫企業掃 code 漏洞”以獲得另一個角度和額外的引用。

開啟相關頁面

我應該將其與什麼進行比較？

對照「xAI Grok 4.3 API 解讀：1M context、低 token 價與語音平台野心」交叉檢查此答案。

開啟相關頁面

繼續你的研究

Claude Security 公測版：Anthropic 點樣用 AI 幫企業掃 code 漏洞

xAI Grok 4.3 API 解讀：1M context、低 token 價與語音平台野心

xAI Grok 4.3 API：1M context、$1.25/$2.50 定價如何改變模型與語音競爭

小米 Q5 係咪 MIX 5？最新風向更似小米 18 系列

小米 Q5 更可能唔係 MIX 5：最新線索指向小米 18 系列

Meta 用 AI 查年齡：Instagram、Facebook 青少年規則點變？

Meta 新 AI 年齡核實：Instagram 同 Facebook 對青少年有咩新規？

研究對話

你2026年4月19日

研究問題

搜尋並查核事實：Claude Opus 4.7 在寫程式、除錯、重構程式碼上到底有多強？

Studio Global AI2026年4月19日12 來源

引用的答案

Anthropic 官方明確說，Opus 4.7 相比 Opus 4.6 在「進階軟體工程」上有顯著進步，特別是在最困難的任務上更強，並且更能處理複雜、長時間執行的工作。^[5]
如果用公開基準看，最有代表性的證據是 SWE-bench。TNW 引述的數字顯示，Opus 4.7 在 SWE-bench Pro 為 64.3%，高於 Opus 4.6 的 53.4%、GPT-5.4 的 57.7%、Gemini 3.1 Pro 的 54.2%；在 SWE-bench Verified 為 87.6%，也高於 Opus 4.6 的 80.8% 與 Gemini 3.1 Pro 的 80.6%。^[3] TNW 同時說明，SWE-bench Pro 是測模型能否解決開源專案真實軟體問題的基準，這使它比一般程式題更接近真實除錯與修復工作。^[3]
在代理式寫程式場景，TNW 也寫到 CursorBench 從 Opus 4.6 的 58% 提升到 Opus 4.7 的 70%。^[3] 這表示它不只是單輪生成程式碼強，對需要多步驟操作的 coding workflow 也更強。^[3]
Anthropic 官方還聲稱，Opus 4.7 在複雜多步驟工作流上比 Opus 4.6 提升 14%，而且工具使用錯誤降到約三分之一。^[3] 如果這個數字在你的 IDE、CLI agent、CI/CD 流程裡也成立，那對除錯和大型重構會很有感。^[3]
但要注意：關於「找 bug、分析 log、提出修復」這類更貼近實務的描述，目前你能查到的很多證據其實來自 Anthropic 官方挑選的早期測試者回饋，不是完全獨立的第三方對照實驗。^[5] 例如官方頁面收錄了 Replit 對「分析 logs and traces、finding bugs、proposing fixes」更有效率與精準的評語，也有多家合作方說它更適合長時間、多步驟的工程工作。^[5]
關於「重構」本身，我目前沒看到公開、獨立、標準化的 refactoring benchmark 被清楚列出來。^[5]^[3] 所以比較嚴格的說法應該是：它在「修 issue、長流程工程任務、代理式 coding」上的證據很強，因而高度暗示它在重構也很強；但公開證據還不足以把「重構能力」單獨精確量化。^[5]^[3]
另一個重要事實是：Anthropic 與 CNBC 都說 Opus 4.7 雖然是 Anthropic「最強的一般可用模型」，但整體廣泛能力仍不如更高階、受限釋出的 Claude Mythos Preview。^[5]^[2] 所以如果你問的是「Anthropic 目前最強公開可買到的 coding 模型是不是它？」答案偏向是；但如果你問「Anthropic 全部模型裡絕對最強的是不是它？」答案不是。^[5]^[2]

來源

[1] Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools - Alpha Spreadalphaspread.com
Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools. Anthropic has announced Claude Opus 4.7, an updated artificial intelligence model that the company says is better at software engineering and difficult coding tasks. The r...
[2] Anthropic rolls out Claude Opus 4.7, an AI model that is less risky than Mythoscnbc.com
Skip Navigation. Markets. Currencies. Cryptocurrency. Bonds. Business. Economy. Finance. Media. Energy. Climate. [Transportation](
[3] Claude Opus 4.7 leads on SWE-bench and agentic reasoning ...thenextweb.com
Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. In short: Anthropic has released Claude Opus 4.7, its most capable generally...
[5] Introducing Claude Opus 4.7anthropic.com
Skip to main contentSkip to footer. Developers can use claude-opus-4-7 via the Claude API. . . ![Image 9: logo](

熱門發現

答案已發布2026年4月29日Last edited 2026年5月6日4 來源

Claude Opus 4.7 寫 code 有幾勁？SWE-bench、debug 同 refactor 邊界

使用 Studio Global AI 搜尋並查核事實從「發現」瀏覽更多內容

19K0

先講結論：coding／debug 值得優先試，refactor 要自己驗

寫 code、debug、refactor，唔好撈埋講

一個 coding model 識寫新 function，唔代表一定識修既有 bug；識修 bug，亦唔代表佢做出嚟嘅大型重構會令 reviewer 放心 merge。

能力	你真正要問嘅問題	目前公開證據
寫 code	能否理解需求、跟到既有 API、項目結構同 coding style，產出可 merge 嘅功能	證據強：TNW 報導 Opus 4.7 喺多個 coding／agentic benchmark 上高過 Opus 4.6。^[3]
Debug	能否讀懂錯誤訊息、logs、traces 同 failing tests，搵到根因並修真實 issue	證據偏強：SWE-bench Pro 被描述為測試模型解決開源項目真實軟件問題嘅能力；Anthropic 官方頁亦收錄早期用戶對 bug finding 同 fix proposal 嘅正面評語。^[3]^[5]
Refactor	能否喺唔改變行為嘅前提下，改善結構、命名、抽象邊界同可維護性	證據未定：可查來源未列出專門量度 refactoring 品質嘅獨立公開 benchmark。^[3]^[5]

最硬淨嘅公開數據：SWE-bench 同 CursorBench

TNW 報導嘅 benchmark 數據，係目前判斷 Opus 4.7 coding 能力最具體嘅公開材料之一。^[3]

指標	Claude Opus 4.7	對照數字	點樣解讀
SWE-bench Pro	64.3%	Opus 4.6：53.4%；GPT-5.4：57.7%；Gemini 3.1 Pro：54.2%	SWE-bench Pro 被描述為測模型解決開源項目真實軟件問題嘅能力，比純演算法題更接近日常 issue 修復。^[3]
SWE-bench Verified	87.6%	Opus 4.6：80.8%；Gemini 3.1 Pro：80.6%	喺 TNW 報導嘅 verified software engineering 任務上，Opus 4.7 明顯高過前代同文中列出嘅主要對照模型。^[3]
CursorBench	70%	Opus 4.6：58%	對代理式 coding workflow 嘅提升明顯，唔只係單輪補幾行 code。^[3]
多步驟 agentic reasoning	比 Opus 4.6 提升 14%	工具錯誤約降至 Opus 4.6 嘅三分之一	對需要工具調用、跨步驟操作、長流程工程任務嘅場景更有參考價值。^[3]

Debug：證據比 refactor 更實在

Refactor：值得試，但未到可以拍心口講最強

一般可用最強，唔等於 Anthropic 所有模型絕對最強

如果要引入，點樣做 A/B test

建議分三類任務測：

功能開發：畀同一份需求同同一個項目狀態，睇模型能否產出可 merge 嘅 diff。
Debug 修復：提供 failing test、錯誤 log 或 issue 描述，評估佢搵根因、控制修補範圍同避免 regression 嘅能力。
Refactor 任務：要求模型喺保持行為不變嘅前提下改善結構，再由工程師評估可讀性、test pass rate、diff 可 review 性同維護性。

Bottom line

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

重點

Claude Opus 4.7 已可經 Claude API 使用；TNW 報導它在 SWE bench Pro 得 64.3%、SWE bench Verified 得 87.6%，coding 與真實 repo issue 修復證據強，但大型 refactor 未有獨立專項 benchmark。[3][5]
Agentic coding 方面，TNW 報導 CursorBench 由 Opus 4.6 的 58% 升至 Opus 4.7 的 70%；多步驟 agentic reasoning 提升 14%，工具錯誤約降至三分之一。[3]
引入 IDE、Claude API 或內部 coding agent 前，最好用同一個 repo snapshot 測功能開發、debug 同 refactor，睇 test、review 同人工返工成本。

人們還問

「Claude Opus 4.7 寫 code 有幾勁？SWE-bench、debug 同 refactor 邊界」的簡短答案是什麼？

首先要驗證的關鍵點是什麼？

接下來在實務上我該做什麼？

引入 IDE、Claude API 或內部 coding agent 前，最好用同一個 repo snapshot 測功能開發、debug 同 refactor，睇 test、review 同人工返工成本。

接下來我應該探索哪個相關主題？

繼續“Claude Security 公測版：Anthropic 點樣用 AI 幫企業掃 code 漏洞”以獲得另一個角度和額外的引用。

開啟相關頁面

我應該將其與什麼進行比較？

對照「xAI Grok 4.3 API 解讀：1M context、低 token 價與語音平台野心」交叉檢查此答案。

開啟相關頁面

繼續你的研究

研究對話

你2026年4月19日

研究問題

搜尋並查核事實：Claude Opus 4.7 在寫程式、除錯、重構程式碼上到底有多強？

Studio Global AI2026年4月19日12 來源

引用的答案

Anthropic 官方明確說，Opus 4.7 相比 Opus 4.6 在「進階軟體工程」上有顯著進步，特別是在最困難的任務上更強，並且更能處理複雜、長時間執行的工作。^[5]
如果用公開基準看，最有代表性的證據是 SWE-bench。TNW 引述的數字顯示，Opus 4.7 在 SWE-bench Pro 為 64.3%，高於 Opus 4.6 的 53.4%、GPT-5.4 的 57.7%、Gemini 3.1 Pro 的 54.2%；在 SWE-bench Verified 為 87.6%，也高於 Opus 4.6 的 80.8% 與 Gemini 3.1 Pro 的 80.6%。^[3] TNW 同時說明，SWE-bench Pro 是測模型能否解決開源專案真實軟體問題的基準，這使它比一般程式題更接近真實除錯與修復工作。^[3]
在代理式寫程式場景，TNW 也寫到 CursorBench 從 Opus 4.6 的 58% 提升到 Opus 4.7 的 70%。^[3] 這表示它不只是單輪生成程式碼強，對需要多步驟操作的 coding workflow 也更強。^[3]
Anthropic 官方還聲稱，Opus 4.7 在複雜多步驟工作流上比 Opus 4.6 提升 14%，而且工具使用錯誤降到約三分之一。^[3] 如果這個數字在你的 IDE、CLI agent、CI/CD 流程裡也成立，那對除錯和大型重構會很有感。^[3]
但要注意：關於「找 bug、分析 log、提出修復」這類更貼近實務的描述，目前你能查到的很多證據其實來自 Anthropic 官方挑選的早期測試者回饋，不是完全獨立的第三方對照實驗。^[5] 例如官方頁面收錄了 Replit 對「分析 logs and traces、finding bugs、proposing fixes」更有效率與精準的評語，也有多家合作方說它更適合長時間、多步驟的工程工作。^[5]
關於「重構」本身，我目前沒看到公開、獨立、標準化的 refactoring benchmark 被清楚列出來。^[5]^[3] 所以比較嚴格的說法應該是：它在「修 issue、長流程工程任務、代理式 coding」上的證據很強，因而高度暗示它在重構也很強；但公開證據還不足以把「重構能力」單獨精確量化。^[5]^[3]
另一個重要事實是：Anthropic 與 CNBC 都說 Opus 4.7 雖然是 Anthropic「最強的一般可用模型」，但整體廣泛能力仍不如更高階、受限釋出的 Claude Mythos Preview。^[5]^[2] 所以如果你問的是「Anthropic 目前最強公開可買到的 coding 模型是不是它？」答案偏向是；但如果你問「Anthropic 全部模型裡絕對最強的是不是它？」答案不是。^[5]^[2]

來源

[1] Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools - Alpha Spreadalphaspread.com
Anthropic Releases Claude Opus 4.7 and Signals a Push Into Visual Productivity Tools. Anthropic has announced Claude Opus 4.7, an updated artificial intelligence model that the company says is better at software engineering and difficult coding tasks. The r...
[2] Anthropic rolls out Claude Opus 4.7, an AI model that is less risky than Mythoscnbc.com
Skip Navigation. Markets. Currencies. Cryptocurrency. Bonds. Business. Economy. Finance. Media. Energy. Climate. [Transportation](
[3] Claude Opus 4.7 leads on SWE-bench and agentic reasoning ...thenextweb.com
Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance. In short: Anthropic has released Claude Opus 4.7, its most capable generally...
[5] Introducing Claude Opus 4.7anthropic.com
Skip to main contentSkip to footer. Developers can use claude-opus-4-7 via the Claude API. . . ![Image 9: logo](