studioglobal
인기 있는 발견
답변게시됨14 소스

四大 AI 模型跑分點揀?唔係排總榜,而係睇你要做咩

公開跑分唔支持一句話排出總冠軍:GPT 5.5 在 Terminal Bench 2.0 有 82.7%,Claude Opus 4.7 在 SWE Bench Pro 有 64.3%、SWE Bench Verified 有 87.6%;但四個模型未有足夠同一評測框架下的獨立比較 [19][27][5]。 Kimi K2.6 有 SWE Bench Pro 58.6%、SWE Bench Verified 80.2%、Terminal Bench 2.0 66.7% 等公開數字,但部分來自模型卡或 Moonshot in house harness,不應視作與 GPT 5.5、Claude 分數完全同條件對比 [1][6]。

17K0
네 개의 AI 모델 벤치마크 점수를 비교하는 추상적인 대시보드 일러스트
GPT-5.5·Claude Opus 4.7·DeepSeek V4·Kimi K2.6 벤치마크 비교: 작업별 승자는 다르다AI 생성 이미지. 네 모델의 공개 벤치마크 비교를 상징적으로 표현했습니다.
AI 프롬프트

Create a landscape editorial hero image for this Studio Global article: GPT-5.5·Claude Opus 4.7·DeepSeek V4·Kimi K2.6 벤치마크 비교: 작업별 승자는 다르다. Article summary: 종합 1위는 보류가 맞습니다. 공개값 기준으로 GPT 5.5는 Terminal Bench 2.0 82.7%, Claude Opus 4.7은 SWE Bench Pro 64.3%·SWE Bench Verified 87.6%가 강하지만, 네 모델을 같은 하네스로 평가한 독립 비교는 부족합니다 [19][27][5].. Topic tags: ai, benchmarks, openai, chatgpt, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fmedium.com%2F%40cognidownunder%2Fclaude-opus-4-7-leads-on-code-gpt-5-5-wins-intelligence-and-kimi-k2-6-" source context "Claude Opus 4.7 Leads on Code, GPT 5.5 Wins Intelligence, and Kimi K2.6 Changes Everything" Reference image 2: visual subject "[Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fmedium.com%2F%40cognidownun

openai.com

如果你係產品或工程團隊要揀模型,最危險嘅做法係只問:邊個最勁?公開跑分睇落好多,但 GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6 並未全部喺同一套 prompt、工具權限、推理設定同評分器下完整對打。

比較穩陣嘅讀法係:先睇任務,再揀候選模型。GPT-5.5 同 Claude Opus 4.7 有較多同表比較;Kimi K2.6 的分數混合了模型卡及個別 harness;DeepSeek V4 則缺少多個主要編程基準測試的共通數字 [1][2][5][6]

先睇結論:每類工作先試邊個?

  • 終端機型代理編程:可以先試 GPT-5.5。OpenAI 指 GPT-5.5 在 Terminal-Bench 2.0 達 82.7%;公開比較表中,Claude Opus 4.7 為 69.4%,Kimi K2.6 為 66.7% [19][8][13][6]
  • 真實 GitHub issue 修復、代碼維護:Claude Opus 4.7 係較強的一線候選。公開資料報告其 SWE-Bench Pro 為 64.3%、SWE-Bench Verified 為 87.6%,高於 GPT-5.5 在 SWE-Bench Pro 的 58.6% [27][19]
  • 長上下文、多模態輸入:Kimi K2.6 值得放入候選名單。Kimi K2.6 被介紹為支援文字、圖片、影片輸入,並有 256k context route [7]
  • 成本敏感的大量 API 呼叫:DeepSeek V4 價格較搶眼。Mashable 整理的 API 價格顯示,每 100 萬 tokens,DeepSeek V4 輸入 US$1.74、輸出 US$3.48;GPT-5.5 為輸入 US$5、輸出 US$30;Claude Opus 4.7 為輸入 US$5、輸出 US$25 [3]

核心跑分對照表

下表入面的「—」只代表在提供的公開來源中,暫時未見可直接對應的同名基準測試數字;不代表該模型做不到相關工作。

基準測試GPT-5.5Claude Opus 4.7Kimi K2.6DeepSeek V4點樣讀
Terminal-Bench 2.082.7% [19]69.4% [8][13]66.7% [6]終端機、命令列工作流方面,GPT-5.5 的公開值最高。
SWE-Bench Pro58.6% [19]64.3% [27]58.6% [1][6]代碼修復、GitHub issue 解決方面,Claude Opus 4.7 領先。
SWE-Bench Verified87.6% [27]80.2% [1][6]以提供來源計,Claude Opus 4.7 同 Kimi K2.6 有可見數字。
GPQA Diamond93.6% [8][13]94.2% [8][13]GPT-5.5 同 Claude Opus 4.7 非常接近,公開值 Claude 略高。
HLE with tools52.2% [8]54.7% [8][29]54.0% [6]Claude 同 Kimi 數字較高,但 Kimi 可能屬另一套比較條件 [6]
BrowseComp84.4% [8][13]79.3% [8][13]網頁瀏覽、搜尋型評測中,GPT-5.5 的公開值較高。
OSWorld-Verified78.7% [13]78.0% [13]兩者差距細。
MCP Atlas75.3% [13]79.1% [13]MCP、工具整合型評測中,Claude Opus 4.7 較高。

GPT-5.5:終端機型代理編程的強候選

OpenAI 表示,GPT-5.5 在 Terminal-Bench 2.0 達 82.7%,在 SWE-Bench Pro 達 58.6% [19]。按 OpenAI 的說法,Terminal-Bench 2.0 測試的是需要規劃、反覆嘗試、協調工具的複雜命令列工作流;SWE-Bench Pro 則評估模型解決真實 GitHub issue 的能力 [19]

所以,如果你的產品工作流涉及 sandbox 執行、shell 指令、CI 重現、建立及修改檔案,甚至要模型長時間喺終端機入面自己摸索,GPT-5.5 值得優先測試。不過,SWE-Bench Pro 方面,Claude Opus 4.7 的 64.3% 高於 GPT-5.5 的 58.6%;因此唔應直接推論「所有編程工作都係 GPT-5.5 贏」[19][27]

Claude Opus 4.7:較適合代碼修補、審核型工作

Claude Opus 4.7 被報告在 SWE-Bench Pro 達 64.3%,SWE-Bench Verified 達 87.6% [27]。DataCamp 亦整理指,Opus 4.7 被放到 14 個基準測試中評估,範圍包括 coding、reasoning、工具使用、電腦使用同視覺推理 [27]

同 GPT-5.5 的共通比較入面,Claude Opus 4.7 在 GPQA Diamond 為 94.2% 對 GPT-5.5 的 93.6%,在 MCP Atlas 為 79.1% 對 GPT-5.5 的 75.3% [8][13]。相反,Terminal-Bench 2.0 同 BrowseComp 則係 GPT-5.5 的公開值較高 [8][13][19]

換句話講,Claude Opus 4.7 未必係所有終端機自動化場景的絕對王者;但如果任務係修 issue、改 code、review、重構、檢查邏輯一致性,它應該係第一批要試的模型。

Kimi K2.6:長多模態輸入吸引,但要睇清跑分條件

Kimi K2.6 被列出 SWE-Bench Pro 58.6%、SWE-Bench Verified 80.2%;另有指南列出 Terminal-Bench 2.0 66.7%、HLE with tools 54.0% [1][6]。不過,同一份指南說明 K2.6 數字來自 Moonshot AI 官方模型卡,並對 SWE-Bench Pro 加上 Moonshot in-house harness 的註腳 [6]

因此,即使 Kimi K2.6 的 SWE-Bench Pro 58.6% 在數字上同 GPT-5.5 的 58.6% 一樣,都唔應直接當成「完全同條件打和」[1][6][19]。Kimi K2.6 更值得留意的,是它被介紹為支援文字、圖片、影片輸入,以及 256k context route;如果你的應用要處理好長文件、截圖、影片或跨模態材料,值得單獨做實測 [7]

DeepSeek V4:價錢有吸引力,但準確性要自己驗證

在今次表格涉及的 Terminal-Bench、SWE-Bench Pro、SWE-Bench Verified、GPQA Diamond 等項目上,提供來源內未有足夠公開數字可把 DeepSeek V4 直接放入同一行比較。另一方面,Artificial Analysis 指 DeepSeek V4 Pro Max 在 AA-Omniscience 得分為 -10,較 V3.2 改善 11 分;V4 Flash Max 則為 -23 [2]。同一來源亦報告 V4 Pro 及 V4 Flash 的幻覺率分別為 94% 及 96%,並解讀為模型在不知道答案時幾乎仍會作答 [2]

架構同成本方面,DeepSeek V4 仍有值得研究的地方。DataCamp 指 DeepSeek V4 使用 Mixture of Experts 架構;Pro 模型總參數為 1.6 兆、啟用參數為 490 億,Flash 模型總參數為 2,840 億、啟用參數為 130 億 [4]。Mashable 整理的 API 價格亦顯示,DeepSeek V4 較 GPT-5.5 同 Claude Opus 4.7 便宜 [3]

所以,DeepSeek V4 較適合放入成本敏感的大量處理、內部可驗證流程、開放權重路線評估之中。但如果你的產品重視正確性,就要配合自家評測、後處理、失敗偵測同人工抽查;高幻覺率報告同共通跑分空白都唔應忽視 [2][3][4]

使用場景選擇指南

使用場景優先測試模型主要理由
長時間終端機自動化、shell-based agent、CI 重現GPT-5.5Terminal-Bench 2.0 公開數字為 GPT-5.5 82.7%、Claude Opus 4.7 69.4%、Kimi K2.6 66.7% [19][8][13][6]
真實 GitHub issue 解決、代碼修補、SWE-Bench 類工作Claude Opus 4.7Claude Opus 4.7 被報告 SWE-Bench Pro 64.3%、SWE-Bench Verified 87.6% [27]
網頁瀏覽、搜尋型任務GPT-5.5BrowseComp 報告 GPT-5.5 84.4%、Claude Opus 4.7 79.3% [8][13]
MCP、工具整合型任務Claude Opus 4.7MCP Atlas 報告 Claude Opus 4.7 79.1%、GPT-5.5 75.3% [13]
長多模態上下文Kimi K2.6Kimi K2.6 被介紹為支援文字、圖片、影片輸入,以及 256k context route [7]
成本敏感的大量 API 呼叫DeepSeek V4Mashable 價格顯示 DeepSeek V4 token 成本低於 GPT-5.5、Claude Opus 4.7;但要同時留意 Artificial Analysis 的高幻覺率報告 [2][3]

點解唔應急住宣布總冠軍?

第一,提供來源內未有足夠證據顯示四個模型在同一 prompt、同一工具存取、同一推理預算、同一評分器下完成完整獨立比較。GPT-5.5 同 Claude Opus 4.7 的共同比較相對多;Kimi K2.6 混合模型卡及 in-house harness 數字;DeepSeek V4 則在多個共通基準測試行入面留白 [1][2][5][6]

第二,同一個 benchmark 名稱,實際執行條件都可以唔同。有整理資料指出,GPT-5.5 同 Claude Opus 4.7 的公開分數在形式上可比較,但不代表方法論完全一致 [5]。Anthropic 亦披露,其 Terminal-Bench 2.0 評估使用 Terminus-2 harness,並有特定資源條件 [31]

第三,跑分只係產品品質的一部分。真正導入時,除了正確率,仲要睇失敗模式、幻覺率、延遲、成本、工具呼叫穩定性、安全政策同 log 可重現性。ExplainX 亦提醒,leaderboard 定義、prompt 同工具政策都會令分數變動;公開跑分應視為快照,而唔係取代自家 eval harness [28]

最後點揀?

以現有公開證據計,較務實的策略係:終端機型代理編程先試 GPT-5.5SWE-Bench 類代碼修補先試 Claude Opus 4.7長多模態上下文把 Kimi K2.6 放入候選成本敏感的大量呼叫評估 DeepSeek V4 [19][27][7][3]

但如果你要把模型放入正式產品,最穩陣仍然係用自己真實 prompt、真實資料、真實工具權限做小型回歸測試。公開跑分可以幫你縮窄 shortlist;最後拍板,應該交畀你自己的任務表現、成本同風險要求 [5][28][31]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI로 검색 및 팩트체크

주요 시사점

  • 公開跑分唔支持一句話排出總冠軍:GPT 5.5 在 Terminal Bench 2.0 有 82.7%,Claude Opus 4.7 在 SWE Bench Pro 有 64.3%、SWE Bench Verified 有 87.6%;但四個模型未有足夠同一評測框架下的獨立比較 [19][27][5]。
  • Kimi K2.6 有 SWE Bench Pro 58.6%、SWE Bench Verified 80.2%、Terminal Bench 2.0 66.7% 等公開數字,但部分來自模型卡或 Moonshot in house harness,不應視作與 GPT 5.5、Claude 分數完全同條件對比 [1][6]。
  • DeepSeek V4 的 API 每 100 萬 tokens 價格較低,但在今次比較涉及的共通編程跑分上公開資料不足;Artificial Analysis 亦報告 V4 Pro、V4 Flash 有很高幻覺率 [2][3]。

사람들은 또한 묻습니다.

"四大 AI 模型跑分點揀?唔係排總榜,而係睇你要做咩"에 대한 짧은 대답은 무엇입니까?

公開跑分唔支持一句話排出總冠軍:GPT 5.5 在 Terminal Bench 2.0 有 82.7%,Claude Opus 4.7 在 SWE Bench Pro 有 64.3%、SWE Bench Verified 有 87.6%;但四個模型未有足夠同一評測框架下的獨立比較 [19][27][5]。

먼저 검증할 핵심 포인트는 무엇인가요?

公開跑分唔支持一句話排出總冠軍:GPT 5.5 在 Terminal Bench 2.0 有 82.7%,Claude Opus 4.7 在 SWE Bench Pro 有 64.3%、SWE Bench Verified 有 87.6%;但四個模型未有足夠同一評測框架下的獨立比較 [19][27][5]。 Kimi K2.6 有 SWE Bench Pro 58.6%、SWE Bench Verified 80.2%、Terminal Bench 2.0 66.7% 等公開數字,但部分來自模型卡或 Moonshot in house harness,不應視作與 GPT 5.5、Claude 分數完全同條件對比 [1][6]。

실무에서는 다음으로 무엇을 해야 합니까?

DeepSeek V4 的 API 每 100 萬 tokens 價格較低,但在今次比較涉及的共通編程跑分上公開資料不足;Artificial Analysis 亦報告 V4 Pro、V4 Flash 有很高幻覺率 [2][3]。

다음에는 어떤 관련 주제를 탐구해야 할까요?

다른 각도와 추가 인용을 보려면 "Claude Security 공개 베타: 앤트로픽의 AI 코드 취약점 스캐너가 하는 일"으로 계속하세요.

관련 페이지 열기

이것을 무엇과 비교해야 합니까?

"Grok 4.3 API 분석: 100만 토큰과 낮은 단가, xAI의 다음 승부수"에 대해 이 답변을 대조 확인하세요.

관련 페이지 열기

연구를 계속하세요

연구 대화

당신

연구문제

GPT-5.5, Claude Opus 4.7, DeepSeek V4, Kimi K2.6의 벤치마크를 비교해 주세요.

Studio Global AI34 소스

인용 답변

요약하면, 공개적으로 확인 가능한 “동일 조건” 벤치마크만 놓고 보면 GPT-5.5는 터미널/에이전트 코딩, Claude Opus 4.7은 SWE-Bench Pro·Verified와 추론 계열에서 강합니다. DeepSeek V4와 Kimi K2.6은 공개 수치가 흩어져 있어 4개 모델을 한 표로 공정하게 순위화하기에는 근거가 부족합니다.

항목GPT-5.5Claude Opus 4.7DeepSeek V4Kimi K2.6
Terminal-Bench 2.082.7%69.4%공개 비교 근거 부족약 66%대로 보도됨
SWE-Bench Pro58.6%64.3%공개 비교 근거 부족58.6%
SWE-Bench Verified공개 수치 확인 부족87.6%공개 비교 근거 부족80.2%
GPQA Diamond0.94/1로 집계된 자료 있음94.2%로 보도됨공개 비교 근거 부족공개 비교 근거 부족
HLEGPT-5.5와 Opus 4.7 비교 자료에서 항목 존재46.9%로 보도됨공개 비교 근거 부족공개 비교 근거 부족
기타OSWorld, BrowseComp, CyberGym 등에서 GPT-5.5가 앞선다는 비교가 있음FinanceAgent, MCP Atlas 등에서 Opus 4.7이 앞선다는 비교가 있음Artificial Analysis 기준 AA-Omniscience 개선이 보고됐지만 환각률이 매우 높다는 지적도 있음텍스트·이미지·비디오 입력과 256k 컨텍스트 지원으로 소개됨
  • GPT-5.5는 OpenAI 발표 기준 Terminal-Bench 2.0에서 82.7%, SWE-Bench Pro에서 58.6%를 기록한 것으로 확인됩니다 [15].
  • Claude Opus 4.7은 여러 비교 자료에서 SWE-Bench Pro 64.3%, SWE-Bench Verified 87.6%, Terminal-Bench 69.4%, GPQA Diamond 94.2%로 제시됩니다 [3].
  • GPT-5.5와 Claude Opus 4.7만 비교하면, 한 집계 자료는 공통 10개 벤치마크 중 Claude Opus 4.7이 6개, GPT-5.5가 4개에서 앞선다고 요약합니다 [12].
  • Kimi K2.6은 SWE-Bench Pro 58.6%, SWE-Bench Verified 80.2%로 소개된 자료가 있지만, GPT-5.5·Claude Opus 4.7·DeepSeek V4와 완전히 같은 하네스에서 비교됐는지는 명확하지 않습니다 [15][4].
  • DeepSeek V4는 Artificial Analysis에서 V4 Pro와 V4 Flash의 지식·환각 관련 평가가 언급되지만, 위 표의 SWE-Bench Pro, Terminal-Bench, GPQA, HLE와 직접 대응되는 공개 비교 수치는 충분하지 않습니다 [7].

결론적으로 “벤치마크 승자”를 하나로 고르면 왜곡될 수 있습니다.

  • 에이전트 터미널 작업: GPT-5.5 우세로 볼 근거가 가장 강합니다 [15].
  • 실전 GitHub 이슈 해결/SWE-Bench 계열: Claude Opus 4.7 우세로 볼 근거가 강합니다 [3][12].
  • 비용 대비 성능/오픈웨이트 계열: DeepSeek V4와 Kimi K2.6을 검토할 만하지만, 동일 벤치마크 공개 근거는 부족합니다 [7][4].
  • 종합 순위: Insufficient evidence. 4개 모델 모두를 같은 프롬프트, 같은 도구 사용 조건, 같은 채점기로 평가한 독립 벤치마크가 필요합니다.

출처

  • [1] Claude Opus 4.7 vs Kimi K2.6 - Detailed Performance & Feature Comparisondocsbot.ai

    SWE-Bench Verified Evaluates software engineering capabilities through verified code modifications and custom agent setups Not available 80.2% SWE-Bench Verified, thinking mode Source SWE-Bench Pro Evaluates software engineering on multi-language SWE-Bench...

  • [2] DeepSeek is back among the leading open weights models with V4 Pro ...artificialanalysis.ai

    Gains in knowledge but an increase in hallucination rate: DeepSeek V4 Pro (Max) scores -10 on AA-Omniscience, an 11 point improvement over V3.2 (Reasoning, -21), driven primarily by higher accuracy. V4 Flash (Max) scores -23, broadly in line with V3.2. V4 P...

  • [3] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com

    Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...

  • [4] DeepSeek V4: Features, Benchmarks, and Comparisons - DataCampdatacamp.com

    How large are the DeepSeek V4 models? DeepSeek uses a Mixture of Experts (MoE) architecture. The Pro model contains 1.6 trillion total parameters (49 billion active) and requires an 865GB download. The Flash model contains 284 billion parameters (13 billion...

  • [5] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarksllm-stats.com

    The Verdict On the 10 benchmarks both providers report, Opus 4.7 leads on 6 and GPT-5.5 leads on 4. The leads cluster by category, not by overall quality: Opus 4.7 is ahead on the reasoning-heavy and review-grade tests (GPQA Diamond, HLE with and without to...

  • [6] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai

    Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

  • [7] Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7blog.laozhang.ai

    Official Contract Lanes Official rows keep the comparison honest. Kimi's K2.6 pricing page says K2.6 is the latest and smartest Kimi model, supports text, image, and video input, and has a 256k context route. DeepSeek's pricing page lists deepseek-v4-flash...

  • [8] OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com

    Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...

  • [13] Everything You Need to Know About GPT-5.5 - Vellumvellum.ai

    Benchmark GPT-5.5 GPT-5.5 Pro GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro --- --- --- Terminal-Bench 2.0 82.7% — 75.1% 69.4% 68.5% SWE-Bench Pro 58.6% — 57.7% 64.3% 54.2% Expert-SWE (Internal) 73.1% — 68.5% — — GDPval 84.9% 82.3% 83.0% 80.3% 67.3% OSWorld-Verifi...

  • [19] Introducing GPT-5.5 - OpenAIopenai.com

    Agentic coding GPT‑5.5 is our strongest agentic coding model to date. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%. On SWE-Bench Pro,...

  • [27] Claude Opus 4.7: Anthropic's New Best (Available) Model - DataCampdatacamp.com

    Claude Opus 4.7 Benchmarks Opus 4.7 was evaluated across 14 benchmarks covering coding, reasoning, tool use, computer use, and visual reasoning. The table below shows the full comparison with Opus 4.6, GPT 5.4, Gemini 3.1 Pro, and the not-yet-published Myth...

  • [28] Claude Opus 4.7: Anthropic’s new flagship, benchmarks, and how it compares to Sonnet & Haiku | explainx.ai Blog | explainx.aiexplainx.ai

    \Percentages are as printed on Anthropic’s benchmark figure; leaderboard definitions, prompts, and tool policies can move scores over time—treat this as a snapshot, not a substitute for your eval harness. Reading the table pragmatically Agentic coding (SWE-...

  • [29] Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safetymashable.com

    Claude Mythos scored 56.8 percent on HLE Claude Opus 4.7 scored 46.9 percent Gemini 3.1 Pro scored 44.4 percent GPT-5-4 Pro scored 42.7 percent Claude Opus 4.6 scored 40.0 percent With tools, GPT-5-4-Pro scored 58.7 percent compared to Opus 4.7’s 54.7 perce...

  • [31] Introducing Claude Opus 4.7 - Anthropicanthropic.com

    For GPT-5.4 and Gemini 3.1 Pro, we compared against the best reported model version available via API in the charts and table. MCP-Atlas: The Opus 4.6 score has been updated to reflect revised grading methodology from Scale AI. SWE-bench Verified, Pro, and...