レポート公開済み2026年4月28日Last edited 2026年5月6日14 ソース

Claude Opus 4.7 vs GPT-5.5 Spud：ベンチマークで本当に分かること

現時点で明確な勝者は出せません。Claude Opus 4.7はAnthropic公式資料で確認できますが、GPT 5.5 SpudはここではOpenAIの一次情報で検証できていません。信頼できるベンチマークには、最近または非公開の課題、公開された手法、客観的な採点、独立した再現が必要です。

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

17K0

Editorial illustration of Claude Opus 4.7 and GPT-5.5 Spud benchmark claims being compared on scorecards — Claude Opus 4.7 vs GPT-5.5 Spud: Why the Benchmark Winner Isn’t Proven YetAI-generated editorial image visualizing a benchmark comparison where one model is verified and the other remains unconfirmed in the supplied evidence.
AI プロンプト
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs GPT-5.5 Spud: Why the Benchmark Winner Isn’t Proven Yet. Article summary: Claude Opus 4.7 is documented by Anthropic and reported as publicly released, while GPT 5.5 Spud is not verified here by a primary OpenAI source; a reliable head to head winner cannot be named yet.. Topic tags: ai, ai benchmarks, anthropic, claude, openai. Reference image context from search candidates: Reference image 1: visual subject "# Claude 4.7 vs GPT-5.5: Who Actually Wins in 2026? Both offer a 1,000,000-token context window. Both charge $5.00 per million input tokens. The difference between choosing the rig" source context "Claude 4.7 vs GPT-5.5: Who Actually Wins in 2026? | Topify" Reference image 2: visual subject "# OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better? OpenAI released its latest model, GPT-5.5, on
openai.com

Claude Opus 4.7 vs GPT-5.5 Spudは、一見すると「どちらのモデルが強いか」という分かりやすい競争に見えます。けれど今回確認できる資料で本当に問題になるのは、スコアの大小より先に、比較対象が同じ検証水準に立っているかです。

Anthropicの公式資料には、開発者がClaude API経由でclaude-opus-4-7を利用できると記載されています。^[8] またVentureBeatも、Claude Opus 4.7の公開リリースを報じています。^[1] 一方、GPT-5.5 Spudについて今回示されている資料は、OpenAIのモデルカード、システムカード、リリースノート、API文書ではなく、将来または可能性のあるOpenAIモデルを扱う第三者ページです。^[19]^[20]

つまり結論は非対称です。Claude Opus 4.7は、この資料群では実在し評価対象にできるモデルとして扱えます。GPT-5.5 Spudは、少なくともここではOpenAIがリリース済みと確認できるモデルとしては扱えません。したがって、両者の直接対決で「勝者」を出す根拠はまだありません。

まず確認できる事実

確認したいこと	資料から言えること	なぜ重要か
Claude Opus 4.7はAnthropicのモデルとして確認できるか	はい。Anthropicは`claude-opus-4-7`をClaude API向けに掲載しています。^[8]	社内評価や検証の候補に入れられます。
Claude Opus 4.7は公開リリースとして報じられているか	はい。VentureBeatが公開リリースを報じています。^[1]	公式資料や信頼できる報道にたどれる主張は、うわさより重みがあります。
GPT-5.5 SpudはOpenAIのリリース済みモデルとして確認できるか	いいえ。今回のSpud関連資料は、将来または可能性のあるモデルを扱う第三者ページです。^[19]^[20]	直接の性能主張は未確認として扱うべきです。
Claude Opus 4.7とGPT-5.5 Spudを同条件で比べた独立ベンチマークはあるか	今回の資料群には見当たりません。	直接ランキングを出すと、根拠を大きく見せすぎることになります。

ベンチマークが証明できること、できないこと

ベンチマークが示せるのは、特定の課題セット、実行ハーネス、採点方法、ツール利用条件、アクセス条件の下で、そのモデルがどう振る舞ったかです。それだけで、あらゆる業務や用途での普遍的な優劣までは証明できません。

この点は、LLM評価の研究でも繰り返し指摘されています。静的なベンチマークには、スコアの飽和、データ汚染、独立再現の不足といった問題が起こり得ます。^[26] 片方のモデルは公式に確認でき、もう片方は一次資料で確認できない、という今回のような比較では、この注意点はいっそう重要です。

Claude Opus 4.7 vs GPT-5.5 Spudについて信頼できる主張をするなら、少なくとも次の条件が必要です。

OpenAIの一次情報でSpudが確認されていること。
安定したSpudのモデルIDがあること。
両モデルに再現可能なアクセス条件があること。
プロンプト、ツール、再試行、採点方法などの設定が開示されていること。
同等条件で独立した再現が行われていること。

今回のSpud関連資料は、この水準を満たしていません。^[19]^[20]

データ汚染で順位は変わり得る

ベンチマーク汚染やリークが問題になるのは、高いスコアが「汎用能力」ではなく、テストデータ、解法パターン、公開された評価用アーティファクトへの接触を反映している可能性があるからです。近年のベンチマーク研究は、特に静的または公開データセットでこのリスクを指摘しています。^[25]^[26]^[45]

LLMベンチマークのサーベイは、LiveBenchのような動的ベンチマーク設計がデータリークのリスクを下げると述べています。^[25] もちろん、動的だからといって単独の順位表が最終回答になるわけではありません。それでも、古い静的ベンチマークよりは、頻繁に更新され汚染を抑える設計の評価の方が、フロンティアモデルを見るうえで有益です。

LiveBenchは強いシグナル。ただし最終判断ではない

今回の資料群で見る限り、LiveBenchは比較的強い公開ベンチマークの一つです。汚染を抑えた課題、最近の情報源からの頻繁な問題更新、手続き的な問題生成、客観的な正解に基づく採点を特徴としています。^[37] 公式サイトからはリーダーボード、詳細、コード、データ、論文にもアクセスでき、単発のローンチ資料より検証しやすい構造になっています。^[36]

ただし、LiveBenchも「採用決定をそのまま任せるもの」ではありません。公開ベンチマークは候補を絞る材料にはなりますが、自社のプロンプト、コードベース、レイテンシ要件、コスト制約、失敗許容度を置き換えるものではありません。

SWE-bench系は有用だが、名前だけで読んではいけない

SWE-bench系の評価は、コーディングやソフトウェアエンジニアリング向けエージェントの比較に役立ちます。ただし、「SWE-benchで何点」という見出しだけでは不十分です。どの変種か、どのハーネスか、ツールアクセスはどうか、リポジトリの状態は何か、再試行を許したか、採点はどう行ったかで結果は変わります。

SWE-bench Liveは、事前学習データへの混入を減らすため、2024年1月1日から2025年4月20日までに作成されたissueに課題を限定したと説明されています。また、著者らはリーダーボードの設定が大きく異なり得るとも述べています。^[43] SWE-bench Proは、より長い時間軸のソフトウェアエンジニアリング課題に対する、より難しく汚染耐性のあるベンチマークとして提示されています。^[44]

一方で、注意点も大きいです。SWE-Bench++は、オープンソースソフトウェアを使うベンチマークには重大なデータ汚染リスクがあり、解答の漏えいが順位を歪め得ると指摘しています。^[45] 2026年のSWE-benchリーダーボード分析も、SWE-bench Verifiedの最近の提出にデータ汚染が見られたと報告しています。^[47]

さらに、飽和の問題もあります。あるベンチマーク基盤に関する論文は、SWE-bench Verifiedでの結果がSWE-bench Proでは23%まで落ちる例を報告しています。^[46] SWE-ABSも、SWE-bench Verifiedのリーダーボードは飽和に近づいており、課題を敵対的に強化するまでは成功率が高く見えすぎる可能性があると論じています。^[49]

ベンチマークを読むための信頼度の目安

公開ベンチマークは、最終判定ではなくフィルターとして使うのが現実的です。

証拠の種類	信頼の置き方	主な注意点
自社ワークロードでの非公開評価	実務上の価値は最も高い。実際のプロンプト、ツール、コード、制約に合うため。	再現可能なハーネスと慎重な採点が必要。
動的または汚染制限型の公開ベンチマーク	静的テストより強い。課題更新によりリークの影響を減らせるため。^[25]^[37]	それでも本番業務と一致するとは限らない。
SWE-bench Live、SWE-bench Pro	ソフトウェアエンジニアリング向けエージェントの比較に有用。^[43]^[44]	ハーネスやツール設定の違いで順位が変わり得る。^[43]
SWE-bench Verifiedなどのリーダーボード	市場全体の大まかなシグナルとしては有用。	汚染、リーク、飽和で生スコアが歪む可能性がある。^[45]^[47]^[49]
ベンダーのローンチ時チャート	モデル開発元が何を強みと見ているかを知る材料になる。	重要な意思決定には独立再現が必要。^[26]
うわさ記事やSEO比較記事	調査の出発点にはなる。	未確認モデルの一次証拠にはならない。^[19]^[20]

モデルを切り替える前に見るべきこと

Claude Opus 4.7を、OpenAI、Google、Anthropic、またはオープンモデルと比較するなら、最初にベンチマークの信頼性を確認し、最後は自社の作業で試すべきです。

正確なモデルIDを確認する。 Claude Opus 4.7については、Anthropicがclaude-opus-4-7をClaude API向けに示しています。^[8] GPT-5.5 Spudについては、今回の資料群にOpenAIの一次的なモデルIDはありません。^[19]^[20]
全モデルで同じハーネスを使う。 SWE-bench Liveは、リーダーボードの設定が大きく異なり得ると明記しています。条件が違えば、見かけの順位も変わります。^[43]
最近の課題、非公開課題、汚染耐性のある課題を優先する。 動的ベンチマークや汚染耐性を意識したソフトウェアエンジニアリング評価は、リークの影響を下げるために設計されています。^[25]^[37]^[44]
実務制約を記録する。 再試行回数、レイテンシ、コスト、ツール権限、失敗モード、そして高コストな試行の末に解けただけなのかを残すべきです。
評価を繰り返す。 単一のリーダーボード結果は、社内テストや第三者再現で支えられるまで仮説として扱うのが安全です。^[26]

結論が変わる条件

GPT-5.5 Spudについて、OpenAIの一次発表、モデルカード、システムカード、API文書のいずれかが確認でき、安定したモデルID、再現可能なアクセス、比較可能なハーネスとツール権限を備えた独立ベンチマークが出てくれば、判断は変わります。

さらに、その結果がLiveBench、SWE-bench Live、SWE-bench Proのような汚染制限型または汚染耐性を意識した評価に掲載され、独立チームが再現できれば、証拠はより強くなります。^[37]^[43]^[44]^[26]

重要な限界

この分析は、今回参照している資料に限定されています。ここにOpenAIの一次資料がないことは、「GPT-5.5 Spudが存在しない」ことの証明ではありません。あくまで、この資料群では確認できないという意味です。^[19]^[20]

また、ここで引用したベンチマーク方法論の資料には、arXiv、OpenReview、SSRNの記録が含まれ、最終的な査読付きジャーナル論文ではないものもあります。評価設計、汚染リスク、再現性の問題を理解するうえでは有用ですが、出版状況は踏まえて読む必要があります。^[25]^[26]^[37]^[43]^[44]^[45]^[46]^[47]^[49]

要点

Claude Opus 4.7は、今回の資料群ではAnthropic公式資料と報道で確認できます。GPT-5.5 Spudは、ここではOpenAIの一次資料によって確認されていません。^[8]^[1]^[19]^[20] そのため、Claude Opus 4.7 vs GPT-5.5 Spudの勝者を発表するには、Spudの確認、安定したモデルID、再現可能なアクセス、同等条件での評価が必要です。

モデル選定では、汚染制限型または汚染耐性を意識したベンチマーク、検証可能な方法、繰り返し評価を重視すべきです。LiveBench、SWE-bench Live、SWE-bench Proは、静的ベンチマークやベンダー単独のチャートより有益な材料になります。ただし、どれも自社ワークロードでの管理された評価の代わりにはなりません。^[37]^[25]^[43]^[44]^[26]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

現時点で明確な勝者は出せません。Claude Opus 4.7はAnthropic公式資料で確認できますが、GPT 5.5 SpudはここではOpenAIの一次情報で検証できていません。
信頼できるベンチマークには、最近または非公開の課題、公開された手法、客観的な採点、独立した再現が必要です。
LiveBenchや新しいSWE bench系評価は汚染リスクを下げる工夫がありますが、順位表だけで採用判断を下すのは危険です。

人々も尋ねます

「Claude Opus 4.7 vs GPT-5.5 Spud：ベンチマークで本当に分かること」の短い答えは何ですか?

現時点で明確な勝者は出せません。Claude Opus 4.7はAnthropic公式資料で確認できますが、GPT 5.5 SpudはここではOpenAIの一次情報で検証できていません。

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

LiveBenchや新しいSWE bench系評価は汚染リスクを下げる工夫がありますが、順位表だけで採用判断を下すのは危険です。

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6：2026年ベンチマーク比較

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

Northwest Timber と Southeast Timber：代替性が高いと、なぜ「larger; larger」になるのか

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか

研究に関する会話

あなた2026年4月21日

リサーチクエスチョン

Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.

Studio Global AI2026年4月21日40 ソース

引用された回答

Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.^[1] In the retrieved evidence, “GPT-5.5 Spud” appeared on third-party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evidence does not support a clean Claude-vs-Spud head-to-head yet.^[2]^[3] On benchmark credibility, the strongest signals come from contamination-limited or contamination-resistant public benchmarks such as LiveBench and newer SWE-bench variants, not from release-day charts alone.^[4]^[5]^[6]^[7]^[8]

Key findings

Model status: Claude Opus 4.7 is a real, released Anthropic model.^[1] In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.^[2]^[3]
Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates.^[4] A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.^[5]
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially.^[6] SWE-bench Pro is presented as a more contamination-resistant benchmark for longer-horizon software-engineering tasks.^[7]
Contamination risk remains material: SWE-Bench++ argues that public GitHub-based software benchmarks face critical contamination risk and that solution leakage can skew rankings.^[9] A 2026 analysis of SWE-Bench leaderboards also reports recent Verified submissions with data contamination.^[10]
Saturation and benchmark gaming are also real risks: one 2026 benchmarking paper says results that look strong on SWE-bench Verified can drop to 23% on SWE-bench Pro.^[11] SWE-ABS separately argues that Verified is approaching saturation and can show inflated success rates until tasks are adversarially strengthened.^[12]
Independent replication is still thin: a broader 2025 assessment of major LLM evaluations says many claims now have clearer methods but still limited independent replication.^[8] Inference: that caution applies to fresh Opus 4.7 launch claims until stronger third-party replications appear.^[1]^[8]
Practical takeaway: if you want the most trustworthy performance signal today, put the most weight on contamination-limited/resistant, publicly inspectable benchmarks and repeated third-party tests, and less weight on vendor launch charts.^[4]^[5]^[7]^[8]^[9]^[10]^[11]^[12]

Evidence notes

I did not retrieve a primary OpenAI announcement, model card, or system card for “GPT-5.5 Spud.”
I also did not retrieve a strong, independent, apples-to-apples replication of Claude Opus 4.7 versus a verified GPT-5.5 Spud model.
Sources
- ^[1] Anthropic, “Introducing Claude Opus 4.7” — https://www.anthropic.com/news/claude-opus-4-7
- ^[2] Pasquale Pillitteri, “GPT-5.5 Spud and GPT Image 2: Complete Guide to OpenAI Next Models in 2026” — https://pasqualepillitteri.it/en/news/741/gpt-5-5-spud-gpt-image-2-openai-complete-guide-2026
- ^[3] Idlen, “OpenAI Spud: GPT-6 Release Between April 14 and May 5, 2026” — https://www.idlen.io/news/openai-spud-gpt6-release-april-14-may-5-super-app-ambient-computing
- ^[4] White et al., “LiveBench: A Challenging, Contamination-Limited LLM Benchmark” — https://arxiv.org/abs/2406.19314
- ^[5] “A survey on large language model benchmarks” — https://arxiv.org/abs/2508.15361
- ^[6] “Swe-bench goes live!” — https://arxiv.org/abs/2505.23419
- ^[7] “Swe-bench pro: Can AI agents solve long-horizon software engineering tasks?” — https://arxiv.org/abs/2509.16941
- ^[8] “From Benchmarks to Real-World Performance: A Data-Driven Assessment of Large Language Models in 2025” — https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5991894
- ^[9] “SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories” — https://arxiv.org/abs/2512.17419
- ^[10] “What’s in a Benchmark? The Case of SWE-Bench in Automated Program Repair” — https://arxiv.org/abs/2602.04449
- ^[11] “Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering” — https://arxiv.org/abs/2601.21070
- ^[12] “SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark” — https://arxiv.org/abs/2603.00520

Limitations / uncertainty

Because “GPT-5.5 Spud” was not verified from primary-source material in the retrieved evidence, any direct Opus-4.7-vs-Spud ranking claim should be treated as provisional.^[2]^[3]
Several benchmark-methodology sources here are recent arXiv or SSRN papers rather than final peer-reviewed journal versions.^[4]^[5]^[6]^[7]^[8]^[9]^[10]^[11]^[12]

Summary

The fact-checked conclusion is asymmetric: Claude Opus 4.7 is real and released, while “GPT-5.5 Spud” is not verified from primary-source material in the evidence I retrieved.^[1]^[2]^[3] For benchmark credibility, the best current evidence favors contamination-limited or contamination-resistant public evals such as LiveBench and SWE-bench Pro over static or vendor-only charts.^[4]^[5]^[6]^[7]^[8]^[9]^[10]^[11]^[12] If you want, I can turn this into a stricter buyer’s guide with a simple trust ranking for specific benchmarks like LiveBench, SWE-bench Verified, SWE-bench Pro, HumanEval, and vendor internal evals.

情報源

[1] Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLM - VentureBeatventurebeat.com
Anthropic is publicly releasing its most powerful large language model yet,Claude Opus 4.7, today — as it continues to keep aneven more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and pa...
[8] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Skip to main contentSkip to footer. . Developers can use claude-opus-4-7 via the Claude API. ![Image 3: logo](
[19] GPT-5.5 Spud and GPT Image 2: Complete Guide to OpenAI Next Models in 2026pasqualepillitteri.it
GPT-5.5 Spud and GPT Image 2: Complete Guide to OpenAI Next Models in 2026. Complete guide to GPT-5.5 Spud and GPT Image 2: everything about release date (ChatGPT 5.5 release date), capabilities, benchmarks, competitor comparison and how to test upcoming Op...
[20] OpenAI Spud: GPT-6 Release Between April 14 and May 5, 2026 | Idlenidlen.io
2. OpenAI Spud Drops Between April 14 and May 5 — 78% Polymarket, Greg Brockman Says 'Not Incremental': GPT-5.5 or GPT-6? OpenAI Spud Drops Between April 14 and May 5 — 78% Polymarket, Greg Brockman Says 'Not Incremental': GPT-5.5 or GPT-6? Spud, OpenAI's n...
[25] A survey on large language model benchmarksarxiv.org
… In this survey, we present a comprehensive review of LLM … The creation of dynamic, non-public benchmarks like LiveBench [100] … of the dataset but also reduces the risk of data leakage. … 2025
[26] From Benchmarks to Real-World Performance: A Data-Driven Assessment of Large Language Models in 2025papers.ssrn.com
… -relevant outcomes across major 2025 LLM systems. … of static benchmarks, including saturation effects, data contamination, and … with clear methods but limited independent replication. … 5991
[36] LiveBenchlivebench.ai
LeaderboardDetailsCodeDataPaper. GPT-5.4 Thinking xHigh Effort OpenAI 80.28 88.12 77.54 70.00 94.15 79.31 82.63 70.22 . Claude 4.6 Opus Thinking High Effort Anthropic 76.33 88.67 78.18 61.67 89.32 69.89 83.27 63.31 . [Claude 4.5 Opus Thinking High Effort](htt…
[37] LiveBench: A Challenging, Contamination-Limited LLM Benchmarkopenreview.net
TL;DR: LiveBench is a difficult LLM benchmark consisting of contamination-limited tasks that employ verifiable ground truth answers on frequently-updated questions from recent information sources and procedural question generation techniques. We release Liv...
[43] Swe-bench goes live!arxiv.org
… contamination from pretraining, we restrict the dataset to issues created between January 1, 2024, and April 20, 2025. … setups on the SWE-bench leaderboard often involve dramatically … 2025
[44] Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arxiv.org
… PRO, a substantially more challenging benchmark that … Overall, SWE-BENCH PRO provides a contamination-resistant … publicly in this paper and will update in the leaderboard. This is … 2025
[45] SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositoriesarxiv.org
… benchmarks introduces a critical data contamination risk: most … SWE-bench and its manually curated variant SWE-bench … rather than reasoning, further skewing leaderboard rankings. … 2025
[46] Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineeringarxiv.org
… context, and widespread contamination issues. To understand … on SWE-bench Verified drop to just 23% on SWE-bench Pro, … evaluation methods or reusing existing but often inadequate … 2026
[47] What's in a Benchmark? The Case of SWE-Bench in Automated Program Repairarxiv.org
… To carry out our study, we examine each entry in the SWE-Bench leaderboards. … We also observed in Verified several recent submissions (August 2025) with … Data Contamination. Some … 2602
[49] SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmarkarxiv.org
… The SWE-Bench Verified leaderboard is approaching saturation, with the … 2025) pioneered test augmentation for SWE-Bench, … effectiveness on contamination-resistant SWE-Bench Pro … 2026

トレンドを発見する

レポート公開済み2026年4月28日Last edited 2026年5月6日14 ソース