Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT 5.5 Spud” appeared on third party pages about upcoming OpenAI mode...
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT 5.5 Spud” appeared on third party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evidence does not support a clean Claude vs
Key takeaways
Model status: Claude Opus 4.7 is a real, released Anthropic model. In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.
Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates. A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially. SWE-bench Pro is prese
Contamination risk remains material: SWE-Bench++ argues that public GitHub-based software benchmarks face critical contamination risk and that solution leakage can skew rankings. A 2026 analysis of SWE-Bench leaderboards also reports recent Verified submissions with data contamin
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT-5.5 Spud” appeared on third-party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evi
Key findings
Research answer
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT-5.5 Spud” appeared on third-party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evidence does not support a clean Claude-vs-Spud head-to-head yet.[2][3] On benchmark credibility, the strongest signals come from contamination-limited or contamination-resistant public benchmarks such as LiveBench and newer SWE-bench variants, not from release-day charts alone.[4][5][6][7][8]
Key findings
Model status: Claude Opus 4.7 is a real, released Anthropic model.[1] In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.[2][3]
Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates.[4] A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.[5]
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially.[6] SWE-bench Pro is presented as a more contamination-resistant benchmark for longer-horizon software-engineering tasks.[7]
Contamination risk remains material: SWE-Bench++ argues that public GitHub-based software benchmarks face critical contamination risk and that solution leakage can skew rankings.[9] A 2026 analysis of SWE-Bench leaderboards also reports recent Verified submissions with data contamination.[10]
Saturation and benchmark gaming are also real risks: one 2026 benchmarking paper says results that look strong on SWE-bench Verified can drop to 23% on SWE-bench Pro.[11] SWE-ABS separately argues that Verified is approaching saturation and can show inflated success rates until tasks are adversarially strengthened.[12]
Independent replication is still thin: a broader 2025 assessment of major LLM evaluations says many claims now have clearer methods but still limited independent replication.[8] Inference: that caution applies to fresh Opus 4.7 launch claims until stronger third-party replications appear.[1][8]
Practical takeaway: if you want the most trustworthy performance signal today, put the most weight on contamination-limited/resistant, publicly inspectable benchmarks and repeated third-party tests, and less weight on vendor launch charts.[4][5][7][8][9][10][11][12]
Evidence notes
I did not retrieve a primary OpenAI announcement, model card, or system card for “GPT-5.5 Spud.”
I also did not retrieve a strong, independent, apples-to-apples replication of Claude Opus 4.7 versus a verified GPT-5.5 Spud model.
[9] “SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories” — https://arxiv.org/abs/2512.17419
Because “GPT-5.5 Spud” was not verified from primary-source material in the retrieved evidence, any direct Opus-4.7-vs-Spud ranking claim should be treated as provisional.[2][3]
Several benchmark-methodology sources here are recent arXiv or SSRN papers rather than final peer-reviewed journal versions.[4][5][6][7][8][9][10][11][12]
Summary
The fact-checked conclusion is asymmetric: Claude Opus 4.7 is real and released, while “GPT-5.5 Spud” is not verified from primary-source material in the evidence I retrieved.[1][2][3] For benchmark credibility, the best current evidence favors contamination-limited or contamination-resistant public evals such as LiveBench and SWE-bench Pro over static or vendor-only charts.[4][5][6][7][8][9][10][11][12] If you want, I can turn this into a stricter buyer’s guide with a simple trust ranking for specific benchmarks like LiveBench, SWE-bench Verified, SWE-bench Pro, HumanEval, and vendor internal evals.
Supporting visuals
Claude Opus 4.7 Benchmark Full Analysis: Empirical DataThe image displays benchmark results for the Claude Opus 4.7 model released in April 2026, highlighting its leading performance in seven major rankings, including SW-E-bench, SW-E-bench Pro, GPQA Diamond reasoning, and Vision multimodal, with notable accuracy percentages and performance improvements.Claude Opus 4.7 Benchmark Full Analysis: Empirical DataA comparative bar chart displays the performance metrics of Claude Opus 4.7, Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across various benchmarks related to AI model evaluation, with Opus 4.7 leading in several categories.claude code buddy terminal pet companion activation guide en image 0 图示openclaw openai compatible vs claude native config guide en image 0 图示google gemini api free tier changes april 2026 guide en image 0 图示claude opus 4 6 vs sonnet 4 6 comparison guide en image 0 图示everything claude code plugin guide en image 0 图示claude-opus-4-7-benchmark-review-2026-en 图示claude-opus-4-7-benchmark-review-2026-en 图示claude 4 6 fast mode guide en image 0 图示claude-opus-4-7-vs-gpt-5-comparisonpalm-lines-meaning-iweaver-guideTelgram Live APIYIhow-to-read-palm-linesen_USes_ESko_KRfr_FR
Research conversation
You
Research question
Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.
Studio Global AI40 sources
Cited answer
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT-5.5 Spud” appeared on third-party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evidence does not support a clean Claude-vs-Spud head-to-head yet.[2][3] On benchmark credibility, the strongest signals come from contamination-limited or contamination-resistant public benchmarks such as LiveBench and newer SWE-bench variants, not from release-day charts alone.[4][5][6][7][8]
Key findings
Model status: Claude Opus 4.7 is a real, released Anthropic model.[1] In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.[2][3]
Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates.[4] A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.[5]
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially.[6] SWE-bench Pro is presented as a more contamination-resistant benchmark for longer-horizon software-engineering tasks.[7]
Contamination risk remains material: SWE-Bench++ argues that public GitHub-based software benchmarks face critical contamination risk and that solution leakage can skew rankings.[9] A 2026 analysis of SWE-Bench leaderboards also reports recent Verified submissions with data contamination.[10]
Saturation and benchmark gaming are also real risks: one 2026 benchmarking paper says results that look strong on SWE-bench Verified can drop to 23% on SWE-bench Pro.[11] SWE-ABS separately argues that Verified is approaching saturation and can show inflated success rates until tasks are adversarially strengthened.[12]
Independent replication is still thin: a broader 2025 assessment of major LLM evaluations says many claims now have clearer methods but still limited independent replication.[8] Inference: that caution applies to fresh Opus 4.7 launch claims until stronger third-party replications appear.[1][8]
Practical takeaway: if you want the most trustworthy performance signal today, put the most weight on contamination-limited/resistant, publicly inspectable benchmarks and repeated third-party tests, and less weight on vendor launch charts.[4][5][7][8][9][10][11][12]
Evidence notes
I did not retrieve a primary OpenAI announcement, model card, or system card for “GPT-5.5 Spud.”
I also did not retrieve a strong, independent, apples-to-apples replication of Claude Opus 4.7 versus a verified GPT-5.5 Spud model.
[9] “SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories” — https://arxiv.org/abs/2512.17419
Because “GPT-5.5 Spud” was not verified from primary-source material in the retrieved evidence, any direct Opus-4.7-vs-Spud ranking claim should be treated as provisional.[2][3]
Several benchmark-methodology sources here are recent arXiv or SSRN papers rather than final peer-reviewed journal versions.[4][5][6][7][8][9][10][11][12]
Summary
The fact-checked conclusion is asymmetric: Claude Opus 4.7 is real and released, while “GPT-5.5 Spud” is not verified from primary-source material in the evidence I retrieved.[1][2][3] For benchmark credibility, the best current evidence favors contamination-limited or contamination-resistant public evals such as LiveBench and SWE-bench Pro over static or vendor-only charts.[4][5][6][7][8][9][10][11][12] If you want, I can turn this into a stricter buyer’s guide with a simple trust ranking for specific benchmarks like LiveBench, SWE-bench Verified, SWE-bench Pro, HumanEval, and vendor internal evals.
People also ask
What is the short answer to "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication."?
Model status: Claude Opus 4.7 is a real, released Anthropic model. In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.
What are the key points to validate first?
Model status: Claude Opus 4.7 is a real, released Anthropic model. In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models. Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates. A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.
What should I do next in practice?
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially. SWE-bench Pro is prese
Which related topic should I explore next?
Continue with "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty" for another angle and extra citations.
Cross-check this answer against "Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.".
AI. * Claude. * Anthropic. Anthropic has announced its latest AI model with Claude Opus 4.7. Claude Opus 4.7 is the latest generally available version of Anthropic’s main AI model with a focus on advanced software development. However, Mythos isn’t generally available like Opus 4.7 since Anthropic is only sharing it with key software platform vendors like A…
The release closes the computer use gap that gave OpenAI its main differentiator in March, layers new multi agent coordination features on top of the existing Agent Teams architecture, and keeps the 1M token context window that Opus 4.6 introduced. The underlying architectures differ (GPT-5.4 treats computer use as a first class capability baked into the model, Opus 4.7 routes it through an integrated tool surface), but the production reliability is now comparable. * GPT-5.4 for rapid prototyping, Opus 4.7 for production code and architectural work. We route to GPT-5.4 for three specific case…
Claude Opus 4.7 launched April 18 2026 but developers are already posting backlash on Reddit and X — arguing nonstop, hallucination loops, safety overfit. Within 24 hours, developer threads on Reddit and X were calling it "legendarily bad." The complaints are specific: the model argues with users to the point of hallucination, fights back against corrections, and produces worse code output than Opus 4.6 on tasks where earlier versions worked cleanly. It is the first major post-training regression backlash Anthropic has faced since the Claude 3 series, and it arrives at the worst possible mome…
— though Anthropic's migration guide flags two breaking changes worth checking before you flip the switch in production (more below). Anthropic's April 16 release reports the following benchmark shifts — all Anthropic-conducted unless otherwise noted:. * Claude Managed Agents Pricing: What You Actually Pay — How Opus 4.7's tokenizer change interacts with session-hour billing in Claude Managed Agents. * [Claude Code vs Verdent: Multi-Ag…
5 days ago - Opus 4.7 didn't just match it — it cleared the score by 6.6 points and pushed SWE-bench Verified past 87% . OpenAI's next move is widely expected to be GPT-5.5 or a Codex-specific variant, but as of this writing (April 17, 2026) GPT-5.4 is
The new AI model promises sharper coding skills, faster performance, and expanded enterprise use as it rolls out to developers and businesses worldwide. According to Anthropic’s official announcement, Opus 4.7 “handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back.” Users have reported that they can now hand off their most challenging coding assignments—the kind that once demanded close human supervision—to Opus 4.7 with newfound confidence. For now, Opus 4.7 serves as the flagsh…
Claude 4 Opus vs GPT-5: The Ultimate Developer Benchmark. # Claude 4 Opus vs GPT-5: The Ultimate Developer Benchmark. We tested Claude 4 Opus and GPT-5 across 15 real-world coding tasks. Two titans now dominate the developer AI landscape: Anthropic's Claude 4 Opus and OpenAI's GPT-5. | Task Category | Claude 4 Opus | GPT-5 | Notes |. | Code Refactoring | 4.9 | 4.5 | Claude excels here significantly |. Overall Average: Claude 4 Opus: 4.61 | GPT-5: 4.55. Claude's 200,000 token context window is genuinely useful for:. When I fed both models ~50,000 tokens of codebase context, Claude maintain…
Claude Opus 4.7. Release date: 2026-05-14更新于: 2026-04-16 16:38:05151. Claude Opus 4.7 is an AI model published by Anthropic, released on 2026-05-14, for 推理大模型, with 0.0B parameters, and 1000K tokens context length, under the 不开源 license. Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology. ## Model basics. ## API details. No public API pricing yet. ## Benchmark Results. No benchmark data to show. ## Model Overview. Claude Opus 4.7 是 Anthropic 正在筹备推出的下一代旗舰大语言模型,预计将于 2026 年…
Claude Opus 4.7 model announcement by Anthropic, showing the Opus tier with improved reasoning and agentic capabilities. Claude Opus 4.7 is Anthropic's new flagship model, released April 16, 2026. The model string is claude-opus-4-7-20260416. | Model string | i.j4i.i2
GPT-5.5 Spud and GPT Image 2: Complete Guide to OpenAI Next Models in 2026. ##### Complete guide to GPT-5.5 Spud and GPT Image 2: everything about release date (ChatGPT 5.5 release date), capabilities, benchmarks, competitor comparison and how to test upcoming OpenAI models early. OpenAI is preparing two major releases for 2026: GPT-5.5 Spud, the successor to GPT-5 with evolved agentic capabilities, and GPT Image 2, the new image generation model that appeared on Chatbot Arena before the official announcement. If you are searching for gpt 5.5, chatgpt 5.5 release date or **g…
OpenAI Spud Drops Between April 14 and May 5 — 78% Polymarket, Greg Brockman Says 'Not Incremental': GPT-5.5 or GPT-6? # OpenAI Spud Drops Between April 14 and May 5 — 78% Polymarket, Greg Brockman Says 'Not Incremental': GPT-5.5 or GPT-6? Spud, OpenAI's next flagship model, launches between April 14 and May 5, 2026. | Spud / GPT-6 | OpenAI | 🔜 April 14 → May 5 | "Not incremental" |. * Spud (OpenAI's next flagship model) drops between April 14 and May 5, 2026 — calculated from pre-training completion on March 24 + standard 3-to-6-week post-training cycle. * Spud is…
OpenAI's GPT-5.5 'Spud' Finishes Pretraining — Greg Brockman Calls It a Massive Leap Toward AGI. OpenAI has completed pretraining on GPT-5.5, internally codenamed 'Spud.' President Greg Brockman says it represents two years of research and a massive qualitative leap in reasoning, coding, and agentic capabilities. | GPT-5.5 "Spud" | OpenAI | Pretraining complete | 47% | Reasoning, coding, agentic AI |. GPT-5.5, codenamed "Spud" internally at OpenAI, is the next frontier model that completed pretraining in late March/early April 2026. The engineering resources and compute freed by sunsetting…
Anthropic Launches Claude Opus 4.7 (Best AI Model Yet). Anthropic just released Claude Opus 4.7, the latest flagship model in the Claude family. ## What Is Claude Opus 4.7?. Claude Opus 4.7 is Anthropic’s most capable model to date, launched on April 16, 2026. The model ID is i.j4i.i2
claude-opus-4-7
. ## Claude Opus 4.7 Benchmarks and Performance. One of the most practical improvements is how Opus 4.7 handles sustained coding tasks at different effort levels. Claude Opus 4.7 agentic coding performance by effort level chart showing improvement over Opus 4.6. Opus 4.7 outperforms Opus 4.…
GPT-5.5 Review (Spud) 2026: Everything We Know About OpenAI’s Most Powerful Model Yet. On March 24, 2026, The Information broke a story that reset the entire AI landscape: OpenAI had completed pre-training on a new model internally codenamed “Spud.” CEO Sam Altman told employees it was a “very strong model” that could “really accelerate the economy.” OpenAI President Greg Brockman went further, describing it on the Big Technology podcast as the result of “two years worth of research” that would set a new benchmark for AI models — coining the evocative phrase “big model smell” to cap…
… In this survey, we present a comprehensive review of LLM … The creation of dynamic, non-public benchmarks like LiveBench [100] … of the dataset but also reduces the risk of data leakage. … 2025
… -relevant outcomes across major 2025 LLM systems. … of static benchmarks, including saturation effects, data contamination, and … with clear methods but limited independent replication. … 5991
… SWE-bench-verified benchmark. Recent works extend the … with Claude Sonnet 4.5 as the core LLM, as OpenHands is a … using synthetic or replicated biometric data was expected to take … 2025
… problems of data leakage and contamination in evaluation … , is a contamination-free version of SWE-Bench which evaluates … to measure LLM data contamination for each benchmark. In … 2025
… single-cell study, where we use an off-the-shelf LLM (gemini-2.5-… development of systematic replication benchmarks, where … models o3-mini-2025-01-31 and GPT-4o-2024-08-06 for o3-… 2026
… comparing different approaches of LLM learning as well as … better than Gemini 2.5 Pro on SWE-bench verified [48]. … was LiveBench, a benchmark which limits data contamination … 2025
… code reproduction as a test-time adaptation problem for LLM … We introduce SARE a framework for adapting LLM agents to … it encounters on the SUPER benchmark and the strategies it …
**TL;DR:**LiveBench is a difficult LLM benchmark consisting of contamination-limited tasks that employ verifiable ground truth answers on frequently-updated questions from recent information sources and procedural question generation techniques. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. The noticeable…
When a model’s output unexpectedly includes these tokens, it strongly indicates that the model has memorized 10096 Task Type Benchmark Math Static GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), AIME 2024 (of America, 2024), CNMO 2024) (Society, 2024) Dynamic LiveBench (White et al., 2024), UGMathBench (Xu et al., 2025), Mathador-LM (Kurtic et al., 2024) Language Static GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019), CLUE (Xu et al., 2020) Dynamic LiveBench (White et al., 2024), C2LEVA (Li et al., 2025), ITD (Zhu et al., 2024c) Coding Static HumanEval (Chen et al., 2021),…
Introducing LiveBench: a benchmark for LLMs designed with test set contamination and objective evaluation in mind. LiveBench has the following properties:.
… contamination from pretraining, we restrict the dataset to issues created between January 1, 2024, and April 20, 2025. … setups on the SWE-bench leaderboard often involve dramatically … 2025
… PRO, a substantially more challenging benchmark that … Overall, SWE-BENCH PRO provides a contamination-resistant … publicly in this paper and will update in the leaderboard. This is … 2025
… benchmarks introduces a critical data contamination risk: most … SWE-bench and its manually curated variant SWE-bench … rather than reasoning, further skewing leaderboard rankings. … 2025
… context, and widespread contamination issues. To understand … on SWE-bench Verified drop to just 23% on SWE-bench Pro, … evaluation methods or reusing existing but often inadequate … 2026
… To carry out our study, we examine each entry in the SWE-Bench leaderboards. … We also observed in Verified several recent submissions (August 2025) with … Data Contamination. Some … 2602
… from the full benchmark; and (2) SWE-Bench Verified, which … 2 Methodology In this section, we describe the methods used to … 2024, SWE-Bench Verified experienced a sudden growth … 2506
… The SWE-Bench Verified leaderboard is approaching saturation, with the … 2025) pioneered test augmentation for SWE-Bench, … effectiveness on contamination-resistant SWE-Bench Pro … 2026
… RQ1: How can we design a contamination-free dataset for … Both agents were listed on the SWE-bench Lite leaderboard, … We observed that 27 out of the 29 tasks from 2024 belonged to … 2025
… based on Aider coding agent1 and a dynamic user leaderboard 23… , highlighting the risks of data contamination, as most issues … We slightly modify Aider 7 and Aider-SWE-bench8 … 2025
… based on Aider coding agent1 and a dynamic user leaderboard 23… , highlighting the risks of data contamination, as most issues … We slightly modify Aider 7 and Aider-SWE-bench8 … 2025
SWE-bench Explained: Complete Guide to AI Coding Benchmarks 2025. # SWE-bench Explained: Complete Guide to AI Coding Benchmarks 2026. Unlike simple "write a function" tests, SWE-bench throws AI models into the deep end—real GitHub issues from production codebases with thousands of files, complex dependencies, and ambiguous requirements. Here's your complete guide to understanding SWE-bench, HumanEval, and the benchmarks that determine which AI models truly deliver for software development. ### Top AI Models Ranked by SWE-bench Verified Score. ## How to Test AI Coding Models Yourself. ## Loc…
SWE-bench February 2026 leaderboard update (via) SWE-bench is one of the benchmarks that the labs love to list in their model releases. Now let me carefully add the labels using an inline plugin on the chart instance to avoid the recursion issue." A collapsed "Browser_evaluate" section shows a browser_evaluate tool call with JavaScript code using Chart.js canvas context to draw percentage labels on bars: meta.data.forEach((bar, index) => { const value = dataset.data[index]; if (value !== unde…
Over time, SWE-Bench leaderboards have expanded beyond Python and now encompass multilingual and stateful agent benchmarks, advanced multi-resource effectiveness metrics, and rigorous validations against data contamination and test insufficiency. * SWE-bench-java-verified: The first officially supported non-Python leaderboard, evaluating 91 curated Java issue-patch pairs with Dockerized build/test harnesses (Zan et al., 2024). * Pass@k: For code-generation benchmarks sampling k completions per issue, the proportion of tasks for which any candidate passes the entire test suite (e.g., p…
SWE-Bench Live Q2 2026 leaderboard analysis — what the scores actually predict, delivery velocity vs test pass rate, and why some top models underperform. Scores Predict Scores: A model that posts 70 percent on SWE-Bench Verified predicts another benchmark score, not how quickly your agency ships billable features. This guide walks through what SWE-Bench Live actually measures, how the Live, Verified, and Pro variants differ, what the Q2 2026 numbers show, and where benchmark position genuinely predicts real-world behaviour versus where it breaks down. * Frontier closed models — Claude Op…
SWE-Bench Verified Leaderboard: April 2026. SWE-bench Verified tests AI models on 500 real GitHub issues from popular Python repositories.
ReportsPublic URLApr 21, 202663 sources
Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT 5.5 Spud” appeared on third party pages about upcoming OpenAI mode...
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT 5.5 Spud” appeared on third party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evidence does not support a clean Claude vs
Key takeaways
Model status: Claude Opus 4.7 is a real, released Anthropic model. In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.
Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates. A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially. SWE-bench Pro is prese
Contamination risk remains material: SWE-Bench++ argues that public GitHub-based software benchmarks face critical contamination risk and that solution leakage can skew rankings. A 2026 analysis of SWE-Bench leaderboards also reports recent Verified submissions with data contamin
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT-5.5 Spud” appeared on third-party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evi
Key findings
Research answer
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT-5.5 Spud” appeared on third-party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evidence does not support a clean Claude-vs-Spud head-to-head yet.[2][3] On benchmark credibility, the strongest signals come from contamination-limited or contamination-resistant public benchmarks such as LiveBench and newer SWE-bench variants, not from release-day charts alone.[4][5][6][7][8]
Key findings
Model status: Claude Opus 4.7 is a real, released Anthropic model.[1] In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.[2][3]
Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates.[4] A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.[5]
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially.[6] SWE-bench Pro is presented as a more contamination-resistant benchmark for longer-horizon software-engineering tasks.[7]
Contamination risk remains material: SWE-Bench++ argues that public GitHub-based software benchmarks face critical contamination risk and that solution leakage can skew rankings.[9] A 2026 analysis of SWE-Bench leaderboards also reports recent Verified submissions with data contamination.[10]
Saturation and benchmark gaming are also real risks: one 2026 benchmarking paper says results that look strong on SWE-bench Verified can drop to 23% on SWE-bench Pro.[11] SWE-ABS separately argues that Verified is approaching saturation and can show inflated success rates until tasks are adversarially strengthened.[12]
Independent replication is still thin: a broader 2025 assessment of major LLM evaluations says many claims now have clearer methods but still limited independent replication.[8] Inference: that caution applies to fresh Opus 4.7 launch claims until stronger third-party replications appear.[1][8]
Practical takeaway: if you want the most trustworthy performance signal today, put the most weight on contamination-limited/resistant, publicly inspectable benchmarks and repeated third-party tests, and less weight on vendor launch charts.[4][5][7][8][9][10][11][12]
Evidence notes
I did not retrieve a primary OpenAI announcement, model card, or system card for “GPT-5.5 Spud.”
I also did not retrieve a strong, independent, apples-to-apples replication of Claude Opus 4.7 versus a verified GPT-5.5 Spud model.
[9] “SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories” — https://arxiv.org/abs/2512.17419
Because “GPT-5.5 Spud” was not verified from primary-source material in the retrieved evidence, any direct Opus-4.7-vs-Spud ranking claim should be treated as provisional.[2][3]
Several benchmark-methodology sources here are recent arXiv or SSRN papers rather than final peer-reviewed journal versions.[4][5][6][7][8][9][10][11][12]
Summary
The fact-checked conclusion is asymmetric: Claude Opus 4.7 is real and released, while “GPT-5.5 Spud” is not verified from primary-source material in the evidence I retrieved.[1][2][3] For benchmark credibility, the best current evidence favors contamination-limited or contamination-resistant public evals such as LiveBench and SWE-bench Pro over static or vendor-only charts.[4][5][6][7][8][9][10][11][12] If you want, I can turn this into a stricter buyer’s guide with a simple trust ranking for specific benchmarks like LiveBench, SWE-bench Verified, SWE-bench Pro, HumanEval, and vendor internal evals.
Supporting visuals
Claude Opus 4.7 Benchmark Full Analysis: Empirical DataThe image displays benchmark results for the Claude Opus 4.7 model released in April 2026, highlighting its leading performance in seven major rankings, including SW-E-bench, SW-E-bench Pro, GPQA Diamond reasoning, and Vision multimodal, with notable accuracy percentages and performance improvements.Claude Opus 4.7 Benchmark Full Analysis: Empirical DataA comparative bar chart displays the performance metrics of Claude Opus 4.7, Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across various benchmarks related to AI model evaluation, with Opus 4.7 leading in several categories.claude code buddy terminal pet companion activation guide en image 0 图示openclaw openai compatible vs claude native config guide en image 0 图示google gemini api free tier changes april 2026 guide en image 0 图示claude opus 4 6 vs sonnet 4 6 comparison guide en image 0 图示everything claude code plugin guide en image 0 图示claude-opus-4-7-benchmark-review-2026-en 图示claude-opus-4-7-benchmark-review-2026-en 图示claude 4 6 fast mode guide en image 0 图示claude-opus-4-7-vs-gpt-5-comparisonpalm-lines-meaning-iweaver-guideTelgram Live APIYIhow-to-read-palm-linesen_USes_ESko_KRfr_FR
Research conversation
You
Research question
Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication.
Studio Global AI40 sources
Cited answer
Claude Opus 4.7 is verified: Anthropic officially announced it on Apr. 16, 2026 and said it is generally available.[1] In the retrieved evidence, “GPT-5.5 Spud” appeared on third-party pages about upcoming OpenAI models rather than in a primary OpenAI release document, so the evidence does not support a clean Claude-vs-Spud head-to-head yet.[2][3] On benchmark credibility, the strongest signals come from contamination-limited or contamination-resistant public benchmarks such as LiveBench and newer SWE-bench variants, not from release-day charts alone.[4][5][6][7][8]
Key findings
Model status: Claude Opus 4.7 is a real, released Anthropic model.[1] In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.[2][3]
Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates.[4] A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.[5]
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially.[6] SWE-bench Pro is presented as a more contamination-resistant benchmark for longer-horizon software-engineering tasks.[7]
Contamination risk remains material: SWE-Bench++ argues that public GitHub-based software benchmarks face critical contamination risk and that solution leakage can skew rankings.[9] A 2026 analysis of SWE-Bench leaderboards also reports recent Verified submissions with data contamination.[10]
Saturation and benchmark gaming are also real risks: one 2026 benchmarking paper says results that look strong on SWE-bench Verified can drop to 23% on SWE-bench Pro.[11] SWE-ABS separately argues that Verified is approaching saturation and can show inflated success rates until tasks are adversarially strengthened.[12]
Independent replication is still thin: a broader 2025 assessment of major LLM evaluations says many claims now have clearer methods but still limited independent replication.[8] Inference: that caution applies to fresh Opus 4.7 launch claims until stronger third-party replications appear.[1][8]
Practical takeaway: if you want the most trustworthy performance signal today, put the most weight on contamination-limited/resistant, publicly inspectable benchmarks and repeated third-party tests, and less weight on vendor launch charts.[4][5][7][8][9][10][11][12]
Evidence notes
I did not retrieve a primary OpenAI announcement, model card, or system card for “GPT-5.5 Spud.”
I also did not retrieve a strong, independent, apples-to-apples replication of Claude Opus 4.7 versus a verified GPT-5.5 Spud model.
[9] “SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories” — https://arxiv.org/abs/2512.17419
Because “GPT-5.5 Spud” was not verified from primary-source material in the retrieved evidence, any direct Opus-4.7-vs-Spud ranking claim should be treated as provisional.[2][3]
Several benchmark-methodology sources here are recent arXiv or SSRN papers rather than final peer-reviewed journal versions.[4][5][6][7][8][9][10][11][12]
Summary
The fact-checked conclusion is asymmetric: Claude Opus 4.7 is real and released, while “GPT-5.5 Spud” is not verified from primary-source material in the evidence I retrieved.[1][2][3] For benchmark credibility, the best current evidence favors contamination-limited or contamination-resistant public evals such as LiveBench and SWE-bench Pro over static or vendor-only charts.[4][5][6][7][8][9][10][11][12] If you want, I can turn this into a stricter buyer’s guide with a simple trust ranking for specific benchmarks like LiveBench, SWE-bench Verified, SWE-bench Pro, HumanEval, and vendor internal evals.
People also ask
What is the short answer to "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Benchmark credibility, contamination risk, and independent replication."?
Model status: Claude Opus 4.7 is a real, released Anthropic model. In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models.
What are the key points to validate first?
Model status: Claude Opus 4.7 is a real, released Anthropic model. In the retrieved evidence, mentions of “GPT-5.5 Spud” were on third-party pages describing “next” or unreleased OpenAI models. Benchmark credibility: LiveBench was explicitly designed to resist contamination by using recent-source questions, objective ground-truth scoring, and monthly updates. A later benchmark survey says dynamic benchmark designs like LiveBench reduce data-leakage risk.
What should I do next in practice?
SWE-bench is useful, but raw leaderboard scores need caution: SWE-bench Live restricts tasks to issues created between Jan. 1, 2024 and Apr. 20, 2025 to reduce pretraining contamination, and its authors note that leaderboard setups can differ substantially. SWE-bench Pro is prese
Which related topic should I explore next?
Continue with "Research and fact-check: Claude Opus 4.7 vs GPT-5.5 Spud, Hallucination control, abstention behavior, and calibrated uncertainty" for another angle and extra citations.
Cross-check this answer against "Research and fact-check: GPT-5.5 Spud, Agentic coding and tool orchestration, including tool calling, web search, and tool-heavy workflows.".
AI. * Claude. * Anthropic. Anthropic has announced its latest AI model with Claude Opus 4.7. Claude Opus 4.7 is the latest generally available version of Anthropic’s main AI model with a focus on advanced software development. However, Mythos isn’t generally available like Opus 4.7 since Anthropic is only sharing it with key software platform vendors like A…
The release closes the computer use gap that gave OpenAI its main differentiator in March, layers new multi agent coordination features on top of the existing Agent Teams architecture, and keeps the 1M token context window that Opus 4.6 introduced. The underlying architectures differ (GPT-5.4 treats computer use as a first class capability baked into the model, Opus 4.7 routes it through an integrated tool surface), but the production reliability is now comparable. * GPT-5.4 for rapid prototyping, Opus 4.7 for production code and architectural work. We route to GPT-5.4 for three specific case…
Claude Opus 4.7 launched April 18 2026 but developers are already posting backlash on Reddit and X — arguing nonstop, hallucination loops, safety overfit. Within 24 hours, developer threads on Reddit and X were calling it "legendarily bad." The complaints are specific: the model argues with users to the point of hallucination, fights back against corrections, and produces worse code output than Opus 4.6 on tasks where earlier versions worked cleanly. It is the first major post-training regression backlash Anthropic has faced since the Claude 3 series, and it arrives at the worst possible mome…
— though Anthropic's migration guide flags two breaking changes worth checking before you flip the switch in production (more below). Anthropic's April 16 release reports the following benchmark shifts — all Anthropic-conducted unless otherwise noted:. * Claude Managed Agents Pricing: What You Actually Pay — How Opus 4.7's tokenizer change interacts with session-hour billing in Claude Managed Agents. * [Claude Code vs Verdent: Multi-Ag…
5 days ago - Opus 4.7 didn't just match it — it cleared the score by 6.6 points and pushed SWE-bench Verified past 87% . OpenAI's next move is widely expected to be GPT-5.5 or a Codex-specific variant, but as of this writing (April 17, 2026) GPT-5.4 is
The new AI model promises sharper coding skills, faster performance, and expanded enterprise use as it rolls out to developers and businesses worldwide. According to Anthropic’s official announcement, Opus 4.7 “handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back.” Users have reported that they can now hand off their most challenging coding assignments—the kind that once demanded close human supervision—to Opus 4.7 with newfound confidence. For now, Opus 4.7 serves as the flagsh…
Claude 4 Opus vs GPT-5: The Ultimate Developer Benchmark. # Claude 4 Opus vs GPT-5: The Ultimate Developer Benchmark. We tested Claude 4 Opus and GPT-5 across 15 real-world coding tasks. Two titans now dominate the developer AI landscape: Anthropic's Claude 4 Opus and OpenAI's GPT-5. | Task Category | Claude 4 Opus | GPT-5 | Notes |. | Code Refactoring | 4.9 | 4.5 | Claude excels here significantly |. Overall Average: Claude 4 Opus: 4.61 | GPT-5: 4.55. Claude's 200,000 token context window is genuinely useful for:. When I fed both models ~50,000 tokens of codebase context, Claude maintain…
Claude Opus 4.7. Release date: 2026-05-14更新于: 2026-04-16 16:38:05151. Claude Opus 4.7 is an AI model published by Anthropic, released on 2026-05-14, for 推理大模型, with 0.0B parameters, and 1000K tokens context length, under the 不开源 license. Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology. ## Model basics. ## API details. No public API pricing yet. ## Benchmark Results. No benchmark data to show. ## Model Overview. Claude Opus 4.7 是 Anthropic 正在筹备推出的下一代旗舰大语言模型,预计将于 2026 年…
Claude Opus 4.7 model announcement by Anthropic, showing the Opus tier with improved reasoning and agentic capabilities. Claude Opus 4.7 is Anthropic's new flagship model, released April 16, 2026. The model string is claude-opus-4-7-20260416. | Model string | i.j4i.i2
GPT-5.5 Spud and GPT Image 2: Complete Guide to OpenAI Next Models in 2026. ##### Complete guide to GPT-5.5 Spud and GPT Image 2: everything about release date (ChatGPT 5.5 release date), capabilities, benchmarks, competitor comparison and how to test upcoming OpenAI models early. OpenAI is preparing two major releases for 2026: GPT-5.5 Spud, the successor to GPT-5 with evolved agentic capabilities, and GPT Image 2, the new image generation model that appeared on Chatbot Arena before the official announcement. If you are searching for gpt 5.5, chatgpt 5.5 release date or **g…
OpenAI Spud Drops Between April 14 and May 5 — 78% Polymarket, Greg Brockman Says 'Not Incremental': GPT-5.5 or GPT-6? # OpenAI Spud Drops Between April 14 and May 5 — 78% Polymarket, Greg Brockman Says 'Not Incremental': GPT-5.5 or GPT-6? Spud, OpenAI's next flagship model, launches between April 14 and May 5, 2026. | Spud / GPT-6 | OpenAI | 🔜 April 14 → May 5 | "Not incremental" |. * Spud (OpenAI's next flagship model) drops between April 14 and May 5, 2026 — calculated from pre-training completion on March 24 + standard 3-to-6-week post-training cycle. * Spud is…
OpenAI's GPT-5.5 'Spud' Finishes Pretraining — Greg Brockman Calls It a Massive Leap Toward AGI. OpenAI has completed pretraining on GPT-5.5, internally codenamed 'Spud.' President Greg Brockman says it represents two years of research and a massive qualitative leap in reasoning, coding, and agentic capabilities. | GPT-5.5 "Spud" | OpenAI | Pretraining complete | 47% | Reasoning, coding, agentic AI |. GPT-5.5, codenamed "Spud" internally at OpenAI, is the next frontier model that completed pretraining in late March/early April 2026. The engineering resources and compute freed by sunsetting…
Anthropic Launches Claude Opus 4.7 (Best AI Model Yet). Anthropic just released Claude Opus 4.7, the latest flagship model in the Claude family. ## What Is Claude Opus 4.7?. Claude Opus 4.7 is Anthropic’s most capable model to date, launched on April 16, 2026. The model ID is i.j4i.i2
claude-opus-4-7
. ## Claude Opus 4.7 Benchmarks and Performance. One of the most practical improvements is how Opus 4.7 handles sustained coding tasks at different effort levels. Claude Opus 4.7 agentic coding performance by effort level chart showing improvement over Opus 4.6. Opus 4.7 outperforms Opus 4.…
GPT-5.5 Review (Spud) 2026: Everything We Know About OpenAI’s Most Powerful Model Yet. On March 24, 2026, The Information broke a story that reset the entire AI landscape: OpenAI had completed pre-training on a new model internally codenamed “Spud.” CEO Sam Altman told employees it was a “very strong model” that could “really accelerate the economy.” OpenAI President Greg Brockman went further, describing it on the Big Technology podcast as the result of “two years worth of research” that would set a new benchmark for AI models — coining the evocative phrase “big model smell” to cap…
… In this survey, we present a comprehensive review of LLM … The creation of dynamic, non-public benchmarks like LiveBench [100] … of the dataset but also reduces the risk of data leakage. … 2025
… -relevant outcomes across major 2025 LLM systems. … of static benchmarks, including saturation effects, data contamination, and … with clear methods but limited independent replication. … 5991
… SWE-bench-verified benchmark. Recent works extend the … with Claude Sonnet 4.5 as the core LLM, as OpenHands is a … using synthetic or replicated biometric data was expected to take … 2025
… problems of data leakage and contamination in evaluation … , is a contamination-free version of SWE-Bench which evaluates … to measure LLM data contamination for each benchmark. In … 2025
… single-cell study, where we use an off-the-shelf LLM (gemini-2.5-… development of systematic replication benchmarks, where … models o3-mini-2025-01-31 and GPT-4o-2024-08-06 for o3-… 2026
… comparing different approaches of LLM learning as well as … better than Gemini 2.5 Pro on SWE-bench verified [48]. … was LiveBench, a benchmark which limits data contamination … 2025
… code reproduction as a test-time adaptation problem for LLM … We introduce SARE a framework for adapting LLM agents to … it encounters on the SUPER benchmark and the strategies it …
**TL;DR:**LiveBench is a difficult LLM benchmark consisting of contamination-limited tasks that employ verifiable ground truth answers on frequently-updated questions from recent information sources and procedural question generation techniques. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. The noticeable…
When a model’s output unexpectedly includes these tokens, it strongly indicates that the model has memorized 10096 Task Type Benchmark Math Static GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), AIME 2024 (of America, 2024), CNMO 2024) (Society, 2024) Dynamic LiveBench (White et al., 2024), UGMathBench (Xu et al., 2025), Mathador-LM (Kurtic et al., 2024) Language Static GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019), CLUE (Xu et al., 2020) Dynamic LiveBench (White et al., 2024), C2LEVA (Li et al., 2025), ITD (Zhu et al., 2024c) Coding Static HumanEval (Chen et al., 2021),…
Introducing LiveBench: a benchmark for LLMs designed with test set contamination and objective evaluation in mind. LiveBench has the following properties:.
… contamination from pretraining, we restrict the dataset to issues created between January 1, 2024, and April 20, 2025. … setups on the SWE-bench leaderboard often involve dramatically … 2025
… PRO, a substantially more challenging benchmark that … Overall, SWE-BENCH PRO provides a contamination-resistant … publicly in this paper and will update in the leaderboard. This is … 2025
… benchmarks introduces a critical data contamination risk: most … SWE-bench and its manually curated variant SWE-bench … rather than reasoning, further skewing leaderboard rankings. … 2025
… context, and widespread contamination issues. To understand … on SWE-bench Verified drop to just 23% on SWE-bench Pro, … evaluation methods or reusing existing but often inadequate … 2026
… To carry out our study, we examine each entry in the SWE-Bench leaderboards. … We also observed in Verified several recent submissions (August 2025) with … Data Contamination. Some … 2602
… from the full benchmark; and (2) SWE-Bench Verified, which … 2 Methodology In this section, we describe the methods used to … 2024, SWE-Bench Verified experienced a sudden growth … 2506
… The SWE-Bench Verified leaderboard is approaching saturation, with the … 2025) pioneered test augmentation for SWE-Bench, … effectiveness on contamination-resistant SWE-Bench Pro … 2026
… RQ1: How can we design a contamination-free dataset for … Both agents were listed on the SWE-bench Lite leaderboard, … We observed that 27 out of the 29 tasks from 2024 belonged to … 2025
… based on Aider coding agent1 and a dynamic user leaderboard 23… , highlighting the risks of data contamination, as most issues … We slightly modify Aider 7 and Aider-SWE-bench8 … 2025
… based on Aider coding agent1 and a dynamic user leaderboard 23… , highlighting the risks of data contamination, as most issues … We slightly modify Aider 7 and Aider-SWE-bench8 … 2025
SWE-bench Explained: Complete Guide to AI Coding Benchmarks 2025. # SWE-bench Explained: Complete Guide to AI Coding Benchmarks 2026. Unlike simple "write a function" tests, SWE-bench throws AI models into the deep end—real GitHub issues from production codebases with thousands of files, complex dependencies, and ambiguous requirements. Here's your complete guide to understanding SWE-bench, HumanEval, and the benchmarks that determine which AI models truly deliver for software development. ### Top AI Models Ranked by SWE-bench Verified Score. ## How to Test AI Coding Models Yourself. ## Loc…
SWE-bench February 2026 leaderboard update (via) SWE-bench is one of the benchmarks that the labs love to list in their model releases. Now let me carefully add the labels using an inline plugin on the chart instance to avoid the recursion issue." A collapsed "Browser_evaluate" section shows a browser_evaluate tool call with JavaScript code using Chart.js canvas context to draw percentage labels on bars: meta.data.forEach((bar, index) => { const value = dataset.data[index]; if (value !== unde…
Over time, SWE-Bench leaderboards have expanded beyond Python and now encompass multilingual and stateful agent benchmarks, advanced multi-resource effectiveness metrics, and rigorous validations against data contamination and test insufficiency. * SWE-bench-java-verified: The first officially supported non-Python leaderboard, evaluating 91 curated Java issue-patch pairs with Dockerized build/test harnesses (Zan et al., 2024). * Pass@k: For code-generation benchmarks sampling k completions per issue, the proportion of tasks for which any candidate passes the entire test suite (e.g., p…
SWE-Bench Live Q2 2026 leaderboard analysis — what the scores actually predict, delivery velocity vs test pass rate, and why some top models underperform. Scores Predict Scores: A model that posts 70 percent on SWE-Bench Verified predicts another benchmark score, not how quickly your agency ships billable features. This guide walks through what SWE-Bench Live actually measures, how the Live, Verified, and Pro variants differ, what the Q2 2026 numbers show, and where benchmark position genuinely predicts real-world behaviour versus where it breaks down. * Frontier closed models — Claude Op…