보고서공개 URL2026년 4월 26일43 소스

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

제공된 근거만으로는 네 모델을 동등한 수준으로 “종합 벤치마크 비교”하기에는 자료가 크게 부족합니다. 확인 가능한 정량 벤치마크는 사실상 Claude Opus 4.7에 집중되어 있고, GPT 5.5·DeepSeek V4·Kimi K2.6은 제공 근거 안에서 직접 비교 가능한 공식 점수나 일관된 벤치마크 결과가 없습니다. 따라서 아래 비교는 “확인 가능한 것”과 “증거 부족”을 명확히 나...

주요 시사점

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다.
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다.
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다.
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다.
제공된 근거만으로는 네 모델을 동등한 수준으로 “종합 벤치마크 비교”하기에는 자료가 크게 부족합니다. 확인 가능한 정량 벤치마크는 사실상 Claude Opus 4.7에 집중되어 있고, GPT-5.5·DeepSeek V4·Kimi K2.6은 제공 근거 안에서 직접 비교 가능한 공식 점수나 일관된 벤치마크 결과가 없습니다. 따라서 아래 비교는 “확인 가능한 것”과 “증거 부족”을 명확히 나누어 정리합니다.
## Key findings

연구 답변

제공된 근거만으로는 네 모델을 동등한 수준으로 “종합 벤치마크 비교”하기에는 자료가 크게 부족합니다. 확인 가능한 정량 벤치마크는 사실상 Claude Opus 4.7에 집중되어 있고, GPT-5.5·DeepSeek V4·Kimi K2.6은 제공 근거 안에서 직접 비교 가능한 공식 점수나 일관된 벤치마크 결과가 없습니다. 따라서 아래 비교는 “확인 가능한 것”과 “증거 부족”을 명확히 나누어 정리합니다.

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

현재 제공된 증거만 놓고 보면 Claude Opus 4.7이 네 모델 중 유일하게 구체적이고 비교 가능한 성능 자료를 갖고 있습니다. Claude Opus 4.7은 SWE-bench Verified 82.4%, BenchLM provisional 2위/110개 모델, overall score 97/100으로 제시되어 코딩과 종합 평가에서 매우 강한 모델로 볼 수 있습니다 ^[6]^[7].

반면 GPT-5.5, DeepSeek V4, Kimi K2.6은 제공 근거 안에서 직접 벤치마크 점수가 없으므로 Claude Opus 4.7보다 높다거나 낮다고 결론낼 수 없습니다. Insufficient evidence.

시각적 지원

Google Gemini in 2026: Model Overview, NotebookLM, and Workspace Integration

Kimi K2.5 by Moonshot AI: The Open Chinese Model That Broke Into the Elite

claude code buddy terminal pet companion activation guide en image 0 图示

openclaw openai compatible vs claude native config guide en image 0 图示

GLM-5 by Z.ai in 2026: The Chinese Model That Pretends to Be Claude

google gemini api free tier changes april 2026 guide en image 0 图示

DeepSeek in 2026: A Review of the Budget Flagship Among AI Models

Grok by xAI in 2026: Elon Musk's AI with X and Tesla Integration

YandexGPT in 2026: A Review of Russia's AI Platform for Business

GenAI Tools Comparison 2026: Which AI Should a Manager ChooseGenAI Tools Comparison 2026: Which AI Should a Manager Choose?

ChatGPT in 2026: What Changed and Where Managers Should Start

GigaChat in 2026: Honest Review – Is It Worth Using for WorkGigaChat in 2026: Honest Review – Is It Worth Using for Work?

claude opus 4 6 vs sonnet 4 6 comparison guide en image 0 图示

Qwen by Alibaba in 2026: Free Open-Source AI for Business

everything claude code plugin guide en image 0 图示

claude 4 6 fast mode guide en image 0 图示

The content presents a series of bar graphs comparing the performance metrics, such as accuracy and rating, of various models including DeepThe content presents a series of bar graphs comparing the performance metrics, such as accuracy and rating, of various models including DeepSeek-K, GPT-4, and Qwen across multiple benchmarks, specifically AIME 2024, Math, Codeforces, LiveCodeBench, GPQA Diamond, and ZebraLogic.

연구 대화

당신2026년 4월 26일

연구문제

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Studio Global AI2026년 4월 26일40 소스

인용 답변

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

사람들은 또한 묻습니다.

"Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요."에 대한 짧은 대답은 무엇입니까?

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다.

먼저 검증할 핵심 포인트는 무엇인가요?

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다. Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다.

실무에서는 다음으로 무엇을 해야 합니까?

Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다.

다음에는 어떤 관련 주제를 탐구해야 할까요?

다른 각도와 추가 인용을 보려면 "DeepSeek V4와 Kimi K2.6의 벤치마크를 조사해줘. 코딩, 콘텐츠 제작, 번역에서는 누가 이길까?"으로 계속하세요.

관련 페이지 열기

이것을 무엇과 비교해야 합니까?

"GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。"에 대해 이 답변을 대조 확인하세요.

관련 페이지 열기

연구를 계속하세요

DeepSeek V4와 Kimi K2.6의 벤치마크를 조사해줘. 코딩, 콘텐츠 제작, 번역에서는 누가 이길까?

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。

Deep research & compare GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4

請閱讀 Claude Opus 4.7 同 GPT 5.5 嘅介紹： https://www.anthropic.com/news/claude-opus-4-7 https://openai.com/index/introducing-gpt-5-5 請再搜尋更多相關資料，並就

請閱讀 Claude Opus 4.7 同 GPT 5.5 嘅介紹： https://www.anthropic.com/news/claude-opus-4-7 https://openai.com/index/introducin...

출처

[1] Claude Opus 4.7 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude Opus 4.7 4/8/2026 Meta Muse Spark 4/2/2026 Google Gemma...
[2] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[3] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[4] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Apr 16, 2026•16 min•ByNicolas Zeeb Guides CONTENTS Key observations of reported benchmarks Coding capabilities SWE-bench Verified SWE-bench Pro Terminal-Bench 2.0 Agentic capabilities MCP-Atlas (Scaled tool use) Finance Agent v1.1 OSWorld-Verified (Computer...
[5] Claude Opus 4.7: Complete Guide to Features, Benchmarks ...nxcode.io
April 16, 2026 — Anthropic released Claude Opus 4.7 today, and the story is not about a new pricing tier or a dramatic architectural overhaul. It is about targeted, measurable improvements in the two areas that matter most for production use: coding and vis...
[6] How to Compare AI Models in 2026: Benchmarks for Summarization ...mysummit.school
1. Academic breadth (MMLU and GPQA Diamond) Everyone used to look at MMLU (tests across 57 disciplines). But in 2026, this test has become a “basic hygiene minimum.” If a model scores below 85–90%, it simply doesn’t belong in the top tier. Today the gold st...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] AI Benchmarks 2026: Top Evaluations and Their Limitskili-technology.com
Image 2: Kili Technology.png) Kili Technology · Apr 13, 2026 Image 3: AI Benchmarks Guide: The Top Evaluations in 2026 and Why They're Not Enough Table of contents Introduction What Are the Most Important AI Benchmarks in 2026? General knowledge and reasoni...
[9] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[10] The Definitive LLM Selection & Benchmarks Guideiternal.ai
Model Provider Parameters License Key Strengths Best Benchmarks --- --- --- Qwen 3.5 (397B-A17B) Alibaba 397B (17B active MoE) Apache 2.0 Reasoning, math, multilingual (201 langs), native vision, 256K context; 19x faster decoding GPQA 88.4%; AIME 91.3%; SWE...
[11] Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safetymashable.com
Claude Opus 4.7 is available now. Credit: Samuel Boivin/NurPhoto via Getty Images Anthropic has been shipping products and making news at a blistering pace in 2026, and on Thursday, the AI company announced the launch of Claude Opus 4.7. Claude Opus 4.7 is...
[12] Claude Opus 4.7 results: early benchmarks, real-world feedback ...boringbot.substack.com
The Claude Opus 4.7 benchmarks on software engineering tasks show the clearest improvement. On SWE-Bench, the industry-standard benchmark for evaluating autonomous code repair across real GitHub issues, Opus 4.7 shows a meaningful step up from Opus 4.6, wit...
[13] [PDF] Technical Performance - Stanford HAIhai.stanford.edu
Technical Performance Benchmarks vs. Human Performance 76 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% 120% Image classiǇcation (ImageNet Top-5) Visual reasoning (VQA) English language understanding (SuperGLU...
[14] AIME-2026 | EvalScope - Read the Docsevalscope.readthedocs.io
Skip to content Logo LogoEvalScope AIME-2026 Overview AIME 2026 (American Invitational Mathematics Examination 2026) is a benchmark based on problems from the prestigious AIME competition, one of the most challenging high school mathematics contests in the...
[15] DeepSeek V4: Features, Benchmarks, and Comparisonsdatacamp.com
DeepSeek V4 Benchmarks According to DeepSeek’s internal results, DeepSeek V4 demonstrates impressive performance, particularly when pushed to its maximum reasoning limits (DeepSeek-V4-Pro-Max). According to the official release notes, here is how the model...
[16] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max :---: :---: :---: Knowledge & Reasoning MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5 SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9 Chinese-SimpleQA (Pass@1...
[17] DeepSeek-V4-Pro-Max: Pricing, Benchmarks & Performancellm-stats.com
Benchmarks GPQA MMLU MMLU-Pro AIME 2025 MATH HumanEval MMMU LiveCodeBench IFEval GSM8K SWE-Bench Verified Models Gemini 3 Pro Grok-4 Heavy GPT-5.1 Grok-4 Qwen3-235B-A22B-Thinking DeepSeek-R1-0528 GLM-4.6 GPT OSS 120B Resources Playground Blog News Community...
[18] What MMLU, ARC-AGI, SWE-bench, and GPQA Actually Mean in 2026lumichats.com
MATH and AIME: For JEE and Competitive Exam Students MATH benchmark scores (2026): GPT-5.4 approximately 95%. o3 approximately 97%. Claude Sonnet 4.6 approximately 93%. AIME 2025: o3 solved 23.3/30 problems — a level that would qualify for USAMO. Genuine co...
[19] DeepSeek V4 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/23/2026 DeepSeek DeepSeek V4 4/23/2026 OpenAI GPT 5.5 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude...
[20] DeepSeek V4: Architecture, Benchmarks, and API Guide (2026)morphllm.com
Benchmark DeepSeek V4 (claimed) Claude Opus 4.6 GPT-5.3 Codex DeepSeek V3.2 --- --- SWE-bench Verified 80-85% 80.8% 77.3% 65% HumanEval 90% 88% 85% 80% Context window 1M tokens 1M tokens (beta) 128K tokens 128K tokens Active params 32B N/A (dense) N/A (dens...
[21] Deepseek v4 models are out and here are benchmarks !( 4 ...reddit.com
Local hosting needs planning but pays off for privacy and removing token limits. Start by testing a compact quantized model on the target hardware, pick a backend that matches your team needs (easy UX vs deep control), and design predictable latency and mod...
[22] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[23] GPT-5.5 System Card - Deployment Safety Hub - OpenAIdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[24] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Leaderboards AI Leaderboards LLM Leaderboard Open LLM Leaderboard Best AI for Coding Best AI for Math Best AI for Image Generation Best AI for Writing Arenas All Arenas Chat Arena Coding Arena Image Arena Video Arena Audio Arena Trading Arena AI Image Gener...
[25] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
GPT-5.5: Pricing, Benchmarks & Performance Image 1: LLM Stats LogoLLM Stats Leaderboards Benchmarks Compare Playground Arenas Gateway Services Search⌘K Sign in Toggle theme NEW•NEW•NEW•NEW• Make AI phone calls with one API call CallingBox Start for free 1....
[26] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
The Intelligence score (25%) reflects composite reasoning across MMLU, GPQA Diamond, Humanity's Last Exam, FrontierMath, and the Artificial Analysis Intelligence Index. This is the broadest measure of raw model capability across knowledge domains. Coding (2...
[27] What Is GPT-5.5 for Builders in 2026? | WaveSpeedAI Blogwavespeed.ai
Item Status --- Release date: April 23, 2026 Confirmed — OpenAI official Live in ChatGPT (Plus/Pro/Business/Enterprise) Confirmed — OpenAI official Live in Codex (Plus/Pro/Business/Enterprise/Edu/Go) Confirmed — OpenAI official 400K context in Codex Confirm...
[28] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[29] Introducing GPT-5.5 - OpenAIopenai.com
GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. OnGDPval⁠⁠, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, whic...
[30] BenchLM.ai: LLM Leaderboard 2026 — Compare 220 AI Models ...benchlm.ai
The BenchLM LLM leaderboard 2026 provisionally ranks 115+ models and tracks 220+ large language models side by side across 178 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you nee...
[31] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[32] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[33] Model Drop: GPT-5.5handyai.substack.com
Headline benchmarks: Terminal-Bench 2.0 at 82.7% (Opus 4.7: 69.4%, Gemini 3.1 Pro: 68.5%). SWE-Bench Pro at 58.6% (Opus 4.7 still leads at 64.3%). OpenAI’s internal Expert-SWE eval, where tasks have a 20-hour median human completion time, at 73.1% (up from...
[34] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
Proactive & Open Orchestration: For autonomous tasks, K2.6 demonstrates strong performance in powering persistent, 24/7 background agents that ... 6 days ago
[35] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
Kimi K2.6 tops SWE-Bench Pro and runs 300 parallel sub-agents on 4x H100S. Learn the full architecture, benchmark results, and how to run it ... 3 days ago
[36] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
Moonshot's Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at 4 on the Artificial Analysis Intelligence Index (54) behind ... 5 days ago
[37] Kimi 2.6 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Kimi 2.6 by Moonshot AI scores 85/100 on BenchLM's provisional leaderboard ( 12 of 115) with 27 published benchmark scores currently shown ... 6 days ago
[38] Kimi K2.6: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical ... 6 days ago
[39] Kimi K2.6 - Vals AIvals.ai
Kimi K2.6. Release Date: 4/20/2026. Vals Index. Accuracy (Vals Index). 63.94% ± 1.97. Latency (Vals Index). 373.57s. Cost/Test (Vals Index). $0.21.
[40] Kimi K2 is the large language model series developed by Moonshot ...github.com
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters.
[41] Moonshot AI releases Kimi K2.6 with long-horizon coding and agent ...facebook.com
- Benchmarks: 54.0 on HLE-Full with tools (leads GPT- 5.4, Claude Opus 4.6, Gemini 3.1 Pro), 58.6 on SWE- Bench Pro, 89.6 on LiveCodeBench (v6), ... 5 days ago
[42] Meet Kimi K2.6: Advancing Open-Source Coding - Redditreddit.com
What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, ... 6 days ago
[43] Last night, Moonshot AI dropped Kimi K2.6. - Saurabh khansaurabhkhan.medium.com
But K2.6 matched or outperformed GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro across Humanity Last Exam, SWE-Bench Pro, and DeepSearchQA. Those ... 5 days ago

보고서공개 URL2026년 4월 26일43 소스

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

주요 시사점

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다.
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다.
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다.
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다.
제공된 근거만으로는 네 모델을 동등한 수준으로 “종합 벤치마크 비교”하기에는 자료가 크게 부족합니다. 확인 가능한 정량 벤치마크는 사실상 Claude Opus 4.7에 집중되어 있고, GPT-5.5·DeepSeek V4·Kimi K2.6은 제공 근거 안에서 직접 비교 가능한 공식 점수나 일관된 벤치마크 결과가 없습니다. 따라서 아래 비교는 “확인 가능한 것”과 “증거 부족”을 명확히 나누어 정리합니다.
## Key findings

연구 답변

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

시각적 지원

연구 대화

당신2026년 4월 26일

연구문제

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Studio Global AI2026년 4월 26일40 소스

인용 답변

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

사람들은 또한 묻습니다.

"Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요."에 대한 짧은 대답은 무엇입니까?

먼저 검증할 핵심 포인트는 무엇인가요?

실무에서는 다음으로 무엇을 해야 합니까?

Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다.

다음에는 어떤 관련 주제를 탐구해야 할까요?

다른 각도와 추가 인용을 보려면 "DeepSeek V4와 Kimi K2.6의 벤치마크를 조사해줘. 코딩, 콘텐츠 제작, 번역에서는 누가 이길까?"으로 계속하세요.

관련 페이지 열기

이것을 무엇과 비교해야 합니까?

"GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。"에 대해 이 답변을 대조 확인하세요.

관련 페이지 열기

연구를 계속하세요

DeepSeek V4와 Kimi K2.6의 벤치마크를 조사해줘. 코딩, 콘텐츠 제작, 번역에서는 누가 이길까?

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。

Deep research & compare GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4

請閱讀 Claude Opus 4.7 同 GPT 5.5 嘅介紹： https://www.anthropic.com/news/claude-opus-4-7 https://openai.com/index/introducing-gpt-5-5 請再搜尋更多相關資料，並就

請閱讀 Claude Opus 4.7 同 GPT 5.5 嘅介紹： https://www.anthropic.com/news/claude-opus-4-7 https://openai.com/index/introducin...

출처

[1] Claude Opus 4.7 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude Opus 4.7 4/8/2026 Meta Muse Spark 4/2/2026 Google Gemma...
[2] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[3] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[4] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Apr 16, 2026•16 min•ByNicolas Zeeb Guides CONTENTS Key observations of reported benchmarks Coding capabilities SWE-bench Verified SWE-bench Pro Terminal-Bench 2.0 Agentic capabilities MCP-Atlas (Scaled tool use) Finance Agent v1.1 OSWorld-Verified (Computer...
[5] Claude Opus 4.7: Complete Guide to Features, Benchmarks ...nxcode.io
April 16, 2026 — Anthropic released Claude Opus 4.7 today, and the story is not about a new pricing tier or a dramatic architectural overhaul. It is about targeted, measurable improvements in the two areas that matter most for production use: coding and vis...
[6] How to Compare AI Models in 2026: Benchmarks for Summarization ...mysummit.school
1. Academic breadth (MMLU and GPQA Diamond) Everyone used to look at MMLU (tests across 57 disciplines). But in 2026, this test has become a “basic hygiene minimum.” If a model scores below 85–90%, it simply doesn’t belong in the top tier. Today the gold st...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] AI Benchmarks 2026: Top Evaluations and Their Limitskili-technology.com
Image 2: Kili Technology.png) Kili Technology · Apr 13, 2026 Image 3: AI Benchmarks Guide: The Top Evaluations in 2026 and Why They're Not Enough Table of contents Introduction What Are the Most Important AI Benchmarks in 2026? General knowledge and reasoni...
[9] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[10] The Definitive LLM Selection & Benchmarks Guideiternal.ai
Model Provider Parameters License Key Strengths Best Benchmarks --- --- --- Qwen 3.5 (397B-A17B) Alibaba 397B (17B active MoE) Apache 2.0 Reasoning, math, multilingual (201 langs), native vision, 256K context; 19x faster decoding GPQA 88.4%; AIME 91.3%; SWE...
[11] Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safetymashable.com
Claude Opus 4.7 is available now. Credit: Samuel Boivin/NurPhoto via Getty Images Anthropic has been shipping products and making news at a blistering pace in 2026, and on Thursday, the AI company announced the launch of Claude Opus 4.7. Claude Opus 4.7 is...
[12] Claude Opus 4.7 results: early benchmarks, real-world feedback ...boringbot.substack.com
The Claude Opus 4.7 benchmarks on software engineering tasks show the clearest improvement. On SWE-Bench, the industry-standard benchmark for evaluating autonomous code repair across real GitHub issues, Opus 4.7 shows a meaningful step up from Opus 4.6, wit...
[13] [PDF] Technical Performance - Stanford HAIhai.stanford.edu
Technical Performance Benchmarks vs. Human Performance 76 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% 120% Image classiǇcation (ImageNet Top-5) Visual reasoning (VQA) English language understanding (SuperGLU...
[14] AIME-2026 | EvalScope - Read the Docsevalscope.readthedocs.io
Skip to content Logo LogoEvalScope AIME-2026 Overview AIME 2026 (American Invitational Mathematics Examination 2026) is a benchmark based on problems from the prestigious AIME competition, one of the most challenging high school mathematics contests in the...
[15] DeepSeek V4: Features, Benchmarks, and Comparisonsdatacamp.com
DeepSeek V4 Benchmarks According to DeepSeek’s internal results, DeepSeek V4 demonstrates impressive performance, particularly when pushed to its maximum reasoning limits (DeepSeek-V4-Pro-Max). According to the official release notes, here is how the model...
[16] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max :---: :---: :---: Knowledge & Reasoning MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5 SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9 Chinese-SimpleQA (Pass@1...
[17] DeepSeek-V4-Pro-Max: Pricing, Benchmarks & Performancellm-stats.com
Benchmarks GPQA MMLU MMLU-Pro AIME 2025 MATH HumanEval MMMU LiveCodeBench IFEval GSM8K SWE-Bench Verified Models Gemini 3 Pro Grok-4 Heavy GPT-5.1 Grok-4 Qwen3-235B-A22B-Thinking DeepSeek-R1-0528 GLM-4.6 GPT OSS 120B Resources Playground Blog News Community...
[18] What MMLU, ARC-AGI, SWE-bench, and GPQA Actually Mean in 2026lumichats.com
MATH and AIME: For JEE and Competitive Exam Students MATH benchmark scores (2026): GPT-5.4 approximately 95%. o3 approximately 97%. Claude Sonnet 4.6 approximately 93%. AIME 2025: o3 solved 23.3/30 problems — a level that would qualify for USAMO. Genuine co...
[19] DeepSeek V4 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/23/2026 DeepSeek DeepSeek V4 4/23/2026 OpenAI GPT 5.5 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude...
[20] DeepSeek V4: Architecture, Benchmarks, and API Guide (2026)morphllm.com
Benchmark DeepSeek V4 (claimed) Claude Opus 4.6 GPT-5.3 Codex DeepSeek V3.2 --- --- SWE-bench Verified 80-85% 80.8% 77.3% 65% HumanEval 90% 88% 85% 80% Context window 1M tokens 1M tokens (beta) 128K tokens 128K tokens Active params 32B N/A (dense) N/A (dens...
[21] Deepseek v4 models are out and here are benchmarks !( 4 ...reddit.com
Local hosting needs planning but pays off for privacy and removing token limits. Start by testing a compact quantized model on the target hardware, pick a backend that matches your team needs (easy UX vs deep control), and design predictable latency and mod...
[22] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[23] GPT-5.5 System Card - Deployment Safety Hub - OpenAIdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[24] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Leaderboards AI Leaderboards LLM Leaderboard Open LLM Leaderboard Best AI for Coding Best AI for Math Best AI for Image Generation Best AI for Writing Arenas All Arenas Chat Arena Coding Arena Image Arena Video Arena Audio Arena Trading Arena AI Image Gener...
[25] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
GPT-5.5: Pricing, Benchmarks & Performance Image 1: LLM Stats LogoLLM Stats Leaderboards Benchmarks Compare Playground Arenas Gateway Services Search⌘K Sign in Toggle theme NEW•NEW•NEW•NEW• Make AI phone calls with one API call CallingBox Start for free 1....
[26] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
The Intelligence score (25%) reflects composite reasoning across MMLU, GPQA Diamond, Humanity's Last Exam, FrontierMath, and the Artificial Analysis Intelligence Index. This is the broadest measure of raw model capability across knowledge domains. Coding (2...
[27] What Is GPT-5.5 for Builders in 2026? | WaveSpeedAI Blogwavespeed.ai
Item Status --- Release date: April 23, 2026 Confirmed — OpenAI official Live in ChatGPT (Plus/Pro/Business/Enterprise) Confirmed — OpenAI official Live in Codex (Plus/Pro/Business/Enterprise/Edu/Go) Confirmed — OpenAI official 400K context in Codex Confirm...
[28] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[29] Introducing GPT-5.5 - OpenAIopenai.com
GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. OnGDPval⁠⁠, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, whic...
[30] BenchLM.ai: LLM Leaderboard 2026 — Compare 220 AI Models ...benchlm.ai
The BenchLM LLM leaderboard 2026 provisionally ranks 115+ models and tracks 220+ large language models side by side across 178 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you nee...
[31] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[32] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[33] Model Drop: GPT-5.5handyai.substack.com
Headline benchmarks: Terminal-Bench 2.0 at 82.7% (Opus 4.7: 69.4%, Gemini 3.1 Pro: 68.5%). SWE-Bench Pro at 58.6% (Opus 4.7 still leads at 64.3%). OpenAI’s internal Expert-SWE eval, where tasks have a 20-hour median human completion time, at 73.1% (up from...
[34] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
Proactive & Open Orchestration: For autonomous tasks, K2.6 demonstrates strong performance in powering persistent, 24/7 background agents that ... 6 days ago
[35] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
Kimi K2.6 tops SWE-Bench Pro and runs 300 parallel sub-agents on 4x H100S. Learn the full architecture, benchmark results, and how to run it ... 3 days ago
[36] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
Moonshot's Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at 4 on the Artificial Analysis Intelligence Index (54) behind ... 5 days ago
[37] Kimi 2.6 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Kimi 2.6 by Moonshot AI scores 85/100 on BenchLM's provisional leaderboard ( 12 of 115) with 27 published benchmark scores currently shown ... 6 days ago
[38] Kimi K2.6: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical ... 6 days ago
[39] Kimi K2.6 - Vals AIvals.ai
Kimi K2.6. Release Date: 4/20/2026. Vals Index. Accuracy (Vals Index). 63.94% ± 1.97. Latency (Vals Index). 373.57s. Cost/Test (Vals Index). $0.21.
[40] Kimi K2 is the large language model series developed by Moonshot ...github.com
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters.
[41] Moonshot AI releases Kimi K2.6 with long-horizon coding and agent ...facebook.com
- Benchmarks: 54.0 on HLE-Full with tools (leads GPT- 5.4, Claude Opus 4.6, Gemini 3.1 Pro), 58.6 on SWE- Bench Pro, 89.6 on LiveCodeBench (v6), ... 5 days ago
[42] Meet Kimi K2.6: Advancing Open-Source Coding - Redditreddit.com
What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, ... 6 days ago
[43] Last night, Moonshot AI dropped Kimi K2.6. - Saurabh khansaurabhkhan.medium.com
But K2.6 matched or outperformed GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro across Humanity Last Exam, SWE-Bench Pro, and DeepSearchQA. Those ... 5 days ago

보고서공개 URL2026년 4월 26일43 소스

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

주요 시사점

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다.
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다.
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다.
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다.
제공된 근거만으로는 네 모델을 동등한 수준으로 “종합 벤치마크 비교”하기에는 자료가 크게 부족합니다. 확인 가능한 정량 벤치마크는 사실상 Claude Opus 4.7에 집중되어 있고, GPT-5.5·DeepSeek V4·Kimi K2.6은 제공 근거 안에서 직접 비교 가능한 공식 점수나 일관된 벤치마크 결과가 없습니다. 따라서 아래 비교는 “확인 가능한 것”과 “증거 부족”을 명확히 나누어 정리합니다.
## Key findings

연구 답변

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

시각적 지원

연구 대화

당신2026년 4월 26일

연구문제

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Studio Global AI2026년 4월 26일40 소스

인용 답변

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

사람들은 또한 묻습니다.

"Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요."에 대한 짧은 대답은 무엇입니까?

먼저 검증할 핵심 포인트는 무엇인가요?

실무에서는 다음으로 무엇을 해야 합니까?

Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다.

다음에는 어떤 관련 주제를 탐구해야 할까요?

다른 각도와 추가 인용을 보려면 "DeepSeek V4와 Kimi K2.6의 벤치마크를 조사해줘. 코딩, 콘텐츠 제작, 번역에서는 누가 이길까?"으로 계속하세요.

관련 페이지 열기

이것을 무엇과 비교해야 합니까?

"GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。"에 대해 이 답변을 대조 확인하세요.

관련 페이지 열기

請閱讀 Claude Opus 4.7 同 GPT 5.5 嘅介紹： https://www.anthropic.com/news/claude-opus-4-7 https://openai.com/index/introducing-gpt-5-5 請再搜尋更多相關資料，並就

請閱讀 Claude Opus 4.7 同 GPT 5.5 嘅介紹： https://www.anthropic.com/news/claude-opus-4-7 https://openai.com/index/introducin...

출처

[1] Claude Opus 4.7 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude Opus 4.7 4/8/2026 Meta Muse Spark 4/2/2026 Google Gemma...
[2] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[3] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[4] Claude Opus 4.7 Benchmarks Explained - Vellumvellum.ai
Apr 16, 2026•16 min•ByNicolas Zeeb Guides CONTENTS Key observations of reported benchmarks Coding capabilities SWE-bench Verified SWE-bench Pro Terminal-Bench 2.0 Agentic capabilities MCP-Atlas (Scaled tool use) Finance Agent v1.1 OSWorld-Verified (Computer...
[5] Claude Opus 4.7: Complete Guide to Features, Benchmarks ...nxcode.io
April 16, 2026 — Anthropic released Claude Opus 4.7 today, and the story is not about a new pricing tier or a dramatic architectural overhaul. It is about targeted, measurable improvements in the two areas that matter most for production use: coding and vis...
[6] How to Compare AI Models in 2026: Benchmarks for Summarization ...mysummit.school
1. Academic breadth (MMLU and GPQA Diamond) Everyone used to look at MMLU (tests across 57 disciplines). But in 2026, this test has become a “basic hygiene minimum.” If a model scores below 85–90%, it simply doesn’t belong in the top tier. Today the gold st...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] AI Benchmarks 2026: Top Evaluations and Their Limitskili-technology.com
Image 2: Kili Technology.png) Kili Technology · Apr 13, 2026 Image 3: AI Benchmarks Guide: The Top Evaluations in 2026 and Why They're Not Enough Table of contents Introduction What Are the Most Important AI Benchmarks in 2026? General knowledge and reasoni...
[9] Claude Opus 4.7 Benchmark Full Analysis: Empirical Data Leading ...help.apiyi.com
Q1: What is Claude Opus 4.7? Claude Opus 4.7 is the flagship Large Language Model released by Anthropic on April 16, 2026. It leads in multiple benchmarks, including coding (SWE-bench Verified 87.6%), Agent tool invocation, and scientific reasoning (GPQA Di...
[10] The Definitive LLM Selection & Benchmarks Guideiternal.ai
Model Provider Parameters License Key Strengths Best Benchmarks --- --- --- Qwen 3.5 (397B-A17B) Alibaba 397B (17B active MoE) Apache 2.0 Reasoning, math, multilingual (201 langs), native vision, 256K context; 19x faster decoding GPQA 88.4%; AIME 91.3%; SWE...
[11] Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safetymashable.com
Claude Opus 4.7 is available now. Credit: Samuel Boivin/NurPhoto via Getty Images Anthropic has been shipping products and making news at a blistering pace in 2026, and on Thursday, the AI company announced the launch of Claude Opus 4.7. Claude Opus 4.7 is...
[12] Claude Opus 4.7 results: early benchmarks, real-world feedback ...boringbot.substack.com
The Claude Opus 4.7 benchmarks on software engineering tasks show the clearest improvement. On SWE-Bench, the industry-standard benchmark for evaluating autonomous code repair across real GitHub issues, Opus 4.7 shows a meaningful step up from Opus 4.6, wit...
[13] [PDF] Technical Performance - Stanford HAIhai.stanford.edu
Technical Performance Benchmarks vs. Human Performance 76 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% 120% Image classiǇcation (ImageNet Top-5) Visual reasoning (VQA) English language understanding (SuperGLU...
[14] AIME-2026 | EvalScope - Read the Docsevalscope.readthedocs.io
Skip to content Logo LogoEvalScope AIME-2026 Overview AIME 2026 (American Invitational Mathematics Examination 2026) is a benchmark based on problems from the prestigious AIME competition, one of the most challenging high school mathematics contests in the...
[15] DeepSeek V4: Features, Benchmarks, and Comparisonsdatacamp.com
DeepSeek V4 Benchmarks According to DeepSeek’s internal results, DeepSeek V4 demonstrates impressive performance, particularly when pushed to its maximum reasoning limits (DeepSeek-V4-Pro-Max). According to the official release notes, here is how the model...
[16] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max :---: :---: :---: Knowledge & Reasoning MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5 SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9 Chinese-SimpleQA (Pass@1...
[17] DeepSeek-V4-Pro-Max: Pricing, Benchmarks & Performancellm-stats.com
Benchmarks GPQA MMLU MMLU-Pro AIME 2025 MATH HumanEval MMMU LiveCodeBench IFEval GSM8K SWE-Bench Verified Models Gemini 3 Pro Grok-4 Heavy GPT-5.1 Grok-4 Qwen3-235B-A22B-Thinking DeepSeek-R1-0528 GLM-4.6 GPT OSS 120B Resources Playground Blog News Community...
[18] What MMLU, ARC-AGI, SWE-bench, and GPQA Actually Mean in 2026lumichats.com
MATH and AIME: For JEE and Competitive Exam Students MATH benchmark scores (2026): GPT-5.4 approximately 95%. o3 approximately 97%. Claude Sonnet 4.6 approximately 93%. AIME 2025: o3 solved 23.3/30 problems — a level that would qualify for USAMO. Genuine co...
[19] DeepSeek V4 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/23/2026 DeepSeek DeepSeek V4 4/23/2026 OpenAI GPT 5.5 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude...
[20] DeepSeek V4: Architecture, Benchmarks, and API Guide (2026)morphllm.com
Benchmark DeepSeek V4 (claimed) Claude Opus 4.6 GPT-5.3 Codex DeepSeek V3.2 --- --- SWE-bench Verified 80-85% 80.8% 77.3% 65% HumanEval 90% 88% 85% 80% Context window 1M tokens 1M tokens (beta) 128K tokens 128K tokens Active params 32B N/A (dense) N/A (dens...
[21] Deepseek v4 models are out and here are benchmarks !( 4 ...reddit.com
Local hosting needs planning but pays off for privacy and removing token limits. Start by testing a compact quantized model on the target hardware, pick a backend that matches your team needs (easy UX vs deep control), and design predictable latency and mod...
[22] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[23] GPT-5.5 System Card - Deployment Safety Hub - OpenAIdeploymentsafety.openai.com
We measure GPT-5.5’s controllability by running CoT-Control, an evaluation suite described in (Yueh-Han, 2026 ) that tracks the model’s ability to follow user instructions about their CoT. CoT-Control includes over 13,000 tasks built from established benchm...
[24] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Leaderboards AI Leaderboards LLM Leaderboard Open LLM Leaderboard Best AI for Coding Best AI for Math Best AI for Image Generation Best AI for Writing Arenas All Arenas Chat Arena Coding Arena Image Arena Video Arena Audio Arena Trading Arena AI Image Gener...
[25] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
GPT-5.5: Pricing, Benchmarks & Performance Image 1: LLM Stats LogoLLM Stats Leaderboards Benchmarks Compare Playground Arenas Gateway Services Search⌘K Sign in Toggle theme NEW•NEW•NEW•NEW• Make AI phone calls with one API call CallingBox Start for free 1....
[26] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
The Intelligence score (25%) reflects composite reasoning across MMLU, GPQA Diamond, Humanity's Last Exam, FrontierMath, and the Artificial Analysis Intelligence Index. This is the broadest measure of raw model capability across knowledge domains. Coding (2...
[27] What Is GPT-5.5 for Builders in 2026? | WaveSpeedAI Blogwavespeed.ai
Item Status --- Release date: April 23, 2026 Confirmed — OpenAI official Live in ChatGPT (Plus/Pro/Business/Enterprise) Confirmed — OpenAI official Live in Codex (Plus/Pro/Business/Enterprise/Edu/Go) Confirmed — OpenAI official 400K context in Codex Confirm...
[28] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[29] Introducing GPT-5.5 - OpenAIopenai.com
GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. OnGDPval⁠⁠, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, whic...
[30] BenchLM.ai: LLM Leaderboard 2026 — Compare 220 AI Models ...benchlm.ai
The BenchLM LLM leaderboard 2026 provisionally ranks 115+ models and tracks 220+ large language models side by side across 178 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you nee...
[31] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[32] LLM Leaderboard 2026 — Compare Top AI Models - Vellumvellum.ai
93.6% GPT-5.5 92.4% GPT 5.2 91.9% Gemini 3 Pro Best in Reasoning (GPQA Diamond) Model Score --- Claude 3 Opus 95.4% Claude Opus 4.7 94.2% GPT-5.5 93.6% GPT 5.2 92.4% Gemini 3 Pro 91.9% Best in High School Math (AIME 2025) 100%96%93%89%86% 100% Gemini 3 Pro...
[33] Model Drop: GPT-5.5handyai.substack.com
Headline benchmarks: Terminal-Bench 2.0 at 82.7% (Opus 4.7: 69.4%, Gemini 3.1 Pro: 68.5%). SWE-Bench Pro at 58.6% (Opus 4.7 still leads at 64.3%). OpenAI’s internal Expert-SWE eval, where tasks have a 20-hour median human completion time, at 73.1% (up from...
[34] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
Proactive & Open Orchestration: For autonomous tasks, K2.6 demonstrates strong performance in powering persistent, 24/7 background agents that ... 6 days ago
[35] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
Kimi K2.6 tops SWE-Bench Pro and runs 300 parallel sub-agents on 4x H100S. Learn the full architecture, benchmark results, and how to run it ... 3 days ago
[36] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
Moonshot's Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at 4 on the Artificial Analysis Intelligence Index (54) behind ... 5 days ago
[37] Kimi 2.6 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Kimi 2.6 by Moonshot AI scores 85/100 on BenchLM's provisional leaderboard ( 12 of 115) with 27 published benchmark scores currently shown ... 6 days ago
[38] Kimi K2.6: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical ... 6 days ago
[39] Kimi K2.6 - Vals AIvals.ai
Kimi K2.6. Release Date: 4/20/2026. Vals Index. Accuracy (Vals Index). 63.94% ± 1.97. Latency (Vals Index). 373.57s. Cost/Test (Vals Index). $0.21.
[40] Kimi K2 is the large language model series developed by Moonshot ...github.com
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters.
[41] Moonshot AI releases Kimi K2.6 with long-horizon coding and agent ...facebook.com
- Benchmarks: 54.0 on HLE-Full with tools (leads GPT- 5.4, Claude Opus 4.6, Gemini 3.1 Pro), 58.6 on SWE- Bench Pro, 89.6 on LiveCodeBench (v6), ... 5 days ago
[42] Meet Kimi K2.6: Advancing Open-Source Coding - Redditreddit.com
What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, ... 6 days ago
[43] Last night, Moonshot AI dropped Kimi K2.6. - Saurabh khansaurabhkhan.medium.com
But K2.6 matched or outperformed GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro across Humanity Last Exam, SWE-Bench Pro, and DeepSearchQA. Those ... 5 days ago