レポート公開済み2026年4月29日Last edited 2026年5月6日12 ソース

GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較

端末操作型のコーディングエージェントはGPT 5.5、ソフトウェア修正系ベンチマークはClaude Opus 4.7が有力です [18][24]。 GPT 5.5 ProはベースのGPT 5.5と別扱いです。報告されている範囲ではBrowseComp 90.1%、ツールありHumanity’s Last Exam 57.2%で首位です [24]。

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

17K0

Abstract benchmark dashboard comparing GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4 — GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks ComparedAI-generated editorial illustration for a benchmark comparison of GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4.
AI プロンプト
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks Compared. Article summary: There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,.... Topic tags: ai, ai benchmarks, llm, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hp
openai.com

ベンチマーク表だけを見ると、4つのモデルが同じトラックを走っているように見えます。しかし実際には、完全に同じ条件でGPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4を横並び評価した単一の公開表はありません。引用できる範囲で最も近い比較は、GPT-5.5、GPT-5.5 Pro、Claude Opus 4.7、DeepSeek-V4-Pro-Maxを並べたものです。Kimi K2.6は、Kimi向けのリリース記事、モデルカード、リーダーボードの数値を別途参照する形になります ^[1]^[6]^[24]。

本稿では、DeepSeek V4については、ベンチマークとコストの行が確認できるDeepSeek-V4-Pro-Maxを比較対象として扱います ^[18]^[24]。また、GPT-5.5とGPT-5.5 Proは、数値が別々に報告されている箇所では混ぜずに扱います ^[24]。

まず結論：用途別にどれから試すべきか

ターミナル操作が多いコーディングエージェント：GPT-5.5。共同比較のTerminal-Bench 2.0で82.7%と最も高い数値です ^[24]。
ソフトウェア修正・実リポジトリ系の評価：Claude Opus 4.7。SWE-Bench Proで64.3%、SWE-Bench Verifiedで87.6%と、この比較範囲では強い結果です ^[18]^[24]。
ツールなしの難問推論：Claude Opus 4.7。GPQA DiamondとツールなしHumanity’s Last Examで首位です ^[24]。
ツールを使う推論やブラウジング系タスク：GPT-5.5 Pro。ツールありHumanity’s Last Examで57.2%、BrowseCompで90.1%と報告されています ^[24]。
重み公開モデルを自社環境で試したい場合：Kimi K2.6。1TパラメータのMoE、32Bアクティブパラメータ、256Kコンテキストのオープンウェイトモデルとして説明されています ^[1]。
APIコストを強く抑えたい場合：DeepSeek-V4-Pro-Max。LLM Statsでは1Mコンテキスト、SWE-Bench Verified 80.6%、価格欄は$1.74/$3.48とされています ^[18]。

主要ベンチマーク比較

「—」は、そのモデルの該当スコアが引用元で確認できなかったことを示します。0点という意味ではありません。GPT-5.5、GPT-5.5 Pro、Claude Opus 4.7、DeepSeek-V4-Pro-Maxは主に同じ比較表から、Kimi K2.6は別のKimi関連ソースからの数値です ^[1]^[6]^[24]。

ベンチマーク	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	約91% ^[28]	90.1% ^[24]
Humanity’s Last Exam、ツールなし	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam、ツールあり	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

優先順位別の候補

重視すること	最初に試す候補	理由
ターミナル型のコーディングエージェント	GPT-5.5	共同比較のTerminal-Bench 2.0で82.7%と最も高い数値です ^[24]。
ソフトウェア修正ベンチマーク	Claude Opus 4.7	SWE-Bench ProとSWE-Bench Verifiedの引用値で、このグループの中では最も強い結果です ^[18]^[24]。
ツールなしの難問推論	Claude Opus 4.7	GPQA DiamondとツールなしHumanity’s Last Examで首位です ^[24]。
ツールあり推論・ブラウジング	GPT-5.5 Pro	GPT-5.5 Proが別枠で報告されている範囲では、ツールありHLEとBrowseCompで首位です ^[24]。
オープンウェイト展開	Kimi K2.6	1TパラメータMoEのオープンウェイトモデルとして説明され、Hugging Faceのモデルカードでもコーディング系の強い数値が示されています ^[1]^[6]。
コスト重視のホスト推論	DeepSeek-V4-Pro-Max	LLM Statsでは1Mコンテキスト、SWE-Bench Verified 80.6%、Claude Opus 4.7より低い価格欄が示されています ^[18]。
長いコンテキスト	GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max	引用元ではGPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Maxに1Mコンテキストが示され、Kimi K2.6はおおむね256K〜262Kです ^[1]^[11]^[16]^[18]^[27]。

モデル別の読み方

GPT-5.5

OpenAIはGPT-5.5を、コーディング、リサーチ、データ分析など複雑なタスク向けのモデルと説明しています ^[38]。共同比較では、Terminal-Bench 2.0が82.7%で、Claude Opus 4.7の69.4%、DeepSeek-V4-Pro-Maxの67.9%を上回っています ^[24]。同じ表では、GPQA Diamond 93.6%、SWE-Bench Pro 58.6%、BrowseComp 84.4%も報告されています ^[24]。

注意点は、GPT-5.5 ProをベースのGPT-5.5と同一視しないことです。GPT-5.5 ProはBrowseCompで90.1%、ツールありHumanity’s Last Examで57.2%と報告されていますが、コスト、レイテンシ、設定を比較する際には別モデルとして扱うべきです ^[24]。

調達面では、BenchLMがGPT-5.5を1Mトークンのコンテキストウィンドウ付きモデルとして掲載し、別の価格報道では入力100万トークンあたり$5、出力100万トークンあたり$30とされています ^[27]^[36]。ただし価格は変わりやすいため、実際の見積もり前に提供元の最新情報を確認する必要があります。

Claude Opus 4.7

Claude Opus 4.7は、この4候補の中でソフトウェア修正系ベンチマークのシグナルが特に強いモデルです。LLM StatsはSWE-Bench Verifiedで87.6%と掲載し、共同比較ではSWE-Bench Proで64.3%と報告されています ^[18]^[24]。また、GPQA Diamond 94.2%、ツールなしHumanity’s Last Exam 46.9%、MCP Atlas 79.1%でも首位です ^[24]。

LLM StatsはClaude Opus 4.7について、1Mトークンのコンテキストウィンドウと、100万トークンあたり$5/$25の価格を掲載しています ^[16]。一方で、Anthropicは一部のベンチマークについて、社内実装や更新された評価ハーネスを使っており、公開リーダーボードのスコアと直接比較できない場合があると説明しています ^[17]。

Kimi K2.6

Kimi K2.6の強みは、性能だけでなく、オープンウェイトの選択肢である点です。リリース関連の情報では、1TパラメータのMoE、32Bアクティブパラメータ、384エキスパート、ネイティブマルチモーダル、INT4量子化、256Kコンテキストのモデルとして説明されています ^[1]。Hugging Faceのモデルカードでは、SWE-Bench Verified 80.2%、SWE-Bench Pro 58.6%、Terminal-Bench 2.0 66.7%、LiveCodeBench v6 89.6が報告されています ^[6]。

同じリリース関連情報では、Kimi K2.6についてツールありHumanity’s Last Exam 54.0、BrowseComp 83.2も示されています ^[1]。LLM Statsでは、262Kコンテキスト、価格欄$0.95/$4.00、Open Sourceラベルが掲載されています ^[11]。ただし、Kimiの数値はGPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Maxと同じ比較表から来ているわけではないため、僅差の勝敗はそのまま結論にせず、自分のタスクで検証する前提で読むべきです ^[1]^[6]^[24]。

DeepSeek-V4-Pro-Max

DeepSeek-V4-Pro-Maxは、総合首位というより、価格性能比を検証したい候補です。LLM Statsでは、1.6Tサイズ、1Mコンテキスト、SWE-Bench Verified 80.6%、価格欄$1.74/$3.48とされています ^[18]。共同比較では、GPQA Diamond 90.1%、ツールなしHumanity’s Last Exam 37.7%、ツールありHumanity’s Last Exam 48.2%、Terminal-Bench 2.0 67.9%、SWE-Bench Pro 55.4%、BrowseComp 83.4%、MCP Atlas 73.6%が報告されています ^[24]。

このため、DeepSeek-V4-Pro-Maxはコスト制約の強いワークロードでは試す価値があります。ただし、同じ比較表では多くの行でGPT-5.5、GPT-5.5 Pro、Claude Opus 4.7のいずれかが上回っています。本番環境でプレミアムモデルを置き換える前に、自社の実タスクで品質を確認する必要があります ^[24]。

価格とコンテキスト長は「参考値」として見る

価格とコンテキスト長は、常に同じ提供元・同じ条件で報告されるわけではありません。以下は見積もりの最終値ではなく、候補を絞るためのシグナルとして見るのが安全です。

モデル	引用元で確認できるコンテキスト・価格シグナル	実務上の読み方
GPT-5.5	BenchLMは1Mコンテキストを掲載。価格報道では入力$5、出力$30／100万トークン ^[27]^[36]。	プレミアムなホスト型候補。最新価格の確認が必須。
Claude Opus 4.7	LLM Statsは1Mコンテキストと$5/$25／100万トークンを掲載 ^[16]。	コーディング、推論、長文コンテキスト用途の高性能候補。
Kimi K2.6	リリース情報では256Kコンテキスト、LLM Statsでは262Kコンテキストと$0.95/$4.00の価格欄 ^[1]^[11]。	オープンウェイト重視なら有力。ホスト価格は提供事業者により変わり得る。
DeepSeek-V4-Pro-Max	LLM Statsは1Mコンテキスト、1.6Tサイズ、SWE-Bench Verified 80.6%、価格欄$1.74/$3.48を掲載 ^[18]。	品質が用途に合えば、コスト重視の有力候補。

なぜランキングが食い違うのか

ベンチマークは、見ている能力がそれぞれ異なります。GPQA DiamondやHumanity’s Last Examは難問推論、Terminal-Bench 2.0やSWE-Bench系はコーディングやエージェント的なソフトウェア作業、BrowseCompはブラウジング型の検索・照合能力を測る文脈で使われています ^[24]。そのため、あるモデルが1つの行で勝っても、別の行では負けることがあります。

さらに、同じ名前のベンチマークでも、評価環境が違えば数値は変わります。たとえばClaude Opus 4.7のSWE-Bench Verifiedについて、LLM Statsは87.6%を掲載していますが、LMCouncilは自社設定の下で83.5% ± 1.7としています ^[18]^[30]。Anthropicも、一部の結果は社内実装や更新されたハーネスを使っており、公開リーダーボードと直接比較できない場合があるとしています ^[17]。

したがって、1〜2ポイントの差だけで採用を決めるのは危険です。公開ベンチマークは候補を絞る材料であり、最終判断は自分のワークロードでの評価に任せるべきです。

導入前に見るべきチェックリスト

実際のプロンプト、ファイル、リポジトリで試す。 ベンチマークの問題は、自社のコードベース、文書、業務ルール、ユーザー行動を完全には再現しません。
ツール環境をそろえる。 ターミナル、ブラウジング、検索、リポジトリ参照、社内APIの有無で、コーディングエージェントの結果は大きく変わります。
同じ設定でコストとレイテンシを測る。 Proモードや高推論設定は品質だけでなく、出力トークン数や応答時間も変えます。
失敗例を人間が確認する。 コード生成では、テスト通過だけでなく、差分の保守性、セキュリティ、存在しない依存関係の混入も確認すべきです。
低コスト候補を最低1つ入れる。 オープンウェイトや推論コストが重要なら、Kimi K2.6とDeepSeek-V4-Pro-Maxは評価セットに入れる価値があります ^[1]^[18]。

結論

高性能モデルをまず絞り込むなら、GPT-5.5とClaude Opus 4.7を並べて試すのが分かりやすい出発点です。GPT-5.5はTerminal-Bench 2.0で最も強い引用値を持ち、Claude Opus 4.7はSWE-Bench ProとSWE-Bench Verifiedで最も強い引用値を持っています ^[18]^[24]。重み公開が必要ならKimi K2.6から、コストが最優先ならDeepSeek-V4-Pro-Maxを含めて検証するのが現実的です ^[1]^[6]^[18]^[24]。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

端末操作型のコーディングエージェントはGPT 5.5、ソフトウェア修正系ベンチマークはClaude Opus 4.7が有力です [18][24]。
GPT 5.5 ProはベースのGPT 5.5と別扱いです。報告されている範囲ではBrowseComp 90.1%、ツールありHumanity’s Last Exam 57.2%で首位です [24]。
オープンウェイト重視ならKimi K2.6、コスト重視のホスト推論ならDeepSeek V4 Pro Maxを実タスクで検証するのが現実的です [1][18]。

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較」の短い答えは何ですか?

端末操作型のコーディングエージェントはGPT 5.5、ソフトウェア修正系ベンチマークはClaude Opus 4.7が有力です [18][24]。

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

オープンウェイト重視ならKimi K2.6、コスト重視のホスト推論ならDeepSeek V4 Pro Maxを実タスクで検証するのが現実的です [1][18]。

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6：2026年ベンチマーク比較

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

Northwest Timber と Southeast Timber：代替性が高いと、なぜ「larger; larger」になるのか

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか

研究に関する会話

あなた2026年4月25日

リサーチクエスチョン

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI2026年4月25日40 ソース

引用された回答

I found public benchmark data for all four, but not one fully independent, apples-to-apples leaderboard covering GPT-5.5, Claude Opus 4.7, Kimi K2.6, and DeepSeek V4 under the same harness. The practical read is: GPT-5.5 looks strongest for broad agentic coding/computer-use tasks, Claude Opus 4.7 is extremely strong on coding and hard reasoning, Kimi K2.6 is the strongest open-weight option in this group, and DeepSeek V4 looks like the best price/performance play if its reported results hold up across independent testing ^[4]^[5]^[6]^[13].

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

If you want the safest high-end general choice, start with GPT-5.5. If your priority is top-tier coding reliability and long-context enterprise work, test Claude Opus 4.7. If you need open weights and strong frontier-adjacent performance, Kimi K2.6 is the most compelling open option. If cost is the constraint and you can tolerate some validation risk, DeepSeek V4 is the one to benchmark first.

情報源

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

トレンドを発見する

レポート公開済み2026年4月29日Last edited 2026年5月6日12 ソース

GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

17K0

まず結論：用途別にどれから試すべきか

ターミナル操作が多いコーディングエージェント：GPT-5.5。共同比較のTerminal-Bench 2.0で82.7%と最も高い数値です ^[24]。
ソフトウェア修正・実リポジトリ系の評価：Claude Opus 4.7。SWE-Bench Proで64.3%、SWE-Bench Verifiedで87.6%と、この比較範囲では強い結果です ^[18]^[24]。
ツールなしの難問推論：Claude Opus 4.7。GPQA DiamondとツールなしHumanity’s Last Examで首位です ^[24]。
ツールを使う推論やブラウジング系タスク：GPT-5.5 Pro。ツールありHumanity’s Last Examで57.2%、BrowseCompで90.1%と報告されています ^[24]。
重み公開モデルを自社環境で試したい場合：Kimi K2.6。1TパラメータのMoE、32Bアクティブパラメータ、256Kコンテキストのオープンウェイトモデルとして説明されています ^[1]。
APIコストを強く抑えたい場合：DeepSeek-V4-Pro-Max。LLM Statsでは1Mコンテキスト、SWE-Bench Verified 80.6%、価格欄は$1.74/$3.48とされています ^[18]。

主要ベンチマーク比較

ベンチマーク	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	約91% ^[28]	90.1% ^[24]
Humanity’s Last Exam、ツールなし	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam、ツールあり	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

優先順位別の候補

重視すること	最初に試す候補	理由
ターミナル型のコーディングエージェント	GPT-5.5	共同比較のTerminal-Bench 2.0で82.7%と最も高い数値です ^[24]。
ソフトウェア修正ベンチマーク	Claude Opus 4.7	SWE-Bench ProとSWE-Bench Verifiedの引用値で、このグループの中では最も強い結果です ^[18]^[24]。
ツールなしの難問推論	Claude Opus 4.7	GPQA DiamondとツールなしHumanity’s Last Examで首位です ^[24]。
ツールあり推論・ブラウジング	GPT-5.5 Pro	GPT-5.5 Proが別枠で報告されている範囲では、ツールありHLEとBrowseCompで首位です ^[24]。
オープンウェイト展開	Kimi K2.6	1TパラメータMoEのオープンウェイトモデルとして説明され、Hugging Faceのモデルカードでもコーディング系の強い数値が示されています ^[1]^[6]。
コスト重視のホスト推論	DeepSeek-V4-Pro-Max	LLM Statsでは1Mコンテキスト、SWE-Bench Verified 80.6%、Claude Opus 4.7より低い価格欄が示されています ^[18]。
長いコンテキスト	GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max	引用元ではGPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Maxに1Mコンテキストが示され、Kimi K2.6はおおむね256K〜262Kです ^[1]^[11]^[16]^[18]^[27]。

モデル別の読み方

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

価格とコンテキスト長は「参考値」として見る

モデル	引用元で確認できるコンテキスト・価格シグナル	実務上の読み方
GPT-5.5	BenchLMは1Mコンテキストを掲載。価格報道では入力$5、出力$30／100万トークン ^[27]^[36]。	プレミアムなホスト型候補。最新価格の確認が必須。
Claude Opus 4.7	LLM Statsは1Mコンテキストと$5/$25／100万トークンを掲載 ^[16]。	コーディング、推論、長文コンテキスト用途の高性能候補。
Kimi K2.6	リリース情報では256Kコンテキスト、LLM Statsでは262Kコンテキストと$0.95/$4.00の価格欄 ^[1]^[11]。	オープンウェイト重視なら有力。ホスト価格は提供事業者により変わり得る。
DeepSeek-V4-Pro-Max	LLM Statsは1Mコンテキスト、1.6Tサイズ、SWE-Bench Verified 80.6%、価格欄$1.74/$3.48を掲載 ^[18]。	品質が用途に合えば、コスト重視の有力候補。

なぜランキングが食い違うのか

導入前に見るべきチェックリスト

実際のプロンプト、ファイル、リポジトリで試す。 ベンチマークの問題は、自社のコードベース、文書、業務ルール、ユーザー行動を完全には再現しません。
ツール環境をそろえる。 ターミナル、ブラウジング、検索、リポジトリ参照、社内APIの有無で、コーディングエージェントの結果は大きく変わります。
同じ設定でコストとレイテンシを測る。 Proモードや高推論設定は品質だけでなく、出力トークン数や応答時間も変えます。
失敗例を人間が確認する。 コード生成では、テスト通過だけでなく、差分の保守性、セキュリティ、存在しない依存関係の混入も確認すべきです。
低コスト候補を最低1つ入れる。 オープンウェイトや推論コストが重要なら、Kimi K2.6とDeepSeek-V4-Pro-Maxは評価セットに入れる価値があります ^[1]^[18]。

結論

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

端末操作型のコーディングエージェントはGPT 5.5、ソフトウェア修正系ベンチマークはClaude Opus 4.7が有力です [18][24]。
GPT 5.5 ProはベースのGPT 5.5と別扱いです。報告されている範囲ではBrowseComp 90.1%、ツールありHumanity’s Last Exam 57.2%で首位です [24]。
オープンウェイト重視ならKimi K2.6、コスト重視のホスト推論ならDeepSeek V4 Pro Maxを実タスクで検証するのが現実的です [1][18]。

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較」の短い答えは何ですか?

端末操作型のコーディングエージェントはGPT 5.5、ソフトウェア修正系ベンチマークはClaude Opus 4.7が有力です [18][24]。

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

オープンウェイト重視ならKimi K2.6、コスト重視のホスト推論ならDeepSeek V4 Pro Maxを実タスクで検証するのが現実的です [1][18]。

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6：2026年ベンチマーク比較

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

Northwest Timber と Southeast Timber：代替性が高いと、なぜ「larger; larger」になるのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか

研究に関する会話

あなた2026年4月25日

リサーチクエスチョン

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI2026年4月25日40 ソース

引用された回答

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

情報源

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

トレンドを発見する

レポート公開済み2026年4月29日Last edited 2026年5月6日12 ソース

GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

17K0

まず結論：用途別にどれから試すべきか

ターミナル操作が多いコーディングエージェント：GPT-5.5。共同比較のTerminal-Bench 2.0で82.7%と最も高い数値です ^[24]。
ソフトウェア修正・実リポジトリ系の評価：Claude Opus 4.7。SWE-Bench Proで64.3%、SWE-Bench Verifiedで87.6%と、この比較範囲では強い結果です ^[18]^[24]。
ツールなしの難問推論：Claude Opus 4.7。GPQA DiamondとツールなしHumanity’s Last Examで首位です ^[24]。
ツールを使う推論やブラウジング系タスク：GPT-5.5 Pro。ツールありHumanity’s Last Examで57.2%、BrowseCompで90.1%と報告されています ^[24]。
重み公開モデルを自社環境で試したい場合：Kimi K2.6。1TパラメータのMoE、32Bアクティブパラメータ、256Kコンテキストのオープンウェイトモデルとして説明されています ^[1]。
APIコストを強く抑えたい場合：DeepSeek-V4-Pro-Max。LLM Statsでは1Mコンテキスト、SWE-Bench Verified 80.6%、価格欄は$1.74/$3.48とされています ^[18]。

主要ベンチマーク比較

ベンチマーク	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93.6% ^[24]	—	94.2% ^[24]	約91% ^[28]	90.1% ^[24]
Humanity’s Last Exam、ツールなし	41.4% ^[24]	43.1% ^[24]	46.9% ^[24]	—	37.7% ^[24]
Humanity’s Last Exam、ツールあり	52.2% ^[24]	57.2% ^[24]	54.7% ^[24]	54.0% ^[1]	48.2% ^[24]
Terminal-Bench 2.0	82.7% ^[24]	—	69.4% ^[24]	66.7% ^[6]	67.9% ^[24]
SWE-Bench Pro	58.6% ^[24]	—	64.3% ^[24]	58.6% ^[6]	55.4% ^[24]
BrowseComp	84.4% ^[24]	90.1% ^[24]	79.3% ^[24]	83.2% ^[1]	83.4% ^[24]
MCP Atlas / MCPAtlas Public	75.3% ^[24]	—	79.1% ^[24]	—	73.6% ^[24]
SWE-Bench Verified	—	—	87.6% ^[18]	80.2% ^[6]	80.6% ^[18]

優先順位別の候補

重視すること	最初に試す候補	理由
ターミナル型のコーディングエージェント	GPT-5.5	共同比較のTerminal-Bench 2.0で82.7%と最も高い数値です ^[24]。
ソフトウェア修正ベンチマーク	Claude Opus 4.7	SWE-Bench ProとSWE-Bench Verifiedの引用値で、このグループの中では最も強い結果です ^[18]^[24]。
ツールなしの難問推論	Claude Opus 4.7	GPQA DiamondとツールなしHumanity’s Last Examで首位です ^[24]。
ツールあり推論・ブラウジング	GPT-5.5 Pro	GPT-5.5 Proが別枠で報告されている範囲では、ツールありHLEとBrowseCompで首位です ^[24]。
オープンウェイト展開	Kimi K2.6	1TパラメータMoEのオープンウェイトモデルとして説明され、Hugging Faceのモデルカードでもコーディング系の強い数値が示されています ^[1]^[6]。
コスト重視のホスト推論	DeepSeek-V4-Pro-Max	LLM Statsでは1Mコンテキスト、SWE-Bench Verified 80.6%、Claude Opus 4.7より低い価格欄が示されています ^[18]。
長いコンテキスト	GPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Max	引用元ではGPT-5.5、Claude Opus 4.7、DeepSeek-V4-Pro-Maxに1Mコンテキストが示され、Kimi K2.6はおおむね256K〜262Kです ^[1]^[11]^[16]^[18]^[27]。

モデル別の読み方

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

価格とコンテキスト長は「参考値」として見る

モデル	引用元で確認できるコンテキスト・価格シグナル	実務上の読み方
GPT-5.5	BenchLMは1Mコンテキストを掲載。価格報道では入力$5、出力$30／100万トークン ^[27]^[36]。	プレミアムなホスト型候補。最新価格の確認が必須。
Claude Opus 4.7	LLM Statsは1Mコンテキストと$5/$25／100万トークンを掲載 ^[16]。	コーディング、推論、長文コンテキスト用途の高性能候補。
Kimi K2.6	リリース情報では256Kコンテキスト、LLM Statsでは262Kコンテキストと$0.95/$4.00の価格欄 ^[1]^[11]。	オープンウェイト重視なら有力。ホスト価格は提供事業者により変わり得る。
DeepSeek-V4-Pro-Max	LLM Statsは1Mコンテキスト、1.6Tサイズ、SWE-Bench Verified 80.6%、価格欄$1.74/$3.48を掲載 ^[18]。	品質が用途に合えば、コスト重視の有力候補。

なぜランキングが食い違うのか

導入前に見るべきチェックリスト

実際のプロンプト、ファイル、リポジトリで試す。 ベンチマークの問題は、自社のコードベース、文書、業務ルール、ユーザー行動を完全には再現しません。
ツール環境をそろえる。 ターミナル、ブラウジング、検索、リポジトリ参照、社内APIの有無で、コーディングエージェントの結果は大きく変わります。
同じ設定でコストとレイテンシを測る。 Proモードや高推論設定は品質だけでなく、出力トークン数や応答時間も変えます。
失敗例を人間が確認する。 コード生成では、テスト通過だけでなく、差分の保守性、セキュリティ、存在しない依存関係の混入も確認すべきです。
低コスト候補を最低1つ入れる。 オープンウェイトや推論コストが重要なら、Kimi K2.6とDeepSeek-V4-Pro-Maxは評価セットに入れる価値があります ^[1]^[18]。

結論

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

端末操作型のコーディングエージェントはGPT 5.5、ソフトウェア修正系ベンチマークはClaude Opus 4.7が有力です [18][24]。
GPT 5.5 ProはベースのGPT 5.5と別扱いです。報告されている範囲ではBrowseComp 90.1%、ツールありHumanity’s Last Exam 57.2%で首位です [24]。
オープンウェイト重視ならKimi K2.6、コスト重視のホスト推論ならDeepSeek V4 Pro Maxを実タスクで検証するのが現実的です [1][18]。

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較」の短い答えは何ですか?

端末操作型のコーディングエージェントはGPT 5.5、ソフトウェア修正系ベンチマークはClaude Opus 4.7が有力です [18][24]。

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

オープンウェイト重視ならKimi K2.6、コスト重視のホスト推論ならDeepSeek V4 Pro Maxを実タスクで検証するのが現実的です [1][18]。

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

研究に関する会話

あなた2026年4月25日

リサーチクエスチョン

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI2026年4月25日40 ソース

引用された回答

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

情報源

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

まず結論：用途別にどれから試すべきか

主要ベンチマーク比較

優先順位別の候補

モデル別の読み方

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

価格とコンテキスト長は「参考値」として見る

なぜランキングが食い違うのか

導入前に見るべきチェックリスト

結論

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

リサーチクエスチョン

引用された回答

Key findings

Comparison

How to interpret the numbers

Evidence notes

Limitations / uncertainty

Summary

情報源

まず結論：用途別にどれから試すべきか

主要ベンチマーク比較

優先順位別の候補

モデル別の読み方

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

価格とコンテキスト長は「参考値」として見る

なぜランキングが食い違うのか

導入前に見るべきチェックリスト

結論

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、Kimi K2.6、DeepSeek V4のベンチマーク比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

リサーチクエスチョン

引用された回答

Key findings

Comparison

How to interpret the numbers

Evidence notes

Limitations / uncertainty

Summary

情報源

まず結論：用途別にどれから試すべきか

主要ベンチマーク比較

優先順位別の候補

モデル別の読み方

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

価格とコンテキスト長は「参考値」として見る

なぜランキングが食い違うのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼