レポート公開済み2026年4月28日Last edited 2026年5月6日14 ソース

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較

4モデルを完全同一条件で横比較した決定版ランキングとしては読まず、用途別に選ぶのが安全です。総合・経済タスクはGPT 5.5、推論・レビューはClaude Opus 4.7、オープンウェイトの速度はKimi K2.6、長文脈と低API価格はDeepSeek V4 Proが有力です。[4][23][26][27] GPT 5.5はArtificial AnalysisのモデルページでIntelligence 59、GDPval AAでElo 1785とされ、Claude Opus 4.7はLLM Statsの共通10ベンチマーク整理で6勝4敗とされています。[4][26][27] Kimi K2.6はArtificial Ana...

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

16K0

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマーク比較を示す抽象的なAIダッシュボード — GPT-5.5・Claude Opus 4.7・DeepSeek V4・Kimi K2.6比較：ベンチマークで見る用途別の勝者4つの主要AIモデルを、総合性能・推論・速度・文脈長・価格の観点で比較するイメージ。
AI プロンプト
Create a landscape editorial hero image for this Studio Global article: GPT-5.5・Claude Opus 4.7・DeepSeek V4・Kimi K2.6比較：ベンチマークで見る用途別の勝者. Article summary: 4モデルを完全同一条件で横比較した公開表は確認できないため、単一の勝者ではなく用途別に選ぶのが安全です。総合候補はGPT 5.5（AA Intelligence 59、GDPval AA Elo 1785）とClaude Opus 4.7（共通10ベンチマークで6勝4敗）です。[4][26][27]. Topic tags: ai, llm benchmarks, openai, anthropic, deepseek. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison - YouTube" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](
openai.com

4モデルの比較で最初に分けるべきなのは、総合順位ではなく「どの仕事に使うか」です。公開ベンチマークは推論設定、評価時点、自己申告か第三者評価かがそろっていないため、1本のランキングにまとめると誤解しやすくなります。^[4]^[18]

本稿では、DeepSeekについては数値を確認できるDeepSeek V4 Pro（Reasoning, Max Effort）を中心に扱います。Artificial Analysisのオープンモデル表では、Kimi K2.6とDeepSeek V4 ProのIntelligence、文脈長、価格列、出力速度が並んでいます。^[23]

まず結論：用途別の第一候補

用途	第一候補	根拠
総合性能・経済価値タスク	GPT-5.5	GPT-5.5 highはArtificial Analysis Intelligence Indexで59、GPT-5.5 xhighはGDPval-AAでElo 1785と報告されています。^[26]^[27]
深い推論、レビュー、専門タスク	Claude Opus 4.7	LLM Statsは、GPT-5.5との共通10ベンチマークでClaude Opus 4.7が6勝、GPT-5.5が4勝と整理しています。^[4]
端末操作、ブラウズ、長時間のツール利用	GPT-5.5	LLM Statsでは、Terminal-Bench 2.0、BrowseComp、OSWorld-Verified、CyberGymでGPT-5.5が強いと整理されています。^[4]
オープンウェイト系で速度と価格性能を重視	Kimi K2.6	Artificial Analysisのオープンモデル表では、Kimi K2.6がIntelligence 54、256k context、Price列$1.7、112 tokens/sです。^[23]
長文脈と低API価格を重視	DeepSeek V4 Pro / DeepSeek V4系	Artificial AnalysisではDeepSeek V4 Proが1M context、MashableではDeepSeek V4のAPI価格がGPT-5.5やClaude Opus 4.7より低い水準として報告されています。^[3]^[23]

4モデルの主要シグナル

モデル	ベンチマークで見える強み	価格・運用で見える特徴
GPT-5.5	GPT-5.5 highはArtificial Analysis Intelligence Indexで59。GPT-5.5 xhighはGDPval-AAでElo 1785とされ、Claude Opus 4.7 maxを約30ポイント上回ると報告されています。^[26]^[27]	MashableはAPI価格を100万入力トークンあたり$5、100万出力トークンあたり$30と報告しています。^[3]
Claude Opus 4.7	LLM Statsの共通10ベンチマーク整理では6勝4敗。Mashableの表ではSWE-Bench Pro 64.3%、GPQA Diamond 94.2%、HLE with tools 54.7%が報告されています。^[4]^[9]	MashableはAPI価格を100万入力トークンあたり$5、100万出力トークンあたり$25と報告しています。^[3]
Kimi K2.6	Artificial Analysisのオープンモデル表ではIntelligence 54。The DecoderはMoonshot AIの発表値として、HLE with Tools 54.0、SWE-Bench Pro 58.6、BrowseComp 83.2を報告しています。^[20]^[23]	Artificial Analysisの同表では256k context、Price列$1.7、112 tokens/sです。^[23]
DeepSeek V4 Pro	Artificial Analysisのオープンモデル表ではIntelligence 52。DataCampは、DeepSeek V4が純粋な能力ではGPT-5.5やClaude Opus 4.7を上回らないと整理しています。^[16]^[23]	Artificial Analysisの同表では1M context、Price列$2.2、36 tokens/s。MashableはDeepSeek V4のAPI価格を100万入力トークンあたり$1.74、100万出力トークンあたり$3.48と報告しています。^[3]^[23]

GPT-5.5 vs Claude Opus 4.7：フロンティア同士はタスクで分かれる

GPT-5.5とClaude Opus 4.7は、ベンチマーク名ごとに勝者が入れ替わります。Mashableが報告した主な数値では、Claude Opus 4.7はSWE-Bench ProとGPQA Diamondで上回り、GPT-5.5はTerminal-Bench 2.0、Humanity's Last Exam、BrowseComp、ARC-AGI-1 Verifiedで上回っています。^[9]

ベンチマーク	GPT-5.5	Claude Opus 4.7	Mashable表でのリード
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5
Humanity's Last Exam	40.6%	31.2%	GPT-5.5
Humanity's Last Exam with tools	52.2%	54.7%	Claude Opus 4.7
BrowseComp	84.4%	79.3%	GPT-5.5
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7
ARC-AGI-1 Verified	94.5%	92.0%	GPT-5.5

一方、LLM Statsは共通10ベンチマークの整理として、Claude Opus 4.7が6つ、GPT-5.5が4つでリードするとしています。同サイトは、Opus 4.7が推論・レビュー系、GPT-5.5が長時間のツール利用系で強いと説明しています。^[4]

ただし、ここは重要な注意点があります。LLM Statsは、これらのスコアが各プロバイダーの高推論ティアにおける自己申告値であり、「形としては比較できるが、方法論まで同一ではない」としています。^[4] さらに、Humanity's Last Examのように、ソースによってリードの見え方が異なる項目もあります。^[4]^[9]

Kimi K2.6 vs DeepSeek V4 Pro：オープンウェイト系は速度か文脈長か

Kimi K2.6とDeepSeek V4 Proは、クローズドなフロンティアモデルと同じ土俵で単純比較するより、オープンウェイト系の運用候補として見ると判断しやすくなります。

指標	Kimi K2.6	DeepSeek V4 Pro
Artificial Analysis Intelligence	54	52
Context window	256k	1.00M
Price列	$1.7	$2.2
Output speed	112 tokens/s	36 tokens/s

この表だけなら、Kimi K2.6はIntelligenceと出力速度で有利、DeepSeek V4 Proは1M contextで有利です。^[23] The DecoderはMoonshot AIの発表値として、Kimi K2.6がHLE with Tools 54.0、SWE-Bench Pro 58.6、BrowseComp 83.2を記録したと報告しています。^[20]

ただし、Kimi K2.6の公開実験はGPT-5.5やClaude Opus 4.7との完全な同条件比較ではありません。Hugging Faceのモデルカードでは、Kimi K2.6はthinking mode、temperature 1.0、top-p 1.0、262,144トークン文脈長などの条件で評価され、比較対象も主にClaude Opus 4.6、GPT-5.4、Gemini 3.1 Proです。^[18]

DeepSeek V4 Proは、絶対性能の王者というより、長文脈とコストで評価するモデルです。DataCampは、DeepSeek V4が純粋な能力ではGPT-5.5やClaude Opus 4.7を上回らない一方、near-frontier性能を低コストで狙う位置づけだと整理しています。^[16]

価格比較では、数字の種類を混ぜない

価格を見るときは、少なくとも3種類の数字を分ける必要があります。

1つ目はAPIのトークン単価です。Mashableは、DeepSeek V4を100万入力トークンあたり$1.74、100万出力トークンあたり$3.48、GPT-5.5を$5/$30、Claude Opus 4.7を$5/$25と報告しています。^[3]

2つ目はArtificial Analysisのモデル表にあるPrice列です。Kimi K2.6は$1.7、DeepSeek V4 Proは$2.2と示されていますが、MashableのAPI単価と同じ指標として扱うべきではありません。^[23]

3つ目はベンチマーク実行コストです。Artificial Analysisの記事では、Intelligence Indexの実行コストとしてDeepSeek V4 Proが$1,071、Kimi K2.6が$948、Claude Opus 4.7が$4,811と報告されています。^[2]

したがって、「DeepSeekが安い」「Kimiが安い」「Claudeが高い」といった結論は、API単価なのか、評価実行コストなのか、出力トークン量を含む実運用コストなのかを分けて判断する必要があります。^[2]^[3]^[23]

安全性・信頼性はベンチマークとは別軸

Claude Opus 4.7については、MashableがAnthropicの主張として92%のhonesty rateと、より少ないsycophancyを報告しています。^[15] Anthropicの発表でも、Claude Opus 4.7は内部research-agent benchmarkで6モジュール合計0.715のトップタイとなり、General FinanceではOpus 4.6の0.767から0.813に改善したとされています。^[17]

ただし、これらはSWE-Bench Pro、GPQA Diamond、BrowseCompのような能力ベンチマークとは別の評価軸です。実務で使う場合は、能力スコア、コスト、速度、幻覚リスク、監査しやすさを分けて見るべきです。^[15]^[17]

実運用では、1モデル固定よりルーティングが現実的

本番運用では、1つのモデルを全タスクに固定するより、用途別にルーティングする構成が現実的です。MindStudioのコード比較では、GPT-5.5は同じコーディングタスクでClaude Opus 4.7より72%少ない出力トークンを使ったとされる一方、複雑で推論負荷の高い大規模コードベースではOpus 4.7の丁寧さがコストを正当化し得るとされています。^[28]

実務的には、標準的な生成・修正・端末系タスクはGPT-5.5、深いレビューや専門判断はClaude Opus 4.7、安価なオープンウェイト実験はKimi K2.6、長文脈・大量処理はDeepSeek V4 Proから試すのが自然です。^[3]^[4]^[23]^[28]

最終判断

現時点の公開情報からは、GPT-5.5、Claude Opus 4.7、DeepSeek V4 Pro、Kimi K2.6の単一勝者を決めるより、用途別に選ぶのが最も安全です。GPT-5.5は総合・経済価値タスク、Claude Opus 4.7は推論・レビュー、Kimi K2.6はオープンウェイト系の速度と価格性能、DeepSeek V4 Proは長文脈と低API価格が主な強みです。^[3]^[4]^[23]^[26]^[27]

加えて、Artificial Analysis内でもGPT-5.5 highをIntelligence 59とするモデルページと、Claude Opus 4.7 Adaptive Reasoning, Max EffortをIntelligence 57で首位とする一覧ページがあり、ページの更新時点や推論設定によって見え方が変わります。^[27]^[30] ベンチマークは出発点として使い、最後は自社の実タスク、予算、レイテンシ、失敗許容度で小さく並走評価するのが最も堅実です。^[4]^[18]^[28]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

4モデルを完全同一条件で横比較した決定版ランキングとしては読まず、用途別に選ぶのが安全です。総合・経済タスクはGPT 5.5、推論・レビューはClaude Opus 4.7、オープンウェイトの速度はKimi K2.6、長文脈と低API価格はDeepSeek V4 Proが有力です。[4][23][26][27]
GPT 5.5はArtificial AnalysisのモデルページでIntelligence 59、GDPval AAでElo 1785とされ、Claude Opus 4.7はLLM Statsの共通10ベンチマーク整理で6勝4敗とされています。[4][26][27]
Kimi K2.6はArtificial Analysisのオープンモデル表でIntelligence 54・112 tokens/s、DeepSeek V4 ProはIntelligence 52・1M context。前者は速度、後者は文脈長とDeepSeek V4系の低API価格が目立ちます。[3][23]

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

Kimi K2.6はArtificial Analysisのオープンモデル表でIntelligence 54・112 tokens/s、DeepSeek V4 ProはIntelligence 52・1M context。前者は速度、後者は文脈長とDeepSeek V4系の低API価格が目立ちます。[3][23]

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6：2026年ベンチマーク比較

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

Northwest Timber と Southeast Timber：代替性が高いと、なぜ「larger; larger」になるのか

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか

研究に関する会話

あなた2026年4月25日

リサーチクエスチョン

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。

Studio Global AI2026年4月25日38 ソース

引用された回答

提供された証拠だけでは、GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6を同一ベンチマーク上で厳密に横比較するには不十分です。確認できる範囲では、GPT-5.5は「新たな首位モデル」とされる一方、Claude Opus 4.7もArtificial Analysis系の文脈で首位級と示唆されており、Kimi K2.6は一部の公開モデル比較指標のみ数値が確認できます。^[3]^[7]^[34]

Key findings

モデル	確認できたベンチマーク・性能情報	比較上の解釈
GPT-5.5	Artificial Analysis関連記事のスニペットでは、2026年4月24日時点で「新たな主要/首位AIモデル」とされている。^[3] LinkedInスニペットにも「OpenAIのGPT-5.5が新たな首位AIモデル」とある。^[34]	総合性能で非常に強い可能性はあるが、提供証拠には具体的なスコア、評価項目、競合平均との差がないため、定量順位は確定できない。^[3]^[34]
Claude Opus 4.7	Mashableスニペットでは、AnthropicがClaude Opus 4.7について「92%のhonesty rate」と「より少ないsycophancy」を主張している。^[4] SourceForgeスニペットでは、Claude Opus 4.7は高度なソフトウェアエンジニアリングと複雑な問題解決での改善を目的にしたAnthropicの最新モデルと説明されている。^[6]	安全性・誠実性系の指標では具体値があるが、GPT-5.5やKimi K2.6と同じ総合ベンチマークでの数値比較はできない。^[4]^[6]
DeepSeek V4	Artificial Analysisの公開モデル比較スニペットに「DeepSeek V4 Pro」の記載はあるが、スコアや速度、価格などの数値は提示されていない。^[7]	Insufficient evidence. 提供証拠だけではDeepSeek V4のベンチマーク順位・強み・弱みを判断できない。^[7]
Kimi K2.6	Artificial Analysisの公開モデル比較スニペットでは、Kimi K2.6のIntelligenceが54、コンテキスト長が256k、価格が$1.7、出力速度が112 tokens/sと示されている。^[7]	4モデル中で最も具体的な公開比較数値が確認できるが、GPT-5.5やClaude Opus 4.7と同一指標で直接比較できる証拠はない。^[7]

Evidence notes

Stanford HAIの2026年AI Indexでは、2026年2月時点のSWE-bench Verifiedで上位モデルが70%台前半〜中盤に密集し、Claude 4.5 Opus high reasoningが約76.8%で首位、KimiK2.5、GPT-5.2、Gemini 3 Flash high reasoningが70%〜76%の範囲に入っていたとされている。^[2]
ただし、このStanford HAIの記述はClaude Opus 4.7、GPT-5.5、Kimi K2.6、DeepSeek V4そのものではなく、直前世代または別モデルの状況を示す背景情報にとどまる。^[2]
SciCode Benchの提供スニペットにはClaude Sonnet 4 high、Gemini 2.5 Pro、GPT-5-1-codex、Grok 4などの記載があるが、比較対象であるGPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6の直接スコアは確認できない。^[1]
GPT-5.5については「新たな首位」とする記述がある一方、別スニペットではClaude Opus 4.7もArtificial Analysis Intelligence文脈で首位に位置するように読めるため、時点差、評価セット差、またはスニペットの文脈差による不整合がある可能性がある。^[3]^[34]

Limitations / uncertainty

Insufficient evidence. 4モデルすべてを同一ベンチマーク、同一日付、同一推論設定で比較した表やスコアは、提供証拠内にはありません。^[1]^[3]^[5]^[7]^[34]
DeepSeek V4は特に情報が不足しており、提供証拠では「DeepSeek V4 Pro」というモデル名の出現以上の定量情報を確認できません。^[7]
Claude Opus 4.7の「92% honesty rate」は安全性・誠実性に関する指標であり、コーディング、数学、エージェント性能、総合知能スコアと同列には扱えません。^[4]
Kimi K2.6のIntelligence 54、256k context、$1.7、112 tokens/sは有用な比較材料ですが、GPT-5.5やClaude Opus 4.7の同じ列の数値が提供されていないため、相対順位は出せません。^[7]

Summary

現時点の提供証拠で最も強い総合性能シグナルがあるのはGPT-5.5ですが、数値スコアはありません。^[3]^[34]
Claude Opus 4.7は安全性・誠実性で92%という具体値があり、ソフトウェアエンジニアリングや複雑問題解決の改善が主張されていますが、総合スコア比較はできません。^[4]^[6]
Kimi K2.6はIntelligence 54、256k context、$1.7、112 tokens/sという具体的な公開比較値がありますが、他3モデルとの同一指標比較は不足しています。^[7]
DeepSeek V4は提供証拠だけでは評価不能です。^[7]

情報源

[2] DeepSeek is back among the leading open weights models with V4 Pro ...artificialanalysis.ai
Lower cost than frontier models, but high token usage keeps costs above most open weights peers: DeepSeek V4 Pro costs $1,071 to run the Artificial Analysis Intelligence Index, more than 4x cheaper than Claude Opus 4.7 ($4,811) but above several open weight...
[3] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[4] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarksllm-stats.com
The Verdict On the 10 benchmarks both providers report, Opus 4.7 leads on 6 and GPT-5.5 leads on 4. The leads cluster by category, not by overall quality: Opus 4.7 is ahead on the reasoning-heavy and review-grade tests (GPQA Diamond, HLE with and without to...
[9] OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[15] Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safetymashable.com
Tim studied print journalism at the University of Southern California. He currently splits his time between Brooklyn, NY and Charleston, SC. He's currently working on his second novel, a science-fiction book. Recommended For You Anthropic says Claude Opus 4...
[16] DeepSeek V4: Features, Benchmarks, and Comparisons - DataCampdatacamp.com
How large are the DeepSeek V4 models? DeepSeek uses a Mixture of Experts (MoE) architecture. The Pro model contains 1.6 trillion total parameters (49 billion active) and requires an 865GB download. The Flash model contains 284 billion parameters (13 billion...
[17] Introducing Claude Opus 4.7anthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[18] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
Footnotes 1. General Testing Details We report results for Kimi K2.6 and Kimi K2.5 with thinking mode enabled, Claude Opus 4.6 with max effort, GPT-5.4 with xhigh reasoning effort, and Gemini 3.1 Pro with a high thinking level. Unless otherwise specified, a...
[20] Open-weight Kimi K2.6 takes on GPT-5.4 and Claude Opus 4.6 with agent swarmsthe-decoder.com
The Decoder Open-weight Kimi K2.6 takes on GPT-5.4 and Claude Opus 4.6 with agent swarms Matthias Bastian Image description Moonshot AI has released Kimi K2.6 as an open-weight model. It's built to match GPT-5.4 and Claude Opus 4.6 on coding benchmarks, and...
[23] Comparison of Open Source AI Models across Intelligence, Performance, Price, Context Window, and more | Artificial Analysisartificialanalysis.ai
Model Name Intelligence Parameters Context Window Price Output Speed (t/s) Weights Providers Provider Benchmarks --- --- --- --- Kimi logo Kimi K2.6 Kimi 54 1.0KB (32B active at inference time) 256k $1.7 112 🤗 Novita Kimi SiliconFlow +6 more View DeepSeek...
[26] OpenAI's GPT-5.5 is the new leading AI model - Artificial Analysisartificialanalysis.ai
➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by 30 pts and Gemini 3.1 Pro Preview by 470 pts. GDPval-AA is Artificial Analysis' benchmark that leverages OpenAI's GDPval dataset to evaluate models on real-world e...
[27] GPT-5.5 (high) - Intelligence, Performance & Price Analysisartificialanalysis.ai
No, GPT-5.5 (high) is proprietary. The model weights are not publicly available. GPT-5.5 (high) is a proprietary model and OpenAI has not disclosed the model size or parameter count. GPT-5.5 (high) achieves a score of 59 on the Artificial Analysis Intellige...
[28] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance Comparedmindstudio.ai
GPT-5.5 uses 72% fewer output tokens than Claude Opus 4.7 on the same coding tasks — a structural difference, not a minor gap. On raw benchmark quality, both models are competitive. Neither dominates on every task type. For high-volume agentic coding pipeli...
[30] Comparison of AI Models across Intelligence, Performance, and Priceartificialanalysis.ai
Which is the most intelligent AI model? Claude Opus 4.7 (Adaptive Reasoning, Max Effort) currently leads the Artificial Analysis Intelligence Index with a score of 57, out of 347 models evaluated. What are the top AI models? The top AI models by Intelligenc...

トレンドを発見する

レポート公開済み2026年4月28日Last edited 2026年5月6日14 ソース

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

16K0

まず結論：用途別の第一候補

用途	第一候補	根拠
総合性能・経済価値タスク	GPT-5.5	GPT-5.5 highはArtificial Analysis Intelligence Indexで59、GPT-5.5 xhighはGDPval-AAでElo 1785と報告されています。^[26]^[27]
深い推論、レビュー、専門タスク	Claude Opus 4.7	LLM Statsは、GPT-5.5との共通10ベンチマークでClaude Opus 4.7が6勝、GPT-5.5が4勝と整理しています。^[4]
端末操作、ブラウズ、長時間のツール利用	GPT-5.5	LLM Statsでは、Terminal-Bench 2.0、BrowseComp、OSWorld-Verified、CyberGymでGPT-5.5が強いと整理されています。^[4]
オープンウェイト系で速度と価格性能を重視	Kimi K2.6	Artificial Analysisのオープンモデル表では、Kimi K2.6がIntelligence 54、256k context、Price列$1.7、112 tokens/sです。^[23]
長文脈と低API価格を重視	DeepSeek V4 Pro / DeepSeek V4系	Artificial AnalysisではDeepSeek V4 Proが1M context、MashableではDeepSeek V4のAPI価格がGPT-5.5やClaude Opus 4.7より低い水準として報告されています。^[3]^[23]

4モデルの主要シグナル

モデル	ベンチマークで見える強み	価格・運用で見える特徴
GPT-5.5	GPT-5.5 highはArtificial Analysis Intelligence Indexで59。GPT-5.5 xhighはGDPval-AAでElo 1785とされ、Claude Opus 4.7 maxを約30ポイント上回ると報告されています。^[26]^[27]	MashableはAPI価格を100万入力トークンあたり$5、100万出力トークンあたり$30と報告しています。^[3]
Claude Opus 4.7	LLM Statsの共通10ベンチマーク整理では6勝4敗。Mashableの表ではSWE-Bench Pro 64.3%、GPQA Diamond 94.2%、HLE with tools 54.7%が報告されています。^[4]^[9]	MashableはAPI価格を100万入力トークンあたり$5、100万出力トークンあたり$25と報告しています。^[3]
Kimi K2.6	Artificial Analysisのオープンモデル表ではIntelligence 54。The DecoderはMoonshot AIの発表値として、HLE with Tools 54.0、SWE-Bench Pro 58.6、BrowseComp 83.2を報告しています。^[20]^[23]	Artificial Analysisの同表では256k context、Price列$1.7、112 tokens/sです。^[23]
DeepSeek V4 Pro	Artificial Analysisのオープンモデル表ではIntelligence 52。DataCampは、DeepSeek V4が純粋な能力ではGPT-5.5やClaude Opus 4.7を上回らないと整理しています。^[16]^[23]	Artificial Analysisの同表では1M context、Price列$2.2、36 tokens/s。MashableはDeepSeek V4のAPI価格を100万入力トークンあたり$1.74、100万出力トークンあたり$3.48と報告しています。^[3]^[23]

GPT-5.5 vs Claude Opus 4.7：フロンティア同士はタスクで分かれる

ベンチマーク	GPT-5.5	Claude Opus 4.7	Mashable表でのリード
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5
Humanity's Last Exam	40.6%	31.2%	GPT-5.5
Humanity's Last Exam with tools	52.2%	54.7%	Claude Opus 4.7
BrowseComp	84.4%	79.3%	GPT-5.5
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7
ARC-AGI-1 Verified	94.5%	92.0%	GPT-5.5

Kimi K2.6 vs DeepSeek V4 Pro：オープンウェイト系は速度か文脈長か

指標	Kimi K2.6	DeepSeek V4 Pro
Artificial Analysis Intelligence	54	52
Context window	256k	1.00M
Price列	$1.7	$2.2
Output speed	112 tokens/s	36 tokens/s

価格比較では、数字の種類を混ぜない

価格を見るときは、少なくとも3種類の数字を分ける必要があります。

安全性・信頼性はベンチマークとは別軸

実運用では、1モデル固定よりルーティングが現実的

最終判断

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

4モデルを完全同一条件で横比較した決定版ランキングとしては読まず、用途別に選ぶのが安全です。総合・経済タスクはGPT 5.5、推論・レビューはClaude Opus 4.7、オープンウェイトの速度はKimi K2.6、長文脈と低API価格はDeepSeek V4 Proが有力です。[4][23][26][27]
GPT 5.5はArtificial AnalysisのモデルページでIntelligence 59、GDPval AAでElo 1785とされ、Claude Opus 4.7はLLM Statsの共通10ベンチマーク整理で6勝4敗とされています。[4][26][27]
Kimi K2.6はArtificial Analysisのオープンモデル表でIntelligence 54・112 tokens/s、DeepSeek V4 ProはIntelligence 52・1M context。前者は速度、後者は文脈長とDeepSeek V4系の低API価格が目立ちます。[3][23]

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

研究に関する会話

あなた2026年4月25日

リサーチクエスチョン

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。

Studio Global AI2026年4月25日38 ソース

引用された回答

Key findings

モデル	確認できたベンチマーク・性能情報	比較上の解釈
GPT-5.5	Artificial Analysis関連記事のスニペットでは、2026年4月24日時点で「新たな主要/首位AIモデル」とされている。^[3] LinkedInスニペットにも「OpenAIのGPT-5.5が新たな首位AIモデル」とある。^[34]	総合性能で非常に強い可能性はあるが、提供証拠には具体的なスコア、評価項目、競合平均との差がないため、定量順位は確定できない。^[3]^[34]
Claude Opus 4.7	Mashableスニペットでは、AnthropicがClaude Opus 4.7について「92%のhonesty rate」と「より少ないsycophancy」を主張している。^[4] SourceForgeスニペットでは、Claude Opus 4.7は高度なソフトウェアエンジニアリングと複雑な問題解決での改善を目的にしたAnthropicの最新モデルと説明されている。^[6]	安全性・誠実性系の指標では具体値があるが、GPT-5.5やKimi K2.6と同じ総合ベンチマークでの数値比較はできない。^[4]^[6]
DeepSeek V4	Artificial Analysisの公開モデル比較スニペットに「DeepSeek V4 Pro」の記載はあるが、スコアや速度、価格などの数値は提示されていない。^[7]	Insufficient evidence. 提供証拠だけではDeepSeek V4のベンチマーク順位・強み・弱みを判断できない。^[7]
Kimi K2.6	Artificial Analysisの公開モデル比較スニペットでは、Kimi K2.6のIntelligenceが54、コンテキスト長が256k、価格が$1.7、出力速度が112 tokens/sと示されている。^[7]	4モデル中で最も具体的な公開比較数値が確認できるが、GPT-5.5やClaude Opus 4.7と同一指標で直接比較できる証拠はない。^[7]

Evidence notes

Stanford HAIの2026年AI Indexでは、2026年2月時点のSWE-bench Verifiedで上位モデルが70%台前半〜中盤に密集し、Claude 4.5 Opus high reasoningが約76.8%で首位、KimiK2.5、GPT-5.2、Gemini 3 Flash high reasoningが70%〜76%の範囲に入っていたとされている。^[2]
ただし、このStanford HAIの記述はClaude Opus 4.7、GPT-5.5、Kimi K2.6、DeepSeek V4そのものではなく、直前世代または別モデルの状況を示す背景情報にとどまる。^[2]
SciCode Benchの提供スニペットにはClaude Sonnet 4 high、Gemini 2.5 Pro、GPT-5-1-codex、Grok 4などの記載があるが、比較対象であるGPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6の直接スコアは確認できない。^[1]
GPT-5.5については「新たな首位」とする記述がある一方、別スニペットではClaude Opus 4.7もArtificial Analysis Intelligence文脈で首位に位置するように読めるため、時点差、評価セット差、またはスニペットの文脈差による不整合がある可能性がある。^[3]^[34]

Limitations / uncertainty

Insufficient evidence. 4モデルすべてを同一ベンチマーク、同一日付、同一推論設定で比較した表やスコアは、提供証拠内にはありません。^[1]^[3]^[5]^[7]^[34]
DeepSeek V4は特に情報が不足しており、提供証拠では「DeepSeek V4 Pro」というモデル名の出現以上の定量情報を確認できません。^[7]
Claude Opus 4.7の「92% honesty rate」は安全性・誠実性に関する指標であり、コーディング、数学、エージェント性能、総合知能スコアと同列には扱えません。^[4]
Kimi K2.6のIntelligence 54、256k context、$1.7、112 tokens/sは有用な比較材料ですが、GPT-5.5やClaude Opus 4.7の同じ列の数値が提供されていないため、相対順位は出せません。^[7]

Summary

現時点の提供証拠で最も強い総合性能シグナルがあるのはGPT-5.5ですが、数値スコアはありません。^[3]^[34]
Claude Opus 4.7は安全性・誠実性で92%という具体値があり、ソフトウェアエンジニアリングや複雑問題解決の改善が主張されていますが、総合スコア比較はできません。^[4]^[6]
Kimi K2.6はIntelligence 54、256k context、$1.7、112 tokens/sという具体的な公開比較値がありますが、他3モデルとの同一指標比較は不足しています。^[7]
DeepSeek V4は提供証拠だけでは評価不能です。^[7]

情報源

[2] DeepSeek is back among the leading open weights models with V4 Pro ...artificialanalysis.ai
Lower cost than frontier models, but high token usage keeps costs above most open weights peers: DeepSeek V4 Pro costs $1,071 to run the Artificial Analysis Intelligence Index, more than 4x cheaper than Claude Opus 4.7 ($4,811) but above several open weight...
[3] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[4] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarksllm-stats.com
The Verdict On the 10 benchmarks both providers report, Opus 4.7 leads on 6 and GPT-5.5 leads on 4. The leads cluster by category, not by overall quality: Opus 4.7 is ahead on the reasoning-heavy and review-grade tests (GPQA Diamond, HLE with and without to...
[9] OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[15] Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safetymashable.com
Tim studied print journalism at the University of Southern California. He currently splits his time between Brooklyn, NY and Charleston, SC. He's currently working on his second novel, a science-fiction book. Recommended For You Anthropic says Claude Opus 4...
[16] DeepSeek V4: Features, Benchmarks, and Comparisons - DataCampdatacamp.com
How large are the DeepSeek V4 models? DeepSeek uses a Mixture of Experts (MoE) architecture. The Pro model contains 1.6 trillion total parameters (49 billion active) and requires an 865GB download. The Flash model contains 284 billion parameters (13 billion...
[17] Introducing Claude Opus 4.7anthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[18] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
Footnotes 1. General Testing Details We report results for Kimi K2.6 and Kimi K2.5 with thinking mode enabled, Claude Opus 4.6 with max effort, GPT-5.4 with xhigh reasoning effort, and Gemini 3.1 Pro with a high thinking level. Unless otherwise specified, a...
[20] Open-weight Kimi K2.6 takes on GPT-5.4 and Claude Opus 4.6 with agent swarmsthe-decoder.com
The Decoder Open-weight Kimi K2.6 takes on GPT-5.4 and Claude Opus 4.6 with agent swarms Matthias Bastian Image description Moonshot AI has released Kimi K2.6 as an open-weight model. It's built to match GPT-5.4 and Claude Opus 4.6 on coding benchmarks, and...
[23] Comparison of Open Source AI Models across Intelligence, Performance, Price, Context Window, and more | Artificial Analysisartificialanalysis.ai
Model Name Intelligence Parameters Context Window Price Output Speed (t/s) Weights Providers Provider Benchmarks --- --- --- --- Kimi logo Kimi K2.6 Kimi 54 1.0KB (32B active at inference time) 256k $1.7 112 🤗 Novita Kimi SiliconFlow +6 more View DeepSeek...
[26] OpenAI's GPT-5.5 is the new leading AI model - Artificial Analysisartificialanalysis.ai
➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by 30 pts and Gemini 3.1 Pro Preview by 470 pts. GDPval-AA is Artificial Analysis' benchmark that leverages OpenAI's GDPval dataset to evaluate models on real-world e...
[27] GPT-5.5 (high) - Intelligence, Performance & Price Analysisartificialanalysis.ai
No, GPT-5.5 (high) is proprietary. The model weights are not publicly available. GPT-5.5 (high) is a proprietary model and OpenAI has not disclosed the model size or parameter count. GPT-5.5 (high) achieves a score of 59 on the Artificial Analysis Intellige...
[28] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance Comparedmindstudio.ai
GPT-5.5 uses 72% fewer output tokens than Claude Opus 4.7 on the same coding tasks — a structural difference, not a minor gap. On raw benchmark quality, both models are competitive. Neither dominates on every task type. For high-volume agentic coding pipeli...
[30] Comparison of AI Models across Intelligence, Performance, and Priceartificialanalysis.ai
Which is the most intelligent AI model? Claude Opus 4.7 (Adaptive Reasoning, Max Effort) currently leads the Artificial Analysis Intelligence Index with a score of 57, out of 347 models evaluated. What are the top AI models? The top AI models by Intelligenc...

トレンドを発見する

レポート公開済み2026年4月28日Last edited 2026年5月6日14 ソース

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較

Studio Global AIで検索して事実確認 Discover からさらに閲覧する

16K0

まず結論：用途別の第一候補

用途	第一候補	根拠
総合性能・経済価値タスク	GPT-5.5	GPT-5.5 highはArtificial Analysis Intelligence Indexで59、GPT-5.5 xhighはGDPval-AAでElo 1785と報告されています。^[26]^[27]
深い推論、レビュー、専門タスク	Claude Opus 4.7	LLM Statsは、GPT-5.5との共通10ベンチマークでClaude Opus 4.7が6勝、GPT-5.5が4勝と整理しています。^[4]
端末操作、ブラウズ、長時間のツール利用	GPT-5.5	LLM Statsでは、Terminal-Bench 2.0、BrowseComp、OSWorld-Verified、CyberGymでGPT-5.5が強いと整理されています。^[4]
オープンウェイト系で速度と価格性能を重視	Kimi K2.6	Artificial Analysisのオープンモデル表では、Kimi K2.6がIntelligence 54、256k context、Price列$1.7、112 tokens/sです。^[23]
長文脈と低API価格を重視	DeepSeek V4 Pro / DeepSeek V4系	Artificial AnalysisではDeepSeek V4 Proが1M context、MashableではDeepSeek V4のAPI価格がGPT-5.5やClaude Opus 4.7より低い水準として報告されています。^[3]^[23]

4モデルの主要シグナル

モデル	ベンチマークで見える強み	価格・運用で見える特徴
GPT-5.5	GPT-5.5 highはArtificial Analysis Intelligence Indexで59。GPT-5.5 xhighはGDPval-AAでElo 1785とされ、Claude Opus 4.7 maxを約30ポイント上回ると報告されています。^[26]^[27]	MashableはAPI価格を100万入力トークンあたり$5、100万出力トークンあたり$30と報告しています。^[3]
Claude Opus 4.7	LLM Statsの共通10ベンチマーク整理では6勝4敗。Mashableの表ではSWE-Bench Pro 64.3%、GPQA Diamond 94.2%、HLE with tools 54.7%が報告されています。^[4]^[9]	MashableはAPI価格を100万入力トークンあたり$5、100万出力トークンあたり$25と報告しています。^[3]
Kimi K2.6	Artificial Analysisのオープンモデル表ではIntelligence 54。The DecoderはMoonshot AIの発表値として、HLE with Tools 54.0、SWE-Bench Pro 58.6、BrowseComp 83.2を報告しています。^[20]^[23]	Artificial Analysisの同表では256k context、Price列$1.7、112 tokens/sです。^[23]
DeepSeek V4 Pro	Artificial Analysisのオープンモデル表ではIntelligence 52。DataCampは、DeepSeek V4が純粋な能力ではGPT-5.5やClaude Opus 4.7を上回らないと整理しています。^[16]^[23]	Artificial Analysisの同表では1M context、Price列$2.2、36 tokens/s。MashableはDeepSeek V4のAPI価格を100万入力トークンあたり$1.74、100万出力トークンあたり$3.48と報告しています。^[3]^[23]

GPT-5.5 vs Claude Opus 4.7：フロンティア同士はタスクで分かれる

ベンチマーク	GPT-5.5	Claude Opus 4.7	Mashable表でのリード
SWE-Bench Pro	58.6%	64.3%	Claude Opus 4.7
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5
Humanity's Last Exam	40.6%	31.2%	GPT-5.5
Humanity's Last Exam with tools	52.2%	54.7%	Claude Opus 4.7
BrowseComp	84.4%	79.3%	GPT-5.5
GPQA Diamond	93.6%	94.2%	Claude Opus 4.7
ARC-AGI-1 Verified	94.5%	92.0%	GPT-5.5

Kimi K2.6 vs DeepSeek V4 Pro：オープンウェイト系は速度か文脈長か

指標	Kimi K2.6	DeepSeek V4 Pro
Artificial Analysis Intelligence	54	52
Context window	256k	1.00M
Price列	$1.7	$2.2
Output speed	112 tokens/s	36 tokens/s

価格比較では、数字の種類を混ぜない

価格を見るときは、少なくとも3種類の数字を分ける必要があります。

安全性・信頼性はベンチマークとは別軸

実運用では、1モデル固定よりルーティングが現実的

最終判断

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AIで検索して事実確認

重要なポイント

4モデルを完全同一条件で横比較した決定版ランキングとしては読まず、用途別に選ぶのが安全です。総合・経済タスクはGPT 5.5、推論・レビューはClaude Opus 4.7、オープンウェイトの速度はKimi K2.6、長文脈と低API価格はDeepSeek V4 Proが有力です。[4][23][26][27]
GPT 5.5はArtificial AnalysisのモデルページでIntelligence 59、GDPval AAでElo 1785とされ、Claude Opus 4.7はLLM Statsの共通10ベンチマーク整理で6勝4敗とされています。[4][26][27]
Kimi K2.6はArtificial Analysisのオープンモデル表でIntelligence 54・112 tokens/s、DeepSeek V4 ProはIntelligence 52・1M context。前者は速度、後者は文脈長とDeepSeek V4系の低API価格が目立ちます。[3][23]

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

別の角度からの引用や追加の引用については、「Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論」に進みます。

関連ページを開く

これを何と比較すればいいでしょうか？

この回答を「DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点」と照合してください。

関連ページを開く

研究を続けてください

研究に関する会話

あなた2026年4月25日

リサーチクエスチョン

GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6のベンチマークを調査し、比較してください。

Studio Global AI2026年4月25日38 ソース

引用された回答

Key findings

モデル	確認できたベンチマーク・性能情報	比較上の解釈
GPT-5.5	Artificial Analysis関連記事のスニペットでは、2026年4月24日時点で「新たな主要/首位AIモデル」とされている。^[3] LinkedInスニペットにも「OpenAIのGPT-5.5が新たな首位AIモデル」とある。^[34]	総合性能で非常に強い可能性はあるが、提供証拠には具体的なスコア、評価項目、競合平均との差がないため、定量順位は確定できない。^[3]^[34]
Claude Opus 4.7	Mashableスニペットでは、AnthropicがClaude Opus 4.7について「92%のhonesty rate」と「より少ないsycophancy」を主張している。^[4] SourceForgeスニペットでは、Claude Opus 4.7は高度なソフトウェアエンジニアリングと複雑な問題解決での改善を目的にしたAnthropicの最新モデルと説明されている。^[6]	安全性・誠実性系の指標では具体値があるが、GPT-5.5やKimi K2.6と同じ総合ベンチマークでの数値比較はできない。^[4]^[6]
DeepSeek V4	Artificial Analysisの公開モデル比較スニペットに「DeepSeek V4 Pro」の記載はあるが、スコアや速度、価格などの数値は提示されていない。^[7]	Insufficient evidence. 提供証拠だけではDeepSeek V4のベンチマーク順位・強み・弱みを判断できない。^[7]
Kimi K2.6	Artificial Analysisの公開モデル比較スニペットでは、Kimi K2.6のIntelligenceが54、コンテキスト長が256k、価格が$1.7、出力速度が112 tokens/sと示されている。^[7]	4モデル中で最も具体的な公開比較数値が確認できるが、GPT-5.5やClaude Opus 4.7と同一指標で直接比較できる証拠はない。^[7]

Evidence notes

Stanford HAIの2026年AI Indexでは、2026年2月時点のSWE-bench Verifiedで上位モデルが70%台前半〜中盤に密集し、Claude 4.5 Opus high reasoningが約76.8%で首位、KimiK2.5、GPT-5.2、Gemini 3 Flash high reasoningが70%〜76%の範囲に入っていたとされている。^[2]
ただし、このStanford HAIの記述はClaude Opus 4.7、GPT-5.5、Kimi K2.6、DeepSeek V4そのものではなく、直前世代または別モデルの状況を示す背景情報にとどまる。^[2]
SciCode Benchの提供スニペットにはClaude Sonnet 4 high、Gemini 2.5 Pro、GPT-5-1-codex、Grok 4などの記載があるが、比較対象であるGPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6の直接スコアは確認できない。^[1]
GPT-5.5については「新たな首位」とする記述がある一方、別スニペットではClaude Opus 4.7もArtificial Analysis Intelligence文脈で首位に位置するように読めるため、時点差、評価セット差、またはスニペットの文脈差による不整合がある可能性がある。^[3]^[34]

Limitations / uncertainty

Insufficient evidence. 4モデルすべてを同一ベンチマーク、同一日付、同一推論設定で比較した表やスコアは、提供証拠内にはありません。^[1]^[3]^[5]^[7]^[34]
DeepSeek V4は特に情報が不足しており、提供証拠では「DeepSeek V4 Pro」というモデル名の出現以上の定量情報を確認できません。^[7]
Claude Opus 4.7の「92% honesty rate」は安全性・誠実性に関する指標であり、コーディング、数学、エージェント性能、総合知能スコアと同列には扱えません。^[4]
Kimi K2.6のIntelligence 54、256k context、$1.7、112 tokens/sは有用な比較材料ですが、GPT-5.5やClaude Opus 4.7の同じ列の数値が提供されていないため、相対順位は出せません。^[7]

Summary

現時点の提供証拠で最も強い総合性能シグナルがあるのはGPT-5.5ですが、数値スコアはありません。^[3]^[34]
Claude Opus 4.7は安全性・誠実性で92%という具体値があり、ソフトウェアエンジニアリングや複雑問題解決の改善が主張されていますが、総合スコア比較はできません。^[4]^[6]
Kimi K2.6はIntelligence 54、256k context、$1.7、112 tokens/sという具体的な公開比較値がありますが、他3モデルとの同一指標比較は不足しています。^[7]
DeepSeek V4は提供証拠だけでは評価不能です。^[7]

情報源

[2] DeepSeek is back among the leading open weights models with V4 Pro ...artificialanalysis.ai
Lower cost than frontier models, but high token usage keeps costs above most open weights peers: DeepSeek V4 Pro costs $1,071 to run the Artificial Analysis Intelligence Index, more than 4x cheaper than Claude Opus 4.7 ($4,811) but above several open weight...
[3] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[4] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarksllm-stats.com
The Verdict On the 10 benchmarks both providers report, Opus 4.7 leads on 6 and GPT-5.5 leads on 4. The leads cluster by category, not by overall quality: Opus 4.7 is ahead on the reasoning-heavy and review-grade tests (GPQA Diamond, HLE with and without to...
[9] OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[15] Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safetymashable.com
Tim studied print journalism at the University of Southern California. He currently splits his time between Brooklyn, NY and Charleston, SC. He's currently working on his second novel, a science-fiction book. Recommended For You Anthropic says Claude Opus 4...
[16] DeepSeek V4: Features, Benchmarks, and Comparisons - DataCampdatacamp.com
How large are the DeepSeek V4 models? DeepSeek uses a Mixture of Experts (MoE) architecture. The Pro model contains 1.6 trillion total parameters (49 billion active) and requires an 865GB download. The Flash model contains 284 billion parameters (13 billion...
[17] Introducing Claude Opus 4.7anthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[18] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
Footnotes 1. General Testing Details We report results for Kimi K2.6 and Kimi K2.5 with thinking mode enabled, Claude Opus 4.6 with max effort, GPT-5.4 with xhigh reasoning effort, and Gemini 3.1 Pro with a high thinking level. Unless otherwise specified, a...
[20] Open-weight Kimi K2.6 takes on GPT-5.4 and Claude Opus 4.6 with agent swarmsthe-decoder.com
The Decoder Open-weight Kimi K2.6 takes on GPT-5.4 and Claude Opus 4.6 with agent swarms Matthias Bastian Image description Moonshot AI has released Kimi K2.6 as an open-weight model. It's built to match GPT-5.4 and Claude Opus 4.6 on coding benchmarks, and...
[23] Comparison of Open Source AI Models across Intelligence, Performance, Price, Context Window, and more | Artificial Analysisartificialanalysis.ai
Model Name Intelligence Parameters Context Window Price Output Speed (t/s) Weights Providers Provider Benchmarks --- --- --- --- Kimi logo Kimi K2.6 Kimi 54 1.0KB (32B active at inference time) 256k $1.7 112 🤗 Novita Kimi SiliconFlow +6 more View DeepSeek...
[26] OpenAI's GPT-5.5 is the new leading AI model - Artificial Analysisartificialanalysis.ai
➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by 30 pts and Gemini 3.1 Pro Preview by 470 pts. GDPval-AA is Artificial Analysis' benchmark that leverages OpenAI's GDPval dataset to evaluate models on real-world e...
[27] GPT-5.5 (high) - Intelligence, Performance & Price Analysisartificialanalysis.ai
No, GPT-5.5 (high) is proprietary. The model weights are not publicly available. GPT-5.5 (high) is a proprietary model and OpenAI has not disclosed the model size or parameter count. GPT-5.5 (high) achieves a score of 59 on the Artificial Analysis Intellige...
[28] GPT-5.5 vs Claude Opus 4.7: Real-World Coding Performance Comparedmindstudio.ai
GPT-5.5 uses 72% fewer output tokens than Claude Opus 4.7 on the same coding tasks — a structural difference, not a minor gap. On raw benchmark quality, both models are competitive. Neither dominates on every task type. For high-volume agentic coding pipeli...
[30] Comparison of AI Models across Intelligence, Performance, and Priceartificialanalysis.ai
Which is the most intelligent AI model? Claude Opus 4.7 (Adaptive Reasoning, Max Effort) currently leads the Artificial Analysis Intelligence Index with a score of 57, out of 347 models evaluated. What are the top AI models? The top AI models by Intelligenc...

まず結論：用途別の第一候補

4モデルの主要シグナル

GPT-5.5 vs Claude Opus 4.7：フロンティア同士はタスクで分かれる

Kimi K2.6 vs DeepSeek V4 Pro：オープンウェイト系は速度か文脈長か

価格比較では、数字の種類を混ぜない

安全性・信頼性はベンチマークとは別軸

実運用では、1モデル固定よりルーティングが現実的

最終判断

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

リサーチクエスチョン

引用された回答

Key findings

Evidence notes

Limitations / uncertainty

Summary

情報源

まず結論：用途別の第一候補

4モデルの主要シグナル

GPT-5.5 vs Claude Opus 4.7：フロンティア同士はタスクで分かれる

Kimi K2.6 vs DeepSeek V4 Pro：オープンウェイト系は速度か文脈長か

価格比較では、数字の種類を混ぜない

安全性・信頼性はベンチマークとは別軸

実運用では、1モデル固定よりルーティングが現実的

最終判断

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

リサーチクエスチョン

引用された回答

Key findings

Evidence notes

Limitations / uncertainty

Summary

情報源

まず結論：用途別の第一候補

4モデルの主要シグナル

GPT-5.5 vs Claude Opus 4.7：フロンティア同士はタスクで分かれる

Kimi K2.6 vs DeepSeek V4 Pro：オープンウェイト系は速度か文脈長か

価格比較では、数字の種類を混ぜない

安全性・信頼性はベンチマークとは別軸

実運用では、1モデル固定よりルーティングが現実的

最終判断

Search, cite, and publish your own answer

重要なポイント

人々も尋ねます

「GPT-5.5、Claude Opus 4.7、DeepSeek V4、Kimi K2.6をベンチマークで比較」の短い答えは何ですか?

最初に検証する重要なポイントは何ですか?

次の実践では何をすればいいでしょうか？

次にどの関連トピックを検討すればよいでしょうか?

これを何と比較すればいいでしょうか？

研究を続けてください

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6比較：2026年ベンチマークの結論

DeepSeek V4を技術面から読む：100万トークン、MoE、API移行の要点

Northwest Timber と Southeast Timber：なぜ答えは「larger; larger」なのか

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか 本当の焦点はコストと信頼

研究に関する会話

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼

DeepSeekはOpenAI、Claude、Gemini、Grokに勝てるのか　本当の焦点はコストと信頼