Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 benchmark 比較：真正要睇咩？

公開資料未有用同一套 benchmark、設定同成本條件把四個模型一次過橫向評分，所以唔應該硬排 1 至 4；Claude Opus 4.7 有 BenchLM 97/100、SWE bench Verified 82.4%，GPT 5.5 則有 GDPval 84.9% 等不同軸數字 [2][3][29]。 DeepSeek V4 Pro Max 顯示 MMLU Pro 87.5%、GPQA Diamond 90.1%；Kimi K2.6 有 BenchLM 85/100、Vals Accuracy 63.94% ± 1.97、每次測試 0.21 美元等資料，但來源與設定要分開睇 [15][37][39]。

Studio Global AI로 검색 및 팩트체크 Discover에서 더 많은 것을 찾아보세요

16K0

네 개의 최신 AI 모델을 벤치마크 차트와 비교하는 추상적 에디토리얼 이미지 — Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: 2026 벤치마크 비교Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 2026 벤치마크 비교를 표현한 AI 생성 이미지.
AI 프롬프트
Create a landscape editorial hero image for this Studio Global article: Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: 2026 벤치마크 비교. Article summary: 네 모델의 ‘절대 1위’를 확정하기는 어렵습니다. 공개 자료 기준 Claude Opus 4.7은 BenchLM 97/100·SWE bench Verified 82.4%가 가장 뚜렷하고, GPT 5.5는 GDPval 84.9% 등 업무형 공식 수치가 강하지만 평가 체계가 달라 직접 합산할 수 없습니다 [2][3][29].. Topic tags: ai, ai benchmarks, llm, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "# Kimi K2.6 vs Claude Opus 4.7: Which Model Wins in 2026? Kimi K2.6 ties Opus 4.7 on multilingual SWE-bench but trails by 7 points on Verified — at 1/5th the cost. Two weeks after" source context "Kimi K2.6 vs Claude Opus 4.7 (2026): Benchmarks, Cost, When Each Wins" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www
openai.com

如果只問「邊個最強」，答案反而容易誤導。Vals AI 列表把 DeepSeek V4 同 GPT-5.5 列為 2026 年 4 月 23 日項目，Kimi K2.6 為 4 月 20 日，Claude Opus 4.7 為 4 月 16 日 ^[19]。但目前可見資料分散喺 BenchLM、官方發布、DataCamp/Hugging Face、Vals、Artificial Analysis 等不同體系，未有一份用同一把尺、同一設定、同一成本條件將四個模型齊齊橫評 ^[2]^[3]^[15]^[16]^[28]^[29]^[36]^[37]^[39]。

所以，今次比較嘅重點唔係硬排 1 至 4，而係拆開睇：你要寫 code、做知識工作、跑 agent、處理金融文件、做科學推理，抑或想控制成本？答案會唔同。

先講清楚：AI benchmark 唔係一場單一考試

2026 年嘅 AI benchmark 更似一籃子能力測試。Kili Technology 將 MMLU、MMLU-Pro、GPQA Diamond、SWE-Bench、Terminal-Bench、GAIA、WebArena、GDPval、安全性評估等分成不同能力軸去理解 ^[8]。Stanford HAI 的 AI Index 亦將技術性能分開看，例如 MMLU、MATH、GPQA Diamond、MMMU、OSWorld、AIME、SWE-bench Verified 等 ^[13]。

尤其係 MMLU 這類一般知識測試，對頂尖模型嘅分辨力已經下降。Nanonets 解釋，MMLU 以 5-shot 方式計分，而到 2026 年，頂級模型大多擠在 88% 以上，差距變得難以拉開 ^[22]。換句話講，淨係睇一個總分，好容易睇錯重點；揀模型之前，應該先問清楚自己實際要佢做咩 ^[8]^[22]。

公開數字一覽

模型	公開資料中可見重點數字	可以點讀	小心位
Claude Opus 4.7	BenchLM 97/100，provisional 第 2/110；SWE-bench Verified 82.4%；FinanceBench 82.7%；MathVista 上升 9.5 分 ^[2]^[3]	coding、綜合 leaderboard、金融文件分析、視覺數學推理	Anthropic 的 research-agent benchmark 0.715 屬內部評估，唔應直接同 GPT-5.5 的 GDPval 等數字當同一把尺比較 ^[7]^[29]。
GPT-5.5	BenchLM 89/100，provisional 第 5/112；GDPval 84.9%；OSWorld-Verified 78.7%；Tau2-bench Telecom 98.0%；Vals Accuracy 67.76% ± 1.79 ^[28]^[29]^[31]	知識工作、電腦操作、客戶支援流程、agent 型任務	OpenAI 官方發布、BenchLM、Vals Index 係不同評估體系，唔應直接相加或互換 ^[28]^[29]^[31]。
DeepSeek V4 / V4-Pro-Max	Vals AI 2026 年 4 月 23 日項目；V4-Pro-Max MMLU-Pro 87.5%、GPQA Diamond 90.1%、GSM8K 92.6% ^[15]^[19]	科學問答、數學、高難度推理候選	DataCamp 說明相關數字基於 DeepSeek 內部結果，解讀時要同獨立 leaderboard 分開 ^[15]。
Kimi K2.6	BenchLM 85/100，provisional 第 12/115；Vals Accuracy 63.94% ± 1.97、Latency 373.57s、Cost/Test 0.21 美元；Artificial Analysis Intelligence Index 54、整體第 4 ^[36]^[37]^[39]	開放權重路線、成本與延遲、營運效率	資料中有 Kimi 2.6、Kimi K2.6、K2.6 Thinking 等叫法，要確認是否同一設定 ^[37]^[39]。

綜合 leaderboard：BenchLM 入面 Claude 較前

只看提供資料中有 BenchLM 頁面的三個模型，Claude Opus 4.7 位置最前。BenchLM 顯示 Claude Opus 4.7 在 provisional leaderboard 110 個模型中排第 2，overall score 97/100；在 verified leaderboard 亦是 14 個模型中第 2 ^[3]。

GPT-5.5 在 BenchLM provisional leaderboard 112 個模型中排第 5，overall score 89/100；verified leaderboard 則是 16 個模型中第 2 ^[28]。Kimi 2.6 在 BenchLM provisional leaderboard 115 個模型中排第 12，overall score 85/100，並顯示有 27 個公開 benchmark 分數 ^[37]。

不過，這個排序只係 BenchLM 角度。三個頁面比較樣本數分別是 110、112、115，而且目前資料未能提供 DeepSeek V4 同等 BenchLM 分數放埋一齊比 ^[3]^[28]^[37]。

Coding：Claude Opus 4.7 的公開數字最直接

如果焦點係軟件工程、修 bug、改 repo，Claude Opus 4.7 的 SWE-bench Verified 數字最清楚。MindStudio 指 Claude Opus 4.7 在 SWE-bench Verified 達 82.4%，比 Opus 4.6 約升 11 個百分點 ^[2]。同一資料亦列出 Claude Opus 4.7 的 FinanceBench 為 82.7%，並提到視覺相關改善中 MathVista 上升 9.5 分 ^[2]。

GPT-5.5 方面，OpenAI 介紹資料重點列出的不是 SWE-bench，而是 GDPval、OSWorld-Verified、Tau2-bench Telecom 等工作型指標 ^[29]。Kimi K2.6 方面，GMI Cloud 資料聲稱其在 SWE-Bench Pro 有領先表現，但現有公開摘要未足以確認精確分數，亦未能證明四模型是在同一條件下比較 ^[35]。DeepSeek V4 在這批資料中，較具體可見的是推理同數學相關數字，而非 coding 橫向比較 ^[15]^[16]。

工作型 agent：GPT-5.5 的官方指標最具體

如果你關心嘅係「模型可唔可以自己完成知識工作、操作電腦環境、處理客戶支援流程」，GPT-5.5 的官方數字相對最集中。OpenAI 表示 GPT-5.5 在 GDPval 得 84.9%，而 GDPval 測試 agent 在 44 個職業中產出規格化知識工作的能力 ^[29]。OpenAI 亦列出 GPT-5.5 在 OSWorld-Verified 得 78.7%，以及在測試複雜客戶服務流程的 Tau2-bench Telecom 得 98.0% ^[29]。

Claude Opus 4.7 亦有 agent 型資料。Anthropic 指，在其內部 research-agent benchmark 中，Claude Opus 4.7 於 6 個模組總分為 0.715，並列最高；在 General Finance 模組中，它由 Opus 4.6 的 0.767 升至 0.813 ^[7]。

但要留神：GPT-5.5 的 GDPval、OSWorld-Verified、Tau2-bench，同 Claude Opus 4.7 的 Anthropic 內部 research-agent benchmark，評估體系唔同 ^[7]^[29]。GPT-5.5 的 84.9% 同 Claude 的 0.715，唔可以當成同一分數表直接比較 ^[7]^[29]。

推理與知識：DeepSeek V4-Pro-Max 同 Kimi K2.6 Thinking 有部分同表資料

DeepSeek V4 較具體的公開數字，主要見於 V4-Pro-Max 設定。DataCamp 指，根據 DeepSeek 內部結果，DeepSeek V4-Pro-Max 在 MMLU-Pro 得 87.5%、GPQA Diamond 得 90.1%、GSM8K 數學得 92.6% ^[15]。這些數字有參考價值，但既然 DataCamp 明確指出是基於內部結果，就唔應同獨立 leaderboard 當成同等證據重量 ^[15]。

Hugging Face 的 DeepSeek-V4-Pro 資料，將 DeepSeek V4-Pro-Max 同 Kimi K2.6 Thinking 放在同一表格的部分知識／推理項目中比較 ^[16]：

Benchmark	DeepSeek V4-Pro-Max	Kimi K2.6 Thinking	表內較高者
MMLU-Pro	87.5	87.1	DeepSeek V4-Pro-Max
SimpleQA-Verified	57.9	36.9	DeepSeek V4-Pro-Max
Chinese-SimpleQA	84.4	75.9	DeepSeek V4-Pro-Max
GPQA Diamond	90.1	90.5	Kimi K2.6 Thinking
HLE	37.7	36.4	DeepSeek V4-Pro-Max

單看這張表，DeepSeek V4-Pro-Max 在 MMLU-Pro、SimpleQA-Verified、Chinese-SimpleQA、HLE 較高；Kimi K2.6 Thinking 則在 GPQA Diamond 略高 ^[16]。但同表比較對象不是 Claude Opus 4.7 同 GPT-5.5，而是 Opus-4.6 Max、GPT-5.4 xHigh 等其他模型，所以唔足以推出四模型總排名 ^[16]。

成本與延遲：Kimi K2.6 的營運指標值得望多眼

Vals AI 顯示 GPT-5.5 的 Accuracy 為 67.76% ± 1.79，Latency 為 409.09s，Context Window 為 1M ^[31]。Kimi K2.6 則顯示 Accuracy 為 63.94% ± 1.97，Latency 為 373.57s，Cost/Test 為 0.21 美元 ^[39]。只比較這兩條 Vals 紀錄，準確率顯示值是 GPT-5.5 較高，延遲顯示值則是 Kimi K2.6 較低 ^[31]^[39]。

Kimi K2.6 對重視開放權重的使用者亦有另一層意義。Artificial Analysis 形容 Moonshot 的 Kimi K2.6 是 leading open weights model，並列出 Artificial Analysis Intelligence Index 54、整體第 4 的排名 ^[36]。但 Artificial Analysis、Vals、BenchLM 都是不同評估體系，Kimi 的 54 分、Vals Accuracy 63.94%、BenchLM 85/100 不應加埋變成一個總分 ^[36]^[37]^[39]。

實務揀模型：可以咁樣拆開睇

主要做 coding／自動修補軟件問題：Claude Opus 4.7 值得先試。現有公開根據中，SWE-bench Verified 82.4% 同 BenchLM 97/100 是最清晰的強項數字 ^[2]^[3]。
主要做知識工作、電腦操作、客戶支援流程：GPT-5.5 的 GDPval 84.9%、OSWorld-Verified 78.7%、Tau2-bench Telecom 98.0% 是較直接的官方指標 ^[29]。
主要比較科學問答、數學、高難度推理候選：可一併看 DeepSeek V4-Pro-Max 與 Kimi K2.6 Thinking 的 MMLU-Pro、GPQA Diamond、HLE 等項目 ^[15]^[16]。
主要重視開放權重、成本與延遲：Kimi K2.6 的 Artificial Analysis open weights 評價，以及 Vals 的每次測試 0.21 美元、373.57s 延遲，會是實用參考點 ^[36]^[39]。
唔好只靠 MMLU 一個分數：2026 年頂級模型在 MMLU 高分區太接近，分辨力已下降 ^[22]。

最後判斷

以現有公開根據計，Claude Opus 4.7 的優勢較明顯在 coding 與 BenchLM 綜合榜；GPT-5.5 較突出在知識工作、電腦使用同工作型 agent 指標；DeepSeek V4-Pro-Max 有較具體的推理與數學公開數字；Kimi K2.6 則在開放權重、成本與延遲指標上較值得留意 ^[2]^[3]^[15]^[16]^[28]^[29]^[36]^[37]^[39]。

但四個模型目前仍未有一套完全公平、同條件、同成本的公開橫評。真正落地時，最好用這些 benchmark 做起點，再按自己工作場景另行測試：例如 coding、金融文件分析、瀏覽器／電腦控制、客戶支援、長時間 agent 執行等。咁樣揀模型，會比追逐單一「第一名」安全得多 ^[8]^[22]。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI로 검색 및 팩트체크

주요 시사점

公開資料未有用同一套 benchmark、設定同成本條件把四個模型一次過橫向評分，所以唔應該硬排 1 至 4；Claude Opus 4.7 有 BenchLM 97/100、SWE bench Verified 82.4%，GPT 5.5 則有 GDPval 84.9% 等不同軸數字 [2][3][29]。
DeepSeek V4 Pro Max 顯示 MMLU Pro 87.5%、GPQA Diamond 90.1%；Kimi K2.6 有 BenchLM 85/100、Vals Accuracy 63.94% ± 1.97、每次測試 0.21 美元等資料，但來源與設定要分開睇 [15][37][39]。
實務揀模型應按任務拆開：coding 可先睇 Claude，知識工作／電腦操作先睇 GPT 5.5，科學推理候選比較 DeepSeek/Kimi，重視開放權重同成本則留意 Kimi [8][16][29][36]。

사람들은 또한 묻습니다.

"Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 benchmark 比較：真正要睇咩？"에 대한 짧은 대답은 무엇입니까?

公開資料未有用同一套 benchmark、設定同成本條件把四個模型一次過橫向評分，所以唔應該硬排 1 至 4；Claude Opus 4.7 有 BenchLM 97/100、SWE bench Verified 82.4%，GPT 5.5 則有 GDPval 84.9% 等不同軸數字 [2][3][29]。

먼저 검증할 핵심 포인트는 무엇인가요?

실무에서는 다음으로 무엇을 해야 합니까?

實務揀模型應按任務拆開：coding 可先睇 Claude，知識工作／電腦操作先睇 GPT 5.5，科學推理候選比較 DeepSeek/Kimi，重視開放權重同成本則留意 Kimi [8][16][29][36]。

다음에는 어떤 관련 주제를 탐구해야 할까요?

다른 각도와 추가 인용을 보려면 "Claude Opus 4.7·GPT-5.5·DeepSeek V4·Kimi K2.6 벤치마크 비교: 2026년 판정"으로 계속하세요.

관련 페이지 열기

이것을 무엇과 비교해야 합니까?

"DeepSeek V4 엔지니어링 해설: 100만 토큰, MoE, API 이전 체크포인트"에 대해 이 답변을 대조 확인하세요.

관련 페이지 열기

연구를 계속하세요

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7·GPT-5.5·DeepSeek V4·Kimi K2.6 벤치마크 비교: 2026년 판정

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: 2026 벤치마크 비교

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4 엔지니어링 해설: 100만 토큰, MoE, API 이전 체크포인트

Northwest vs. Southeast Timber 탄력성 풀이: 왜 답은 larger; larger일까

Northwest vs. Southeast Timber: 대체성이 높을수록 왜 larger; larger인가

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek가 OpenAI·Claude·Gemini·Grok을 이길까: 답은 비용과 신뢰에 있다

DeepSeek가 OpenAI·Claude·Gemini·Grok을 이길까

연구 대화

당신2026년 4월 26일

연구문제

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Studio Global AI2026년 4월 26일40 소스

인용 답변

제공된 근거만으로는 네 모델을 동등한 수준으로 “종합 벤치마크 비교”하기에는 자료가 크게 부족합니다. 확인 가능한 정량 벤치마크는 사실상 Claude Opus 4.7에 집중되어 있고, GPT-5.5·DeepSeek V4·Kimi K2.6은 제공 근거 안에서 직접 비교 가능한 공식 점수나 일관된 벤치마크 결과가 없습니다. 따라서 아래 비교는 “확인 가능한 것”과 “증거 부족”을 명확히 나누어 정리합니다.

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

현재 제공된 증거만 놓고 보면 Claude Opus 4.7이 네 모델 중 유일하게 구체적이고 비교 가능한 성능 자료를 갖고 있습니다. Claude Opus 4.7은 SWE-bench Verified 82.4%, BenchLM provisional 2위/110개 모델, overall score 97/100으로 제시되어 코딩과 종합 평가에서 매우 강한 모델로 볼 수 있습니다 ^[6]^[7].

반면 GPT-5.5, DeepSeek V4, Kimi K2.6은 제공 근거 안에서 직접 벤치마크 점수가 없으므로 Claude Opus 4.7보다 높다거나 낮다고 결론낼 수 없습니다. Insufficient evidence.

출처

[2] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[3] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] AI Benchmarks 2026: Top Evaluations and Their Limitskili-technology.com
Image 2: Kili Technology.png) Kili Technology · Apr 13, 2026 Image 3: AI Benchmarks Guide: The Top Evaluations in 2026 and Why They're Not Enough Table of contents Introduction What Are the Most Important AI Benchmarks in 2026? General knowledge and reasoni...
[13] [PDF] Technical Performance - Stanford HAIhai.stanford.edu
Technical Performance Benchmarks vs. Human Performance 76 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% 120% Image classiǇcation (ImageNet Top-5) Visual reasoning (VQA) English language understanding (SuperGLU...
[15] DeepSeek V4: Features, Benchmarks, and Comparisonsdatacamp.com
DeepSeek V4 Benchmarks According to DeepSeek’s internal results, DeepSeek V4 demonstrates impressive performance, particularly when pushed to its maximum reasoning limits (DeepSeek-V4-Pro-Max). According to the official release notes, here is how the model...
[16] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max :---: :---: :---: Knowledge & Reasoning MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5 SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9 Chinese-SimpleQA (Pass@1...
[19] DeepSeek V4 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/23/2026 DeepSeek DeepSeek V4 4/23/2026 OpenAI GPT 5.5 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude...
[22] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[28] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[29] Introducing GPT-5.5 - OpenAIopenai.com
GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. OnGDPval⁠⁠, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, whic...
[31] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[35] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
Kimi K2.6 tops SWE-Bench Pro and runs 300 parallel sub-agents on 4x H100S. Learn the full architecture, benchmark results, and how to run it ... 3 days ago
[36] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
Moonshot's Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at 4 on the Artificial Analysis Intelligence Index (54) behind ... 5 days ago
[37] Kimi 2.6 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Kimi 2.6 by Moonshot AI scores 85/100 on BenchLM's provisional leaderboard ( 12 of 115) with 27 published benchmark scores currently shown ... 6 days ago
[39] Kimi K2.6 - Vals AIvals.ai
Kimi K2.6. Release Date: 4/20/2026. Vals Index. Accuracy (Vals Index). 63.94% ± 1.97. Latency (Vals Index). 373.57s. Cost/Test (Vals Index). $0.21.

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 benchmark 比較：真正要睇咩？

Studio Global AI로 검색 및 팩트체크 Discover에서 더 많은 것을 찾아보세요

16K0

所以，今次比較嘅重點唔係硬排 1 至 4，而係拆開睇：你要寫 code、做知識工作、跑 agent、處理金融文件、做科學推理，抑或想控制成本？答案會唔同。

先講清楚：AI benchmark 唔係一場單一考試

公開數字一覽

模型	公開資料中可見重點數字	可以點讀	小心位
Claude Opus 4.7	BenchLM 97/100，provisional 第 2/110；SWE-bench Verified 82.4%；FinanceBench 82.7%；MathVista 上升 9.5 分 ^[2]^[3]	coding、綜合 leaderboard、金融文件分析、視覺數學推理	Anthropic 的 research-agent benchmark 0.715 屬內部評估，唔應直接同 GPT-5.5 的 GDPval 等數字當同一把尺比較 ^[7]^[29]。
GPT-5.5	BenchLM 89/100，provisional 第 5/112；GDPval 84.9%；OSWorld-Verified 78.7%；Tau2-bench Telecom 98.0%；Vals Accuracy 67.76% ± 1.79 ^[28]^[29]^[31]	知識工作、電腦操作、客戶支援流程、agent 型任務	OpenAI 官方發布、BenchLM、Vals Index 係不同評估體系，唔應直接相加或互換 ^[28]^[29]^[31]。
DeepSeek V4 / V4-Pro-Max	Vals AI 2026 年 4 月 23 日項目；V4-Pro-Max MMLU-Pro 87.5%、GPQA Diamond 90.1%、GSM8K 92.6% ^[15]^[19]	科學問答、數學、高難度推理候選	DataCamp 說明相關數字基於 DeepSeek 內部結果，解讀時要同獨立 leaderboard 分開 ^[15]。
Kimi K2.6	BenchLM 85/100，provisional 第 12/115；Vals Accuracy 63.94% ± 1.97、Latency 373.57s、Cost/Test 0.21 美元；Artificial Analysis Intelligence Index 54、整體第 4 ^[36]^[37]^[39]	開放權重路線、成本與延遲、營運效率	資料中有 Kimi 2.6、Kimi K2.6、K2.6 Thinking 等叫法，要確認是否同一設定 ^[37]^[39]。

綜合 leaderboard：BenchLM 入面 Claude 較前

不過，這個排序只係 BenchLM 角度。三個頁面比較樣本數分別是 110、112、115，而且目前資料未能提供 DeepSeek V4 同等 BenchLM 分數放埋一齊比 ^[3]^[28]^[37]。

Coding：Claude Opus 4.7 的公開數字最直接

工作型 agent：GPT-5.5 的官方指標最具體

推理與知識：DeepSeek V4-Pro-Max 同 Kimi K2.6 Thinking 有部分同表資料

Hugging Face 的 DeepSeek-V4-Pro 資料，將 DeepSeek V4-Pro-Max 同 Kimi K2.6 Thinking 放在同一表格的部分知識／推理項目中比較 ^[16]：

Benchmark	DeepSeek V4-Pro-Max	Kimi K2.6 Thinking	表內較高者
MMLU-Pro	87.5	87.1	DeepSeek V4-Pro-Max
SimpleQA-Verified	57.9	36.9	DeepSeek V4-Pro-Max
Chinese-SimpleQA	84.4	75.9	DeepSeek V4-Pro-Max
GPQA Diamond	90.1	90.5	Kimi K2.6 Thinking
HLE	37.7	36.4	DeepSeek V4-Pro-Max

成本與延遲：Kimi K2.6 的營運指標值得望多眼

實務揀模型：可以咁樣拆開睇

主要做 coding／自動修補軟件問題：Claude Opus 4.7 值得先試。現有公開根據中，SWE-bench Verified 82.4% 同 BenchLM 97/100 是最清晰的強項數字 ^[2]^[3]。
主要做知識工作、電腦操作、客戶支援流程：GPT-5.5 的 GDPval 84.9%、OSWorld-Verified 78.7%、Tau2-bench Telecom 98.0% 是較直接的官方指標 ^[29]。
主要比較科學問答、數學、高難度推理候選：可一併看 DeepSeek V4-Pro-Max 與 Kimi K2.6 Thinking 的 MMLU-Pro、GPQA Diamond、HLE 等項目 ^[15]^[16]。
主要重視開放權重、成本與延遲：Kimi K2.6 的 Artificial Analysis open weights 評價，以及 Vals 的每次測試 0.21 美元、373.57s 延遲，會是實用參考點 ^[36]^[39]。
唔好只靠 MMLU 一個分數：2026 年頂級模型在 MMLU 高分區太接近，分辨力已下降 ^[22]。

最後判斷

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI로 검색 및 팩트체크

주요 시사점

公開資料未有用同一套 benchmark、設定同成本條件把四個模型一次過橫向評分，所以唔應該硬排 1 至 4；Claude Opus 4.7 有 BenchLM 97/100、SWE bench Verified 82.4%，GPT 5.5 則有 GDPval 84.9% 等不同軸數字 [2][3][29]。
DeepSeek V4 Pro Max 顯示 MMLU Pro 87.5%、GPQA Diamond 90.1%；Kimi K2.6 有 BenchLM 85/100、Vals Accuracy 63.94% ± 1.97、每次測試 0.21 美元等資料，但來源與設定要分開睇 [15][37][39]。
實務揀模型應按任務拆開：coding 可先睇 Claude，知識工作／電腦操作先睇 GPT 5.5，科學推理候選比較 DeepSeek/Kimi，重視開放權重同成本則留意 Kimi [8][16][29][36]。

사람들은 또한 묻습니다.

"Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 benchmark 比較：真正要睇咩？"에 대한 짧은 대답은 무엇입니까?

먼저 검증할 핵심 포인트는 무엇인가요?

실무에서는 다음으로 무엇을 해야 합니까?

다음에는 어떤 관련 주제를 탐구해야 할까요?

다른 각도와 추가 인용을 보려면 "Claude Opus 4.7·GPT-5.5·DeepSeek V4·Kimi K2.6 벤치마크 비교: 2026년 판정"으로 계속하세요.

관련 페이지 열기

이것을 무엇과 비교해야 합니까?

"DeepSeek V4 엔지니어링 해설: 100만 토큰, MoE, API 이전 체크포인트"에 대해 이 답변을 대조 확인하세요.

관련 페이지 열기

연구를 계속하세요

Claude Opus 4.7·GPT-5.5·DeepSeek V4·Kimi K2.6 벤치마크 비교: 2026년 판정

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: 2026 벤치마크 비교

DeepSeek V4 엔지니어링 해설: 100만 토큰, MoE, API 이전 체크포인트

Northwest vs. Southeast Timber 탄력성 풀이: 왜 답은 larger; larger일까

Northwest vs. Southeast Timber: 대체성이 높을수록 왜 larger; larger인가

DeepSeek가 OpenAI·Claude·Gemini·Grok을 이길까: 답은 비용과 신뢰에 있다

DeepSeek가 OpenAI·Claude·Gemini·Grok을 이길까

연구 대화

당신2026년 4월 26일

연구문제

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Studio Global AI2026년 4월 26일40 소스

인용 답변

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

출처

[2] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[3] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] AI Benchmarks 2026: Top Evaluations and Their Limitskili-technology.com
Image 2: Kili Technology.png) Kili Technology · Apr 13, 2026 Image 3: AI Benchmarks Guide: The Top Evaluations in 2026 and Why They're Not Enough Table of contents Introduction What Are the Most Important AI Benchmarks in 2026? General knowledge and reasoni...
[13] [PDF] Technical Performance - Stanford HAIhai.stanford.edu
Technical Performance Benchmarks vs. Human Performance 76 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% 120% Image classiǇcation (ImageNet Top-5) Visual reasoning (VQA) English language understanding (SuperGLU...
[15] DeepSeek V4: Features, Benchmarks, and Comparisonsdatacamp.com
DeepSeek V4 Benchmarks According to DeepSeek’s internal results, DeepSeek V4 demonstrates impressive performance, particularly when pushed to its maximum reasoning limits (DeepSeek-V4-Pro-Max). According to the official release notes, here is how the model...
[16] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max :---: :---: :---: Knowledge & Reasoning MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5 SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9 Chinese-SimpleQA (Pass@1...
[19] DeepSeek V4 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/23/2026 DeepSeek DeepSeek V4 4/23/2026 OpenAI GPT 5.5 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude...
[22] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[28] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[29] Introducing GPT-5.5 - OpenAIopenai.com
GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. OnGDPval⁠⁠, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, whic...
[31] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[35] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
Kimi K2.6 tops SWE-Bench Pro and runs 300 parallel sub-agents on 4x H100S. Learn the full architecture, benchmark results, and how to run it ... 3 days ago
[36] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
Moonshot's Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at 4 on the Artificial Analysis Intelligence Index (54) behind ... 5 days ago
[37] Kimi 2.6 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Kimi 2.6 by Moonshot AI scores 85/100 on BenchLM's provisional leaderboard ( 12 of 115) with 27 published benchmark scores currently shown ... 6 days ago
[39] Kimi K2.6 - Vals AIvals.ai
Kimi K2.6. Release Date: 4/20/2026. Vals Index. Accuracy (Vals Index). 63.94% ± 1.97. Latency (Vals Index). 373.57s. Cost/Test (Vals Index). $0.21.

Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 benchmark 比較：真正要睇咩？

Studio Global AI로 검색 및 팩트체크 Discover에서 더 많은 것을 찾아보세요

16K0

所以，今次比較嘅重點唔係硬排 1 至 4，而係拆開睇：你要寫 code、做知識工作、跑 agent、處理金融文件、做科學推理，抑或想控制成本？答案會唔同。

先講清楚：AI benchmark 唔係一場單一考試

公開數字一覽

模型	公開資料中可見重點數字	可以點讀	小心位
Claude Opus 4.7	BenchLM 97/100，provisional 第 2/110；SWE-bench Verified 82.4%；FinanceBench 82.7%；MathVista 上升 9.5 分 ^[2]^[3]	coding、綜合 leaderboard、金融文件分析、視覺數學推理	Anthropic 的 research-agent benchmark 0.715 屬內部評估，唔應直接同 GPT-5.5 的 GDPval 等數字當同一把尺比較 ^[7]^[29]。
GPT-5.5	BenchLM 89/100，provisional 第 5/112；GDPval 84.9%；OSWorld-Verified 78.7%；Tau2-bench Telecom 98.0%；Vals Accuracy 67.76% ± 1.79 ^[28]^[29]^[31]	知識工作、電腦操作、客戶支援流程、agent 型任務	OpenAI 官方發布、BenchLM、Vals Index 係不同評估體系，唔應直接相加或互換 ^[28]^[29]^[31]。
DeepSeek V4 / V4-Pro-Max	Vals AI 2026 年 4 月 23 日項目；V4-Pro-Max MMLU-Pro 87.5%、GPQA Diamond 90.1%、GSM8K 92.6% ^[15]^[19]	科學問答、數學、高難度推理候選	DataCamp 說明相關數字基於 DeepSeek 內部結果，解讀時要同獨立 leaderboard 分開 ^[15]。
Kimi K2.6	BenchLM 85/100，provisional 第 12/115；Vals Accuracy 63.94% ± 1.97、Latency 373.57s、Cost/Test 0.21 美元；Artificial Analysis Intelligence Index 54、整體第 4 ^[36]^[37]^[39]	開放權重路線、成本與延遲、營運效率	資料中有 Kimi 2.6、Kimi K2.6、K2.6 Thinking 等叫法，要確認是否同一設定 ^[37]^[39]。

綜合 leaderboard：BenchLM 入面 Claude 較前

不過，這個排序只係 BenchLM 角度。三個頁面比較樣本數分別是 110、112、115，而且目前資料未能提供 DeepSeek V4 同等 BenchLM 分數放埋一齊比 ^[3]^[28]^[37]。

Coding：Claude Opus 4.7 的公開數字最直接

工作型 agent：GPT-5.5 的官方指標最具體

推理與知識：DeepSeek V4-Pro-Max 同 Kimi K2.6 Thinking 有部分同表資料

Hugging Face 的 DeepSeek-V4-Pro 資料，將 DeepSeek V4-Pro-Max 同 Kimi K2.6 Thinking 放在同一表格的部分知識／推理項目中比較 ^[16]：

Benchmark	DeepSeek V4-Pro-Max	Kimi K2.6 Thinking	表內較高者
MMLU-Pro	87.5	87.1	DeepSeek V4-Pro-Max
SimpleQA-Verified	57.9	36.9	DeepSeek V4-Pro-Max
Chinese-SimpleQA	84.4	75.9	DeepSeek V4-Pro-Max
GPQA Diamond	90.1	90.5	Kimi K2.6 Thinking
HLE	37.7	36.4	DeepSeek V4-Pro-Max

成本與延遲：Kimi K2.6 的營運指標值得望多眼

實務揀模型：可以咁樣拆開睇

主要做 coding／自動修補軟件問題：Claude Opus 4.7 值得先試。現有公開根據中，SWE-bench Verified 82.4% 同 BenchLM 97/100 是最清晰的強項數字 ^[2]^[3]。
主要做知識工作、電腦操作、客戶支援流程：GPT-5.5 的 GDPval 84.9%、OSWorld-Verified 78.7%、Tau2-bench Telecom 98.0% 是較直接的官方指標 ^[29]。
主要比較科學問答、數學、高難度推理候選：可一併看 DeepSeek V4-Pro-Max 與 Kimi K2.6 Thinking 的 MMLU-Pro、GPQA Diamond、HLE 等項目 ^[15]^[16]。
主要重視開放權重、成本與延遲：Kimi K2.6 的 Artificial Analysis open weights 評價，以及 Vals 的每次測試 0.21 美元、373.57s 延遲，會是實用參考點 ^[36]^[39]。
唔好只靠 MMLU 一個分數：2026 年頂級模型在 MMLU 高分區太接近，分辨力已下降 ^[22]。

最後判斷

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI로 검색 및 팩트체크

주요 시사점

公開資料未有用同一套 benchmark、設定同成本條件把四個模型一次過橫向評分，所以唔應該硬排 1 至 4；Claude Opus 4.7 有 BenchLM 97/100、SWE bench Verified 82.4%，GPT 5.5 則有 GDPval 84.9% 等不同軸數字 [2][3][29]。
DeepSeek V4 Pro Max 顯示 MMLU Pro 87.5%、GPQA Diamond 90.1%；Kimi K2.6 有 BenchLM 85/100、Vals Accuracy 63.94% ± 1.97、每次測試 0.21 美元等資料，但來源與設定要分開睇 [15][37][39]。
實務揀模型應按任務拆開：coding 可先睇 Claude，知識工作／電腦操作先睇 GPT 5.5，科學推理候選比較 DeepSeek/Kimi，重視開放權重同成本則留意 Kimi [8][16][29][36]。

사람들은 또한 묻습니다.

"Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 benchmark 比較：真正要睇咩？"에 대한 짧은 대답은 무엇입니까?

먼저 검증할 핵심 포인트는 무엇인가요?

실무에서는 다음으로 무엇을 해야 합니까?

다음에는 어떤 관련 주제를 탐구해야 할까요?

다른 각도와 추가 인용을 보려면 "Claude Opus 4.7·GPT-5.5·DeepSeek V4·Kimi K2.6 벤치마크 비교: 2026년 판정"으로 계속하세요.

관련 페이지 열기

이것을 무엇과 비교해야 합니까?

"DeepSeek V4 엔지니어링 해설: 100만 토큰, MoE, API 이전 체크포인트"에 대해 이 답변을 대조 확인하세요.

관련 페이지 열기

연구를 계속하세요

연구 대화

당신2026년 4월 26일

연구문제

Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Studio Global AI2026년 4월 26일40 소스

인용 답변

Key findings

Claude Opus 4.7은 제공된 자료 중 유일하게 구체적인 벤치마크 수치가 확인되는 모델입니다. SWE-bench Verified에서 82.4%를 기록했고, 이는 Opus 4.6 대비 약 11점 상승한 것으로 제시됩니다 ^[6].
Claude Opus 4.7은 BenchLM 기준 provisional leaderboard에서 110개 모델 중 2위, overall score 97/100으로 제시되며, verified leaderboard에서도 14개 모델 중 2위로 언급됩니다 ^[7].
Claude Opus 4.7은 비전 관련 개선 폭이 컸고, MathVista가 9.5점 상승했다는 자료가 있습니다 ^[6].
Kimi K2.6은 Vals AI의 모델 목록에서 Moonshot AI 모델로 확인되며, 제공된 스니펫에는 2026년 4월 20일 항목으로 표시됩니다 ^[5]. 다만 이 근거에는 Kimi K2.6의 실제 벤치마크 점수가 포함되어 있지 않습니다 ^[5].
Claude Opus 4.7 역시 Vals AI 목록에서 Anthropic 모델로 확인되며, 제공된 스니펫에는 2026년 4월 16일 항목으로 표시됩니다 ^[5].
GPT-5.5에 대해서는 제공된 근거 안에 직접적인 벤치마크 점수, 공식 발표, 리더보드 결과가 없습니다. Insufficient evidence.
DeepSeek V4에 대해서는 제공된 근거 중 Reddit 스니펫이 있으나, 해당 스니펫은 로컬 호스팅과 운영상 조언에 가까우며 비교 가능한 벤치마크 수치를 제공하지 않습니다 ^[40]. Insufficient evidence.

비교 표

모델	확인 가능한 벤치마크 / 순위	강점으로 볼 수 있는 영역	증거 수준
Claude Opus 4.7	SWE-bench Verified 82.4% ^[6]; BenchLM provisional 2/110, 97/100 ^[7]; verified 2/14 ^[7]	코딩, 비전 수학 추론, 종합 리더보드 성능 ^[6]^[7]	상대적으로 높음
GPT-5.5	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
DeepSeek V4	제공 근거 내 직접 점수 없음	판단 불가	Insufficient evidence
Kimi K2.6	모델 목록 및 날짜 항목은 확인되나 점수 없음 ^[5]	판단 불가	낮음

벤치마크별 해석

SWE-bench Verified 기준으로는 Claude Opus 4.7만 비교 가능한 수치가 있습니다. Claude Opus 4.7의 82.4%는 제공 자료상 Opus 4.6 대비 약 11점 개선된 결과로 제시됩니다 ^[6].
종합 리더보드 기준으로도 Claude Opus 4.7만 명확한 순위가 있습니다. BenchLM은 Claude Opus 4.7을 provisional leaderboard 110개 모델 중 2위, overall score 97/100으로 제시합니다 ^[7].
비전·수학 영역에서는 Claude Opus 4.7의 MathVista 점수가 9.5점 상승했다는 자료가 있으나, GPT-5.5·DeepSeek V4·Kimi K2.6의 동등 지표가 제공되지 않아 상대 비교는 불가능합니다 ^[6].
일반 지식·추론 벤치마크는 2026년 기준 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등이 주요 평가로 언급됩니다 ^[2]. 다만 MMLU는 상위 모델들이 이미 88% 이상을 기록해 2026년에는 변별력이 약하다는 지적이 있습니다 ^[3].

Evidence notes

Stanford HAI의 2026 AI Index 기술 성능 자료는 ImageNet, VQA, SuperGLUE, SQuAD 2.0 등 여러 벤치마크에서 AI 성능이 인간 성능과 비교되는 장기 추세를 다룹니다 ^[1]. 이는 개별 네 모델의 직접 비교 자료라기보다는 벤치마크 해석의 배경 자료에 가깝습니다 ^[1].
Kili Technology 자료는 2026년 주요 벤치마크로 MMLU, MMLU-Pro, GPQA Diamond, SWE-bench, Arena Elo 등을 언급하면서 벤치마크만으로 모델 성능을 충분히 설명하기 어렵다고 설명합니다 ^[2].
Nanonets 자료는 MMLU가 5-shot 방식으로 계산된다는 점과, 상위 모델들이 88% 이상을 기록하면서 2026년에는 변별력이 낮아졌다는 점을 지적합니다 ^[3].
DeepSeek V4 관련 제공 근거는 Reddit 스니펫뿐이며, 신뢰도 높은 공식·학술·독립 리더보드 근거가 아닙니다 ^[40].

Limitations / uncertainty

GPT-5.5, DeepSeek V4, Kimi K2.6에 대해 제공된 근거만으로는 SWE-bench, GPQA, MMLU-Pro, AIME, MathVista, Arena Elo 같은 동일 벤치마크 축에서 점수를 비교할 수 없습니다. Insufficient evidence.
Claude Opus 4.7의 수치도 출처별로 평가 방법과 리더보드 산식이 다를 수 있으므로, BenchLM의 97/100 및 순위와 SWE-bench Verified 82.4%는 서로 다른 평가 체계의 결과로 해석해야 합니다 ^[6]^[7].
Kimi K2.6은 모델 존재와 날짜 항목은 확인되지만, 제공 근거 안에는 성능 점수가 없습니다 ^[5]. Insufficient evidence.
DeepSeek V4는 제공 근거 안에서 벤치마크 수치가 확인되지 않습니다 ^[40]. Insufficient evidence.

Summary

출처

[2] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[3] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] AI Benchmarks 2026: Top Evaluations and Their Limitskili-technology.com
Image 2: Kili Technology.png) Kili Technology · Apr 13, 2026 Image 3: AI Benchmarks Guide: The Top Evaluations in 2026 and Why They're Not Enough Table of contents Introduction What Are the Most Important AI Benchmarks in 2026? General knowledge and reasoni...
[13] [PDF] Technical Performance - Stanford HAIhai.stanford.edu
Technical Performance Benchmarks vs. Human Performance 76 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% 120% Image classiǇcation (ImageNet Top-5) Visual reasoning (VQA) English language understanding (SuperGLU...
[15] DeepSeek V4: Features, Benchmarks, and Comparisonsdatacamp.com
DeepSeek V4 Benchmarks According to DeepSeek’s internal results, DeepSeek V4 demonstrates impressive performance, particularly when pushed to its maximum reasoning limits (DeepSeek-V4-Pro-Max). According to the official release notes, here is how the model...
[16] deepseek-ai/DeepSeek-V4-Pro - Hugging Facehuggingface.co
Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max :---: :---: :---: Knowledge & Reasoning MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5 SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9 Chinese-SimpleQA (Pass@1...
[19] DeepSeek V4 - Vals AIvals.ai
Benchmarks Models Comparison Model Guide App Reports News About Benchmarks Models Comparison Model Guide App Reports About Release date Models 4/23/2026 DeepSeek DeepSeek V4 4/23/2026 OpenAI GPT 5.5 4/20/2026 Moonshot AI Kimi K2.6 4/16/2026 Anthropic Claude...
[22] AI Benchmarks Explained: GPQA, SWE-bench & Arena Elonanonets.com
How the score is calculated: Before each question, the model is shown 5 example questions with correct answers, this is called 5-shot prompting. Then comes the real question. Score = correct answers ÷ total questions, expressed as a percentage. Why it's nea...
[28] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[29] Introducing GPT-5.5 - OpenAIopenai.com
GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. OnGDPval⁠⁠, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, whic...
[31] GPT 5.5 - Vals AIvals.ai
2/17/2026 Anthropic Claude Sonnet 4.6 2/16/2026 Alibaba Qwen 3.5 Plus 2/12/2026 MiniMax MiniMax-M2.5 2/12/2026 MiniMax MiniMax-M2.5 2/11/2026 zAI GLM 5 2/5/2026 Anthropic Claude Opus 4.6 (Nonthinking) 2/5/2026 Anthropic Claude Opus 4.6 (Thinking) 1/26/2026...
[35] Kimi K2.6 on GMI Cloud: Architecture, Benchmarks & API Accessgmicloud.ai
Kimi K2.6 tops SWE-Bench Pro and runs 300 parallel sub-agents on 4x H100S. Learn the full architecture, benchmark results, and how to run it ... 3 days ago
[36] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
Moonshot's Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at 4 on the Artificial Analysis Intelligence Index (54) behind ... 5 days ago
[37] Kimi 2.6 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Kimi 2.6 by Moonshot AI scores 85/100 on BenchLM's provisional leaderboard ( 12 of 115) with 27 published benchmark scores currently shown ... 6 days ago
[39] Kimi K2.6 - Vals AIvals.ai
Kimi K2.6. Release Date: 4/20/2026. Vals Index. Accuracy (Vals Index). 63.94% ± 1.97. Latency (Vals Index). 373.57s. Cost/Test (Vals Index). $0.21.