ReportsPublic URLApr 26, 202642 sources

研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。

截至 2026 04 26，可核查的公开资料显示：GPT 5.5 是四者中综合基准最强的闭源模型，Claude Opus 4.7 在长程编码、数据纪律和多步任务上紧随其后，Kimi K2.6 是目前证据中最强的开权重模型，DeepSeek V4 Pro 的主要优势是接近前沿模型的能力与显著更低的 API 成本。[1][3][4][5][6] 但四者并没有完整、同源、同设置的官方横向评测；尤其 DeepSeek V4 与 Kimi K2.6 的逐项官方对比证据不足，因此部分结论应视为“公开资料下的最佳估计”。[4][5][6][7] Key findings 综合智能排名上，Artificial Analysis 的公开比较显示 GP

Key takeaways

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。 Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。 VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。
截至 2026-04-26，可核查的公开资料显示：GPT-5.5 是四者中综合基准最强的闭源模型，Claude Opus 4.7 在长程编码、数据纪律和多步任务上紧随其后，Kimi K2.6 是目前证据中最强的开权重模型，DeepSeek V4 Pro 的主要优势是接近前沿模型的能力与显著更低的 API 成本。[1][3][4][5][6] 但四者并没有完整、同源、同设置的官方横向评测；尤其 DeepSeek V4 与 Kimi K2.6 的逐项官方对比证据不足，因此部分结论应视为“公开资料下的最佳估计”。[4][5][6][7]
## Key findings

Research answer

截至 2026-04-26，可核查的公开资料显示：GPT-5.5 是四者中综合基准最强的闭源模型，Claude Opus 4.7 在长程编码、数据纪律和多步任务上紧随其后，Kimi K2.6 是目前证据中最强的开权重模型，DeepSeek V4 Pro 的主要优势是接近前沿模型的能力与显著更低的 API 成本。^[1]^[3]^[4]^[5]^[6] 但四者并没有完整、同源、同设置的官方横向评测；尤其 DeepSeek V4 与 Kimi K2.6 的逐项官方对比证据不足，因此部分结论应视为“公开资料下的最佳估计”。^[4]^[5]^[6]^[7]

Key findings

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。^[5]^[6]
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。^[1] Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。^[3]
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。^[1] 但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。^[4]
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。^[1] VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。^[4]^[7]
成本上，OpenAI API 文档列出 GPT-5.5 为 $5 / 百万输入 token、$30 / 百万输出 token，并支持 1M 上下文与 128K 最大输出。^[2] Anthropic 官方称 Claude Opus 4.7 价格维持 $5 / 百万输入 token、$25 / 百万输出 token。^[3] VentureBeat 引用 DeepSeek API 定价称 DeepSeek V4 Pro 为 $1.74 / 百万输入 token、$3.48 / 百万输出 token，显著低于 GPT-5.5 与 Claude Opus 4.7。^[4]
开权重生态上，Kimi K2.6 被 Artificial Analysis 称为“new leading open weights model”，其模型为 1T 总参数、32B 激活参数的 MoE，支持 256K 上下文、图像和视频输入、文本输出。^[5] DeepSeek V4 被报道为 1.6T 参数 MoE、MIT License、可在 Hugging Face 与 DeepSeek API 获取，因此在可部署性和成本上更开放、更便宜。^[4]

Benchmark comparison

维度 / 基准	GPT-5.5	Claude Opus 4.7	DeepSeek V4 / V4 Pro	Kimi K2.6	解读
Artificial Analysis Intelligence Index	xHigh 60；High 59 ^[6]	57 ^[6]	V4 Pro 被列为领先开权重模型之一，但本次资料未给出同表精确分数 ^[6]	54，开权重第一梯队 ^[5]	GPT-5.5 综合领先，Claude 接近，Kimi 是开权重最强代表之一。
Terminal-Bench 2.0	82.7% ^[1]	68.5% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在命令行代理工作流上优势明显。
SWE-Bench Pro	58.6% ^[1]	Insufficient evidence	Insufficient evidence	第三方资料显示可与 GPT-5.4 / Opus 4.6 同档，但缺少同源 GPT-5.5 / Opus 4.7 对比 ^[5]	GPT-5.5 有官方分数；其他模型缺少同设置数据。
GDPval / GDPval-AA	GPT-5.5 wins-or-ties 84.9% ^[1]	80.3% ^[1]	Insufficient evidence	GDPval-AA Elo 1520，高于 Kimi K2.5 的 1309 ^[5]	GPT 与 Claude 可在 OpenAI 表中横比；Kimi 的 AA Elo 不应与 OpenAI 的百分比分数直接相加比较。
BrowseComp	84.4% ^[1]	79.3% ^[1]	83.4% ^[4]	83.2% ^[7]	GPT-5.5 略高，DeepSeek 与 Kimi 非常接近，Claude 稍弱。
FrontierMath Tier 1–3	51.7% ^[1]	43.8% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在该官方表格中领先 Claude。
FrontierMath Tier 4	35.4% ^[1]	22.9% ^[1]	Insufficient evidence	Insufficient evidence	难题数学上 GPT-5.5 与 Claude 差距扩大。
Humanity’s Last Exam，无工具	41.4% ^[4]	46.9% ^[4]	37.7% ^[4]	Insufficient evidence	Claude 在该报道的无工具 HLE 设置中最高。
Humanity’s Last Exam，有工具	52.2%；GPT-5.5 Pro 为 57.2% ^[4]	54.7% ^[4]	48.2% ^[4]	Insufficient evidence	工具设置下 Claude 高于 GPT-5.5 base，但低于 GPT-5.5 Pro。
CyberGym	81.8% ^[1]	73.1% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在 OpenAI 表格中领先。
AA-Omniscience / 幻觉率	Insufficient evidence	幻觉率 36%，作为对照项出现 ^[5]	Insufficient evidence	幻觉率 39%，低于 Kimi K2.5 的 65% ^[5]	Kimi K2.6 的低幻觉率是其明显亮点。
τ²-Bench Telecom	Insufficient evidence	Insufficient evidence	Insufficient evidence	96% ^[5]	Kimi K2.6 在工具使用 / 电信任务中表现强。

Model-by-model assessment

GPT-5.5

GPT-5.5 的主要优势是综合能力、agentic coding、计算机使用、浏览研究与数学基准的均衡领先。^[1]^[2]^[6] OpenAI 官方称 GPT-5.5 擅长写代码、调试、在线研究、数据分析、创建文档和表格、操作软件并跨工具完成任务。^[1]

GPT-5.5 的主要短板是成本较高，API 文档列出的输出价格为 $30 / 百万 token，高于 Claude Opus 4.7 的 $25 / 百万 token，也远高于 DeepSeek V4 Pro 的 $3.48 / 百万 token。^[2]^[3]^[4] 如果任务量极大且边际成本敏感，DeepSeek V4 Pro 或 Kimi K2.6 可能在成本效益上更有吸引力。^[4]^[5]

Claude Opus 4.7

Claude Opus 4.7 的优势是长程编码、复杂任务执行、视觉增强、数据纪律和对不确定性的处理。^[3] Anthropic 官方称 Opus 4.7 在高级软件工程上明显优于 Opus 4.6，并且在复杂长时间任务中更严谨、更一致、更会验证自己的输出。^[3]

Claude Opus 4.7 在部分硬指标上不总是领先 GPT-5.5；例如 OpenAI 表格中 GPT-5.5 在 Terminal-Bench 2.0、BrowseComp、FrontierMath 和 CyberGym 均高于 Claude Opus 4.7。^[1] 但在 VentureBeat 汇总的 HLE 无工具数据中，Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4%。^[4]

DeepSeek V4 / V4 Pro

DeepSeek V4 Pro 的核心竞争力是成本、开放可用性和接近前沿模型的能力。^[4] VentureBeat 报道 DeepSeek V4 是 1.6T 参数 MoE、MIT License，并通过 Hugging Face 与 DeepSeek API 提供。^[4]

DeepSeek V4 Pro 在已见共享基准中通常接近但不全面超过 GPT-5.5 或 Claude Opus 4.7。^[4] VentureBeat 的数据称 DeepSeek 在 HLE 无工具为 37.7%、有工具为 48.2%，低于 GPT-5.5、GPT-5.5 Pro 与 Claude Opus 4.7 的对应分数。^[4] 但它的 API 成本远低于 GPT-5.5 与 Claude Opus 4.7，因此适合高吞吐、成本敏感、可自托管或开源优先的场景。^[4]

Kimi K2.6

Kimi K2.6 的定位是强开权重、多模态、长上下文和低幻觉率。^[5] Artificial Analysis 称其为新的领先开权重模型，并给出 Intelligence Index 54、GDPval-AA Elo 1520、τ²-Bench Telecom 96%、幻觉率 39%、256K 上下文、图像和视频输入支持等结果。^[5]

Kimi K2.6 与 GPT-5.5 的直接可比官方数据较少，因此不能简单声称它全面超过 GPT-5.5 或 Claude Opus 4.7。^[5]^[7] 在 BrowseComp 上，DocsBot 汇总 Kimi K2.6 为 83.2%，接近 GPT-5.5 的 84.4%，但该数据来自不同汇总源，需谨慎比较。^[1]^[7]

Practical recommendations

如果目标是最高综合智能、agentic coding、浏览研究和工具任务，优先选择 GPT-5.5。^[1]^[2]^[6]
如果目标是长程工程任务、复杂文档/视觉理解、较强自我校验和比 GPT-5.5 略低的输出价格，优先选择 Claude Opus 4.7。^[3]
如果目标是大规模推理、低成本 API、开源许可证或部署灵活性，优先评估 DeepSeek V4 Pro。^[4]
如果目标是开权重、多模态输入、256K 长上下文、低幻觉率和强工具使用，优先评估 Kimi K2.6。^[5]
如果目标是联网研究或网页检索类任务，GPT-5.5、DeepSeek V4 和 Kimi K2.6 的公开 BrowseComp 分数非常接近，建议用自己的真实工作流做小样本 A/B 测试。^[1]^[4]^[7]

Evidence notes

本报告主要使用以下编号来源：OpenAI 官方 GPT-5.5 发布页 ^[1]、OpenAI API 模型文档 ^[2]、Anthropic 官方 Claude Opus 4.7 发布页 ^[3]、VentureBeat 对 DeepSeek V4 的报道与基准汇总 ^[4]、Artificial Analysis 对 Kimi K2.6 的评测 ^[5]、Artificial Analysis 的 GPT-5.5 / DeepSeek V4 比较页 ^[6]、DocsBot 的 Kimi K2.6 / GPT-5.5 对比页 ^[7]、OpenRouter 的 DeepSeek V4 Pro / Kimi K2.6 提供商对比页 ^[8]、AkitaOnRails 的 2026 年 4 月编码实测 ^[9]。

官方来源的证据权重最高，尤其是 OpenAI 与 Anthropic 自家发布页和 API 文档。^[1]^[2]^[3] 第三方基准的优势是可横向比较，但常受提示词、工具链、推理强度、温度、样本集和服务商封装影响，因此本报告把 Artificial Analysis、VentureBeat、DocsBot、OpenRouter 和个人实测作为补充证据而非最终裁决。^[4]^[5]^[6]^[7]^[8]^[9]

Limitations / uncertainty

Insufficient evidence：没有找到四个模型在同一组织、同一时间、同一推理预算、同一工具设置下覆盖所有关键基准的完整公开表格。^[1]^[3]^[4]^[5]^[6]^[7]

Insufficient evidence：DeepSeek V4 与 Kimi K2.6 的官方逐项基准、尤其是与 GPT-5.5 和 Claude Opus 4.7 的同源对比，在本次可核查材料中不完整。^[4]^[5]^[6]^[7]

部分分数来自模型厂商自己的发布页，可能更偏向展示该模型强项。^[1]^[3] 部分分数来自第三方聚合或媒体报道，可能存在模型版本、调用参数、工具权限或供应商路由差异。^[4]^[5]^[6]^[7]^[8]

Summary

综合公开基准看，GPT-5.5 是默认的最高性能选择，Claude Opus 4.7 是最接近的闭源替代并在长程工程与数据纪律上很强，Kimi K2.6 是开权重阵营最强候选，DeepSeek V4 Pro 则是成本效益和开放部署最突出的候选。^[1]^[3]^[4]^[5]^[6]

如果你要做采购或技术选型，不应只看排行榜；建议用自己的 30–100 个真实任务建立私有评测集，并同时记录准确率、失败模式、延迟、token 消耗、可控性、工具调用成功率和总成本。^[1]^[2]^[4]^[5]

Supporting visuals

The image compares the capabilities of Claude Opus 4.6 and GPT-5.3 Codex, highlighting differences in token limits, performance metrics, andThe image compares the capabilities of Claude Opus 4.6 and GPT-5.3 Codex, highlighting differences in token limits, performance metrics, and core advantages such as reasoning speed and autonomous development.

A comparison chart highlights the improvements of Claude Opus 4.6 over 4.5, including increased context window, output capacity, and enhanceA comparison chart highlights the improvements of Claude Opus 4.6 over 4.5, including increased context window, output capacity, and enhanced thinking and multi-agent capabilities.

The image compares the performance and cost of Kimi K2.5 and Claude Opus 4.5 AI models, highlighting their efficiency, safety compliance, anThe image compares the performance and cost of Kimi K2.5 and Claude Opus 4.5 AI models, highlighting their efficiency, safety compliance, and price differences.

The image displays a comparison of performance metrics between Kimi K2.5 and Claude Opus 4.5 models on various benchmarks, illustrated withThe image displays a comparison of performance metrics between Kimi K2.5 and Claude Opus 4.5 models on various benchmarks, illustrated with radar charts.

May be a cartoon of ball and text that says 'Surface Intersection Lab Weierstrass smaath tnouth yMay be a cartoon of ball and text that says 'Surface Intersection Lab Weierstrass smaath tnouth y? 0.01637 e. 9.001642x 6.85e-5 -6.0620-4 004228 Credit: t:BartoszNaskręcki Bartosz Naskręcki Prmpt:Aagtraigsaneysu surfacei Prompt: 林 Jgebrai geometrysurfacei eometr 0иpo1+42) ・名.5k・・番65.34・ 0.65.3A GPT-5.5 GPT-5.5具自主開發力下指

GPT-5.5 VS Deepseek V4 Pro VS Opus 4.7: I tested THEM on My KingBench 2.0 QuestionsThe image features a comparison chart showing the performance of DeepSeek 4.0, GPT-5.5, and Opus 4.7, with GPT-5.5 achieving the highest accuracy of 38.57%, and a large "DEEPSEEK 4 VS. GPT-5.5" banner overlaid on a table or spreadsheet containing benchmark results related to official documents, government, or academic

May be an image of text that says 'Capture-the-Flagschallengetasks( the- Flags challenge tasks (Internal) GPT-5.5 GPT-5.4 4 80% 60% 40% 20%May be an image of text that says 'Capture-the-Flagschallengetasks( the- Flags challenge tasks (Internal) GPT-5.5 GPT-5.4 4 80% 60% 40% 20% 0% 0 20000 40000 60000 80000 100000 Output Outputtokens tokens 120000 140000 GPT-5.5 GPT-5.5邏輯極精簡僅需前代三分之一的” Token 即可拿更高分,效率驚人！即可拿更高分，效率驚人！'

May be an image of text that says '000 OpenAl Al王者回歸？ GPT-5.5 GPT-5.5正式登場 5.5 正式登場 GPT- 數據屠榜超車數據層榜超車Opus4.7 Opus4.7 > READ READMORE MORE'

May be an image of duffle coat, timer and text

thieu.nhattienzkxu.663102's profile picture

trendsdatanewsfr's profile picture

2026年4月最新四大模型横评：Kimi K2.6 vs Claude Opus 4.7 vs GPT-5.5A chart compares various AI models and their associated benchmarks, including Kimi K2.6, Claude Opus 4.7, GPT-5.5, and DeepSeek V4-Pro, with some text in Chinese and numerical scores related to pricing or performance metrics.

Kimi K2.6、DeepSeek V4、GPT-5.5、Claude Opus 4.7：先测哪一个A detailed infographic comparing official contract lanes for AI models Kimi K2.6, DeepSeek V4, GPT-5.5, and Claude Opus 4.7, including their features, pricing, availability, and guidelines for use.

Transportation problem optimal solution showing shipping plan, cost breakdown, and MODI optimality verification generated by Claude Opus 4.7

Optimal allocation table for transportation problem with routes, units, rates, and total cost solved by Kimi K2.6

Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4: Agentic Coding Benchmarks (2026)

Long-Horizon Execution and Agent Orchestration

Research conversation

YouApr 26, 2026

Research question

研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。

Studio Global AIApr 26, 202640 sources

Cited answer

Key findings

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。^[5]^[6]
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。^[1] Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。^[3]
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。^[1] 但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。^[4]
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。^[1] VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。^[4]^[7]
成本上，OpenAI API 文档列出 GPT-5.5 为 $5 / 百万输入 token、$30 / 百万输出 token，并支持 1M 上下文与 128K 最大输出。^[2] Anthropic 官方称 Claude Opus 4.7 价格维持 $5 / 百万输入 token、$25 / 百万输出 token。^[3] VentureBeat 引用 DeepSeek API 定价称 DeepSeek V4 Pro 为 $1.74 / 百万输入 token、$3.48 / 百万输出 token，显著低于 GPT-5.5 与 Claude Opus 4.7。^[4]
开权重生态上，Kimi K2.6 被 Artificial Analysis 称为“new leading open weights model”，其模型为 1T 总参数、32B 激活参数的 MoE，支持 256K 上下文、图像和视频输入、文本输出。^[5] DeepSeek V4 被报道为 1.6T 参数 MoE、MIT License、可在 Hugging Face 与 DeepSeek API 获取，因此在可部署性和成本上更开放、更便宜。^[4]

Benchmark comparison

维度 / 基准	GPT-5.5	Claude Opus 4.7	DeepSeek V4 / V4 Pro	Kimi K2.6	解读
Artificial Analysis Intelligence Index	xHigh 60；High 59 ^[6]	57 ^[6]	V4 Pro 被列为领先开权重模型之一，但本次资料未给出同表精确分数 ^[6]	54，开权重第一梯队 ^[5]	GPT-5.5 综合领先，Claude 接近，Kimi 是开权重最强代表之一。
Terminal-Bench 2.0	82.7% ^[1]	68.5% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在命令行代理工作流上优势明显。
SWE-Bench Pro	58.6% ^[1]	Insufficient evidence	Insufficient evidence	第三方资料显示可与 GPT-5.4 / Opus 4.6 同档，但缺少同源 GPT-5.5 / Opus 4.7 对比 ^[5]	GPT-5.5 有官方分数；其他模型缺少同设置数据。
GDPval / GDPval-AA	GPT-5.5 wins-or-ties 84.9% ^[1]	80.3% ^[1]	Insufficient evidence	GDPval-AA Elo 1520，高于 Kimi K2.5 的 1309 ^[5]	GPT 与 Claude 可在 OpenAI 表中横比；Kimi 的 AA Elo 不应与 OpenAI 的百分比分数直接相加比较。
BrowseComp	84.4% ^[1]	79.3% ^[1]	83.4% ^[4]	83.2% ^[7]	GPT-5.5 略高，DeepSeek 与 Kimi 非常接近，Claude 稍弱。
FrontierMath Tier 1–3	51.7% ^[1]	43.8% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在该官方表格中领先 Claude。
FrontierMath Tier 4	35.4% ^[1]	22.9% ^[1]	Insufficient evidence	Insufficient evidence	难题数学上 GPT-5.5 与 Claude 差距扩大。
Humanity’s Last Exam，无工具	41.4% ^[4]	46.9% ^[4]	37.7% ^[4]	Insufficient evidence	Claude 在该报道的无工具 HLE 设置中最高。
Humanity’s Last Exam，有工具	52.2%；GPT-5.5 Pro 为 57.2% ^[4]	54.7% ^[4]	48.2% ^[4]	Insufficient evidence	工具设置下 Claude 高于 GPT-5.5 base，但低于 GPT-5.5 Pro。
CyberGym	81.8% ^[1]	73.1% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在 OpenAI 表格中领先。
AA-Omniscience / 幻觉率	Insufficient evidence	幻觉率 36%，作为对照项出现 ^[5]	Insufficient evidence	幻觉率 39%，低于 Kimi K2.5 的 65% ^[5]	Kimi K2.6 的低幻觉率是其明显亮点。
τ²-Bench Telecom	Insufficient evidence	Insufficient evidence	Insufficient evidence	96% ^[5]	Kimi K2.6 在工具使用 / 电信任务中表现强。

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

如果目标是最高综合智能、agentic coding、浏览研究和工具任务，优先选择 GPT-5.5。^[1]^[2]^[6]
如果目标是长程工程任务、复杂文档/视觉理解、较强自我校验和比 GPT-5.5 略低的输出价格，优先选择 Claude Opus 4.7。^[3]
如果目标是大规模推理、低成本 API、开源许可证或部署灵活性，优先评估 DeepSeek V4 Pro。^[4]
如果目标是开权重、多模态输入、256K 长上下文、低幻觉率和强工具使用，优先评估 Kimi K2.6。^[5]
如果目标是联网研究或网页检索类任务，GPT-5.5、DeepSeek V4 和 Kimi K2.6 的公开 BrowseComp 分数非常接近，建议用自己的真实工作流做小样本 A/B 测试。^[1]^[4]^[7]

Evidence notes

Limitations / uncertainty

Insufficient evidence：没有找到四个模型在同一组织、同一时间、同一推理预算、同一工具设置下覆盖所有关键基准的完整公开表格。^[1]^[3]^[4]^[5]^[6]^[7]

Insufficient evidence：DeepSeek V4 与 Kimi K2.6 的官方逐项基准、尤其是与 GPT-5.5 和 Claude Opus 4.7 的同源对比，在本次可核查材料中不完整。^[4]^[5]^[6]^[7]

Summary

Continue your research

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so umfassend wie möglich im

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so u...

Claude Opus 4.7과 GPT-5.5의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Research & compare Claude Code vs OpenAI Codex as comprehensively as possible

Sources

[1] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[2] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[3] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[4] DeepSeek V4 Pro (Reasoning, High Effort) vs GPT-5.5 (medium)artificialanalysis.ai
What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...
[5] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai
Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...
[6] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4....
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] Kimi K2.6 vs GPT-5.5 - Detailed Performance & Feature Comparisondocsbot.ai
BrowseComp Evaluates browsing and web comprehension capabilities 83.2% Thinking mode; context management Source 84.4% Source CharXiv Reasoning Evaluates information synthesis from complex charts and scientific visualizations 80.4% RQ; 86.7 with Python Sourc...
[9] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Kimi K2.6 has significantly higher token usage than Kimi K2.5. Kimi K2.5 scores 6 on the AA-Omniscience Index, primarily driven...
[10] Claude Opus 4.7 results: early benchmarks, real-world feedback ...boringbot.substack.com
The Claude Opus 4.7 benchmarks on software engineering tasks show the clearest improvement. On SWE-Bench, the industry-standard benchmark for evaluating autonomous code repair across real GitHub issues, Opus 4.7 shows a meaningful step up from Opus 4.6, wit...
[11] DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task Benchmark ... - FundaAIfundaai.substack.com
As of time of publication, GPT-5.5 has not yet officially released its API. Testing solely through Codex 5.5 may not fully reflect the complete performance of the API. We have currently only conducted urgent testing on DeepSeek V4, and will include GPT-5.5...
[12] GPT 5.5 Vs Claude Opus 4.7 Proves Benchmarks Need Contextreddit.com
That means the benchmark story is not one-sided. A model can be better at running commands and still not be the best choice for careful code cleanup. Another model can be better at clean output and still feel slower when the task needs constant execution. T...
[13] Kimi K2.6 | 熱門關鍵字 - 鉅亨號hao.cnyes.com
摘要：兩個廣東人，撐起中國AI半邊天。千呼萬喚，4月24日上午，DeepSeek終於扔出重磅炸彈，全新系列模型DeepSeek-V4預覽版正式上線並同步開源，在Agent能力、世界知識和推理性能三大維度宣佈達到國內及開源領域領先水平。從2月8日上線測試版至今，其始終保持著神秘的姿態。就在四天前的周一晚上，月之暗面同樣發佈並開源了Kimi K2.6模型，主打長程編碼和Agent叢集能力，在多項基準測試中持平甚至優於GPT-5.4、Claude Opus 4.6等閉源模型。鳳凰網科技統計發現，這已經是Kimi和D...
[14] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai
Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...
[15] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[16] LLM Coding Benchmark (April 2026): GPT 5.5, DeepSeek v4, Kimi ...akitaonrails.com
Rank Model Score Tier RubyLLM OK Time Cost --- --- --- 1 Claude Opus 4.7 97 A ✅ 18m $1.10 1 GPT 5.4 xHigh (Codex) 97 A ✅ 22m $16 3 GPT 5.5 xHigh (Codex) 96 A ✅ 18m $10 4 Kimi K2.6 87 A ✅ 20m $0.30 5 Claude Opus 4.6 83 A ✅ 16m $1.10 6 Gemini 3.1 Pro 82 A ✅ 1...
[17] Usama Akram | AI Agents & AI Automation (@aiagently) on Threadsthreads.com
🚨 April was absolutely insane for AI releases... • Claude Opus 4.7 (Anthropic) • GPT-5.5 (OpenAI) • DeepSeek V4 (DeepSeek) • Xiaomi Mimi V2.5 • Qwen3.6-Plus (Alibaba) • GLM-5.1 (Zhipu) • Muse Spark (Meta) • Qwen3.6-35B-A3B (Alibaba) • Grok 4.3 Beta (xAI) •...
[18] 川弟 (@fatsol_erc) on Threadsthreads.com
近幾日AI大事： Deepseek 正式推出了v4 OpenAI推出GPT 5.5 目前我用GPT比較多而且聽說這次5.5的表現是超越Claude opus 4.7的並且有和mythos差不多的能力尤其這次重大升級也推出了macOS的local-first agent，對Mac用戶超級友善馬上體驗過後確實覺得GPT 5.5比opus 4.7好用，尤其又不會亂封號或是降智我們真的是在最好的時代人類生產力將由AI提升產生指數級的大增長一發布，直接把大家強行推進 AI Superapp 時代。這次最狂的不...
[19] We Gave Claude Opus 4.7 and Kimi K2.6 the Same Workflow ...blog.kilo.ai
Where Open-Weight Models Stand Right Now This test sits inside a pattern we’ve been tracking for a while. MiniMax M2.7 matched Claude Opus 4.6’s detection rate on our last three-part benchmark. GLM-5.1 scored five points behind Claude Opus 4.6 on our job qu...
[20] Bad Opus 4.7, Good Kimi K2.6, and Growing Codexaicodingdaily.substack.com
Simon Willison on X: “Opus 4.7 appears to use 1.46x times the tokens for text and up to 3x the tokens for images than Opus 4.6” x.com I upgraded my Claude token counter tool to compare different models and Opus 4.7 does appear to use 1.46x times the tokens...
[21] A bunch of model releases this week - Opus 4.7, GPT-5.5. And then there's also the second-tier stuff like Qwen3.6, Kimi K2.6, GLM-5.1, DeepSeek V4. Honestly, we keep obsessing over which frontier… | David Kaplanlinkedin.com
Looking at the months ahead, I think that the economics may start to shift. Right now teams are still throwing money at AI, exploring, willing to spend regardless of what it costs per inference. But I'd image that that's not going to last. Eventually someon...
[22] Introducing GPT-Rosalind for life sciences research - OpenAIopenai.com
OpenAI introduces GPT-Rosalind, a frontier reasoning model built to accelerate drug discovery, genomics analysis, protein reasoning, ... Apr 16, 2026
[23] Making ChatGPT better for clinicians - OpenAIopenai.com
Introducing GPT-5.5. Product Apr 23, 2026. OAI Blog Agents Hero 1x1. Introducing workspace agents in ChatGPT. Product Apr 22, 2026. Images 2.0 ... 4 days ago
[24] Semantic Alignment Before Acceleration - Promptingcommunity.openai.com
April 24, 2026. Feedback on GPT-5 Model Performance for Translation Tasks · Feedback. 19, 2756, August 19, 2025. Chatgpt API isn't good as it's ... 1 day ago
[25] Introducing GPT-5.5 - OpenAIopenai.com
Update on April 24, 2026: GPT‑5.5 and GPT‑5.5 Pro are now available ... GPT‑5.5 achieved leading performance among models with published scores. 3 days ago
[26] [PDF] HealthBench Professional: Evaluating Large Language Models on ...cdn.openai.com
An example is Codex (OpenAI, 2026c) using GPT-5.4, an AI system composed of Codex. (harness) and GPT-5.4 (model). Other examples include ... 4 days ago
[27] [PDF] GeneBench: Assessing AI Agents for Multi-Stage Inference ... - OpenAIcdn.openai.com
We introduce GeneBench, a benchmark for AI agents on realistic multi-stage scientific data analysis in genetics and quantitative biology. 3 days ago
[28] GPT-5.5 is here! Available in the API, Codex and ChatGPT todaycommunity.openai.com
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, ... 3 days ago
[29] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
GPT-5.5 is a new model designed for complex, real-world work, including writing code, researching online, analyzing information, ... 3 days ago
[30] Introducing OpenAI Privacy Filteropenai.com
OpenAI Privacy Filter is an open-weight model for detecting and redacting personally identifiable information (PII) in text with ... 4 days ago
[31] Introducing workspace agents in ChatGPT - OpenAIopenai.com
Introducing GPT-5.5. Product Apr 23, 2026. Making ChatGPT free for clinicians. Making ChatGPT better for clinicians. Product Apr 22, 2026.
[32] Image generation | OpenAI APIdevelopers.openai.com
response = client.responses.create( model="gpt-5.5", input=[ { "role": "user", "content": [ {"type": "input text", "text": prompt}, { "type": "input image", "image url": f"data:image/jpeg;base64,{base64 image1}", }, { "type": "input image", "image url": f"d...
[33] OpenAI API Platform Documentationdevelopers.openai.com
Categories Demo apps Videos Topics Agents Audio & Voice Computer Use Codex Evals gpt-oss Fine-tuning Image generation Scaling Tools Video generation Community Programs Codex Ambassadors Codex for Students Codex for Open Source Events Meetups Hackathon Suppo...
[34] Using GPT-5.5 | OpenAI APIdevelopers.openai.com
For tool-heavy or long-running workflows, verify that your application handles phase , preambles, and assistant-item replay correctly. Benchmark against other models on accuracy, token consumption, and end-to-end latency. [...] More efficient reasoning: GPT...
[35] Models | OpenAI APIdevelopers.openai.com
GPT-5.5 New A new class of intelligence for coding and professional work. Model ID gpt-5.5 [Reasoning none low medium high xhigh Input price $5 / Input MTok Output price $30 / Output MTok Latency Fast Max output 128K tokens Context window 1M Tools Functions...
[36] ChatGPT — Release Notes - OpenAI Help Centerhelp.openai.com
March 5, 2026 GPT-5.4 Thinking in ChatGPT GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ while impr...
[37] GPT-4o System Card | OpenAIopenai.com
GPT DALL·E Whisper Generative Models Ethics & Safety Language System Cards Citations 1. 1 OpenAI, “Hello gpt-4o.” 2024. 2. 2 T. Stivers, N. J. Enfield, P. Brown, C. Englert, M. Hayashi, T. Heinemann, G. Hoymann, F. Rossano, J. P. de Ruiter, K. E. Yoon, and...
[38] Harness engineering: leveraging Codex in an agent-first worldopenai.com
For Business Business Overview Solutions Contact Sales Company About Us Our Charter Foundation(opens in a new window) Careers Brand Support Help Center(opens in a new window) More News Stories Academy Livestreams Podcast RSS Terms & Policies Terms of Use Pr...
[39] Introducing deep research - OpenAIopenai.com
Leadership Akshay Nathan, Greg Brockman, Hannah Wong, Jakub Pachocki, Jerry Tworek, Johannes Heidecke, Josh Tobin, Liam Fedus, Mark Chen, Mia Glaese, Nick Turley, Sam Altman, Wojciech Zaremba Our Research Research Index Research Overview Research Residency...
[40] Introducing GPT-5.4 - OpenAIopenai.com
Evals without reasoning EvalGPT‑5.4 (none)GPT‑5.2 (none)GPT-4.1 OmniDocBench (normalized edit distance)0.109 0.140— Tau2-bench Telecom 64.3%57.2%43.6% Evals were run with reasoning effort set to xhigh, except where specified otherwise. Benchmarks were condu...
[41] Codex | AI Coding Partner from OpenAIopenai.com
OpenAI © 2015–2026 Manage Cookies English United States [...] “I needed to update another team’s codebase for a release. Codex handled the refactor and test generation, delivering fully tested code I handed back fast—keeping the feature on schedule without...
[42] Research - OpenAIopenai.com
GPT OpenAI’s GPT series models are fast, versatile, and cost-efficient AI systems designed to understand context, generate content, and reason across text, images, and more. Image 2: Hero Art Card SEO 1x1 A new class of intelligence for real work Release Ap...

ReportsPublic URLApr 26, 202642 sources

研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。

Key takeaways

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。 Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。 VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。
截至 2026-04-26，可核查的公开资料显示：GPT-5.5 是四者中综合基准最强的闭源模型，Claude Opus 4.7 在长程编码、数据纪律和多步任务上紧随其后，Kimi K2.6 是目前证据中最强的开权重模型，DeepSeek V4 Pro 的主要优势是接近前沿模型的能力与显著更低的 API 成本。[1][3][4][5][6] 但四者并没有完整、同源、同设置的官方横向评测；尤其 DeepSeek V4 与 Kimi K2.6 的逐项官方对比证据不足，因此部分结论应视为“公开资料下的最佳估计”。[4][5][6][7]
## Key findings

Research answer

Key findings

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。^[5]^[6]
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。^[1] Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。^[3]
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。^[1] 但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。^[4]
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。^[1] VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。^[4]^[7]
成本上，OpenAI API 文档列出 GPT-5.5 为 $5 / 百万输入 token、$30 / 百万输出 token，并支持 1M 上下文与 128K 最大输出。^[2] Anthropic 官方称 Claude Opus 4.7 价格维持 $5 / 百万输入 token、$25 / 百万输出 token。^[3] VentureBeat 引用 DeepSeek API 定价称 DeepSeek V4 Pro 为 $1.74 / 百万输入 token、$3.48 / 百万输出 token，显著低于 GPT-5.5 与 Claude Opus 4.7。^[4]
开权重生态上，Kimi K2.6 被 Artificial Analysis 称为“new leading open weights model”，其模型为 1T 总参数、32B 激活参数的 MoE，支持 256K 上下文、图像和视频输入、文本输出。^[5] DeepSeek V4 被报道为 1.6T 参数 MoE、MIT License、可在 Hugging Face 与 DeepSeek API 获取，因此在可部署性和成本上更开放、更便宜。^[4]

Benchmark comparison

维度 / 基准	GPT-5.5	Claude Opus 4.7	DeepSeek V4 / V4 Pro	Kimi K2.6	解读
Artificial Analysis Intelligence Index	xHigh 60；High 59 ^[6]	57 ^[6]	V4 Pro 被列为领先开权重模型之一，但本次资料未给出同表精确分数 ^[6]	54，开权重第一梯队 ^[5]	GPT-5.5 综合领先，Claude 接近，Kimi 是开权重最强代表之一。
Terminal-Bench 2.0	82.7% ^[1]	68.5% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在命令行代理工作流上优势明显。
SWE-Bench Pro	58.6% ^[1]	Insufficient evidence	Insufficient evidence	第三方资料显示可与 GPT-5.4 / Opus 4.6 同档，但缺少同源 GPT-5.5 / Opus 4.7 对比 ^[5]	GPT-5.5 有官方分数；其他模型缺少同设置数据。
GDPval / GDPval-AA	GPT-5.5 wins-or-ties 84.9% ^[1]	80.3% ^[1]	Insufficient evidence	GDPval-AA Elo 1520，高于 Kimi K2.5 的 1309 ^[5]	GPT 与 Claude 可在 OpenAI 表中横比；Kimi 的 AA Elo 不应与 OpenAI 的百分比分数直接相加比较。
BrowseComp	84.4% ^[1]	79.3% ^[1]	83.4% ^[4]	83.2% ^[7]	GPT-5.5 略高，DeepSeek 与 Kimi 非常接近，Claude 稍弱。
FrontierMath Tier 1–3	51.7% ^[1]	43.8% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在该官方表格中领先 Claude。
FrontierMath Tier 4	35.4% ^[1]	22.9% ^[1]	Insufficient evidence	Insufficient evidence	难题数学上 GPT-5.5 与 Claude 差距扩大。
Humanity’s Last Exam，无工具	41.4% ^[4]	46.9% ^[4]	37.7% ^[4]	Insufficient evidence	Claude 在该报道的无工具 HLE 设置中最高。
Humanity’s Last Exam，有工具	52.2%；GPT-5.5 Pro 为 57.2% ^[4]	54.7% ^[4]	48.2% ^[4]	Insufficient evidence	工具设置下 Claude 高于 GPT-5.5 base，但低于 GPT-5.5 Pro。
CyberGym	81.8% ^[1]	73.1% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在 OpenAI 表格中领先。
AA-Omniscience / 幻觉率	Insufficient evidence	幻觉率 36%，作为对照项出现 ^[5]	Insufficient evidence	幻觉率 39%，低于 Kimi K2.5 的 65% ^[5]	Kimi K2.6 的低幻觉率是其明显亮点。
τ²-Bench Telecom	Insufficient evidence	Insufficient evidence	Insufficient evidence	96% ^[5]	Kimi K2.6 在工具使用 / 电信任务中表现强。

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

如果目标是最高综合智能、agentic coding、浏览研究和工具任务，优先选择 GPT-5.5。^[1]^[2]^[6]
如果目标是长程工程任务、复杂文档/视觉理解、较强自我校验和比 GPT-5.5 略低的输出价格，优先选择 Claude Opus 4.7。^[3]
如果目标是大规模推理、低成本 API、开源许可证或部署灵活性，优先评估 DeepSeek V4 Pro。^[4]
如果目标是开权重、多模态输入、256K 长上下文、低幻觉率和强工具使用，优先评估 Kimi K2.6。^[5]
如果目标是联网研究或网页检索类任务，GPT-5.5、DeepSeek V4 和 Kimi K2.6 的公开 BrowseComp 分数非常接近，建议用自己的真实工作流做小样本 A/B 测试。^[1]^[4]^[7]

Evidence notes

Limitations / uncertainty

Insufficient evidence：没有找到四个模型在同一组织、同一时间、同一推理预算、同一工具设置下覆盖所有关键基准的完整公开表格。^[1]^[3]^[4]^[5]^[6]^[7]

Insufficient evidence：DeepSeek V4 与 Kimi K2.6 的官方逐项基准、尤其是与 GPT-5.5 和 Claude Opus 4.7 的同源对比，在本次可核查材料中不完整。^[4]^[5]^[6]^[7]

Summary

Supporting visuals

thieu.nhattienzkxu.663102's profile picture

trendsdatanewsfr's profile picture

Research conversation

YouApr 26, 2026

Research question

研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。

Studio Global AIApr 26, 202640 sources

Cited answer

Key findings

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。^[5]^[6]
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。^[1] Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。^[3]
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。^[1] 但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。^[4]
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。^[1] VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。^[4]^[7]
成本上，OpenAI API 文档列出 GPT-5.5 为 $5 / 百万输入 token、$30 / 百万输出 token，并支持 1M 上下文与 128K 最大输出。^[2] Anthropic 官方称 Claude Opus 4.7 价格维持 $5 / 百万输入 token、$25 / 百万输出 token。^[3] VentureBeat 引用 DeepSeek API 定价称 DeepSeek V4 Pro 为 $1.74 / 百万输入 token、$3.48 / 百万输出 token，显著低于 GPT-5.5 与 Claude Opus 4.7。^[4]
开权重生态上，Kimi K2.6 被 Artificial Analysis 称为“new leading open weights model”，其模型为 1T 总参数、32B 激活参数的 MoE，支持 256K 上下文、图像和视频输入、文本输出。^[5] DeepSeek V4 被报道为 1.6T 参数 MoE、MIT License、可在 Hugging Face 与 DeepSeek API 获取，因此在可部署性和成本上更开放、更便宜。^[4]

Benchmark comparison

维度 / 基准	GPT-5.5	Claude Opus 4.7	DeepSeek V4 / V4 Pro	Kimi K2.6	解读
Artificial Analysis Intelligence Index	xHigh 60；High 59 ^[6]	57 ^[6]	V4 Pro 被列为领先开权重模型之一，但本次资料未给出同表精确分数 ^[6]	54，开权重第一梯队 ^[5]	GPT-5.5 综合领先，Claude 接近，Kimi 是开权重最强代表之一。
Terminal-Bench 2.0	82.7% ^[1]	68.5% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在命令行代理工作流上优势明显。
SWE-Bench Pro	58.6% ^[1]	Insufficient evidence	Insufficient evidence	第三方资料显示可与 GPT-5.4 / Opus 4.6 同档，但缺少同源 GPT-5.5 / Opus 4.7 对比 ^[5]	GPT-5.5 有官方分数；其他模型缺少同设置数据。
GDPval / GDPval-AA	GPT-5.5 wins-or-ties 84.9% ^[1]	80.3% ^[1]	Insufficient evidence	GDPval-AA Elo 1520，高于 Kimi K2.5 的 1309 ^[5]	GPT 与 Claude 可在 OpenAI 表中横比；Kimi 的 AA Elo 不应与 OpenAI 的百分比分数直接相加比较。
BrowseComp	84.4% ^[1]	79.3% ^[1]	83.4% ^[4]	83.2% ^[7]	GPT-5.5 略高，DeepSeek 与 Kimi 非常接近，Claude 稍弱。
FrontierMath Tier 1–3	51.7% ^[1]	43.8% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在该官方表格中领先 Claude。
FrontierMath Tier 4	35.4% ^[1]	22.9% ^[1]	Insufficient evidence	Insufficient evidence	难题数学上 GPT-5.5 与 Claude 差距扩大。
Humanity’s Last Exam，无工具	41.4% ^[4]	46.9% ^[4]	37.7% ^[4]	Insufficient evidence	Claude 在该报道的无工具 HLE 设置中最高。
Humanity’s Last Exam，有工具	52.2%；GPT-5.5 Pro 为 57.2% ^[4]	54.7% ^[4]	48.2% ^[4]	Insufficient evidence	工具设置下 Claude 高于 GPT-5.5 base，但低于 GPT-5.5 Pro。
CyberGym	81.8% ^[1]	73.1% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在 OpenAI 表格中领先。
AA-Omniscience / 幻觉率	Insufficient evidence	幻觉率 36%，作为对照项出现 ^[5]	Insufficient evidence	幻觉率 39%，低于 Kimi K2.5 的 65% ^[5]	Kimi K2.6 的低幻觉率是其明显亮点。
τ²-Bench Telecom	Insufficient evidence	Insufficient evidence	Insufficient evidence	96% ^[5]	Kimi K2.6 在工具使用 / 电信任务中表现强。

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

如果目标是最高综合智能、agentic coding、浏览研究和工具任务，优先选择 GPT-5.5。^[1]^[2]^[6]
如果目标是长程工程任务、复杂文档/视觉理解、较强自我校验和比 GPT-5.5 略低的输出价格，优先选择 Claude Opus 4.7。^[3]
如果目标是大规模推理、低成本 API、开源许可证或部署灵活性，优先评估 DeepSeek V4 Pro。^[4]
如果目标是开权重、多模态输入、256K 长上下文、低幻觉率和强工具使用，优先评估 Kimi K2.6。^[5]
如果目标是联网研究或网页检索类任务，GPT-5.5、DeepSeek V4 和 Kimi K2.6 的公开 BrowseComp 分数非常接近，建议用自己的真实工作流做小样本 A/B 测试。^[1]^[4]^[7]

Evidence notes

Limitations / uncertainty

Insufficient evidence：没有找到四个模型在同一组织、同一时间、同一推理预算、同一工具设置下覆盖所有关键基准的完整公开表格。^[1]^[3]^[4]^[5]^[6]^[7]

Insufficient evidence：DeepSeek V4 与 Kimi K2.6 的官方逐项基准、尤其是与 GPT-5.5 和 Claude Opus 4.7 的同源对比，在本次可核查材料中不完整。^[4]^[5]^[6]^[7]

Summary

Continue your research

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so umfassend wie möglich im

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so u...

Claude Opus 4.7과 GPT-5.5의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Research & compare Claude Code vs OpenAI Codex as comprehensively as possible

Sources

[1] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[2] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[3] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[4] DeepSeek V4 Pro (Reasoning, High Effort) vs GPT-5.5 (medium)artificialanalysis.ai
What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...
[5] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai
Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...
[6] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4....
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] Kimi K2.6 vs GPT-5.5 - Detailed Performance & Feature Comparisondocsbot.ai
BrowseComp Evaluates browsing and web comprehension capabilities 83.2% Thinking mode; context management Source 84.4% Source CharXiv Reasoning Evaluates information synthesis from complex charts and scientific visualizations 80.4% RQ; 86.7 with Python Sourc...
[9] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Kimi K2.6 has significantly higher token usage than Kimi K2.5. Kimi K2.5 scores 6 on the AA-Omniscience Index, primarily driven...
[10] Claude Opus 4.7 results: early benchmarks, real-world feedback ...boringbot.substack.com
The Claude Opus 4.7 benchmarks on software engineering tasks show the clearest improvement. On SWE-Bench, the industry-standard benchmark for evaluating autonomous code repair across real GitHub issues, Opus 4.7 shows a meaningful step up from Opus 4.6, wit...
[11] DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task Benchmark ... - FundaAIfundaai.substack.com
As of time of publication, GPT-5.5 has not yet officially released its API. Testing solely through Codex 5.5 may not fully reflect the complete performance of the API. We have currently only conducted urgent testing on DeepSeek V4, and will include GPT-5.5...
[12] GPT 5.5 Vs Claude Opus 4.7 Proves Benchmarks Need Contextreddit.com
That means the benchmark story is not one-sided. A model can be better at running commands and still not be the best choice for careful code cleanup. Another model can be better at clean output and still feel slower when the task needs constant execution. T...
[13] Kimi K2.6 | 熱門關鍵字 - 鉅亨號hao.cnyes.com
摘要：兩個廣東人，撐起中國AI半邊天。千呼萬喚，4月24日上午，DeepSeek終於扔出重磅炸彈，全新系列模型DeepSeek-V4預覽版正式上線並同步開源，在Agent能力、世界知識和推理性能三大維度宣佈達到國內及開源領域領先水平。從2月8日上線測試版至今，其始終保持著神秘的姿態。就在四天前的周一晚上，月之暗面同樣發佈並開源了Kimi K2.6模型，主打長程編碼和Agent叢集能力，在多項基準測試中持平甚至優於GPT-5.4、Claude Opus 4.6等閉源模型。鳳凰網科技統計發現，這已經是Kimi和D...
[14] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai
Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...
[15] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[16] LLM Coding Benchmark (April 2026): GPT 5.5, DeepSeek v4, Kimi ...akitaonrails.com
Rank Model Score Tier RubyLLM OK Time Cost --- --- --- 1 Claude Opus 4.7 97 A ✅ 18m $1.10 1 GPT 5.4 xHigh (Codex) 97 A ✅ 22m $16 3 GPT 5.5 xHigh (Codex) 96 A ✅ 18m $10 4 Kimi K2.6 87 A ✅ 20m $0.30 5 Claude Opus 4.6 83 A ✅ 16m $1.10 6 Gemini 3.1 Pro 82 A ✅ 1...
[17] Usama Akram | AI Agents & AI Automation (@aiagently) on Threadsthreads.com
🚨 April was absolutely insane for AI releases... • Claude Opus 4.7 (Anthropic) • GPT-5.5 (OpenAI) • DeepSeek V4 (DeepSeek) • Xiaomi Mimi V2.5 • Qwen3.6-Plus (Alibaba) • GLM-5.1 (Zhipu) • Muse Spark (Meta) • Qwen3.6-35B-A3B (Alibaba) • Grok 4.3 Beta (xAI) •...
[18] 川弟 (@fatsol_erc) on Threadsthreads.com
近幾日AI大事： Deepseek 正式推出了v4 OpenAI推出GPT 5.5 目前我用GPT比較多而且聽說這次5.5的表現是超越Claude opus 4.7的並且有和mythos差不多的能力尤其這次重大升級也推出了macOS的local-first agent，對Mac用戶超級友善馬上體驗過後確實覺得GPT 5.5比opus 4.7好用，尤其又不會亂封號或是降智我們真的是在最好的時代人類生產力將由AI提升產生指數級的大增長一發布，直接把大家強行推進 AI Superapp 時代。這次最狂的不...
[19] We Gave Claude Opus 4.7 and Kimi K2.6 the Same Workflow ...blog.kilo.ai
Where Open-Weight Models Stand Right Now This test sits inside a pattern we’ve been tracking for a while. MiniMax M2.7 matched Claude Opus 4.6’s detection rate on our last three-part benchmark. GLM-5.1 scored five points behind Claude Opus 4.6 on our job qu...
[20] Bad Opus 4.7, Good Kimi K2.6, and Growing Codexaicodingdaily.substack.com
Simon Willison on X: “Opus 4.7 appears to use 1.46x times the tokens for text and up to 3x the tokens for images than Opus 4.6” x.com I upgraded my Claude token counter tool to compare different models and Opus 4.7 does appear to use 1.46x times the tokens...
[21] A bunch of model releases this week - Opus 4.7, GPT-5.5. And then there's also the second-tier stuff like Qwen3.6, Kimi K2.6, GLM-5.1, DeepSeek V4. Honestly, we keep obsessing over which frontier… | David Kaplanlinkedin.com
Looking at the months ahead, I think that the economics may start to shift. Right now teams are still throwing money at AI, exploring, willing to spend regardless of what it costs per inference. But I'd image that that's not going to last. Eventually someon...
[22] Introducing GPT-Rosalind for life sciences research - OpenAIopenai.com
OpenAI introduces GPT-Rosalind, a frontier reasoning model built to accelerate drug discovery, genomics analysis, protein reasoning, ... Apr 16, 2026
[23] Making ChatGPT better for clinicians - OpenAIopenai.com
Introducing GPT-5.5. Product Apr 23, 2026. OAI Blog Agents Hero 1x1. Introducing workspace agents in ChatGPT. Product Apr 22, 2026. Images 2.0 ... 4 days ago
[24] Semantic Alignment Before Acceleration - Promptingcommunity.openai.com
April 24, 2026. Feedback on GPT-5 Model Performance for Translation Tasks · Feedback. 19, 2756, August 19, 2025. Chatgpt API isn't good as it's ... 1 day ago
[25] Introducing GPT-5.5 - OpenAIopenai.com
Update on April 24, 2026: GPT‑5.5 and GPT‑5.5 Pro are now available ... GPT‑5.5 achieved leading performance among models with published scores. 3 days ago
[26] [PDF] HealthBench Professional: Evaluating Large Language Models on ...cdn.openai.com
An example is Codex (OpenAI, 2026c) using GPT-5.4, an AI system composed of Codex. (harness) and GPT-5.4 (model). Other examples include ... 4 days ago
[27] [PDF] GeneBench: Assessing AI Agents for Multi-Stage Inference ... - OpenAIcdn.openai.com
We introduce GeneBench, a benchmark for AI agents on realistic multi-stage scientific data analysis in genetics and quantitative biology. 3 days ago
[28] GPT-5.5 is here! Available in the API, Codex and ChatGPT todaycommunity.openai.com
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, ... 3 days ago
[29] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
GPT-5.5 is a new model designed for complex, real-world work, including writing code, researching online, analyzing information, ... 3 days ago
[30] Introducing OpenAI Privacy Filteropenai.com
OpenAI Privacy Filter is an open-weight model for detecting and redacting personally identifiable information (PII) in text with ... 4 days ago
[31] Introducing workspace agents in ChatGPT - OpenAIopenai.com
Introducing GPT-5.5. Product Apr 23, 2026. Making ChatGPT free for clinicians. Making ChatGPT better for clinicians. Product Apr 22, 2026.
[32] Image generation | OpenAI APIdevelopers.openai.com
response = client.responses.create( model="gpt-5.5", input=[ { "role": "user", "content": [ {"type": "input text", "text": prompt}, { "type": "input image", "image url": f"data:image/jpeg;base64,{base64 image1}", }, { "type": "input image", "image url": f"d...
[33] OpenAI API Platform Documentationdevelopers.openai.com
Categories Demo apps Videos Topics Agents Audio & Voice Computer Use Codex Evals gpt-oss Fine-tuning Image generation Scaling Tools Video generation Community Programs Codex Ambassadors Codex for Students Codex for Open Source Events Meetups Hackathon Suppo...
[34] Using GPT-5.5 | OpenAI APIdevelopers.openai.com
For tool-heavy or long-running workflows, verify that your application handles phase , preambles, and assistant-item replay correctly. Benchmark against other models on accuracy, token consumption, and end-to-end latency. [...] More efficient reasoning: GPT...
[35] Models | OpenAI APIdevelopers.openai.com
GPT-5.5 New A new class of intelligence for coding and professional work. Model ID gpt-5.5 [Reasoning none low medium high xhigh Input price $5 / Input MTok Output price $30 / Output MTok Latency Fast Max output 128K tokens Context window 1M Tools Functions...
[36] ChatGPT — Release Notes - OpenAI Help Centerhelp.openai.com
March 5, 2026 GPT-5.4 Thinking in ChatGPT GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ while impr...
[37] GPT-4o System Card | OpenAIopenai.com
GPT DALL·E Whisper Generative Models Ethics & Safety Language System Cards Citations 1. 1 OpenAI, “Hello gpt-4o.” 2024. 2. 2 T. Stivers, N. J. Enfield, P. Brown, C. Englert, M. Hayashi, T. Heinemann, G. Hoymann, F. Rossano, J. P. de Ruiter, K. E. Yoon, and...
[38] Harness engineering: leveraging Codex in an agent-first worldopenai.com
For Business Business Overview Solutions Contact Sales Company About Us Our Charter Foundation(opens in a new window) Careers Brand Support Help Center(opens in a new window) More News Stories Academy Livestreams Podcast RSS Terms & Policies Terms of Use Pr...
[39] Introducing deep research - OpenAIopenai.com
Leadership Akshay Nathan, Greg Brockman, Hannah Wong, Jakub Pachocki, Jerry Tworek, Johannes Heidecke, Josh Tobin, Liam Fedus, Mark Chen, Mia Glaese, Nick Turley, Sam Altman, Wojciech Zaremba Our Research Research Index Research Overview Research Residency...
[40] Introducing GPT-5.4 - OpenAIopenai.com
Evals without reasoning EvalGPT‑5.4 (none)GPT‑5.2 (none)GPT-4.1 OmniDocBench (normalized edit distance)0.109 0.140— Tau2-bench Telecom 64.3%57.2%43.6% Evals were run with reasoning effort set to xhigh, except where specified otherwise. Benchmarks were condu...
[41] Codex | AI Coding Partner from OpenAIopenai.com
OpenAI © 2015–2026 Manage Cookies English United States [...] “I needed to update another team’s codebase for a release. Codex handled the refactor and test generation, delivering fully tested code I handed back fast—keeping the feature on schedule without...
[42] Research - OpenAIopenai.com
GPT OpenAI’s GPT series models are fast, versatile, and cost-efficient AI systems designed to understand context, generate content, and reason across text, images, and more. Image 2: Hero Art Card SEO 1x1 A new class of intelligence for real work Release Ap...

ReportsPublic URLApr 26, 202642 sources

研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。

Key takeaways

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。 Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。 VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。
截至 2026-04-26，可核查的公开资料显示：GPT-5.5 是四者中综合基准最强的闭源模型，Claude Opus 4.7 在长程编码、数据纪律和多步任务上紧随其后，Kimi K2.6 是目前证据中最强的开权重模型，DeepSeek V4 Pro 的主要优势是接近前沿模型的能力与显著更低的 API 成本。[1][3][4][5][6] 但四者并没有完整、同源、同设置的官方横向评测；尤其 DeepSeek V4 与 Kimi K2.6 的逐项官方对比证据不足，因此部分结论应视为“公开资料下的最佳估计”。[4][5][6][7]
## Key findings

Research answer

Key findings

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。^[5]^[6]
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。^[1] Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。^[3]
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。^[1] 但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。^[4]
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。^[1] VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。^[4]^[7]
成本上，OpenAI API 文档列出 GPT-5.5 为 $5 / 百万输入 token、$30 / 百万输出 token，并支持 1M 上下文与 128K 最大输出。^[2] Anthropic 官方称 Claude Opus 4.7 价格维持 $5 / 百万输入 token、$25 / 百万输出 token。^[3] VentureBeat 引用 DeepSeek API 定价称 DeepSeek V4 Pro 为 $1.74 / 百万输入 token、$3.48 / 百万输出 token，显著低于 GPT-5.5 与 Claude Opus 4.7。^[4]
开权重生态上，Kimi K2.6 被 Artificial Analysis 称为“new leading open weights model”，其模型为 1T 总参数、32B 激活参数的 MoE，支持 256K 上下文、图像和视频输入、文本输出。^[5] DeepSeek V4 被报道为 1.6T 参数 MoE、MIT License、可在 Hugging Face 与 DeepSeek API 获取，因此在可部署性和成本上更开放、更便宜。^[4]

Benchmark comparison

维度 / 基准	GPT-5.5	Claude Opus 4.7	DeepSeek V4 / V4 Pro	Kimi K2.6	解读
Artificial Analysis Intelligence Index	xHigh 60；High 59 ^[6]	57 ^[6]	V4 Pro 被列为领先开权重模型之一，但本次资料未给出同表精确分数 ^[6]	54，开权重第一梯队 ^[5]	GPT-5.5 综合领先，Claude 接近，Kimi 是开权重最强代表之一。
Terminal-Bench 2.0	82.7% ^[1]	68.5% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在命令行代理工作流上优势明显。
SWE-Bench Pro	58.6% ^[1]	Insufficient evidence	Insufficient evidence	第三方资料显示可与 GPT-5.4 / Opus 4.6 同档，但缺少同源 GPT-5.5 / Opus 4.7 对比 ^[5]	GPT-5.5 有官方分数；其他模型缺少同设置数据。
GDPval / GDPval-AA	GPT-5.5 wins-or-ties 84.9% ^[1]	80.3% ^[1]	Insufficient evidence	GDPval-AA Elo 1520，高于 Kimi K2.5 的 1309 ^[5]	GPT 与 Claude 可在 OpenAI 表中横比；Kimi 的 AA Elo 不应与 OpenAI 的百分比分数直接相加比较。
BrowseComp	84.4% ^[1]	79.3% ^[1]	83.4% ^[4]	83.2% ^[7]	GPT-5.5 略高，DeepSeek 与 Kimi 非常接近，Claude 稍弱。
FrontierMath Tier 1–3	51.7% ^[1]	43.8% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在该官方表格中领先 Claude。
FrontierMath Tier 4	35.4% ^[1]	22.9% ^[1]	Insufficient evidence	Insufficient evidence	难题数学上 GPT-5.5 与 Claude 差距扩大。
Humanity’s Last Exam，无工具	41.4% ^[4]	46.9% ^[4]	37.7% ^[4]	Insufficient evidence	Claude 在该报道的无工具 HLE 设置中最高。
Humanity’s Last Exam，有工具	52.2%；GPT-5.5 Pro 为 57.2% ^[4]	54.7% ^[4]	48.2% ^[4]	Insufficient evidence	工具设置下 Claude 高于 GPT-5.5 base，但低于 GPT-5.5 Pro。
CyberGym	81.8% ^[1]	73.1% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在 OpenAI 表格中领先。
AA-Omniscience / 幻觉率	Insufficient evidence	幻觉率 36%，作为对照项出现 ^[5]	Insufficient evidence	幻觉率 39%，低于 Kimi K2.5 的 65% ^[5]	Kimi K2.6 的低幻觉率是其明显亮点。
τ²-Bench Telecom	Insufficient evidence	Insufficient evidence	Insufficient evidence	96% ^[5]	Kimi K2.6 在工具使用 / 电信任务中表现强。

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

如果目标是最高综合智能、agentic coding、浏览研究和工具任务，优先选择 GPT-5.5。^[1]^[2]^[6]
如果目标是长程工程任务、复杂文档/视觉理解、较强自我校验和比 GPT-5.5 略低的输出价格，优先选择 Claude Opus 4.7。^[3]
如果目标是大规模推理、低成本 API、开源许可证或部署灵活性，优先评估 DeepSeek V4 Pro。^[4]
如果目标是开权重、多模态输入、256K 长上下文、低幻觉率和强工具使用，优先评估 Kimi K2.6。^[5]
如果目标是联网研究或网页检索类任务，GPT-5.5、DeepSeek V4 和 Kimi K2.6 的公开 BrowseComp 分数非常接近，建议用自己的真实工作流做小样本 A/B 测试。^[1]^[4]^[7]

Evidence notes

Limitations / uncertainty

Insufficient evidence：没有找到四个模型在同一组织、同一时间、同一推理预算、同一工具设置下覆盖所有关键基准的完整公开表格。^[1]^[3]^[4]^[5]^[6]^[7]

Insufficient evidence：DeepSeek V4 与 Kimi K2.6 的官方逐项基准、尤其是与 GPT-5.5 和 Claude Opus 4.7 的同源对比，在本次可核查材料中不完整。^[4]^[5]^[6]^[7]

Summary

Supporting visuals

thieu.nhattienzkxu.663102's profile picture

trendsdatanewsfr's profile picture

Research conversation

YouApr 26, 2026

Research question

研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。

Studio Global AIApr 26, 202640 sources

Cited answer

Key findings

综合智能排名上，Artificial Analysis 的公开比较显示 GPT-5.5 xHigh 位列第一，GPT-5.5 High 第二，Claude Opus 4.7 与 Gemini 3.1 Pro Preview 同处 57 分档位，而 Kimi K2.6 在其 Intelligence Index 上为 54 分、位列开权重模型第一梯队。^[5]^[6]
编码与代理任务上，OpenAI 官方表格给出 GPT-5.5 在 Terminal-Bench 2.0 为 82.7%，在 SWE-Bench Pro 为 58.6%，并称其是 OpenAI 当时最强的 agentic coding 模型。^[1] Anthropic 官方没有在可抓取文本中给出完整统一表格，但称 Opus 4.7 相比 Opus 4.6 在高级软件工程、复杂长程任务和自我验证方面显著提升，并引用客户 93-task coding benchmark 中 resolution 提升 13%。^[3]
推理与数学上，OpenAI 官方表格显示 GPT-5.5 在 FrontierMath Tier 1–3 为 51.7%、Tier 4 为 35.4%，高于表中 Claude Opus 4.7 的 43.8% 与 22.9%。^[1] 但 VentureBeat 汇总的 Humanity’s Last Exam 数据显示，在无工具设置下 Claude Opus 4.7 为 46.9%，高于 GPT-5.5 的 41.4% 与 DeepSeek V4 的 37.7%。^[4]
浏览与研究型任务上，GPT-5.5 在 OpenAI 表格中的 BrowseComp 为 84.4%，Claude Opus 4.7 为 79.3%。^[1] VentureBeat 报道 DeepSeek V4 Pro-Max 在 BrowseComp 为 83.4%，DocsBot 汇总 Kimi K2.6 为 83.2%，因此该项上 GPT-5.5、DeepSeek V4、Kimi K2.6 非常接近，Claude Opus 4.7 落后一些。^[4]^[7]
成本上，OpenAI API 文档列出 GPT-5.5 为 $5 / 百万输入 token、$30 / 百万输出 token，并支持 1M 上下文与 128K 最大输出。^[2] Anthropic 官方称 Claude Opus 4.7 价格维持 $5 / 百万输入 token、$25 / 百万输出 token。^[3] VentureBeat 引用 DeepSeek API 定价称 DeepSeek V4 Pro 为 $1.74 / 百万输入 token、$3.48 / 百万输出 token，显著低于 GPT-5.5 与 Claude Opus 4.7。^[4]
开权重生态上，Kimi K2.6 被 Artificial Analysis 称为“new leading open weights model”，其模型为 1T 总参数、32B 激活参数的 MoE，支持 256K 上下文、图像和视频输入、文本输出。^[5] DeepSeek V4 被报道为 1.6T 参数 MoE、MIT License、可在 Hugging Face 与 DeepSeek API 获取，因此在可部署性和成本上更开放、更便宜。^[4]

Benchmark comparison

维度 / 基准	GPT-5.5	Claude Opus 4.7	DeepSeek V4 / V4 Pro	Kimi K2.6	解读
Artificial Analysis Intelligence Index	xHigh 60；High 59 ^[6]	57 ^[6]	V4 Pro 被列为领先开权重模型之一，但本次资料未给出同表精确分数 ^[6]	54，开权重第一梯队 ^[5]	GPT-5.5 综合领先，Claude 接近，Kimi 是开权重最强代表之一。
Terminal-Bench 2.0	82.7% ^[1]	68.5% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在命令行代理工作流上优势明显。
SWE-Bench Pro	58.6% ^[1]	Insufficient evidence	Insufficient evidence	第三方资料显示可与 GPT-5.4 / Opus 4.6 同档，但缺少同源 GPT-5.5 / Opus 4.7 对比 ^[5]	GPT-5.5 有官方分数；其他模型缺少同设置数据。
GDPval / GDPval-AA	GPT-5.5 wins-or-ties 84.9% ^[1]	80.3% ^[1]	Insufficient evidence	GDPval-AA Elo 1520，高于 Kimi K2.5 的 1309 ^[5]	GPT 与 Claude 可在 OpenAI 表中横比；Kimi 的 AA Elo 不应与 OpenAI 的百分比分数直接相加比较。
BrowseComp	84.4% ^[1]	79.3% ^[1]	83.4% ^[4]	83.2% ^[7]	GPT-5.5 略高，DeepSeek 与 Kimi 非常接近，Claude 稍弱。
FrontierMath Tier 1–3	51.7% ^[1]	43.8% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在该官方表格中领先 Claude。
FrontierMath Tier 4	35.4% ^[1]	22.9% ^[1]	Insufficient evidence	Insufficient evidence	难题数学上 GPT-5.5 与 Claude 差距扩大。
Humanity’s Last Exam，无工具	41.4% ^[4]	46.9% ^[4]	37.7% ^[4]	Insufficient evidence	Claude 在该报道的无工具 HLE 设置中最高。
Humanity’s Last Exam，有工具	52.2%；GPT-5.5 Pro 为 57.2% ^[4]	54.7% ^[4]	48.2% ^[4]	Insufficient evidence	工具设置下 Claude 高于 GPT-5.5 base，但低于 GPT-5.5 Pro。
CyberGym	81.8% ^[1]	73.1% ^[1]	Insufficient evidence	Insufficient evidence	GPT-5.5 在 OpenAI 表格中领先。
AA-Omniscience / 幻觉率	Insufficient evidence	幻觉率 36%，作为对照项出现 ^[5]	Insufficient evidence	幻觉率 39%，低于 Kimi K2.5 的 65% ^[5]	Kimi K2.6 的低幻觉率是其明显亮点。
τ²-Bench Telecom	Insufficient evidence	Insufficient evidence	Insufficient evidence	96% ^[5]	Kimi K2.6 在工具使用 / 电信任务中表现强。

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

如果目标是最高综合智能、agentic coding、浏览研究和工具任务，优先选择 GPT-5.5。^[1]^[2]^[6]
如果目标是长程工程任务、复杂文档/视觉理解、较强自我校验和比 GPT-5.5 略低的输出价格，优先选择 Claude Opus 4.7。^[3]
如果目标是大规模推理、低成本 API、开源许可证或部署灵活性，优先评估 DeepSeek V4 Pro。^[4]
如果目标是开权重、多模态输入、256K 长上下文、低幻觉率和强工具使用，优先评估 Kimi K2.6。^[5]
如果目标是联网研究或网页检索类任务，GPT-5.5、DeepSeek V4 和 Kimi K2.6 的公开 BrowseComp 分数非常接近，建议用自己的真实工作流做小样本 A/B 测试。^[1]^[4]^[7]

Evidence notes

Limitations / uncertainty

Insufficient evidence：没有找到四个模型在同一组织、同一时间、同一推理预算、同一工具设置下覆盖所有关键基准的完整公开表格。^[1]^[3]^[4]^[5]^[6]^[7]

Insufficient evidence：DeepSeek V4 与 Kimi K2.6 的官方逐项基准、尤其是与 GPT-5.5 和 Claude Opus 4.7 的同源对比，在本次可核查材料中不完整。^[4]^[5]^[6]^[7]

Summary

Continue your research

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so umfassend wie möglich im

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so u...

Sources

[1] Claude Opus 4.7 Benchmark Breakdown: Vision, Coding, ...mindstudio.ai
Claude Opus 4.7 posted 82.4% on SWE-bench Verified, up roughly 11 points from Opus 4.6 — the most meaningful coding benchmark available. Vision improvements were the largest percentage gains: MathVista jumped 9.5 points, enabling reliable visual math reason...
[2] Claude Opus 4.7 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools Claude Opus 4.7 According to BenchLM.ai, Claude Opus 4.7 ranks 2 out of 110 models on the provisional leaderboard with an overall score of 97/100. It also ranks 2 out of 14 on t...
[3] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[4] DeepSeek V4 Pro (Reasoning, High Effort) vs GPT-5.5 (medium)artificialanalysis.ai
What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...
[5] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai
Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...
[6] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4....
[7] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[8] Kimi K2.6 vs GPT-5.5 - Detailed Performance & Feature Comparisondocsbot.ai
BrowseComp Evaluates browsing and web comprehension capabilities 83.2% Thinking mode; context management Source 84.4% Source CharXiv Reasoning Evaluates information synthesis from complex charts and scientific visualizations 80.4% RQ; 86.7 with Python Sourc...
[9] Kimi K2.6: The new leading open weights model - Artificial Analysisartificialanalysis.ai
➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Kimi K2.6 has significantly higher token usage than Kimi K2.5. Kimi K2.5 scores 6 on the AA-Omniscience Index, primarily driven...
[10] Claude Opus 4.7 results: early benchmarks, real-world feedback ...boringbot.substack.com
The Claude Opus 4.7 benchmarks on software engineering tasks show the clearest improvement. On SWE-Bench, the industry-standard benchmark for evaluating autonomous code repair across real GitHub issues, Opus 4.7 shows a meaningful step up from Opus 4.6, wit...
[11] DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task Benchmark ... - FundaAIfundaai.substack.com
As of time of publication, GPT-5.5 has not yet officially released its API. Testing solely through Codex 5.5 may not fully reflect the complete performance of the API. We have currently only conducted urgent testing on DeepSeek V4, and will include GPT-5.5...
[12] GPT 5.5 Vs Claude Opus 4.7 Proves Benchmarks Need Contextreddit.com
That means the benchmark story is not one-sided. A model can be better at running commands and still not be the best choice for careful code cleanup. Another model can be better at clean output and still feel slower when the task needs constant execution. T...
[13] Kimi K2.6 | 熱門關鍵字 - 鉅亨號hao.cnyes.com
摘要：兩個廣東人，撐起中國AI半邊天。千呼萬喚，4月24日上午，DeepSeek終於扔出重磅炸彈，全新系列模型DeepSeek-V4預覽版正式上線並同步開源，在Agent能力、世界知識和推理性能三大維度宣佈達到國內及開源領域領先水平。從2月8日上線測試版至今，其始終保持著神秘的姿態。就在四天前的周一晚上，月之暗面同樣發佈並開源了Kimi K2.6模型，主打長程編碼和Agent叢集能力，在多項基準測試中持平甚至優於GPT-5.4、Claude Opus 4.6等閉源模型。鳳凰網科技統計發現，這已經是Kimi和D...
[14] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai
Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...
[15] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[16] LLM Coding Benchmark (April 2026): GPT 5.5, DeepSeek v4, Kimi ...akitaonrails.com
Rank Model Score Tier RubyLLM OK Time Cost --- --- --- 1 Claude Opus 4.7 97 A ✅ 18m $1.10 1 GPT 5.4 xHigh (Codex) 97 A ✅ 22m $16 3 GPT 5.5 xHigh (Codex) 96 A ✅ 18m $10 4 Kimi K2.6 87 A ✅ 20m $0.30 5 Claude Opus 4.6 83 A ✅ 16m $1.10 6 Gemini 3.1 Pro 82 A ✅ 1...
[17] Usama Akram | AI Agents & AI Automation (@aiagently) on Threadsthreads.com
🚨 April was absolutely insane for AI releases... • Claude Opus 4.7 (Anthropic) • GPT-5.5 (OpenAI) • DeepSeek V4 (DeepSeek) • Xiaomi Mimi V2.5 • Qwen3.6-Plus (Alibaba) • GLM-5.1 (Zhipu) • Muse Spark (Meta) • Qwen3.6-35B-A3B (Alibaba) • Grok 4.3 Beta (xAI) •...
[18] 川弟 (@fatsol_erc) on Threadsthreads.com
近幾日AI大事： Deepseek 正式推出了v4 OpenAI推出GPT 5.5 目前我用GPT比較多而且聽說這次5.5的表現是超越Claude opus 4.7的並且有和mythos差不多的能力尤其這次重大升級也推出了macOS的local-first agent，對Mac用戶超級友善馬上體驗過後確實覺得GPT 5.5比opus 4.7好用，尤其又不會亂封號或是降智我們真的是在最好的時代人類生產力將由AI提升產生指數級的大增長一發布，直接把大家強行推進 AI Superapp 時代。這次最狂的不...
[19] We Gave Claude Opus 4.7 and Kimi K2.6 the Same Workflow ...blog.kilo.ai
Where Open-Weight Models Stand Right Now This test sits inside a pattern we’ve been tracking for a while. MiniMax M2.7 matched Claude Opus 4.6’s detection rate on our last three-part benchmark. GLM-5.1 scored five points behind Claude Opus 4.6 on our job qu...
[20] Bad Opus 4.7, Good Kimi K2.6, and Growing Codexaicodingdaily.substack.com
Simon Willison on X: “Opus 4.7 appears to use 1.46x times the tokens for text and up to 3x the tokens for images than Opus 4.6” x.com I upgraded my Claude token counter tool to compare different models and Opus 4.7 does appear to use 1.46x times the tokens...
[21] A bunch of model releases this week - Opus 4.7, GPT-5.5. And then there's also the second-tier stuff like Qwen3.6, Kimi K2.6, GLM-5.1, DeepSeek V4. Honestly, we keep obsessing over which frontier… | David Kaplanlinkedin.com
Looking at the months ahead, I think that the economics may start to shift. Right now teams are still throwing money at AI, exploring, willing to spend regardless of what it costs per inference. But I'd image that that's not going to last. Eventually someon...
[22] Introducing GPT-Rosalind for life sciences research - OpenAIopenai.com
OpenAI introduces GPT-Rosalind, a frontier reasoning model built to accelerate drug discovery, genomics analysis, protein reasoning, ... Apr 16, 2026
[23] Making ChatGPT better for clinicians - OpenAIopenai.com
Introducing GPT-5.5. Product Apr 23, 2026. OAI Blog Agents Hero 1x1. Introducing workspace agents in ChatGPT. Product Apr 22, 2026. Images 2.0 ... 4 days ago
[24] Semantic Alignment Before Acceleration - Promptingcommunity.openai.com
April 24, 2026. Feedback on GPT-5 Model Performance for Translation Tasks · Feedback. 19, 2756, August 19, 2025. Chatgpt API isn't good as it's ... 1 day ago
[25] Introducing GPT-5.5 - OpenAIopenai.com
Update on April 24, 2026: GPT‑5.5 and GPT‑5.5 Pro are now available ... GPT‑5.5 achieved leading performance among models with published scores. 3 days ago
[26] [PDF] HealthBench Professional: Evaluating Large Language Models on ...cdn.openai.com
An example is Codex (OpenAI, 2026c) using GPT-5.4, an AI system composed of Codex. (harness) and GPT-5.4 (model). Other examples include ... 4 days ago
[27] [PDF] GeneBench: Assessing AI Agents for Multi-Stage Inference ... - OpenAIcdn.openai.com
We introduce GeneBench, a benchmark for AI agents on realistic multi-stage scientific data analysis in genetics and quantitative biology. 3 days ago
[28] GPT-5.5 is here! Available in the API, Codex and ChatGPT todaycommunity.openai.com
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, ... 3 days ago
[29] GPT-5.5 System Card - OpenAI Deployment Safety Hubdeploymentsafety.openai.com
GPT-5.5 is a new model designed for complex, real-world work, including writing code, researching online, analyzing information, ... 3 days ago
[30] Introducing OpenAI Privacy Filteropenai.com
OpenAI Privacy Filter is an open-weight model for detecting and redacting personally identifiable information (PII) in text with ... 4 days ago
[31] Introducing workspace agents in ChatGPT - OpenAIopenai.com
Introducing GPT-5.5. Product Apr 23, 2026. Making ChatGPT free for clinicians. Making ChatGPT better for clinicians. Product Apr 22, 2026.
[32] Image generation | OpenAI APIdevelopers.openai.com
response = client.responses.create( model="gpt-5.5", input=[ { "role": "user", "content": [ {"type": "input text", "text": prompt}, { "type": "input image", "image url": f"data:image/jpeg;base64,{base64 image1}", }, { "type": "input image", "image url": f"d...
[33] OpenAI API Platform Documentationdevelopers.openai.com
Categories Demo apps Videos Topics Agents Audio & Voice Computer Use Codex Evals gpt-oss Fine-tuning Image generation Scaling Tools Video generation Community Programs Codex Ambassadors Codex for Students Codex for Open Source Events Meetups Hackathon Suppo...
[34] Using GPT-5.5 | OpenAI APIdevelopers.openai.com
For tool-heavy or long-running workflows, verify that your application handles phase , preambles, and assistant-item replay correctly. Benchmark against other models on accuracy, token consumption, and end-to-end latency. [...] More efficient reasoning: GPT...
[35] Models | OpenAI APIdevelopers.openai.com
GPT-5.5 New A new class of intelligence for coding and professional work. Model ID gpt-5.5 [Reasoning none low medium high xhigh Input price $5 / Input MTok Output price $30 / Output MTok Latency Fast Max output 128K tokens Context window 1M Tools Functions...
[36] ChatGPT — Release Notes - OpenAI Help Centerhelp.openai.com
March 5, 2026 GPT-5.4 Thinking in ChatGPT GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ while impr...
[37] GPT-4o System Card | OpenAIopenai.com
GPT DALL·E Whisper Generative Models Ethics & Safety Language System Cards Citations 1. 1 OpenAI, “Hello gpt-4o.” 2024. 2. 2 T. Stivers, N. J. Enfield, P. Brown, C. Englert, M. Hayashi, T. Heinemann, G. Hoymann, F. Rossano, J. P. de Ruiter, K. E. Yoon, and...
[38] Harness engineering: leveraging Codex in an agent-first worldopenai.com
For Business Business Overview Solutions Contact Sales Company About Us Our Charter Foundation(opens in a new window) Careers Brand Support Help Center(opens in a new window) More News Stories Academy Livestreams Podcast RSS Terms & Policies Terms of Use Pr...
[39] Introducing deep research - OpenAIopenai.com
Leadership Akshay Nathan, Greg Brockman, Hannah Wong, Jakub Pachocki, Jerry Tworek, Johannes Heidecke, Josh Tobin, Liam Fedus, Mark Chen, Mia Glaese, Nick Turley, Sam Altman, Wojciech Zaremba Our Research Research Index Research Overview Research Residency...
[40] Introducing GPT-5.4 - OpenAIopenai.com
Evals without reasoning EvalGPT‑5.4 (none)GPT‑5.2 (none)GPT-4.1 OmniDocBench (normalized edit distance)0.109 0.140— Tau2-bench Telecom 64.3%57.2%43.6% Evals were run with reasoning effort set to xhigh, except where specified otherwise. Benchmarks were condu...
[41] Codex | AI Coding Partner from OpenAIopenai.com
OpenAI © 2015–2026 Manage Cookies English United States [...] “I needed to update another team’s codebase for a release. Codex handled the refactor and test generation, delivering fully tested code I handed back fast—keeping the feature on schedule without...
[42] Research - OpenAIopenai.com
GPT OpenAI’s GPT series models are fast, versatile, and cost-efficient AI systems designed to understand context, generate content, and reason across text, images, and more. Image 2: Hero Art Card SEO 1x1 A new class of intelligence for real work Release Ap...

Key takeaways

Research answer

Key findings

Benchmark comparison

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

Evidence notes

Limitations / uncertainty

Summary

Supporting visuals

Research conversation

Research question

Cited answer

Key findings

Benchmark comparison

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

Evidence notes

Limitations / uncertainty

Summary

People also ask

What is the short answer to "研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。"?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so umfassend wie möglich im

Claude Opus 4.7과 GPT-5.5의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Research & compare Claude Code vs OpenAI Codex as comprehensively as possible

Sources

Key takeaways

Research answer

Key findings

Benchmark comparison

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

Evidence notes

Limitations / uncertainty

Summary

Supporting visuals

Research conversation

Research question

Cited answer

Key findings

Benchmark comparison

Model-by-model assessment

GPT-5.5

Claude Opus 4.7

DeepSeek V4 / V4 Pro

Kimi K2.6

Practical recommendations

Evidence notes

Limitations / uncertainty

Summary

People also ask

What is the short answer to "研究 Claude Opus 4.7、GPT-5.5、DeepSeek V4、Kimi K2.6 的基准测试，并尽可能全面地进行比较。请以研究报告的格式呈现。"?

What are the key points to validate first?

What should I do next in practice?

Which related topic should I explore next?

What should I compare this against?

Continue your research

Recherchieren Sie die Benchmarks von Claude Opus 4.7, GPT-5.5, DeepSeek V4 und Kimi K2.6 und vergleichen Sie sie so umfassend wie möglich im

Claude Opus 4.7과 GPT-5.5의 벤치마크를 조사하고 가능한 한 종합적으로 비교해 주세요.

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Research & compare Claude Code vs OpenAI Codex as comprehensively as possible

Sources