ОтчетыОпубликовано29 апр. 2026Last edited 6 мая 202612 источники

GPT-5.5 против Claude Opus 4.7, Kimi K2.6 и DeepSeek V4: что показывают бенчмарки

Для терминальных coding агентов в общей таблице сильнее выглядит GPT 5.5 с 82,7% на Terminal Bench 2.0; для задач ремонта кода лидирует Claude Opus 4.7 — 64,3% на SWE Bench Pro и 87,6% на SWE Bench Verified [18][24]. GPT 5.5 Pro нельзя смешивать с базовой GPT 5.5: там, где Pro версия указана отдельно, она лидирует в...

Искать и проверять факты с Studio Global AI Смотреть больше в «Открыть»

17K0

Abstract benchmark dashboard comparing GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4 — GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks ComparedAI-generated editorial illustration for a benchmark comparison of GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4.
Промпт ИИ
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks Compared. Article summary: There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,.... Topic tags: ai, ai benchmarks, llm, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hp
openai.com

Бенчмарки легко превратить в таблицу «кто кого победил». Но в этом сравнении такой подход будет слишком грубым. Самое близкое общее сравнение в доступных источниках охватывает GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7 и DeepSeek-V4-Pro-Max; Kimi K2.6 появляется в отдельных источниках — релизных материалах, model card и лидербордах ^[1]^[6]^[24]. Поэтому практический вопрос не «какая модель лучшая вообще», а «какую модель первой прогнать на ваших задачах».

Важная оговорка по названию: под DeepSeek V4 здесь подразумевается DeepSeek-V4-Pro-Max, потому что именно у этой версии есть строки с бенчмарками и стоимостью в цитируемых источниках ^[18]^[24]. GPT-5.5 Pro также стоит держать отдельно от базовой GPT-5.5 в тех местах, где источник приводит разные результаты ^[24].

Короткий вывод по типу задачи

Терминальные coding-агенты: GPT-5.5 показывает самый сильный цитируемый результат Terminal-Bench 2.0 в общем сравнении — 82,7% ^[24].
Ремонт и сопровождение ПО: Claude Opus 4.7 лидирует в приведённых строках SWE-Bench Pro с 64,3% и SWE-Bench Verified с 87,6% ^[18]^[24].
Сложное рассуждение без инструментов: Claude Opus 4.7 лидирует в общих строках GPQA Diamond и Humanity’s Last Exam без инструментов ^[24].
Рассуждение с инструментами и browsing-задачи: GPT-5.5 Pro лидирует в Humanity’s Last Exam с инструментами — 57,2% — и BrowseComp — 90,1%, когда эта Pro-версия указана отдельно ^[24].
Open-weight развёртывание: Kimi K2.6 — самый очевидный кандидат среди моделей с открытыми весами в этих источниках: её описывают как MoE-модель на 1 трлн параметров с 32 млрд активных параметров и контекстом 256K ^[1].
Чувствительный к цене облачный инференс: DeepSeek-V4-Pro-Max — модель, которую стоит проверить на соотношение цены и качества: LLM Stats указывает для неё контекст 1M, 80,6% на SWE-Bench Verified и $1,74/$3,48 в ценовых колонках ^[18].

Сводная таблица бенчмарков

Прочерк означает, что в использованных источниках не нашлось результата для этой модели, а не то, что модель получила ноль. Строки GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7 и DeepSeek-V4-Pro-Max в основном взяты из одного общего сравнения; данные по Kimi K2.6 — из отдельных источников Moonshot/Kimi и лидербордов ^[1]^[6]^[24].

Бенчмарк	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93,6% ^[24]	—	94,2% ^[24]	≈91% ^[28]	90,1% ^[24]
Humanity’s Last Exam, без инструментов	41,4% ^[24]	43,1% ^[24]	46,9% ^[24]	—	37,7% ^[24]
Humanity’s Last Exam, с инструментами	52,2% ^[24]	57,2% ^[24]	54,7% ^[24]	54,0% ^[1]	48,2% ^[24]
Terminal-Bench 2.0	82,7% ^[24]	—	69,4% ^[24]	66,7% ^[6]	67,9% ^[24]
SWE-Bench Pro	58,6% ^[24]	—	64,3% ^[24]	58,6% ^[6]	55,4% ^[24]
BrowseComp	84,4% ^[24]	90,1% ^[24]	79,3% ^[24]	83,2% ^[1]	83,4% ^[24]
MCP Atlas / MCPAtlas Public	75,3% ^[24]	—	79,1% ^[24]	—	73,6% ^[24]
SWE-Bench Verified	—	—	87,6% ^[18]	80,2% ^[6]	80,6% ^[18]

С какой модели начинать тесты

Приоритет	Сначала тестировать	Почему
Терминальные coding-агенты	GPT-5.5	Самый высокий Terminal-Bench 2.0 в общем сравнении — 82,7% ^[24].
Ремонт и сопровождение кода	Claude Opus 4.7	Лидирует в приведённых строках SWE-Bench Pro и SWE-Bench Verified среди этих моделей ^[18]^[24].
Сложное рассуждение без инструментов	Claude Opus 4.7	Лидирует в GPQA Diamond и Humanity’s Last Exam без инструментов в общем сравнении ^[24].
Tool-assisted reasoning и browsing	GPT-5.5 Pro	Лидирует в Humanity’s Last Exam с инструментами и BrowseComp там, где GPT-5.5 Pro указана отдельно ^[24].
Open-weight развёртывание	Kimi K2.6	Описана как open-weight MoE-модель на 1 трлн параметров; карточка на Hugging Face приводит сильные результаты по coding-бенчмаркам ^[1]^[6].
Экономия на hosted inference	DeepSeek-V4-Pro-Max	LLM Stats указывает 1M контекста, 80,6% на SWE-Bench Verified и более низкие ценовые колонки, чем у Claude Opus 4.7 на том же лидерборде ^[18].
Длинный контекст	GPT-5.5, Claude Opus 4.7 или DeepSeek-V4-Pro-Max	Источники указывают 1M контекста для GPT-5.5, Claude Opus 4.7 и DeepSeek-V4-Pro-Max; для Kimi K2.6 фигурирует примерно 256K–262K ^[1]^[11]^[16]^[18]^[27].

Заметки по моделям

GPT-5.5

OpenAI описывает GPT-5.5 как модель для сложных задач — программирования, исследований и анализа данных ^[38]. В общем сравнении GPT-5.5 набирает 82,7% на Terminal-Bench 2.0, опережая Claude Opus 4.7 с 69,4% и DeepSeek-V4-Pro-Max с 67,9% ^[24]. В той же таблице у неё 93,6% на GPQA Diamond, 58,6% на SWE-Bench Pro и 84,4% на BrowseComp ^[24].

Главная оговорка — отдельное существование GPT-5.5 Pro как точки сравнения. В той же общей таблице GPT-5.5 Pro достигает 90,1% на BrowseComp и 57,2% на Humanity’s Last Exam с инструментами, но эти цифры не стоит автоматически переносить на базовую GPT-5.5 при оценке цены, задержки и настроек модели ^[24].

Для закупки и планирования бюджета есть только сигналы, а не окончательная смета: BenchLM указывает для GPT-5.5 контекстное окно 1M токенов, а один ценовой обзор приводит $5 за миллион входных токенов и $30 за миллион выходных токенов ^[27]^[36]. Перед бюджетированием такие цифры лучше сверять с актуальным прайсингом провайдера.

Claude Opus 4.7

Claude Opus 4.7 даёт самые сильные цитируемые сигналы по software-repair задачам в этой группе. LLM Stats указывает 87,6% на SWE-Bench Verified, а общее сравнение — 64,3% на SWE-Bench Pro ^[18]^[24]. Также модель лидирует в общих строках GPQA Diamond с 94,2%, Humanity’s Last Exam без инструментов с 46,9% и MCP Atlas с 79,1% ^[24].

LLM Stats сообщает для Claude Opus 4.7 контекстное окно 1M токенов и цену $5/$25 за миллион токенов ^[16]. Но сравнимость результатов требует осторожности: Anthropic отмечает, что часть бенчмарков использовала внутренние реализации или обновлённые параметры harness, а некоторые оценки не являются напрямую сопоставимыми с публичными лидербордами ^[17].

Kimi K2.6

Kimi K2.6 — самый сильный open-weight кандидат в цитируемом материале. Релизное освещение описывает её как open-weight MoE-модель на 1 трлн параметров с 32 млрд активных параметров, 384 экспертами, нативной мультимодальностью, INT4-квантизацией и контекстом 256K ^[1]. Карточка модели на Hugging Face сообщает 80,2% на SWE-Bench Verified, 58,6% на SWE-Bench Pro, 66,7% на Terminal-Bench 2.0 и 89,6 на LiveCodeBench v6 ^[6].

То же релизное освещение указывает для Kimi K2.6 54,0 на Humanity’s Last Exam с инструментами и 83,2 на BrowseComp ^[1]. LLM Stats перечисляет для Kimi K2.6 контекст 262K, $0,95/$4,00 в ценовых колонках и метку Open Source ^[11]. Ограничение здесь принципиальное: показатели Kimi взяты не из той же общей таблицы, что GPT-5.5, Claude Opus 4.7 и DeepSeek-V4-Pro-Max, поэтому небольшие разницы лучше воспринимать как повод для собственного теста, а не как окончательный вердикт ^[1]^[6]^[24].

DeepSeek-V4-Pro-Max

DeepSeek-V4-Pro-Max выглядит скорее как кандидат на лучшее соотношение цены и качества, а не как безусловный лидер по бенчмаркам. LLM Stats указывает для него размер 1,6T, контекст 1M, 80,6% на SWE-Bench Verified и $1,74/$3,48 в ценовых колонках ^[18]. В общем сравнении модель получает 90,1% на GPQA Diamond, 37,7% на Humanity’s Last Exam без инструментов, 48,2% на Humanity’s Last Exam с инструментами, 67,9% на Terminal-Bench 2.0, 55,4% на SWE-Bench Pro, 83,4% на BrowseComp и 73,6% на MCP Atlas ^[24].

Эти цифры делают DeepSeek-V4-Pro-Max разумным кандидатом для cost-sensitive сценариев. Но та же таблица показывает, что GPT-5.5, GPT-5.5 Pro или Claude Opus 4.7 лидируют в большинстве приведённых строк, поэтому DeepSeek стоит валидировать на собственных задачах до замены премиальной модели в продакшене ^[24].

Цена и контекст: как читать сигналы

Стоимость и длина контекста не всегда приводятся одним и тем же источником или самим провайдером. Воспринимайте эти строки как ориентиры для закупки, а не как финальное коммерческое предложение.

Модель	Сигнал по контексту и цене	Практическое чтение
GPT-5.5	BenchLM указывает 1M контекста; один ценовой обзор приводит $5 за вход и $30 за выход за миллион токенов ^[27]^[36].	Премиальный hosted-вариант; обязательно сверять актуальную цену.
Claude Opus 4.7	LLM Stats сообщает 1M контекста и $5/$25 за миллион токенов ^[16].	Премиальный вариант для coding, reasoning и long-context задач.
Kimi K2.6	Релизное освещение говорит о 256K контекста; LLM Stats указывает 262K и $0,95/$4,00 в ценовых колонках ^[1]^[11].	Сильный open-weight кандидат; hosted-цена может зависеть от провайдера.
DeepSeek-V4-Pro-Max	LLM Stats указывает 1M контекста, размер 1,6T, 80,6% на SWE-Bench Verified и $1,74/$3,48 в ценовых колонках ^[18].	Сильный value-кандидат, если качество подтвердится на ваших задачах.

Почему рейтинги расходятся

Разные строки измеряют разные навыки. GPQA Diamond и Humanity’s Last Exam делают упор на сложное рассуждение, Terminal-Bench 2.0 и варианты SWE-Bench — на программирование и агентную работу с кодом, а BrowseComp в общем сравнении отражает browsing-style retrieval задачи ^[24]. Поэтому модель может лидировать в одной строке и заметно уступать в другой: меняются задача, доступ к инструментам и оценочный harness.

Даже один и тот же бенчмарк может отличаться по реализации. LLM Stats указывает для Claude Opus 4.7 87,6% на SWE-Bench Verified, тогда как LMCouncil в своей настройке приводит 83,5% ± 1,7 ^[18]^[30]. Anthropic также пишет, что часть результатов использовала внутренние реализации или обновлённые параметры harness, что ограничивает прямое сравнение с публичными лидербордами ^[17].

Именно поэтому разрыв в один-два процентных пункта не должен сам по себе решать продакшен-внедрение. Публичные бенчмарки хороши для короткого списка; окончательное решение лучше принимать по собственному eval-набору.

Как проверять финалистов

Перед выбором модели прогоните две-три лучшие кандидатуры на задачах, похожих на ваши реальные сценарии.

Используйте реальные промпты, файлы и репозитории. Бенчмарки редко отражают особенности вашего кода, документов, политик и поведения пользователей.
Повторите инструментальную среду. Результаты coding-агента могут меняться, если у модели есть терминал, browsing, retrieval, контекст репозитория или внутренние API.
Сравнивайте цену и задержку при одинаковых настройках. Pro-режимы и повышенный reasoning effort могут менять качество, расход токенов и время ответа.
Разбирайте ошибки вручную. Для coding-задач смотрите тесты, diff, поддерживаемость, регрессии безопасности и выдуманные зависимости.
Включите хотя бы одного более дешёвого претендента. Если важны открытые веса или стоимость инференса, Kimi K2.6 и DeepSeek-V4-Pro-Max заслуживают места в тестовом наборе ^[1]^[18].

Итог

Если нужен короткий список из премиальных моделей, начните с параллельного теста GPT-5.5 и Claude Opus 4.7: GPT-5.5 даёт самый сильный цитируемый Terminal-Bench 2.0, а Claude Opus 4.7 — самые сильные приведённые результаты SWE-Bench Pro и SWE-Bench Verified ^[18]^[24]. Если требуются открытые веса, первым кандидатом выглядит Kimi K2.6 ^[1]^[6]. Если главное ограничение — стоимость, включите DeepSeek-V4-Pro-Max, но проверьте его на собственных задачах, прежде чем считать полноценной заменой премиальным вариантам ^[18]^[24].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Искать и проверять факты с Studio Global AI

Ключевые выводы

Для терминальных coding агентов в общей таблице сильнее выглядит GPT 5.5 с 82,7% на Terminal Bench 2.0; для задач ремонта кода лидирует Claude Opus 4.7 — 64,3% на SWE Bench Pro и 87,6% на SWE Bench Verified [18][24].
GPT 5.5 Pro нельзя смешивать с базовой GPT 5.5: там, где Pro версия указана отдельно, она лидирует в BrowseComp с 90,1% и Humanity’s Last Exam с инструментами с 57,2% [24].
Kimi K2.6 — главный open weight кандидат в этих источниках, а DeepSeek V4 Pro Max стоит рассматривать как более дешёвый hosted вариант, который обязательно нужно проверить на своих задачах [1][18].

Люди также спрашивают

Каков краткий ответ на вопрос «GPT-5.5 против Claude Opus 4.7, Kimi K2.6 и DeepSeek V4: что показывают бенчмарки»?

Какие ключевые моменты необходимо проверить в первую очередь?

Что мне делать дальше на практике?

Kimi K2.6 — главный open weight кандидат в этих источниках, а DeepSeek V4 Pro Max стоит рассматривать как более дешёвый hosted вариант, который обязательно нужно проверить на своих задачах [1][18].

Какую связанную тему мне следует изучить дальше?

Продолжайте с «Claude Opus 4.7, GPT-5.5, DeepSeek V4 и Kimi K2.6: сравнение бенчмарков 2026 года», чтобы увидеть другой ракурс и дополнительные цитаты.

Открыть связанную страницу

С чем мне это сравнить?

Сверьте этот ответ с «Northwest и Southeast timber: почему ответ — «larger; larger»».

Открыть связанную страницу

Продолжайте свое исследование

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7, GPT-5.5, DeepSeek V4 и Kimi K2.6: сравнение бенчмарков 2026 года

Claude Opus 4.7 против GPT-5.5, DeepSeek V4 и Kimi K2.6: кто сильнее по бенчмаркам

Northwest и Southeast timber: почему ответ — «larger; larger»

Почему при большей заменяемости Northwest и Southeast timber обе эластичности становятся larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

DeepSeek против OpenAI, Claude, Gemini и Grok: решат цена и доверие, а не рейтинги

Сможет ли DeepSeek обойти OpenAI, Claude, Gemini и Grok?

Illustration of an AI coding agent connected to cloud database and backup systems

Инцидент PocketOS: что на самом деле показывает история с Claude, Cursor и удалённой базой

Удаление базы PocketOS: урок не в том, что «ИИ сошёл с ума»

Исследовательский разговор

Вы25 апр. 2026

Вопрос исследования

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 апр. 202640 источники

Цитируемый ответ

I found public benchmark data for all four, but not one fully independent, apples-to-apples leaderboard covering GPT-5.5, Claude Opus 4.7, Kimi K2.6, and DeepSeek V4 under the same harness. The practical read is: GPT-5.5 looks strongest for broad agentic coding/computer-use tasks, Claude Opus 4.7 is extremely strong on coding and hard reasoning, Kimi K2.6 is the strongest open-weight option in this group, and DeepSeek V4 looks like the best price/performance play if its reported results hold up across independent testing ^[4]^[5]^[6]^[13].

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

If you want the safest high-end general choice, start with GPT-5.5. If your priority is top-tier coding reliability and long-context enterprise work, test Claude Opus 4.7. If you need open weights and strong frontier-adjacent performance, Kimi K2.6 is the most compelling open option. If cost is the constraint and you can tolerate some validation risk, DeepSeek V4 is the one to benchmark first.

Источники

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

Популярное в «Открыть»

ОтчетыОпубликовано29 апр. 2026Last edited 6 мая 202612 источники

GPT-5.5 против Claude Opus 4.7, Kimi K2.6 и DeepSeek V4: что показывают бенчмарки

Искать и проверять факты с Studio Global AI Смотреть больше в «Открыть»

17K0

Короткий вывод по типу задачи

Терминальные coding-агенты: GPT-5.5 показывает самый сильный цитируемый результат Terminal-Bench 2.0 в общем сравнении — 82,7% ^[24].
Ремонт и сопровождение ПО: Claude Opus 4.7 лидирует в приведённых строках SWE-Bench Pro с 64,3% и SWE-Bench Verified с 87,6% ^[18]^[24].
Сложное рассуждение без инструментов: Claude Opus 4.7 лидирует в общих строках GPQA Diamond и Humanity’s Last Exam без инструментов ^[24].
Рассуждение с инструментами и browsing-задачи: GPT-5.5 Pro лидирует в Humanity’s Last Exam с инструментами — 57,2% — и BrowseComp — 90,1%, когда эта Pro-версия указана отдельно ^[24].
Open-weight развёртывание: Kimi K2.6 — самый очевидный кандидат среди моделей с открытыми весами в этих источниках: её описывают как MoE-модель на 1 трлн параметров с 32 млрд активных параметров и контекстом 256K ^[1].
Чувствительный к цене облачный инференс: DeepSeek-V4-Pro-Max — модель, которую стоит проверить на соотношение цены и качества: LLM Stats указывает для неё контекст 1M, 80,6% на SWE-Bench Verified и $1,74/$3,48 в ценовых колонках ^[18].

Сводная таблица бенчмарков

Бенчмарк	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93,6% ^[24]	—	94,2% ^[24]	≈91% ^[28]	90,1% ^[24]
Humanity’s Last Exam, без инструментов	41,4% ^[24]	43,1% ^[24]	46,9% ^[24]	—	37,7% ^[24]
Humanity’s Last Exam, с инструментами	52,2% ^[24]	57,2% ^[24]	54,7% ^[24]	54,0% ^[1]	48,2% ^[24]
Terminal-Bench 2.0	82,7% ^[24]	—	69,4% ^[24]	66,7% ^[6]	67,9% ^[24]
SWE-Bench Pro	58,6% ^[24]	—	64,3% ^[24]	58,6% ^[6]	55,4% ^[24]
BrowseComp	84,4% ^[24]	90,1% ^[24]	79,3% ^[24]	83,2% ^[1]	83,4% ^[24]
MCP Atlas / MCPAtlas Public	75,3% ^[24]	—	79,1% ^[24]	—	73,6% ^[24]
SWE-Bench Verified	—	—	87,6% ^[18]	80,2% ^[6]	80,6% ^[18]

С какой модели начинать тесты

Приоритет	Сначала тестировать	Почему
Терминальные coding-агенты	GPT-5.5	Самый высокий Terminal-Bench 2.0 в общем сравнении — 82,7% ^[24].
Ремонт и сопровождение кода	Claude Opus 4.7	Лидирует в приведённых строках SWE-Bench Pro и SWE-Bench Verified среди этих моделей ^[18]^[24].
Сложное рассуждение без инструментов	Claude Opus 4.7	Лидирует в GPQA Diamond и Humanity’s Last Exam без инструментов в общем сравнении ^[24].
Tool-assisted reasoning и browsing	GPT-5.5 Pro	Лидирует в Humanity’s Last Exam с инструментами и BrowseComp там, где GPT-5.5 Pro указана отдельно ^[24].
Open-weight развёртывание	Kimi K2.6	Описана как open-weight MoE-модель на 1 трлн параметров; карточка на Hugging Face приводит сильные результаты по coding-бенчмаркам ^[1]^[6].
Экономия на hosted inference	DeepSeek-V4-Pro-Max	LLM Stats указывает 1M контекста, 80,6% на SWE-Bench Verified и более низкие ценовые колонки, чем у Claude Opus 4.7 на том же лидерборде ^[18].
Длинный контекст	GPT-5.5, Claude Opus 4.7 или DeepSeek-V4-Pro-Max	Источники указывают 1M контекста для GPT-5.5, Claude Opus 4.7 и DeepSeek-V4-Pro-Max; для Kimi K2.6 фигурирует примерно 256K–262K ^[1]^[11]^[16]^[18]^[27].

Заметки по моделям

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Цена и контекст: как читать сигналы

Модель	Сигнал по контексту и цене	Практическое чтение
GPT-5.5	BenchLM указывает 1M контекста; один ценовой обзор приводит $5 за вход и $30 за выход за миллион токенов ^[27]^[36].	Премиальный hosted-вариант; обязательно сверять актуальную цену.
Claude Opus 4.7	LLM Stats сообщает 1M контекста и $5/$25 за миллион токенов ^[16].	Премиальный вариант для coding, reasoning и long-context задач.
Kimi K2.6	Релизное освещение говорит о 256K контекста; LLM Stats указывает 262K и $0,95/$4,00 в ценовых колонках ^[1]^[11].	Сильный open-weight кандидат; hosted-цена может зависеть от провайдера.
DeepSeek-V4-Pro-Max	LLM Stats указывает 1M контекста, размер 1,6T, 80,6% на SWE-Bench Verified и $1,74/$3,48 в ценовых колонках ^[18].	Сильный value-кандидат, если качество подтвердится на ваших задачах.

Почему рейтинги расходятся

Как проверять финалистов

Перед выбором модели прогоните две-три лучшие кандидатуры на задачах, похожих на ваши реальные сценарии.

Используйте реальные промпты, файлы и репозитории. Бенчмарки редко отражают особенности вашего кода, документов, политик и поведения пользователей.
Повторите инструментальную среду. Результаты coding-агента могут меняться, если у модели есть терминал, browsing, retrieval, контекст репозитория или внутренние API.
Сравнивайте цену и задержку при одинаковых настройках. Pro-режимы и повышенный reasoning effort могут менять качество, расход токенов и время ответа.
Разбирайте ошибки вручную. Для coding-задач смотрите тесты, diff, поддерживаемость, регрессии безопасности и выдуманные зависимости.
Включите хотя бы одного более дешёвого претендента. Если важны открытые веса или стоимость инференса, Kimi K2.6 и DeepSeek-V4-Pro-Max заслуживают места в тестовом наборе ^[1]^[18].

Итог

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Искать и проверять факты с Studio Global AI

Ключевые выводы

Для терминальных coding агентов в общей таблице сильнее выглядит GPT 5.5 с 82,7% на Terminal Bench 2.0; для задач ремонта кода лидирует Claude Opus 4.7 — 64,3% на SWE Bench Pro и 87,6% на SWE Bench Verified [18][24].
GPT 5.5 Pro нельзя смешивать с базовой GPT 5.5: там, где Pro версия указана отдельно, она лидирует в BrowseComp с 90,1% и Humanity’s Last Exam с инструментами с 57,2% [24].
Kimi K2.6 — главный open weight кандидат в этих источниках, а DeepSeek V4 Pro Max стоит рассматривать как более дешёвый hosted вариант, который обязательно нужно проверить на своих задачах [1][18].

Люди также спрашивают

Каков краткий ответ на вопрос «GPT-5.5 против Claude Opus 4.7, Kimi K2.6 и DeepSeek V4: что показывают бенчмарки»?

Какие ключевые моменты необходимо проверить в первую очередь?

Что мне делать дальше на практике?

Какую связанную тему мне следует изучить дальше?

Открыть связанную страницу

С чем мне это сравнить?

Сверьте этот ответ с «Northwest и Southeast timber: почему ответ — «larger; larger»».

Открыть связанную страницу

Продолжайте свое исследование

Claude Opus 4.7, GPT-5.5, DeepSeek V4 и Kimi K2.6: сравнение бенчмарков 2026 года

Claude Opus 4.7 против GPT-5.5, DeepSeek V4 и Kimi K2.6: кто сильнее по бенчмаркам

Northwest и Southeast timber: почему ответ — «larger; larger»

Почему при большей заменяемости Northwest и Southeast timber обе эластичности становятся larger

DeepSeek против OpenAI, Claude, Gemini и Grok: решат цена и доверие, а не рейтинги

Сможет ли DeepSeek обойти OpenAI, Claude, Gemini и Grok?

Инцидент PocketOS: что на самом деле показывает история с Claude, Cursor и удалённой базой

Удаление базы PocketOS: урок не в том, что «ИИ сошёл с ума»

Исследовательский разговор

Вы25 апр. 2026

Вопрос исследования

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 апр. 202640 источники

Цитируемый ответ

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

Источники

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

Популярное в «Открыть»

ОтчетыОпубликовано29 апр. 2026Last edited 6 мая 202612 источники

GPT-5.5 против Claude Opus 4.7, Kimi K2.6 и DeepSeek V4: что показывают бенчмарки

Искать и проверять факты с Studio Global AI Смотреть больше в «Открыть»

17K0

Короткий вывод по типу задачи

Терминальные coding-агенты: GPT-5.5 показывает самый сильный цитируемый результат Terminal-Bench 2.0 в общем сравнении — 82,7% ^[24].
Ремонт и сопровождение ПО: Claude Opus 4.7 лидирует в приведённых строках SWE-Bench Pro с 64,3% и SWE-Bench Verified с 87,6% ^[18]^[24].
Сложное рассуждение без инструментов: Claude Opus 4.7 лидирует в общих строках GPQA Diamond и Humanity’s Last Exam без инструментов ^[24].
Рассуждение с инструментами и browsing-задачи: GPT-5.5 Pro лидирует в Humanity’s Last Exam с инструментами — 57,2% — и BrowseComp — 90,1%, когда эта Pro-версия указана отдельно ^[24].
Open-weight развёртывание: Kimi K2.6 — самый очевидный кандидат среди моделей с открытыми весами в этих источниках: её описывают как MoE-модель на 1 трлн параметров с 32 млрд активных параметров и контекстом 256K ^[1].
Чувствительный к цене облачный инференс: DeepSeek-V4-Pro-Max — модель, которую стоит проверить на соотношение цены и качества: LLM Stats указывает для неё контекст 1M, 80,6% на SWE-Bench Verified и $1,74/$3,48 в ценовых колонках ^[18].

Сводная таблица бенчмарков

Бенчмарк	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93,6% ^[24]	—	94,2% ^[24]	≈91% ^[28]	90,1% ^[24]
Humanity’s Last Exam, без инструментов	41,4% ^[24]	43,1% ^[24]	46,9% ^[24]	—	37,7% ^[24]
Humanity’s Last Exam, с инструментами	52,2% ^[24]	57,2% ^[24]	54,7% ^[24]	54,0% ^[1]	48,2% ^[24]
Terminal-Bench 2.0	82,7% ^[24]	—	69,4% ^[24]	66,7% ^[6]	67,9% ^[24]
SWE-Bench Pro	58,6% ^[24]	—	64,3% ^[24]	58,6% ^[6]	55,4% ^[24]
BrowseComp	84,4% ^[24]	90,1% ^[24]	79,3% ^[24]	83,2% ^[1]	83,4% ^[24]
MCP Atlas / MCPAtlas Public	75,3% ^[24]	—	79,1% ^[24]	—	73,6% ^[24]
SWE-Bench Verified	—	—	87,6% ^[18]	80,2% ^[6]	80,6% ^[18]

С какой модели начинать тесты

Приоритет	Сначала тестировать	Почему
Терминальные coding-агенты	GPT-5.5	Самый высокий Terminal-Bench 2.0 в общем сравнении — 82,7% ^[24].
Ремонт и сопровождение кода	Claude Opus 4.7	Лидирует в приведённых строках SWE-Bench Pro и SWE-Bench Verified среди этих моделей ^[18]^[24].
Сложное рассуждение без инструментов	Claude Opus 4.7	Лидирует в GPQA Diamond и Humanity’s Last Exam без инструментов в общем сравнении ^[24].
Tool-assisted reasoning и browsing	GPT-5.5 Pro	Лидирует в Humanity’s Last Exam с инструментами и BrowseComp там, где GPT-5.5 Pro указана отдельно ^[24].
Open-weight развёртывание	Kimi K2.6	Описана как open-weight MoE-модель на 1 трлн параметров; карточка на Hugging Face приводит сильные результаты по coding-бенчмаркам ^[1]^[6].
Экономия на hosted inference	DeepSeek-V4-Pro-Max	LLM Stats указывает 1M контекста, 80,6% на SWE-Bench Verified и более низкие ценовые колонки, чем у Claude Opus 4.7 на том же лидерборде ^[18].
Длинный контекст	GPT-5.5, Claude Opus 4.7 или DeepSeek-V4-Pro-Max	Источники указывают 1M контекста для GPT-5.5, Claude Opus 4.7 и DeepSeek-V4-Pro-Max; для Kimi K2.6 фигурирует примерно 256K–262K ^[1]^[11]^[16]^[18]^[27].

Заметки по моделям

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Цена и контекст: как читать сигналы

Модель	Сигнал по контексту и цене	Практическое чтение
GPT-5.5	BenchLM указывает 1M контекста; один ценовой обзор приводит $5 за вход и $30 за выход за миллион токенов ^[27]^[36].	Премиальный hosted-вариант; обязательно сверять актуальную цену.
Claude Opus 4.7	LLM Stats сообщает 1M контекста и $5/$25 за миллион токенов ^[16].	Премиальный вариант для coding, reasoning и long-context задач.
Kimi K2.6	Релизное освещение говорит о 256K контекста; LLM Stats указывает 262K и $0,95/$4,00 в ценовых колонках ^[1]^[11].	Сильный open-weight кандидат; hosted-цена может зависеть от провайдера.
DeepSeek-V4-Pro-Max	LLM Stats указывает 1M контекста, размер 1,6T, 80,6% на SWE-Bench Verified и $1,74/$3,48 в ценовых колонках ^[18].	Сильный value-кандидат, если качество подтвердится на ваших задачах.

Почему рейтинги расходятся

Как проверять финалистов

Перед выбором модели прогоните две-три лучшие кандидатуры на задачах, похожих на ваши реальные сценарии.

Используйте реальные промпты, файлы и репозитории. Бенчмарки редко отражают особенности вашего кода, документов, политик и поведения пользователей.
Повторите инструментальную среду. Результаты coding-агента могут меняться, если у модели есть терминал, browsing, retrieval, контекст репозитория или внутренние API.
Сравнивайте цену и задержку при одинаковых настройках. Pro-режимы и повышенный reasoning effort могут менять качество, расход токенов и время ответа.
Разбирайте ошибки вручную. Для coding-задач смотрите тесты, diff, поддерживаемость, регрессии безопасности и выдуманные зависимости.
Включите хотя бы одного более дешёвого претендента. Если важны открытые веса или стоимость инференса, Kimi K2.6 и DeepSeek-V4-Pro-Max заслуживают места в тестовом наборе ^[1]^[18].

Итог

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Искать и проверять факты с Studio Global AI

Ключевые выводы

Для терминальных coding агентов в общей таблице сильнее выглядит GPT 5.5 с 82,7% на Terminal Bench 2.0; для задач ремонта кода лидирует Claude Opus 4.7 — 64,3% на SWE Bench Pro и 87,6% на SWE Bench Verified [18][24].
GPT 5.5 Pro нельзя смешивать с базовой GPT 5.5: там, где Pro версия указана отдельно, она лидирует в BrowseComp с 90,1% и Humanity’s Last Exam с инструментами с 57,2% [24].
Kimi K2.6 — главный open weight кандидат в этих источниках, а DeepSeek V4 Pro Max стоит рассматривать как более дешёвый hosted вариант, который обязательно нужно проверить на своих задачах [1][18].

Люди также спрашивают

Каков краткий ответ на вопрос «GPT-5.5 против Claude Opus 4.7, Kimi K2.6 и DeepSeek V4: что показывают бенчмарки»?

Какие ключевые моменты необходимо проверить в первую очередь?

Что мне делать дальше на практике?

Какую связанную тему мне следует изучить дальше?

Открыть связанную страницу

С чем мне это сравнить?

Сверьте этот ответ с «Northwest и Southeast timber: почему ответ — «larger; larger»».

Открыть связанную страницу

Продолжайте свое исследование

Исследовательский разговор

Вы25 апр. 2026

Вопрос исследования

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 апр. 202640 источники

Цитируемый ответ

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

Источники

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago