InformesPublicado28 abr 2026Last edited 6 may 202612 fuentes

GPT-5.5, Claude Opus 4.7, DeepSeek V4 y Kimi K2.6: qué modelo de IA conviene elegir

GPT 5.5 tiene la señal agregada más fuerte: Artificial Analysis sitúa GPT 5.5 xhigh en 60 y GPT 5.5 high en 59, por delante de Claude Opus 4.7 en 57.[2] Claude Opus 4.7 gana varias pruebas compartidas de razonamiento y software; DeepSeek V4 ofrece la ventaja de costo más clara en API; Kimi K2.6 es prometedor, pero c...

Buscar y verificar hechos con Studio Global AI Explora más de Descubrir

17K0

Editorial illustration comparing GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6 AI models — GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmarks, Pricing, and Best Use CasesA practical comparison of leading AI models depends on the benchmark, variant, reasoning setting, and API price.
Prompt de IA
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmarks, Pricing, and Best Use Cases. Article summary: There is no universal winner: GPT 5.5 leads the available Artificial Analysis Intelligence Index at 60/59, Claude Opus 4.7 wins several shared VentureBeat reasoning and SWE rows, and DeepSeek V4 is the price value out.... Topic tags: ai, llm, ai benchmarks, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison - YouTube" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://ww
openai.com

Comparar modelos frontera con una sola prueba es una receta para sacar conclusiones equivocadas. Lo más útil para un equipo que va a pagar API o montar flujos de trabajo no es preguntar cuál es el mejor en abstracto, sino qué modelo gana en su tipo de tarea, con qué variante y a qué precio. Con las fuentes disponibles, el mapa queda así: GPT-5.5 ofrece la señal agregada más fuerte, Claude Opus 4.7 gana varias filas duras de razonamiento e ingeniería de software, DeepSeek V4 tiene la ventaja de costo de API más clara, y Kimi K2.6 parece sólido para código y agentes, pero con menos evidencia directa contra GPT-5.5 y Opus 4.7.^[2]^[16]^[15]^[18]^[19]

Veredicto rápido

Si lo que más te importa es…	Elección mejor respaldada	Por qué
Señal agregada de inteligencia	GPT-5.5	Artificial Analysis lista GPT-5.5 xhigh con 60 y GPT-5.5 high con 59, por delante de Claude Opus 4.7 Adaptive Reasoning Max Effort con 57.^[2]
Razonamiento difícil e ingeniería de software	Claude Opus 4.7, con GPT-5.5 muy cerca	En la tabla compartida por VentureBeat, Claude lidera GPQA Diamond, HLE sin herramientas, SWE-Bench Pro y MCP Atlas; GPT-5.5 lidera Terminal-Bench 2.0 y BrowseComp en modelo base, y GPT-5.5 Pro lidera HLE con herramientas y BrowseComp cuando aparece esa variante.^[16]
Menor costo API entre los modelos insignia listados	DeepSeek V4	Mashable lista DeepSeek V4 a $1,74 por 1 millón de tokens de entrada y $3,48 por 1 millón de tokens de salida, por debajo de GPT-5.5 a $5/$30 y Claude Opus 4.7 a $5/$25.^[15]
Métricas publicadas de coding y programación competitiva	DeepSeek V4 Pro	Together AI lista DeepSeek V4 Pro con 93,5 % en LiveCodeBench, Codeforces 3206, 80,6 % en SWE-Bench Verified y 76,2 % en SWE-Bench Multilingual.^[25]
Evaluar Kimi K2.6	Prometedor, pero no cerrado	Kimi K2.6 tiene números útiles en código y tareas agénticas, pero buena parte de la evidencia disponible lo compara con GPT-5.4 y Claude Opus 4.6, no con GPT-5.5 y Claude Opus 4.7.^[18]^[19]

Primero, el ranking agregado: ventaja para GPT-5.5

La señal agregada más clara en las fuentes disponibles viene de Artificial Analysis. En su índice, GPT-5.5 xhigh aparece primero con 60 puntos y GPT-5.5 high segundo con 59; Claude Opus 4.7 Adaptive Reasoning Max Effort aparece con 57.^[2]

Kimi K2.6 queda por debajo de ese bloque GPT-5.5/Claude en los fragmentos compuestos disponibles. OpenRouter lista Kimi K2.6 con 53,9 en Intelligence, 47,1 en Coding y 66,0 en Agentic; LLMBase también muestra Kimi con 53,9 en Intelligence y 47,1 en Coding.^[3]^[1] En esa misma comparación de LLMBase, DeepSeek V4 Flash High aparece con 44,9 en Intelligence y 39,8 en Coding, aunque conviene subrayar que se trata de la variante Flash, no de DeepSeek V4 Pro ni Pro-Max.^[1]

La advertencia es importante: la clasificación agregada da una señal clara para GPT-5.5 frente a Claude Opus 4.7, pero no ofrece una única tabla completa que ponga a GPT-5.5, Claude Opus 4.7, DeepSeek V4 Pro-Max y Kimi K2.6 en la misma fila comparativa.^[2]

En benchmarks compartidos: decisión dividida

La tabla de VentureBeat es la referencia más útil para comparar DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro cuando aparece y Claude Opus 4.7 en las mismas pruebas.^[16]

Benchmark	DeepSeek-V4-Pro-Max	GPT-5.5	GPT-5.5 Pro, cuando aparece	Claude Opus 4.7	Mejor resultado en esa fuente
GPQA Diamond	90,1 %	93,6 %	—	94,2 %	Claude Opus 4.7^[16]
Humanity’s Last Exam, sin herramientas	37,7 %	41,4 %	43,1 %	46,9 %	Claude Opus 4.7^[16]
Humanity’s Last Exam, con herramientas	48,2 %	52,2 %	57,2 %	54,7 %	GPT-5.5 Pro^[16]
Terminal-Bench 2.0	67,9 %	82,7 %	—	69,4 %	GPT-5.5^[16]
SWE-Bench Pro / SWE Pro	55,4 %	58,6 %	—	64,3 %	Claude Opus 4.7^[16]
BrowseComp	83,4 %	84,4 %	90,1 %	79,3 %	GPT-5.5 Pro^[16]
MCP Atlas / MCPAtlas Public	73,6 %	75,3 %	—	79,1 %	Claude Opus 4.7^[16]

La lectura correcta no es que uno arrase. Es un reparto de victorias. Claude Opus 4.7 tiene el mejor argumento en GPQA Diamond, HLE sin herramientas, SWE-Bench Pro y MCP Atlas.^[16] GPT-5.5, en cambio, obtiene los mejores resultados de modelo base en Terminal-Bench 2.0 y BrowseComp, mientras GPT-5.5 Pro queda por encima en HLE con herramientas y BrowseComp cuando VentureBeat incluye esa variante.^[16]

DeepSeek-V4-Pro-Max se mantiene competitivo en varias filas, pero no supera al mejor resultado de GPT-5.5 o Claude Opus 4.7 en esa tabla compartida. Su resultado más cercano está en BrowseComp: 83,4 %, frente al 84,4 % de GPT-5.5 y el 79,3 % de Claude Opus 4.7.^[16]

Programación: no basta con una sola nota de código

Para tareas de ingeniería de software sobre repositorios, Claude Opus 4.7 tiene el resultado compartido más fuerte en SWE-Bench Pro dentro de la tabla de VentureBeat: 64,3 %, frente al 58,6 % de GPT-5.5 y el 55,4 % de DeepSeek-V4-Pro-Max.^[16]

DeepSeek V4 Pro, sin embargo, ofrece el perfil de programación más detallado en las fichas disponibles. Together AI lista DeepSeek V4 Pro con 93,5 % en LiveCodeBench, Codeforces 3206, 80,6 % en SWE-Bench Verified y 76,2 % en SWE-Bench Multilingual.^[25] La ficha de NVIDIA también desglosa variantes DeepSeek V4 Flash y V4 Pro en pruebas como GPQA Diamond, HLE, LiveCodeBench y Codeforces, con V4-Pro Max en 93,5 en LiveCodeBench y 3206 en Codeforces.^[31]

Kimi K2.6 también tiene señales relevantes para programación, aunque las tablas más centradas en Kimi lo comparan sobre todo con competidores de generación anterior. Lorka lista Kimi K2.6 con 58,6 % en SWE-Bench Pro, 54,0 % en HLE-Full con herramientas, 90,5 % en GPQA-Diamond y 79,4 % en MMMU-Pro en una tabla frente a GPT-5.4, Claude Opus 4.6 y Gemini 3.1 Pro.^[18] Verdent lista Kimi K2.6 con 80,2 % en SWE-Bench Verified, 66,7 % en Terminal-Bench 2.0, 54,0 % en HLE con herramientas y 89,6 % en LiveCodeBench v6, y además señala que Opus 4.7 lidera SWE-Bench Verified con 87,6 %.^[19]

La conclusión práctica: Kimi K2.6 merece pruebas internas si el caso de uso combina código, herramientas y agentes. Lo que no permiten afirmar estas fuentes es que sea el ganador global frente a GPT-5.5 o Claude Opus 4.7.^[18]^[19]

Precio de API: DeepSeek cambia la conversación

Si el costo por token pesa mucho en la decisión, DeepSeek V4 tiene el argumento más contundente entre las fuentes disponibles. Mashable lista DeepSeek V4 a $1,74 por 1 millón de tokens de entrada y $3,48 por 1 millón de tokens de salida, frente a GPT-5.5 a $5 por 1 millón de tokens de entrada y $30 por 1 millón de tokens de salida, y Claude Opus 4.7 a $5 por 1 millón de tokens de entrada y $25 por 1 millón de tokens de salida.^[15]

Modelo o variante	Precio de entrada listado	Precio de salida listado	Nota
GPT-5.5	$5 por 1 millón de tokens	$30 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
Claude Opus 4.7	$5 por 1 millón de tokens	$25 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
DeepSeek V4	$1,74 por 1 millón de tokens	$3,48 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
DeepSeek V4 Flash	$0,14 por 1 millón de tokens	$0,28 por 1 millón de tokens	LLMBase lista un precio combinado de $0,18 en su comparación DeepSeek V4 Flash High vs Kimi K2.6.^[1]
Kimi K2.6	$0,95 por 1 millón de tokens	$4,00 por 1 millón de tokens	LLMBase lista un precio combinado de $1,71 en la misma comparación.^[1]

No conviene asumir que todos los endpoints tengan la misma ventana de contexto o los mismos límites. Mashable lista ventanas de contexto de 1 millón para DeepSeek V4, GPT-5.5 y Claude Opus 4.7 en su comparación de precios, mientras que una ficha de OpenRouter para DeepSeek V4 Pro muestra 256K tokens máximos y 66K tokens máximos de salida.^[15]^[3] Antes de pasar a producción, hay que confirmar proveedor, variante exacta y modo de razonamiento.

Qué elegir según el uso

GPT-5.5: el mejor valor por defecto si mandan los rankings agregados

GPT-5.5 es la opción más segura si la decisión se guía por la clasificación agregada disponible. Artificial Analysis sitúa GPT-5.5 xhigh en 60 y GPT-5.5 high en 59, los dos primeros puestos del Intelligence Index en el fragmento aportado.^[2]

También destaca en dos filas compartidas de VentureBeat: 82,7 % en Terminal-Bench 2.0 y 84,4 % en BrowseComp para GPT-5.5 base, con GPT-5.5 Pro en 90,1 % en BrowseComp cuando esa variante aparece.^[16]

Claude Opus 4.7: fuerte para razonamiento duro y software

Claude Opus 4.7 queda muy cerca de GPT-5.5 en el ranking agregado, con 57 en el Intelligence Index de Artificial Analysis para el modo Adaptive Reasoning Max Effort.^[2] En la tabla compartida de VentureBeat, lidera frente a GPT-5.5 y DeepSeek-V4-Pro-Max en GPQA Diamond, HLE sin herramientas, SWE-Bench Pro y MCP Atlas.^[16]

Anthropic también reporta en su material de lanzamiento resultados internos de agente de investigación: un empate en la mejor puntuación global de 0,715 en seis módulos y 0,813 en General Finance frente a 0,767 de Opus 4.6.^[17] Como son métricas internas, sirven como contexto, no como sustituto de una evaluación independiente.^[17]

DeepSeek V4: la mejor apuesta de valor si la variante encaja

La ventaja más evidente de DeepSeek V4 es el precio. En la comparación de Mashable, sus precios de entrada y salida quedan muy por debajo de GPT-5.5 y Claude Opus 4.7: $1,74 y $3,48 por 1 millón de tokens, frente a $5/$30 para GPT-5.5 y $5/$25 para Claude Opus 4.7.^[15]

DeepSeek V4 Pro también presenta métricas de código potentes, entre ellas 93,5 % en LiveCodeBench, Codeforces 3206, 80,6 % en SWE-Bench Verified y 76,2 % en SWE-Bench Multilingual en la ficha de Together AI.^[25] El matiz es que DeepSeek-V4-Pro-Max queda por detrás del mejor resultado de GPT-5.5 o Claude Opus 4.7 en las filas compartidas de VentureBeat, incluso cuando se acerca en BrowseComp.^[16]

Kimi K2.6: candidato serio para código y agentes, pero menos probado en esta carrera

Kimi K2.6 es más difícil de colocar en una clasificación directa de cuatro modelos porque las tablas más centradas en Kimi lo comparan principalmente con GPT-5.4 y Claude Opus 4.6, no con GPT-5.5 y Claude Opus 4.7.^[18]^[19] Aun así, las señales no son menores: OpenRouter lista Kimi K2.6 con 53,9 en Intelligence, 47,1 en Coding y 66,0 en Agentic, mientras Verdent lista 80,2 % en SWE-Bench Verified y 89,6 % en LiveCodeBench v6.^[3]^[19]

La conclusión no es que Kimi K2.6 esté descartado. Es que la evidencia directa es más fina. Si su precio, ruta de despliegue o comportamiento agéntico encaja con tu stack, merece evaluación; las fuentes disponibles no bastan para nombrarlo ganador global frente a GPT-5.5 o Claude Opus 4.7.^[18]^[19]

Advertencias antes de decidir

Las variantes importan. DeepSeek V4 aparece en las fuentes como V4, V4 Flash, V4 Pro y DeepSeek-V4-Pro-Max; precios, límites y resultados cambian según la variante y el modo de razonamiento.^[1]^[15]^[25]^[31]
Las comparaciones de Kimi son menos directas. Las tablas más fuertes para Kimi K2.6 lo enfrentan con GPT-5.4 y Claude Opus 4.6, no con GPT-5.5 y Claude Opus 4.7.^[18]^[19]
Humanity’s Last Exam sin herramientas muestra inconsistencias entre fuentes. LLM Stats y VentureBeat reportan GPT-5.5 en 41,4 % y Claude Opus 4.7 en 46,9 %, mientras que el fragmento de Mashable sobre GPT frente a Claude reporta GPT-5.5 en 40,6 % y Opus 4.7 en 31,2 %.^[7]^[16]^[9]
Los benchmarks internos no equivalen a rankings independientes. El lanzamiento de Anthropic para Opus 4.7 incluye mejoras en un benchmark interno de agente de investigación, pero esos datos deben leerse de forma distinta a una comparación pública entre proveedores.^[17]
Precio y contexto dependen del proveedor. Una misma familia de modelos puede aparecer con distintas ventanas de contexto, límites de salida y configuraciones según el endpoint.^[3]^[15]

Conclusión

Elige GPT-5.5 si tu criterio principal es la mejor señal agregada de inteligencia disponible.^[2] Elige Claude Opus 4.7 si tu trabajo se parece a las filas de razonamiento difícil e ingeniería de software donde lidera, como GPQA Diamond, HLE sin herramientas, SWE-Bench Pro y MCP Atlas.^[16] Elige DeepSeek V4 si el costo-rendimiento es central y puedes validar la variante concreta que vas a usar; su precio API listado es mucho más bajo que el de GPT-5.5 y Claude Opus 4.7, y DeepSeek V4 Pro tiene métricas de código fuertes.^[15]^[25] Trata Kimi K2.6 como un candidato creíble para coding y agentes, pero no como ganador global probado frente a GPT-5.5 o Claude Opus 4.7 con la evidencia directa disponible.^[18]^[19]

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Buscar y verificar hechos con Studio Global AI

Conclusiones clave

GPT 5.5 tiene la señal agregada más fuerte: Artificial Analysis sitúa GPT 5.5 xhigh en 60 y GPT 5.5 high en 59, por delante de Claude Opus 4.7 en 57.[2]
Claude Opus 4.7 gana varias pruebas compartidas de razonamiento y software; DeepSeek V4 ofrece la ventaja de costo más clara en API; Kimi K2.6 es prometedor, pero con menos evidencia directa frente a GPT 5.5 y Opus 4....
En programación, Claude lidera la fila compartida SWE Bench Pro de VentureBeat con 64,3 %, mientras DeepSeek V4 Pro muestra un perfil de código muy detallado con 93,5 % en LiveCodeBench y Codeforces 3206.[16][25]

La gente también pregunta

¿Cuál es la respuesta corta a "GPT-5.5, Claude Opus 4.7, DeepSeek V4 y Kimi K2.6: qué modelo de IA conviene elegir"?

GPT 5.5 tiene la señal agregada más fuerte: Artificial Analysis sitúa GPT 5.5 xhigh en 60 y GPT 5.5 high en 59, por delante de Claude Opus 4.7 en 57.[2]

¿Cuáles son los puntos clave a validar primero?

¿Qué debo hacer a continuación en la práctica?

En programación, Claude lidera la fila compartida SWE Bench Pro de VentureBeat con 64,3 %, mientras DeepSeek V4 Pro muestra un perfil de código muy detallado con 93,5 % en LiveCodeBench y Codeforces 3206.[16][25]

¿Qué tema relacionado debería explorar a continuación?

Continúe con "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto" para conocer otro ángulo y citas adicionales.

Abrir página relacionada

¿Con qué debería comparar esto?

Verifique esta respuesta con "Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger".

Abrir página relacionada

Continúe su investigación

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026

Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger

Northwest y Southeast Timber: por qué más sustituibilidad implica larger; larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

¿DeepSeek derrotará a OpenAI, Claude, Gemini y Grok? La clave no es el ranking, sino el costo y la confianza

Illustration of an AI coding agent connected to cloud database and backup systems

PocketOS y el borrado de su base de datos: lo que de verdad enseña el caso Claude/Cursor

Conversación de investigación

Tú26 abr 2026

Pregunta de investigación

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Studio Global AI26 abr 202632 fuentes

Respuesta citada

Using only the supplied evidence, the strongest benchmark signal is that GPT-5.5 leads the available aggregate Intelligence Index, with GPT-5.5 xhigh at 60 and GPT-5.5 high at 59, while Claude Opus 4.7 Adaptive Reasoning Max Effort scores 57.^[5] DeepSeek V4 Pro has the richest disclosed task-level benchmark data in the evidence, including 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.^[6] For Kimi K2.6, the supplied evidence contains positioning and pricing but no comparable numeric benchmark scores, so a full apples-to-apples benchmark ranking is not supported.^[1]^[3]

Key findings

GPT-5.5 is the best-supported leader on the only cross-model aggregate benchmark available here: GPT-5.5 xhigh ranks first with an Intelligence Index of 60, and GPT-5.5 high ranks second with 59.^[5]
Claude Opus 4.7 is close behind on that same aggregate measure: Claude Opus 4.7 Adaptive Reasoning Max Effort ranks third with an Intelligence Index of 57.^[5]
DeepSeek V4 Pro has the clearest disclosed coding/SWE benchmark profile in the supplied evidence: 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.^[6]
Kimi K2.6 is described as an “open-source flagship” agent-coding model in the launch/news evidence, but the supplied evidence does not provide numeric benchmark scores for Kimi K2.6.^[1]
DeepSeek V4 is materially cheaper than GPT-5.5 in the pricing evidence: DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens, while GPT-5.5 is listed at $5 per 1M input tokens and $30 per 1M output tokens, with both shown as 1M-context models.^[4]
DeepSeek V4 Flash is far cheaper than Kimi K2.6 in the direct pricing comparison: DeepSeek V4 Flash is listed at $0.14 per 1M input tokens, $0.28 per 1M output tokens, and $0.18 blended, while Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.^[3]
The supplied evidence treats DeepSeek V4 as a family with multiple variants, including V4 Flash and V4 Pro, so comparisons depend on which variant is meant.^[2]^[3]^[6]

Benchmark and capability comparison

Area	GPT-5.5	Claude Opus 4.7	DeepSeek V4	Kimi K2.6
Aggregate Intelligence Index	60 for GPT-5.5 xhigh; 59 for GPT-5.5 high.^[5]	57 for Claude Opus 4.7 Adaptive Reasoning Max Effort.^[5]	No numeric Intelligence Index for DeepSeek V4 is provided in the supplied evidence.	No numeric Intelligence Index for Kimi K2.6 is provided in the supplied evidence.
Coding / SWE benchmarks	No exact coding benchmark scores are provided in the supplied evidence.	The launch evidence says Claude Opus 4.7 has improved programming and a threefold vision upgrade, but no exact benchmark numbers are provided.^[1]	DeepSeek V4 Pro is listed with 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.^[6]	Kimi K2.6 is positioned as an open-source flagship agent-coding model, but no exact benchmark numbers are provided.^[1]
Context window evidence	GPT-5.5 is listed with a 1M context window in the pricing comparison.^[4]	The supplied pricing snippet does not clearly provide Claude Opus 4.7’s context window.	DeepSeek V4 is listed with a 1M context window in one comparison, and DeepSeek V4 Flash / Pro are also described as 1M-context options in enterprise guidance.^[2]^[4]	No context-window figure for Kimi K2.6 is provided in the supplied evidence.
Pricing evidence	$5 per 1M input tokens and $30 per 1M output tokens.^[4]	$5 per 1M input tokens is visible in the supplied snippet, but the output price is truncated.^[4]	DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens; DeepSeek V4 Flash is listed at $0.14 input, $0.28 output, and $0.18 blended.^[3]^[4]	Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.^[3]
Best-supported use case from evidence	Highest aggregate intelligence among the four where evidence is available.^[5]	High-end reasoning close to GPT-5.5 on the available Intelligence Index.^[5]	Strongest supplied coding/SWE evidence and strong cost positioning, especially for Flash and Pro variants.^[2]^[3]^[6]	Potentially strong open-source agent-coding positioning, but benchmark evidence is insufficient.^[1]

Pricing and value takeaways

On the available aggregate benchmark, GPT-5.5 leads Claude Opus 4.7 by 3 points at the xhigh setting and by 2 points at the high setting.^[5]
On the available API pricing comparison, GPT-5.5 costs about 2.9x DeepSeek V4 for input tokens and about 8.6x DeepSeek V4 for output tokens.^[4]
In the DeepSeek V4 Flash vs Kimi K2.6 pricing comparison, Kimi K2.6’s blended price of $1.71 per 1M tokens is about 9.5x DeepSeek V4 Flash’s $0.18 blended price.^[3]
One enterprise-oriented comparison recommends DeepSeek V4 Flash for high-volume text summarization, customer-service classification, and internal knowledge-base Q&A because of low cost and 1M context.^[2]
The same enterprise-oriented comparison recommends DeepSeek V4 Pro for complex document analysis, cross-department workflows, and automated agents because it is described as lower-cost than closed flagship models while offering high-end open-weight capability and 1M context.^[2]

Evidence notes

The strongest single benchmark source in the supplied evidence is the Intelligence Index listing, which ranks GPT-5.5 xhigh first at 60, GPT-5.5 high second at 59, and Claude Opus 4.7 Adaptive Reasoning Max Effort third at 57.^[5]
The strongest task-level benchmark evidence is for DeepSeek V4 Pro, because it includes exact coding and software-engineering metrics rather than only positioning language.^[6]
The Kimi K2.6 evidence is mostly launch positioning and pricing, not benchmark results.^[1]^[3]
A user post describes four one-shot head-to-head experiments comparing Codex on GPT-5.5 with Claude Code on Opus 4.7, but the supplied snippet does not provide the outcomes or benchmark scores.^[31]
There is conflicting or variant-specific DeepSeek context/pricing evidence: one comparison lists DeepSeek V4 with a 1M context window, while another provider listing for DeepSeek V4 Pro shows 256K max tokens and 66K max output tokens.^[4]^[7]

Limitations / uncertainty

Insufficient evidence for a full, comprehensive benchmark comparison across Claude Opus 4.7, GPT-5.5, DeepSeek V4, and Kimi K2.6.
Insufficient evidence to rank DeepSeek V4 or Kimi K2.6 against GPT-5.5 and Claude Opus 4.7 on the same aggregate Intelligence Index, because the supplied excerpt only provides numeric Intelligence Index scores for GPT-5.5 and Claude Opus 4.7 among the requested models.^[5]
Insufficient evidence to compare all four models on the same coding benchmarks, because only DeepSeek V4 Pro has detailed numeric coding/SWE benchmark scores in the supplied evidence.^[6]
Insufficient evidence to compare multimodal performance across all four models, because the supplied evidence only includes a launch-positioning claim that Claude Opus 4.7 has a threefold vision upgrade and does not provide comparable multimodal benchmark numbers for the others.^[1]

Summary

GPT-5.5 has the strongest supplied aggregate benchmark result, with Intelligence Index scores of 60 and 59 depending on reasoning setting.^[5] Claude Opus 4.7 is close behind at 57 and is the only other requested model with a numeric aggregate score in the supplied evidence.^[5] DeepSeek V4 Pro has the best-supported coding/SWE benchmark profile, while DeepSeek V4 Flash and V4 Pro appear especially strong on cost and long-context enterprise use cases.^[2]^[3]^[6] Kimi K2.6 may be competitively positioned as an open-source agent-coding model, but the supplied evidence does not include enough numeric benchmark data to verify how it compares against GPT-5.5, Claude Opus 4.7, or DeepSeek V4.^[1]^[3]

Fuentes

[1] Compare DeepSeek V4 Flash (Reasoning, High Effort) vs Kimi K2.6 | AI Model Comparisonllmbase.ai
Metric DeepSeek logo De DeepSeek V4 Flash (Reasoning, High Effort) DeepSeek Kimi logo Ki Kimi K2.6 Kimi --- Pricing per 1M tokens Input Cost $0.14/1M $0.95/1M Output Cost $0.28/1M $4.00/1M Blended (3:1) $0.18/1M $1.71/1M Specifications Organization DeepSeek...
[2] DeepSeek V4 Pro (Reasoning, High Effort) vs Kimi K2.6: Model Comparisonartificialanalysis.ai
What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...
[3] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai
Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...
[7] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Reasoning & knowledge Benchmark GPT-5.5 Opus 4.7 Lead --- --- GPQA Diamond 93.6% 94.2% Opus +0.6 HLE (no tools) 41.4% 46.9% Opus +5.5 HLE (with tools) 52.2% 54.7% Opus +2.5 The HLE no-tools margin (+5.5pp) is the most informative entry in the table because...
[9] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[15] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[16] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[18] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai
Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...
[19] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[25] DeepSeek V4 Pro API - Together AItogether.ai
Coding & Software Engineering: • 93.5% LiveCodeBench and Codeforces 3206 for competitive and production code generation • 80.6% SWE-Bench Verified for autonomous software engineering across repositories • 76.2% SWE-Bench Multilingual for cross-language soft...
[31] deepseek-v4-pro Model by Deepseek-ai | NVIDIA NIM - NVIDIA Buildbuild.nvidia.com
Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max --- --- --- Knowledge & Reasoning MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5 SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9 Chinese-SimpleQA...

Tendencias en Descubrir

InformesPublicado28 abr 2026Last edited 6 may 202612 fuentes

GPT-5.5, Claude Opus 4.7, DeepSeek V4 y Kimi K2.6: qué modelo de IA conviene elegir

Buscar y verificar hechos con Studio Global AI Explora más de Descubrir

17K0

Veredicto rápido

Si lo que más te importa es…	Elección mejor respaldada	Por qué
Señal agregada de inteligencia	GPT-5.5	Artificial Analysis lista GPT-5.5 xhigh con 60 y GPT-5.5 high con 59, por delante de Claude Opus 4.7 Adaptive Reasoning Max Effort con 57.^[2]
Razonamiento difícil e ingeniería de software	Claude Opus 4.7, con GPT-5.5 muy cerca	En la tabla compartida por VentureBeat, Claude lidera GPQA Diamond, HLE sin herramientas, SWE-Bench Pro y MCP Atlas; GPT-5.5 lidera Terminal-Bench 2.0 y BrowseComp en modelo base, y GPT-5.5 Pro lidera HLE con herramientas y BrowseComp cuando aparece esa variante.^[16]
Menor costo API entre los modelos insignia listados	DeepSeek V4	Mashable lista DeepSeek V4 a $1,74 por 1 millón de tokens de entrada y $3,48 por 1 millón de tokens de salida, por debajo de GPT-5.5 a $5/$30 y Claude Opus 4.7 a $5/$25.^[15]
Métricas publicadas de coding y programación competitiva	DeepSeek V4 Pro	Together AI lista DeepSeek V4 Pro con 93,5 % en LiveCodeBench, Codeforces 3206, 80,6 % en SWE-Bench Verified y 76,2 % en SWE-Bench Multilingual.^[25]
Evaluar Kimi K2.6	Prometedor, pero no cerrado	Kimi K2.6 tiene números útiles en código y tareas agénticas, pero buena parte de la evidencia disponible lo compara con GPT-5.4 y Claude Opus 4.6, no con GPT-5.5 y Claude Opus 4.7.^[18]^[19]

Primero, el ranking agregado: ventaja para GPT-5.5

En benchmarks compartidos: decisión dividida

La tabla de VentureBeat es la referencia más útil para comparar DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro cuando aparece y Claude Opus 4.7 en las mismas pruebas.^[16]

Benchmark	DeepSeek-V4-Pro-Max	GPT-5.5	GPT-5.5 Pro, cuando aparece	Claude Opus 4.7	Mejor resultado en esa fuente
GPQA Diamond	90,1 %	93,6 %	—	94,2 %	Claude Opus 4.7^[16]
Humanity’s Last Exam, sin herramientas	37,7 %	41,4 %	43,1 %	46,9 %	Claude Opus 4.7^[16]
Humanity’s Last Exam, con herramientas	48,2 %	52,2 %	57,2 %	54,7 %	GPT-5.5 Pro^[16]
Terminal-Bench 2.0	67,9 %	82,7 %	—	69,4 %	GPT-5.5^[16]
SWE-Bench Pro / SWE Pro	55,4 %	58,6 %	—	64,3 %	Claude Opus 4.7^[16]
BrowseComp	83,4 %	84,4 %	90,1 %	79,3 %	GPT-5.5 Pro^[16]
MCP Atlas / MCPAtlas Public	73,6 %	75,3 %	—	79,1 %	Claude Opus 4.7^[16]

Programación: no basta con una sola nota de código

Precio de API: DeepSeek cambia la conversación

Modelo o variante	Precio de entrada listado	Precio de salida listado	Nota
GPT-5.5	$5 por 1 millón de tokens	$30 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
Claude Opus 4.7	$5 por 1 millón de tokens	$25 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
DeepSeek V4	$1,74 por 1 millón de tokens	$3,48 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
DeepSeek V4 Flash	$0,14 por 1 millón de tokens	$0,28 por 1 millón de tokens	LLMBase lista un precio combinado de $0,18 en su comparación DeepSeek V4 Flash High vs Kimi K2.6.^[1]
Kimi K2.6	$0,95 por 1 millón de tokens	$4,00 por 1 millón de tokens	LLMBase lista un precio combinado de $1,71 en la misma comparación.^[1]

Qué elegir según el uso

GPT-5.5: el mejor valor por defecto si mandan los rankings agregados

Claude Opus 4.7: fuerte para razonamiento duro y software

DeepSeek V4: la mejor apuesta de valor si la variante encaja

Kimi K2.6: candidato serio para código y agentes, pero menos probado en esta carrera

Advertencias antes de decidir

Las variantes importan. DeepSeek V4 aparece en las fuentes como V4, V4 Flash, V4 Pro y DeepSeek-V4-Pro-Max; precios, límites y resultados cambian según la variante y el modo de razonamiento.^[1]^[15]^[25]^[31]
Las comparaciones de Kimi son menos directas. Las tablas más fuertes para Kimi K2.6 lo enfrentan con GPT-5.4 y Claude Opus 4.6, no con GPT-5.5 y Claude Opus 4.7.^[18]^[19]
Humanity’s Last Exam sin herramientas muestra inconsistencias entre fuentes. LLM Stats y VentureBeat reportan GPT-5.5 en 41,4 % y Claude Opus 4.7 en 46,9 %, mientras que el fragmento de Mashable sobre GPT frente a Claude reporta GPT-5.5 en 40,6 % y Opus 4.7 en 31,2 %.^[7]^[16]^[9]
Los benchmarks internos no equivalen a rankings independientes. El lanzamiento de Anthropic para Opus 4.7 incluye mejoras en un benchmark interno de agente de investigación, pero esos datos deben leerse de forma distinta a una comparación pública entre proveedores.^[17]
Precio y contexto dependen del proveedor. Una misma familia de modelos puede aparecer con distintas ventanas de contexto, límites de salida y configuraciones según el endpoint.^[3]^[15]

Conclusión

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Buscar y verificar hechos con Studio Global AI

Conclusiones clave

GPT 5.5 tiene la señal agregada más fuerte: Artificial Analysis sitúa GPT 5.5 xhigh en 60 y GPT 5.5 high en 59, por delante de Claude Opus 4.7 en 57.[2]
Claude Opus 4.7 gana varias pruebas compartidas de razonamiento y software; DeepSeek V4 ofrece la ventaja de costo más clara en API; Kimi K2.6 es prometedor, pero con menos evidencia directa frente a GPT 5.5 y Opus 4....
En programación, Claude lidera la fila compartida SWE Bench Pro de VentureBeat con 64,3 %, mientras DeepSeek V4 Pro muestra un perfil de código muy detallado con 93,5 % en LiveCodeBench y Codeforces 3206.[16][25]

La gente también pregunta

¿Cuál es la respuesta corta a "GPT-5.5, Claude Opus 4.7, DeepSeek V4 y Kimi K2.6: qué modelo de IA conviene elegir"?

GPT 5.5 tiene la señal agregada más fuerte: Artificial Analysis sitúa GPT 5.5 xhigh en 60 y GPT 5.5 high en 59, por delante de Claude Opus 4.7 en 57.[2]

¿Cuáles son los puntos clave a validar primero?

¿Qué debo hacer a continuación en la práctica?

¿Qué tema relacionado debería explorar a continuación?

Continúe con "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto" para conocer otro ángulo y citas adicionales.

Abrir página relacionada

¿Con qué debería comparar esto?

Verifique esta respuesta con "Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger".

Abrir página relacionada

Continúe su investigación

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026

Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger

Northwest y Southeast Timber: por qué más sustituibilidad implica larger; larger

¿DeepSeek derrotará a OpenAI, Claude, Gemini y Grok? La clave no es el ranking, sino el costo y la confianza

PocketOS y el borrado de su base de datos: lo que de verdad enseña el caso Claude/Cursor

Conversación de investigación

Tú26 abr 2026

Pregunta de investigación

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Studio Global AI26 abr 202632 fuentes

Respuesta citada

Key findings

GPT-5.5 is the best-supported leader on the only cross-model aggregate benchmark available here: GPT-5.5 xhigh ranks first with an Intelligence Index of 60, and GPT-5.5 high ranks second with 59.^[5]
Claude Opus 4.7 is close behind on that same aggregate measure: Claude Opus 4.7 Adaptive Reasoning Max Effort ranks third with an Intelligence Index of 57.^[5]
DeepSeek V4 Pro has the clearest disclosed coding/SWE benchmark profile in the supplied evidence: 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.^[6]
Kimi K2.6 is described as an “open-source flagship” agent-coding model in the launch/news evidence, but the supplied evidence does not provide numeric benchmark scores for Kimi K2.6.^[1]
DeepSeek V4 is materially cheaper than GPT-5.5 in the pricing evidence: DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens, while GPT-5.5 is listed at $5 per 1M input tokens and $30 per 1M output tokens, with both shown as 1M-context models.^[4]
DeepSeek V4 Flash is far cheaper than Kimi K2.6 in the direct pricing comparison: DeepSeek V4 Flash is listed at $0.14 per 1M input tokens, $0.28 per 1M output tokens, and $0.18 blended, while Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.^[3]
The supplied evidence treats DeepSeek V4 as a family with multiple variants, including V4 Flash and V4 Pro, so comparisons depend on which variant is meant.^[2]^[3]^[6]

Benchmark and capability comparison

Area	GPT-5.5	Claude Opus 4.7	DeepSeek V4	Kimi K2.6
Aggregate Intelligence Index	60 for GPT-5.5 xhigh; 59 for GPT-5.5 high.^[5]	57 for Claude Opus 4.7 Adaptive Reasoning Max Effort.^[5]	No numeric Intelligence Index for DeepSeek V4 is provided in the supplied evidence.	No numeric Intelligence Index for Kimi K2.6 is provided in the supplied evidence.
Coding / SWE benchmarks	No exact coding benchmark scores are provided in the supplied evidence.	The launch evidence says Claude Opus 4.7 has improved programming and a threefold vision upgrade, but no exact benchmark numbers are provided.^[1]	DeepSeek V4 Pro is listed with 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.^[6]	Kimi K2.6 is positioned as an open-source flagship agent-coding model, but no exact benchmark numbers are provided.^[1]
Context window evidence	GPT-5.5 is listed with a 1M context window in the pricing comparison.^[4]	The supplied pricing snippet does not clearly provide Claude Opus 4.7’s context window.	DeepSeek V4 is listed with a 1M context window in one comparison, and DeepSeek V4 Flash / Pro are also described as 1M-context options in enterprise guidance.^[2]^[4]	No context-window figure for Kimi K2.6 is provided in the supplied evidence.
Pricing evidence	$5 per 1M input tokens and $30 per 1M output tokens.^[4]	$5 per 1M input tokens is visible in the supplied snippet, but the output price is truncated.^[4]	DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens; DeepSeek V4 Flash is listed at $0.14 input, $0.28 output, and $0.18 blended.^[3]^[4]	Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.^[3]
Best-supported use case from evidence	Highest aggregate intelligence among the four where evidence is available.^[5]	High-end reasoning close to GPT-5.5 on the available Intelligence Index.^[5]	Strongest supplied coding/SWE evidence and strong cost positioning, especially for Flash and Pro variants.^[2]^[3]^[6]	Potentially strong open-source agent-coding positioning, but benchmark evidence is insufficient.^[1]

Pricing and value takeaways

On the available aggregate benchmark, GPT-5.5 leads Claude Opus 4.7 by 3 points at the xhigh setting and by 2 points at the high setting.^[5]
On the available API pricing comparison, GPT-5.5 costs about 2.9x DeepSeek V4 for input tokens and about 8.6x DeepSeek V4 for output tokens.^[4]
In the DeepSeek V4 Flash vs Kimi K2.6 pricing comparison, Kimi K2.6’s blended price of $1.71 per 1M tokens is about 9.5x DeepSeek V4 Flash’s $0.18 blended price.^[3]
One enterprise-oriented comparison recommends DeepSeek V4 Flash for high-volume text summarization, customer-service classification, and internal knowledge-base Q&A because of low cost and 1M context.^[2]
The same enterprise-oriented comparison recommends DeepSeek V4 Pro for complex document analysis, cross-department workflows, and automated agents because it is described as lower-cost than closed flagship models while offering high-end open-weight capability and 1M context.^[2]

Evidence notes

The strongest single benchmark source in the supplied evidence is the Intelligence Index listing, which ranks GPT-5.5 xhigh first at 60, GPT-5.5 high second at 59, and Claude Opus 4.7 Adaptive Reasoning Max Effort third at 57.^[5]
The strongest task-level benchmark evidence is for DeepSeek V4 Pro, because it includes exact coding and software-engineering metrics rather than only positioning language.^[6]
The Kimi K2.6 evidence is mostly launch positioning and pricing, not benchmark results.^[1]^[3]
A user post describes four one-shot head-to-head experiments comparing Codex on GPT-5.5 with Claude Code on Opus 4.7, but the supplied snippet does not provide the outcomes or benchmark scores.^[31]
There is conflicting or variant-specific DeepSeek context/pricing evidence: one comparison lists DeepSeek V4 with a 1M context window, while another provider listing for DeepSeek V4 Pro shows 256K max tokens and 66K max output tokens.^[4]^[7]

Limitations / uncertainty

Insufficient evidence for a full, comprehensive benchmark comparison across Claude Opus 4.7, GPT-5.5, DeepSeek V4, and Kimi K2.6.
Insufficient evidence to rank DeepSeek V4 or Kimi K2.6 against GPT-5.5 and Claude Opus 4.7 on the same aggregate Intelligence Index, because the supplied excerpt only provides numeric Intelligence Index scores for GPT-5.5 and Claude Opus 4.7 among the requested models.^[5]
Insufficient evidence to compare all four models on the same coding benchmarks, because only DeepSeek V4 Pro has detailed numeric coding/SWE benchmark scores in the supplied evidence.^[6]
Insufficient evidence to compare multimodal performance across all four models, because the supplied evidence only includes a launch-positioning claim that Claude Opus 4.7 has a threefold vision upgrade and does not provide comparable multimodal benchmark numbers for the others.^[1]

Summary

Fuentes

[1] Compare DeepSeek V4 Flash (Reasoning, High Effort) vs Kimi K2.6 | AI Model Comparisonllmbase.ai
Metric DeepSeek logo De DeepSeek V4 Flash (Reasoning, High Effort) DeepSeek Kimi logo Ki Kimi K2.6 Kimi --- Pricing per 1M tokens Input Cost $0.14/1M $0.95/1M Output Cost $0.28/1M $4.00/1M Blended (3:1) $0.18/1M $1.71/1M Specifications Organization DeepSeek...
[2] DeepSeek V4 Pro (Reasoning, High Effort) vs Kimi K2.6: Model Comparisonartificialanalysis.ai
What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...
[3] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai
Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...
[7] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Reasoning & knowledge Benchmark GPT-5.5 Opus 4.7 Lead --- --- GPQA Diamond 93.6% 94.2% Opus +0.6 HLE (no tools) 41.4% 46.9% Opus +5.5 HLE (with tools) 52.2% 54.7% Opus +2.5 The HLE no-tools margin (+5.5pp) is the most informative entry in the table because...
[9] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[15] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[16] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[18] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai
Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...
[19] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[25] DeepSeek V4 Pro API - Together AItogether.ai
Coding & Software Engineering: • 93.5% LiveCodeBench and Codeforces 3206 for competitive and production code generation • 80.6% SWE-Bench Verified for autonomous software engineering across repositories • 76.2% SWE-Bench Multilingual for cross-language soft...
[31] deepseek-v4-pro Model by Deepseek-ai | NVIDIA NIM - NVIDIA Buildbuild.nvidia.com
Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max --- --- --- Knowledge & Reasoning MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5 SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9 Chinese-SimpleQA...

Tendencias en Descubrir

InformesPublicado28 abr 2026Last edited 6 may 202612 fuentes

GPT-5.5, Claude Opus 4.7, DeepSeek V4 y Kimi K2.6: qué modelo de IA conviene elegir

Buscar y verificar hechos con Studio Global AI Explora más de Descubrir

17K0

Veredicto rápido

Si lo que más te importa es…	Elección mejor respaldada	Por qué
Señal agregada de inteligencia	GPT-5.5	Artificial Analysis lista GPT-5.5 xhigh con 60 y GPT-5.5 high con 59, por delante de Claude Opus 4.7 Adaptive Reasoning Max Effort con 57.^[2]
Razonamiento difícil e ingeniería de software	Claude Opus 4.7, con GPT-5.5 muy cerca	En la tabla compartida por VentureBeat, Claude lidera GPQA Diamond, HLE sin herramientas, SWE-Bench Pro y MCP Atlas; GPT-5.5 lidera Terminal-Bench 2.0 y BrowseComp en modelo base, y GPT-5.5 Pro lidera HLE con herramientas y BrowseComp cuando aparece esa variante.^[16]
Menor costo API entre los modelos insignia listados	DeepSeek V4	Mashable lista DeepSeek V4 a $1,74 por 1 millón de tokens de entrada y $3,48 por 1 millón de tokens de salida, por debajo de GPT-5.5 a $5/$30 y Claude Opus 4.7 a $5/$25.^[15]
Métricas publicadas de coding y programación competitiva	DeepSeek V4 Pro	Together AI lista DeepSeek V4 Pro con 93,5 % en LiveCodeBench, Codeforces 3206, 80,6 % en SWE-Bench Verified y 76,2 % en SWE-Bench Multilingual.^[25]
Evaluar Kimi K2.6	Prometedor, pero no cerrado	Kimi K2.6 tiene números útiles en código y tareas agénticas, pero buena parte de la evidencia disponible lo compara con GPT-5.4 y Claude Opus 4.6, no con GPT-5.5 y Claude Opus 4.7.^[18]^[19]

Primero, el ranking agregado: ventaja para GPT-5.5

En benchmarks compartidos: decisión dividida

La tabla de VentureBeat es la referencia más útil para comparar DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro cuando aparece y Claude Opus 4.7 en las mismas pruebas.^[16]

Benchmark	DeepSeek-V4-Pro-Max	GPT-5.5	GPT-5.5 Pro, cuando aparece	Claude Opus 4.7	Mejor resultado en esa fuente
GPQA Diamond	90,1 %	93,6 %	—	94,2 %	Claude Opus 4.7^[16]
Humanity’s Last Exam, sin herramientas	37,7 %	41,4 %	43,1 %	46,9 %	Claude Opus 4.7^[16]
Humanity’s Last Exam, con herramientas	48,2 %	52,2 %	57,2 %	54,7 %	GPT-5.5 Pro^[16]
Terminal-Bench 2.0	67,9 %	82,7 %	—	69,4 %	GPT-5.5^[16]
SWE-Bench Pro / SWE Pro	55,4 %	58,6 %	—	64,3 %	Claude Opus 4.7^[16]
BrowseComp	83,4 %	84,4 %	90,1 %	79,3 %	GPT-5.5 Pro^[16]
MCP Atlas / MCPAtlas Public	73,6 %	75,3 %	—	79,1 %	Claude Opus 4.7^[16]

Programación: no basta con una sola nota de código

Precio de API: DeepSeek cambia la conversación

Modelo o variante	Precio de entrada listado	Precio de salida listado	Nota
GPT-5.5	$5 por 1 millón de tokens	$30 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
Claude Opus 4.7	$5 por 1 millón de tokens	$25 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
DeepSeek V4	$1,74 por 1 millón de tokens	$3,48 por 1 millón de tokens	Mashable lo lista con ventana de contexto de 1 millón en esta comparación.^[15]
DeepSeek V4 Flash	$0,14 por 1 millón de tokens	$0,28 por 1 millón de tokens	LLMBase lista un precio combinado de $0,18 en su comparación DeepSeek V4 Flash High vs Kimi K2.6.^[1]
Kimi K2.6	$0,95 por 1 millón de tokens	$4,00 por 1 millón de tokens	LLMBase lista un precio combinado de $1,71 en la misma comparación.^[1]

Qué elegir según el uso

GPT-5.5: el mejor valor por defecto si mandan los rankings agregados

Claude Opus 4.7: fuerte para razonamiento duro y software

DeepSeek V4: la mejor apuesta de valor si la variante encaja

Kimi K2.6: candidato serio para código y agentes, pero menos probado en esta carrera

Advertencias antes de decidir

Las variantes importan. DeepSeek V4 aparece en las fuentes como V4, V4 Flash, V4 Pro y DeepSeek-V4-Pro-Max; precios, límites y resultados cambian según la variante y el modo de razonamiento.^[1]^[15]^[25]^[31]
Las comparaciones de Kimi son menos directas. Las tablas más fuertes para Kimi K2.6 lo enfrentan con GPT-5.4 y Claude Opus 4.6, no con GPT-5.5 y Claude Opus 4.7.^[18]^[19]
Humanity’s Last Exam sin herramientas muestra inconsistencias entre fuentes. LLM Stats y VentureBeat reportan GPT-5.5 en 41,4 % y Claude Opus 4.7 en 46,9 %, mientras que el fragmento de Mashable sobre GPT frente a Claude reporta GPT-5.5 en 40,6 % y Opus 4.7 en 31,2 %.^[7]^[16]^[9]
Los benchmarks internos no equivalen a rankings independientes. El lanzamiento de Anthropic para Opus 4.7 incluye mejoras en un benchmark interno de agente de investigación, pero esos datos deben leerse de forma distinta a una comparación pública entre proveedores.^[17]
Precio y contexto dependen del proveedor. Una misma familia de modelos puede aparecer con distintas ventanas de contexto, límites de salida y configuraciones según el endpoint.^[3]^[15]

Conclusión

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Buscar y verificar hechos con Studio Global AI

Conclusiones clave

GPT 5.5 tiene la señal agregada más fuerte: Artificial Analysis sitúa GPT 5.5 xhigh en 60 y GPT 5.5 high en 59, por delante de Claude Opus 4.7 en 57.[2]
Claude Opus 4.7 gana varias pruebas compartidas de razonamiento y software; DeepSeek V4 ofrece la ventaja de costo más clara en API; Kimi K2.6 es prometedor, pero con menos evidencia directa frente a GPT 5.5 y Opus 4....
En programación, Claude lidera la fila compartida SWE Bench Pro de VentureBeat con 64,3 %, mientras DeepSeek V4 Pro muestra un perfil de código muy detallado con 93,5 % en LiveCodeBench y Codeforces 3206.[16][25]

La gente también pregunta

¿Cuál es la respuesta corta a "GPT-5.5, Claude Opus 4.7, DeepSeek V4 y Kimi K2.6: qué modelo de IA conviene elegir"?

GPT 5.5 tiene la señal agregada más fuerte: Artificial Analysis sitúa GPT 5.5 xhigh en 60 y GPT 5.5 high en 59, por delante de Claude Opus 4.7 en 57.[2]

¿Cuáles son los puntos clave a validar primero?

¿Qué debo hacer a continuación en la práctica?

¿Qué tema relacionado debería explorar a continuación?

Continúe con "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto" para conocer otro ángulo y citas adicionales.

Abrir página relacionada

¿Con qué debería comparar esto?

Verifique esta respuesta con "Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger".

Abrir página relacionada

Continúe su investigación

Conversación de investigación

Tú26 abr 2026

Pregunta de investigación

Research benchmarks of Claude Opus 4.7, GPT-5.5, DeepSeek V4, Kimi K2.6 and compare as comprehensively as possible

Studio Global AI26 abr 202632 fuentes

Respuesta citada

Key findings

GPT-5.5 is the best-supported leader on the only cross-model aggregate benchmark available here: GPT-5.5 xhigh ranks first with an Intelligence Index of 60, and GPT-5.5 high ranks second with 59.^[5]
Claude Opus 4.7 is close behind on that same aggregate measure: Claude Opus 4.7 Adaptive Reasoning Max Effort ranks third with an Intelligence Index of 57.^[5]
DeepSeek V4 Pro has the clearest disclosed coding/SWE benchmark profile in the supplied evidence: 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.^[6]
Kimi K2.6 is described as an “open-source flagship” agent-coding model in the launch/news evidence, but the supplied evidence does not provide numeric benchmark scores for Kimi K2.6.^[1]
DeepSeek V4 is materially cheaper than GPT-5.5 in the pricing evidence: DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens, while GPT-5.5 is listed at $5 per 1M input tokens and $30 per 1M output tokens, with both shown as 1M-context models.^[4]
DeepSeek V4 Flash is far cheaper than Kimi K2.6 in the direct pricing comparison: DeepSeek V4 Flash is listed at $0.14 per 1M input tokens, $0.28 per 1M output tokens, and $0.18 blended, while Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.^[3]
The supplied evidence treats DeepSeek V4 as a family with multiple variants, including V4 Flash and V4 Pro, so comparisons depend on which variant is meant.^[2]^[3]^[6]

Benchmark and capability comparison

Area	GPT-5.5	Claude Opus 4.7	DeepSeek V4	Kimi K2.6
Aggregate Intelligence Index	60 for GPT-5.5 xhigh; 59 for GPT-5.5 high.^[5]	57 for Claude Opus 4.7 Adaptive Reasoning Max Effort.^[5]	No numeric Intelligence Index for DeepSeek V4 is provided in the supplied evidence.	No numeric Intelligence Index for Kimi K2.6 is provided in the supplied evidence.
Coding / SWE benchmarks	No exact coding benchmark scores are provided in the supplied evidence.	The launch evidence says Claude Opus 4.7 has improved programming and a threefold vision upgrade, but no exact benchmark numbers are provided.^[1]	DeepSeek V4 Pro is listed with 93.5% LiveCodeBench, Codeforces 3206, 80.6% SWE-Bench Verified, and 76.2% SWE-Bench Multilingual.^[6]	Kimi K2.6 is positioned as an open-source flagship agent-coding model, but no exact benchmark numbers are provided.^[1]
Context window evidence	GPT-5.5 is listed with a 1M context window in the pricing comparison.^[4]	The supplied pricing snippet does not clearly provide Claude Opus 4.7’s context window.	DeepSeek V4 is listed with a 1M context window in one comparison, and DeepSeek V4 Flash / Pro are also described as 1M-context options in enterprise guidance.^[2]^[4]	No context-window figure for Kimi K2.6 is provided in the supplied evidence.
Pricing evidence	$5 per 1M input tokens and $30 per 1M output tokens.^[4]	$5 per 1M input tokens is visible in the supplied snippet, but the output price is truncated.^[4]	DeepSeek V4 is listed at $1.74 per 1M input tokens and $3.48 per 1M output tokens; DeepSeek V4 Flash is listed at $0.14 input, $0.28 output, and $0.18 blended.^[3]^[4]	Kimi K2.6 is listed at $0.95 per 1M input tokens, $4.00 per 1M output tokens, and $1.71 blended.^[3]
Best-supported use case from evidence	Highest aggregate intelligence among the four where evidence is available.^[5]	High-end reasoning close to GPT-5.5 on the available Intelligence Index.^[5]	Strongest supplied coding/SWE evidence and strong cost positioning, especially for Flash and Pro variants.^[2]^[3]^[6]	Potentially strong open-source agent-coding positioning, but benchmark evidence is insufficient.^[1]

Pricing and value takeaways

On the available aggregate benchmark, GPT-5.5 leads Claude Opus 4.7 by 3 points at the xhigh setting and by 2 points at the high setting.^[5]
On the available API pricing comparison, GPT-5.5 costs about 2.9x DeepSeek V4 for input tokens and about 8.6x DeepSeek V4 for output tokens.^[4]
In the DeepSeek V4 Flash vs Kimi K2.6 pricing comparison, Kimi K2.6’s blended price of $1.71 per 1M tokens is about 9.5x DeepSeek V4 Flash’s $0.18 blended price.^[3]
One enterprise-oriented comparison recommends DeepSeek V4 Flash for high-volume text summarization, customer-service classification, and internal knowledge-base Q&A because of low cost and 1M context.^[2]
The same enterprise-oriented comparison recommends DeepSeek V4 Pro for complex document analysis, cross-department workflows, and automated agents because it is described as lower-cost than closed flagship models while offering high-end open-weight capability and 1M context.^[2]

Evidence notes

The strongest single benchmark source in the supplied evidence is the Intelligence Index listing, which ranks GPT-5.5 xhigh first at 60, GPT-5.5 high second at 59, and Claude Opus 4.7 Adaptive Reasoning Max Effort third at 57.^[5]
The strongest task-level benchmark evidence is for DeepSeek V4 Pro, because it includes exact coding and software-engineering metrics rather than only positioning language.^[6]
The Kimi K2.6 evidence is mostly launch positioning and pricing, not benchmark results.^[1]^[3]
A user post describes four one-shot head-to-head experiments comparing Codex on GPT-5.5 with Claude Code on Opus 4.7, but the supplied snippet does not provide the outcomes or benchmark scores.^[31]
There is conflicting or variant-specific DeepSeek context/pricing evidence: one comparison lists DeepSeek V4 with a 1M context window, while another provider listing for DeepSeek V4 Pro shows 256K max tokens and 66K max output tokens.^[4]^[7]

Limitations / uncertainty

Insufficient evidence for a full, comprehensive benchmark comparison across Claude Opus 4.7, GPT-5.5, DeepSeek V4, and Kimi K2.6.
Insufficient evidence to rank DeepSeek V4 or Kimi K2.6 against GPT-5.5 and Claude Opus 4.7 on the same aggregate Intelligence Index, because the supplied excerpt only provides numeric Intelligence Index scores for GPT-5.5 and Claude Opus 4.7 among the requested models.^[5]
Insufficient evidence to compare all four models on the same coding benchmarks, because only DeepSeek V4 Pro has detailed numeric coding/SWE benchmark scores in the supplied evidence.^[6]
Insufficient evidence to compare multimodal performance across all four models, because the supplied evidence only includes a launch-positioning claim that Claude Opus 4.7 has a threefold vision upgrade and does not provide comparable multimodal benchmark numbers for the others.^[1]

Summary

Fuentes

[1] Compare DeepSeek V4 Flash (Reasoning, High Effort) vs Kimi K2.6 | AI Model Comparisonllmbase.ai
Metric DeepSeek logo De DeepSeek V4 Flash (Reasoning, High Effort) DeepSeek Kimi logo Ki Kimi K2.6 Kimi --- Pricing per 1M tokens Input Cost $0.14/1M $0.95/1M Output Cost $0.28/1M $4.00/1M Blended (3:1) $0.18/1M $1.71/1M Specifications Organization DeepSeek...
[2] DeepSeek V4 Pro (Reasoning, High Effort) vs Kimi K2.6: Model Comparisonartificialanalysis.ai
What are the top AI models? The top AI models by Intelligence Index are: 1. GPT-5.5 (xhigh) (60), 2. GPT-5.5 (high) (59), 3. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) (57), 4. Gemini 3.1 Pro Preview (57), 5. GPT-5.4 (xhigh) (57). Which is the fastest...
[3] DeepSeek V4 Pro vs Kimi K2.6 - AI Model Comparison | OpenRouteropenrouter.ai
Ready Output will appear here... Pricing Input$0.7448 / M tokens Output$4.655 / M tokens Images– – Features Input Modalities text, image Output Modalities text Quantization int4 Max Tokens (input + output)256K Max Output Tokens 66K Stream cancellation Suppo...
[7] GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks - LLM Statsllm-stats.com
Reasoning & knowledge Benchmark GPT-5.5 Opus 4.7 Lead --- --- GPQA Diamond 93.6% 94.2% Opus +0.6 HLE (no tools) 41.4% 46.9% Opus +5.5 HLE (with tools) 52.2% 54.7% Opus +2.5 The HLE no-tools margin (+5.5pp) is the most informative entry in the table because...
[9] OpenAI's GPT-5.5 vs Claude Opus 4.7: Which is better? | Mashablemashable.com
Thanks for signing up! SWE-Bench Pro: GPT-5.5 scored 58.6; Opus 4.7 scored 64.3 percent Terminal-Bench 2.0: GPT-5.5 scored 82.7 percent; Opus 4.7 scored 69.4 percent Humanity's Last Exam: GPT-5.5 scored 40.6 percent; Opus 4.7 scored 31.2 percent\ Humanity's...
[15] DeepSeek V4 is here: How it compares to ChatGPT, Claude, Geminimashable.com
Here's how the API pricing compares: DeepSeek V4 costs $1.74 per 1 million input tokens and $3.48 per 1 million output tokens (1 million context window) GPT-5.5 costs at $5 per 1 million input tokens and $30 per 1 million output tokens (1 million context wi...
[16] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
Image 7: logo Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-cont...
[18] Kimi K2.6 Tested: Does It Beat Claude and GPT-5? | Lorka AIlorka.ai
Benchmark What it tests Kimi K2.6 GPT-5.4 Opus 4.6 Gemini 3.1 Pro --- --- --- HLE-Full (with tools) Agentic reasoning with tool use 54.0% 52.1% 53.0% 51.4% DeepSearchQA (F1) Research retrieval and synthesis 92.5% 78.6% 91.3% 81.9% SWE-Bench Pro Multi-file c...
[19] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...
[25] DeepSeek V4 Pro API - Together AItogether.ai
Coding & Software Engineering: • 93.5% LiveCodeBench and Codeforces 3206 for competitive and production code generation • 80.6% SWE-Bench Verified for autonomous software engineering across repositories • 76.2% SWE-Bench Multilingual for cross-language soft...
[31] deepseek-v4-pro Model by Deepseek-ai | NVIDIA NIM - NVIDIA Buildbuild.nvidia.com
Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max --- --- --- Knowledge & Reasoning MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5 SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9 Chinese-SimpleQA...