InformesPublicado29 abr 2026Last edited 6 may 202612 fuentes

GPT-5.5, Claude Opus 4.7, Kimi K2.6 y DeepSeek V4: comparativa de benchmarks

Para agentes de código con uso intensivo de terminal, GPT 5.5 parte con ventaja; para reparación de software, Claude Opus 4.7 lidera las filas citadas de SWE Bench [18][24]. GPT 5.5 Pro no debe mezclarse con GPT 5.5 base: cuando aparece por separado, lidera BrowseComp con 90,1% y Humanity’s Last Exam con herramienta...

Buscar y verificar hechos con Studio Global AI Explora más de Descubrir

17K0

Abstract benchmark dashboard comparing GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4 — GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks ComparedAI-generated editorial illustration for a benchmark comparison of GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4.
Prompt de IA
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs Kimi K2.6 vs DeepSeek V4: Benchmarks Compared. Article summary: There is no single apples to apples leaderboard in the cited sources. The clearest signals are GPT 5.5 at 82.7% on Terminal Bench 2.0, Claude Opus 4.7 at 87.6% on SWE Bench Verified, Kimi K2.6 as the open weight pick,.... Topic tags: ai, ai benchmarks, llm, openai, anthropic. Reference image context from search candidates: Reference image 1: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hpenI). ![Image 4](https://www.youtube.com/watch?v=M90iB4hpenI). [](https://www.youtube.com" source context "Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison" Reference image 2: visual subject "[Kimi K2 vs Claude Opus 4.7 vs GPT 5.5 Comparison](https://www.youtube.com/watch?v=M90iB4hp
openai.com

Los gráficos de benchmarks hacen que esta comparación parezca una carrera con una sola meta. No lo es. La comparación común más cercana en las fuentes citadas cubre GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7 y DeepSeek-V4-Pro-Max; Kimi K2.6 aparece en fuentes separadas centradas en Kimi, su ficha de modelo y rankings ^[1]^[6]^[24]. Por eso, la pregunta útil no es tanto «¿cuál gana?» como «¿cuál debería probar primero para mi carga de trabajo?».

Una nota de nombres: aquí uso DeepSeek-V4-Pro-Max para hablar de DeepSeek V4, porque esa es la variante que aparece con filas de benchmark y coste en las fuentes citadas ^[18]^[24]. También separo GPT-5.5 Pro de GPT-5.5 base siempre que la fuente informa resultados distintos ^[24].

Veredicto rápido por tipo de trabajo

Agentes de programación que viven en la terminal: GPT-5.5 tiene el mejor resultado citado en Terminal-Bench 2.0 dentro de la comparación compartida, con 82,7% ^[24].
Reparación de software: Claude Opus 4.7 lidera la fila citada de SWE-Bench Pro con 64,3% y la de SWE-Bench Verified con 87,6% ^[18]^[24].
Razonamiento difícil sin herramientas: Claude Opus 4.7 encabeza las filas compartidas de GPQA Diamond y Humanity’s Last Exam sin herramientas ^[24].
Razonamiento con herramientas y navegación: GPT-5.5 Pro lidera Humanity’s Last Exam con herramientas con 57,2% y BrowseComp con 90,1%, allí donde se informa esa variante Pro ^[24].
Despliegue con pesos abiertos: Kimi K2.6 es el candidato más claro en las fuentes citadas: se describe como un modelo MoE de 1 billón de parámetros, con 32.000 millones activos y ventana de contexto de 256K ^[1].
Inferencia alojada sensible al precio: DeepSeek-V4-Pro-Max es el candidato de valor a validar: LLM Stats lo lista con contexto de 1 millón, 80,6% en SWE-Bench Verified y columnas de coste de $1.74/$3.48 ^[18].

Tabla comparativa de benchmarks

Un guion significa que no encontré esa puntuación en las fuentes citadas para ese modelo; no significa que el modelo haya obtenido cero. Las filas de GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7 y DeepSeek-V4-Pro-Max proceden sobre todo de una comparación compartida; las cifras de Kimi K2.6 vienen de fuentes separadas sobre Kimi ^[1]^[6]^[24].

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93,6% ^[24]	—	94,2% ^[24]	≈91% ^[28]	90,1% ^[24]
Humanity’s Last Exam, sin herramientas	41,4% ^[24]	43,1% ^[24]	46,9% ^[24]	—	37,7% ^[24]
Humanity’s Last Exam, con herramientas	52,2% ^[24]	57,2% ^[24]	54,7% ^[24]	54,0% ^[1]	48,2% ^[24]
Terminal-Bench 2.0	82,7% ^[24]	—	69,4% ^[24]	66,7% ^[6]	67,9% ^[24]
SWE-Bench Pro	58,6% ^[24]	—	64,3% ^[24]	58,6% ^[6]	55,4% ^[24]
BrowseComp	84,4% ^[24]	90,1% ^[24]	79,3% ^[24]	83,2% ^[1]	83,4% ^[24]
MCP Atlas / MCPAtlas Public	75,3% ^[24]	—	79,1% ^[24]	—	73,6% ^[24]
SWE-Bench Verified	—	—	87,6% ^[18]	80,2% ^[6]	80,6% ^[18]

Qué modelo probar primero

Prioridad	Empieza por	Por qué
Agentes de código orientados a terminal	GPT-5.5	Tiene la mejor puntuación citada en Terminal-Bench 2.0 dentro de la comparación compartida: 82,7% ^[24].
Reparación de software	Claude Opus 4.7	Lidera la fila citada de SWE-Bench Pro y la de SWE-Bench Verified entre estos modelos ^[18]^[24].
Razonamiento difícil sin herramientas	Claude Opus 4.7	Encabeza GPQA Diamond y Humanity’s Last Exam sin herramientas en la comparación compartida ^[24].
Razonamiento con herramientas o navegación	GPT-5.5 Pro	Lidera Humanity’s Last Exam con herramientas y BrowseComp cuando GPT-5.5 Pro se informa por separado ^[24].
Despliegue con pesos abiertos	Kimi K2.6	Se describe como un MoE de pesos abiertos con 1 billón de parámetros, y su ficha en Hugging Face informa filas fuertes en benchmarks de programación ^[1]^[6].
Inferencia alojada con presión de costes	DeepSeek-V4-Pro-Max	LLM Stats lo lista con contexto de 1 millón, 80,6% en SWE-Bench Verified y columnas de coste inferiores a las de Claude Opus 4.7 en el mismo ranking ^[18].
Necesidades de contexto largo	GPT-5.5, Claude Opus 4.7 o DeepSeek-V4-Pro-Max	Las fuentes citadas listan contexto de 1 millón para GPT-5.5, Claude Opus 4.7 y DeepSeek-V4-Pro-Max; Kimi K2.6 aparece alrededor de 256K a 262K ^[1]^[11]^[16]^[18]^[27].

Notas por modelo

GPT-5.5

OpenAI describe GPT-5.5 como un modelo construido para tareas complejas como programación, investigación y análisis de datos ^[38]. En la comparación compartida citada, GPT-5.5 alcanza 82,7% en Terminal-Bench 2.0, por delante de Claude Opus 4.7 con 69,4% y DeepSeek-V4-Pro-Max con 67,9% ^[24]. En esa misma tabla también aparece con 93,6% en GPQA Diamond, 58,6% en SWE-Bench Pro y 84,4% en BrowseComp ^[24].

La advertencia principal es que GPT-5.5 Pro funciona como punto de comparación separado. En la misma tabla, GPT-5.5 Pro llega a 90,1% en BrowseComp y 57,2% en Humanity’s Last Exam con herramientas, pero esos resultados no deberían mezclarse con los de GPT-5.5 base al comparar coste, latencia o configuración del modelo ^[24].

Para compras o planificación de presupuesto, BenchLM lista GPT-5.5 con una ventana de contexto de 1 millón de tokens, mientras que un informe de precios lo sitúa en $5 por millón de tokens de entrada y $30 por millón de tokens de salida ^[27]^[36]. Conviene tratar esa cifra como una señal a verificar, no como una cotización final.

Claude Opus 4.7

Claude Opus 4.7 tiene las señales más fuertes de reparación de software dentro de este grupo. LLM Stats lo lista con 87,6% en SWE-Bench Verified, y la comparación compartida informa 64,3% en SWE-Bench Pro ^[18]^[24]. También lidera la fila compartida de GPQA Diamond con 94,2%, Humanity’s Last Exam sin herramientas con 46,9% y MCP Atlas con 79,1% ^[24].

LLM Stats informa una ventana de contexto de 1 millón de tokens y precios de $5/$25 por millón de tokens para Claude Opus 4.7 ^[16]. La cautela sobre comparabilidad es importante: Anthropic señala que algunos resultados usaron implementaciones internas o parámetros de harness actualizados, y que ciertas puntuaciones no son directamente comparables con rankings públicos ^[17].

Kimi K2.6

Kimi K2.6 es el candidato de pesos abiertos más claro en el material citado. La cobertura de lanzamiento lo describe como un MoE de pesos abiertos con 1 billón de parámetros, 32.000 millones activos, 384 expertos, multimodalidad nativa, cuantización INT4 y contexto de 256K ^[1]. Su ficha en Hugging Face informa 80,2% en SWE-Bench Verified, 58,6% en SWE-Bench Pro, 66,7% en Terminal-Bench 2.0 y 89,6 en LiveCodeBench v6 ^[6].

La misma cobertura de lanzamiento informa 54,0 en Humanity’s Last Exam con herramientas y 83,2 en BrowseComp para Kimi K2.6 ^[1]. LLM Stats lo lista con contexto de 262K, columnas de precio de $0.95/$4.00 y etiqueta Open Source ^[11]. La limitación es que sus cifras no proceden de la misma tabla compartida que GPT-5.5, Claude Opus 4.7 y DeepSeek-V4-Pro-Max; por tanto, las diferencias pequeñas deberían servir para decidir qué probar, no para declarar un ganador definitivo ^[1]^[6]^[24].

DeepSeek-V4-Pro-Max

DeepSeek-V4-Pro-Max parece más un candidato de valor que un líder absoluto de benchmarks. LLM Stats lo lista con tamaño de 1,6T, contexto de 1 millón, 80,6% en SWE-Bench Verified y columnas de coste de $1.74/$3.48 ^[18]. En la comparación compartida, obtiene 90,1% en GPQA Diamond, 37,7% en Humanity’s Last Exam sin herramientas, 48,2% en Humanity’s Last Exam con herramientas, 67,9% en Terminal-Bench 2.0, 55,4% en SWE-Bench Pro, 83,4% en BrowseComp y 73,6% en MCP Atlas ^[24].

Esas cifras lo hacen interesante para cargas de trabajo sensibles al precio. Aun así, la misma tabla muestra a GPT-5.5, GPT-5.5 Pro o Claude Opus 4.7 liderando la mayoría de las filas reportadas, así que DeepSeek debería validarse con tareas propias antes de sustituir un modelo premium en producción ^[24].

Contexto y señales de precio

Las ventanas de contexto y los precios no siempre proceden de la misma fuente ni del proveedor directo. Úsalos como señales para compras, no como presupuestos cerrados.

Modelo	Señal citada de contexto y precio	Lectura práctica
GPT-5.5	BenchLM lista contexto de 1 millón; un informe de precios lista $5 de entrada y $30 de salida por millón de tokens ^[27]^[36].	Opción premium alojada; verificar precio vigente.
Claude Opus 4.7	LLM Stats informa contexto de 1 millón y precio de $5/$25 por millón de tokens ^[16].	Opción premium para programación, razonamiento y contexto largo.
Kimi K2.6	La cobertura de lanzamiento informa contexto de 256K; LLM Stats lista 262K y $0.95/$4.00 en sus columnas de precio ^[1]^[11].	Candidato fuerte de pesos abiertos; el precio alojado puede variar según proveedor.
DeepSeek-V4-Pro-Max	LLM Stats lista contexto de 1 millón, tamaño de 1,6T, 80,6% en SWE-Bench Verified y $1.74/$3.48 en columnas de coste ^[18].	Buen candidato de valor si mantiene calidad en tu carga real.

Por qué los rankings no siempre coinciden

Cada fila mide una habilidad distinta. GPQA Diamond y Humanity’s Last Exam apuntan a razonamiento difícil; Terminal-Bench 2.0 y las variantes de SWE-Bench se centran en programación y trabajo de software con agentes; BrowseComp mide rendimiento de recuperación y navegación en la comparación compartida ^[24]. Un modelo puede liderar una fila y quedar detrás en otra porque cambian la tarea, el acceso a herramientas y el entorno de evaluación.

Incluso un mismo benchmark puede variar según la implementación. LLM Stats lista Claude Opus 4.7 con 87,6% en SWE-Bench Verified, mientras que LMCouncil lo lista con 83,5% ± 1,7 bajo su propia configuración ^[18]^[30]. Anthropic también afirma que algunos resultados usaron implementaciones internas o parámetros de harness actualizados, lo que limita la comparación directa con rankings públicos ^[17].

Por eso, una diferencia de uno o dos puntos no debería decidir por sí sola un despliegue en producción. Los benchmarks públicos sirven para acotar la lista; la evaluación propia debería tomar la decisión final.

Cómo evaluar a los finalistas

Antes de comprometerte con un modelo, prueba los dos o tres candidatos principales con tareas parecidas a tu carga real.

Usa prompts, archivos y repositorios reales. Los prompts de benchmark rara vez capturan tu base de código, tus documentos, tus políticas internas o el comportamiento de tus usuarios.
Replica el entorno de herramientas. Los resultados de agentes de programación cambian cuando el modelo tiene terminal, navegación, recuperación documental, contexto del repositorio o APIs internas.
Mide coste y latencia con la misma configuración. Los modos Pro y los ajustes de mayor esfuerzo pueden mejorar la calidad, pero también cambian consumo de tokens y tiempo de respuesta.
Revisa los fallos a mano. En programación, mira tests, diffs, mantenibilidad, regresiones de seguridad y dependencias inventadas.
Incluye al menos un rival más barato. Kimi K2.6 y DeepSeek-V4-Pro-Max merecen estar en la prueba si importan los pesos abiertos o el coste de inferencia ^[1]^[18].

Conclusión

Si quieres una lista corta de gama alta, prueba GPT-5.5 y Claude Opus 4.7 en paralelo: GPT-5.5 tiene el mejor resultado citado en Terminal-Bench 2.0, mientras que Claude Opus 4.7 tiene los mejores resultados citados en SWE-Bench Pro y SWE-Bench Verified ^[18]^[24]. Si necesitas pesos abiertos, empieza por Kimi K2.6 ^[1]^[6]. Si la restricción principal es el coste, incluye DeepSeek-V4-Pro-Max, pero valídalo con tu propia carga antes de tratarlo como sustituto directo de las opciones premium ^[18]^[24].

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Buscar y verificar hechos con Studio Global AI

Conclusiones clave

Para agentes de código con uso intensivo de terminal, GPT 5.5 parte con ventaja; para reparación de software, Claude Opus 4.7 lidera las filas citadas de SWE Bench [18][24].
GPT 5.5 Pro no debe mezclarse con GPT 5.5 base: cuando aparece por separado, lidera BrowseComp con 90,1% y Humanity’s Last Exam con herramientas con 57,2% [24].
Kimi K2.6 es el candidato más claro de pesos abiertos, mientras que DeepSeek V4 Pro Max merece prueba si el coste de inferencia es la restricción principal [1][18].

La gente también pregunta

¿Cuál es la respuesta corta a "GPT-5.5, Claude Opus 4.7, Kimi K2.6 y DeepSeek V4: comparativa de benchmarks"?

Para agentes de código con uso intensivo de terminal, GPT 5.5 parte con ventaja; para reparación de software, Claude Opus 4.7 lidera las filas citadas de SWE Bench [18][24].

¿Cuáles son los puntos clave a validar primero?

¿Qué debo hacer a continuación en la práctica?

Kimi K2.6 es el candidato más claro de pesos abiertos, mientras que DeepSeek V4 Pro Max merece prueba si el coste de inferencia es la restricción principal [1][18].

¿Qué tema relacionado debería explorar a continuación?

Continúe con "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto" para conocer otro ángulo y citas adicionales.

Abrir página relacionada

¿Con qué debería comparar esto?

Verifique esta respuesta con "Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger".

Abrir página relacionada

Continúe su investigación

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026

Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger

Northwest y Southeast Timber: por qué más sustituibilidad implica larger; larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

¿DeepSeek derrotará a OpenAI, Claude, Gemini y Grok? La clave no es el ranking, sino el costo y la confianza

Illustration of an AI coding agent connected to cloud database and backup systems

PocketOS y el borrado de su base de datos: lo que de verdad enseña el caso Claude/Cursor

Conversación de investigación

Tú25 abr 2026

Pregunta de investigación

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 abr 202640 fuentes

Respuesta citada

I found public benchmark data for all four, but not one fully independent, apples-to-apples leaderboard covering GPT-5.5, Claude Opus 4.7, Kimi K2.6, and DeepSeek V4 under the same harness. The practical read is: GPT-5.5 looks strongest for broad agentic coding/computer-use tasks, Claude Opus 4.7 is extremely strong on coding and hard reasoning, Kimi K2.6 is the strongest open-weight option in this group, and DeepSeek V4 looks like the best price/performance play if its reported results hold up across independent testing ^[4]^[5]^[6]^[13].

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

If you want the safest high-end general choice, start with GPT-5.5. If your priority is top-tier coding reliability and long-context enterprise work, test Claude Opus 4.7. If you need open weights and strong frontier-adjacent performance, Kimi K2.6 is the most compelling open option. If cost is the constraint and you can tolerate some validation risk, DeepSeek V4 is the one to benchmark first.

Fuentes

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

Tendencias en Descubrir

InformesPublicado29 abr 2026Last edited 6 may 202612 fuentes

GPT-5.5, Claude Opus 4.7, Kimi K2.6 y DeepSeek V4: comparativa de benchmarks

Buscar y verificar hechos con Studio Global AI Explora más de Descubrir

17K0

Veredicto rápido por tipo de trabajo

Agentes de programación que viven en la terminal: GPT-5.5 tiene el mejor resultado citado en Terminal-Bench 2.0 dentro de la comparación compartida, con 82,7% ^[24].
Reparación de software: Claude Opus 4.7 lidera la fila citada de SWE-Bench Pro con 64,3% y la de SWE-Bench Verified con 87,6% ^[18]^[24].
Razonamiento difícil sin herramientas: Claude Opus 4.7 encabeza las filas compartidas de GPQA Diamond y Humanity’s Last Exam sin herramientas ^[24].
Razonamiento con herramientas y navegación: GPT-5.5 Pro lidera Humanity’s Last Exam con herramientas con 57,2% y BrowseComp con 90,1%, allí donde se informa esa variante Pro ^[24].
Despliegue con pesos abiertos: Kimi K2.6 es el candidato más claro en las fuentes citadas: se describe como un modelo MoE de 1 billón de parámetros, con 32.000 millones activos y ventana de contexto de 256K ^[1].
Inferencia alojada sensible al precio: DeepSeek-V4-Pro-Max es el candidato de valor a validar: LLM Stats lo lista con contexto de 1 millón, 80,6% en SWE-Bench Verified y columnas de coste de $1.74/$3.48 ^[18].

Tabla comparativa de benchmarks

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93,6% ^[24]	—	94,2% ^[24]	≈91% ^[28]	90,1% ^[24]
Humanity’s Last Exam, sin herramientas	41,4% ^[24]	43,1% ^[24]	46,9% ^[24]	—	37,7% ^[24]
Humanity’s Last Exam, con herramientas	52,2% ^[24]	57,2% ^[24]	54,7% ^[24]	54,0% ^[1]	48,2% ^[24]
Terminal-Bench 2.0	82,7% ^[24]	—	69,4% ^[24]	66,7% ^[6]	67,9% ^[24]
SWE-Bench Pro	58,6% ^[24]	—	64,3% ^[24]	58,6% ^[6]	55,4% ^[24]
BrowseComp	84,4% ^[24]	90,1% ^[24]	79,3% ^[24]	83,2% ^[1]	83,4% ^[24]
MCP Atlas / MCPAtlas Public	75,3% ^[24]	—	79,1% ^[24]	—	73,6% ^[24]
SWE-Bench Verified	—	—	87,6% ^[18]	80,2% ^[6]	80,6% ^[18]

Qué modelo probar primero

Prioridad	Empieza por	Por qué
Agentes de código orientados a terminal	GPT-5.5	Tiene la mejor puntuación citada en Terminal-Bench 2.0 dentro de la comparación compartida: 82,7% ^[24].
Reparación de software	Claude Opus 4.7	Lidera la fila citada de SWE-Bench Pro y la de SWE-Bench Verified entre estos modelos ^[18]^[24].
Razonamiento difícil sin herramientas	Claude Opus 4.7	Encabeza GPQA Diamond y Humanity’s Last Exam sin herramientas en la comparación compartida ^[24].
Razonamiento con herramientas o navegación	GPT-5.5 Pro	Lidera Humanity’s Last Exam con herramientas y BrowseComp cuando GPT-5.5 Pro se informa por separado ^[24].
Despliegue con pesos abiertos	Kimi K2.6	Se describe como un MoE de pesos abiertos con 1 billón de parámetros, y su ficha en Hugging Face informa filas fuertes en benchmarks de programación ^[1]^[6].
Inferencia alojada con presión de costes	DeepSeek-V4-Pro-Max	LLM Stats lo lista con contexto de 1 millón, 80,6% en SWE-Bench Verified y columnas de coste inferiores a las de Claude Opus 4.7 en el mismo ranking ^[18].
Necesidades de contexto largo	GPT-5.5, Claude Opus 4.7 o DeepSeek-V4-Pro-Max	Las fuentes citadas listan contexto de 1 millón para GPT-5.5, Claude Opus 4.7 y DeepSeek-V4-Pro-Max; Kimi K2.6 aparece alrededor de 256K a 262K ^[1]^[11]^[16]^[18]^[27].

Notas por modelo

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Contexto y señales de precio

Las ventanas de contexto y los precios no siempre proceden de la misma fuente ni del proveedor directo. Úsalos como señales para compras, no como presupuestos cerrados.

Modelo	Señal citada de contexto y precio	Lectura práctica
GPT-5.5	BenchLM lista contexto de 1 millón; un informe de precios lista $5 de entrada y $30 de salida por millón de tokens ^[27]^[36].	Opción premium alojada; verificar precio vigente.
Claude Opus 4.7	LLM Stats informa contexto de 1 millón y precio de $5/$25 por millón de tokens ^[16].	Opción premium para programación, razonamiento y contexto largo.
Kimi K2.6	La cobertura de lanzamiento informa contexto de 256K; LLM Stats lista 262K y $0.95/$4.00 en sus columnas de precio ^[1]^[11].	Candidato fuerte de pesos abiertos; el precio alojado puede variar según proveedor.
DeepSeek-V4-Pro-Max	LLM Stats lista contexto de 1 millón, tamaño de 1,6T, 80,6% en SWE-Bench Verified y $1.74/$3.48 en columnas de coste ^[18].	Buen candidato de valor si mantiene calidad en tu carga real.

Por qué los rankings no siempre coinciden

Cómo evaluar a los finalistas

Antes de comprometerte con un modelo, prueba los dos o tres candidatos principales con tareas parecidas a tu carga real.

Usa prompts, archivos y repositorios reales. Los prompts de benchmark rara vez capturan tu base de código, tus documentos, tus políticas internas o el comportamiento de tus usuarios.
Replica el entorno de herramientas. Los resultados de agentes de programación cambian cuando el modelo tiene terminal, navegación, recuperación documental, contexto del repositorio o APIs internas.
Mide coste y latencia con la misma configuración. Los modos Pro y los ajustes de mayor esfuerzo pueden mejorar la calidad, pero también cambian consumo de tokens y tiempo de respuesta.
Revisa los fallos a mano. En programación, mira tests, diffs, mantenibilidad, regresiones de seguridad y dependencias inventadas.
Incluye al menos un rival más barato. Kimi K2.6 y DeepSeek-V4-Pro-Max merecen estar en la prueba si importan los pesos abiertos o el coste de inferencia ^[1]^[18].

Conclusión

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Buscar y verificar hechos con Studio Global AI

Conclusiones clave

Para agentes de código con uso intensivo de terminal, GPT 5.5 parte con ventaja; para reparación de software, Claude Opus 4.7 lidera las filas citadas de SWE Bench [18][24].
GPT 5.5 Pro no debe mezclarse con GPT 5.5 base: cuando aparece por separado, lidera BrowseComp con 90,1% y Humanity’s Last Exam con herramientas con 57,2% [24].
Kimi K2.6 es el candidato más claro de pesos abiertos, mientras que DeepSeek V4 Pro Max merece prueba si el coste de inferencia es la restricción principal [1][18].

La gente también pregunta

¿Cuál es la respuesta corta a "GPT-5.5, Claude Opus 4.7, Kimi K2.6 y DeepSeek V4: comparativa de benchmarks"?

Para agentes de código con uso intensivo de terminal, GPT 5.5 parte con ventaja; para reparación de software, Claude Opus 4.7 lidera las filas citadas de SWE Bench [18][24].

¿Cuáles son los puntos clave a validar primero?

¿Qué debo hacer a continuación en la práctica?

Kimi K2.6 es el candidato más claro de pesos abiertos, mientras que DeepSeek V4 Pro Max merece prueba si el coste de inferencia es la restricción principal [1][18].

¿Qué tema relacionado debería explorar a continuación?

Continúe con "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto" para conocer otro ángulo y citas adicionales.

Abrir página relacionada

¿Con qué debería comparar esto?

Verifique esta respuesta con "Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger".

Abrir página relacionada

Continúe su investigación

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: comparativa de benchmarks 2026

Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger

Northwest y Southeast Timber: por qué más sustituibilidad implica larger; larger

¿DeepSeek derrotará a OpenAI, Claude, Gemini y Grok? La clave no es el ranking, sino el costo y la confianza

PocketOS y el borrado de su base de datos: lo que de verdad enseña el caso Claude/Cursor

Conversación de investigación

Tú25 abr 2026

Pregunta de investigación

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 abr 202640 fuentes

Respuesta citada

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

Fuentes

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago

Tendencias en Descubrir

InformesPublicado29 abr 2026Last edited 6 may 202612 fuentes

GPT-5.5, Claude Opus 4.7, Kimi K2.6 y DeepSeek V4: comparativa de benchmarks

Buscar y verificar hechos con Studio Global AI Explora más de Descubrir

17K0

Veredicto rápido por tipo de trabajo

Agentes de programación que viven en la terminal: GPT-5.5 tiene el mejor resultado citado en Terminal-Bench 2.0 dentro de la comparación compartida, con 82,7% ^[24].
Reparación de software: Claude Opus 4.7 lidera la fila citada de SWE-Bench Pro con 64,3% y la de SWE-Bench Verified con 87,6% ^[18]^[24].
Razonamiento difícil sin herramientas: Claude Opus 4.7 encabeza las filas compartidas de GPQA Diamond y Humanity’s Last Exam sin herramientas ^[24].
Razonamiento con herramientas y navegación: GPT-5.5 Pro lidera Humanity’s Last Exam con herramientas con 57,2% y BrowseComp con 90,1%, allí donde se informa esa variante Pro ^[24].
Despliegue con pesos abiertos: Kimi K2.6 es el candidato más claro en las fuentes citadas: se describe como un modelo MoE de 1 billón de parámetros, con 32.000 millones activos y ventana de contexto de 256K ^[1].
Inferencia alojada sensible al precio: DeepSeek-V4-Pro-Max es el candidato de valor a validar: LLM Stats lo lista con contexto de 1 millón, 80,6% en SWE-Bench Verified y columnas de coste de $1.74/$3.48 ^[18].

Tabla comparativa de benchmarks

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Kimi K2.6	DeepSeek-V4-Pro-Max
GPQA Diamond	93,6% ^[24]	—	94,2% ^[24]	≈91% ^[28]	90,1% ^[24]
Humanity’s Last Exam, sin herramientas	41,4% ^[24]	43,1% ^[24]	46,9% ^[24]	—	37,7% ^[24]
Humanity’s Last Exam, con herramientas	52,2% ^[24]	57,2% ^[24]	54,7% ^[24]	54,0% ^[1]	48,2% ^[24]
Terminal-Bench 2.0	82,7% ^[24]	—	69,4% ^[24]	66,7% ^[6]	67,9% ^[24]
SWE-Bench Pro	58,6% ^[24]	—	64,3% ^[24]	58,6% ^[6]	55,4% ^[24]
BrowseComp	84,4% ^[24]	90,1% ^[24]	79,3% ^[24]	83,2% ^[1]	83,4% ^[24]
MCP Atlas / MCPAtlas Public	75,3% ^[24]	—	79,1% ^[24]	—	73,6% ^[24]
SWE-Bench Verified	—	—	87,6% ^[18]	80,2% ^[6]	80,6% ^[18]

Qué modelo probar primero

Prioridad	Empieza por	Por qué
Agentes de código orientados a terminal	GPT-5.5	Tiene la mejor puntuación citada en Terminal-Bench 2.0 dentro de la comparación compartida: 82,7% ^[24].
Reparación de software	Claude Opus 4.7	Lidera la fila citada de SWE-Bench Pro y la de SWE-Bench Verified entre estos modelos ^[18]^[24].
Razonamiento difícil sin herramientas	Claude Opus 4.7	Encabeza GPQA Diamond y Humanity’s Last Exam sin herramientas en la comparación compartida ^[24].
Razonamiento con herramientas o navegación	GPT-5.5 Pro	Lidera Humanity’s Last Exam con herramientas y BrowseComp cuando GPT-5.5 Pro se informa por separado ^[24].
Despliegue con pesos abiertos	Kimi K2.6	Se describe como un MoE de pesos abiertos con 1 billón de parámetros, y su ficha en Hugging Face informa filas fuertes en benchmarks de programación ^[1]^[6].
Inferencia alojada con presión de costes	DeepSeek-V4-Pro-Max	LLM Stats lo lista con contexto de 1 millón, 80,6% en SWE-Bench Verified y columnas de coste inferiores a las de Claude Opus 4.7 en el mismo ranking ^[18].
Necesidades de contexto largo	GPT-5.5, Claude Opus 4.7 o DeepSeek-V4-Pro-Max	Las fuentes citadas listan contexto de 1 millón para GPT-5.5, Claude Opus 4.7 y DeepSeek-V4-Pro-Max; Kimi K2.6 aparece alrededor de 256K a 262K ^[1]^[11]^[16]^[18]^[27].

Notas por modelo

GPT-5.5

Claude Opus 4.7

Kimi K2.6

DeepSeek-V4-Pro-Max

Contexto y señales de precio

Las ventanas de contexto y los precios no siempre proceden de la misma fuente ni del proveedor directo. Úsalos como señales para compras, no como presupuestos cerrados.

Modelo	Señal citada de contexto y precio	Lectura práctica
GPT-5.5	BenchLM lista contexto de 1 millón; un informe de precios lista $5 de entrada y $30 de salida por millón de tokens ^[27]^[36].	Opción premium alojada; verificar precio vigente.
Claude Opus 4.7	LLM Stats informa contexto de 1 millón y precio de $5/$25 por millón de tokens ^[16].	Opción premium para programación, razonamiento y contexto largo.
Kimi K2.6	La cobertura de lanzamiento informa contexto de 256K; LLM Stats lista 262K y $0.95/$4.00 en sus columnas de precio ^[1]^[11].	Candidato fuerte de pesos abiertos; el precio alojado puede variar según proveedor.
DeepSeek-V4-Pro-Max	LLM Stats lista contexto de 1 millón, tamaño de 1,6T, 80,6% en SWE-Bench Verified y $1.74/$3.48 en columnas de coste ^[18].	Buen candidato de valor si mantiene calidad en tu carga real.

Por qué los rankings no siempre coinciden

Cómo evaluar a los finalistas

Antes de comprometerte con un modelo, prueba los dos o tres candidatos principales con tareas parecidas a tu carga real.

Usa prompts, archivos y repositorios reales. Los prompts de benchmark rara vez capturan tu base de código, tus documentos, tus políticas internas o el comportamiento de tus usuarios.
Replica el entorno de herramientas. Los resultados de agentes de programación cambian cuando el modelo tiene terminal, navegación, recuperación documental, contexto del repositorio o APIs internas.
Mide coste y latencia con la misma configuración. Los modos Pro y los ajustes de mayor esfuerzo pueden mejorar la calidad, pero también cambian consumo de tokens y tiempo de respuesta.
Revisa los fallos a mano. En programación, mira tests, diffs, mantenibilidad, regresiones de seguridad y dependencias inventadas.
Incluye al menos un rival más barato. Kimi K2.6 y DeepSeek-V4-Pro-Max merecen estar en la prueba si importan los pesos abiertos o el coste de inferencia ^[1]^[18].

Conclusión

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Buscar y verificar hechos con Studio Global AI

Conclusiones clave

Para agentes de código con uso intensivo de terminal, GPT 5.5 parte con ventaja; para reparación de software, Claude Opus 4.7 lidera las filas citadas de SWE Bench [18][24].
GPT 5.5 Pro no debe mezclarse con GPT 5.5 base: cuando aparece por separado, lidera BrowseComp con 90,1% y Humanity’s Last Exam con herramientas con 57,2% [24].
Kimi K2.6 es el candidato más claro de pesos abiertos, mientras que DeepSeek V4 Pro Max merece prueba si el coste de inferencia es la restricción principal [1][18].

La gente también pregunta

¿Cuál es la respuesta corta a "GPT-5.5, Claude Opus 4.7, Kimi K2.6 y DeepSeek V4: comparativa de benchmarks"?

Para agentes de código con uso intensivo de terminal, GPT 5.5 parte con ventaja; para reparación de software, Claude Opus 4.7 lidera las filas citadas de SWE Bench [18][24].

¿Cuáles son los puntos clave a validar primero?

¿Qué debo hacer a continuación en la práctica?

Kimi K2.6 es el candidato más claro de pesos abiertos, mientras que DeepSeek V4 Pro Max merece prueba si el coste de inferencia es la restricción principal [1][18].

¿Qué tema relacionado debería explorar a continuación?

Continúe con "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: benchmarks 2026 y veredicto" para conocer otro ángulo y citas adicionales.

Abrir página relacionada

¿Con qué debería comparar esto?

Verifique esta respuesta con "Madera del Noroeste vs. del Sureste: por qué la respuesta es larger; larger".

Abrir página relacionada

Continúe su investigación

Conversación de investigación

Tú25 abr 2026

Pregunta de investigación

Search for the benchmarks of GPT-5.5, Claude Opus 4.7, Kimi K2.6, DeepSeek V4. Compare them to help me understand more about them.

Studio Global AI25 abr 202640 fuentes

Respuesta citada

Key findings

GPT-5.5’s official benchmarks emphasize agentic coding and computer-use: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, and 81.8% on CyberGym ^[4].
GPT-5.5 Pro appears stronger on some harder research/search/math tasks, including 90.1% on BrowseComp, 52.4% on FrontierMath Tier 1–3, and 39.6% on FrontierMath Tier 4 ^[4].
Claude Opus 4.7 is positioned by Anthropic as a premium hybrid reasoning model for coding and agents, with a 1M-token context window and API pricing of $5 per million input tokens and $25 per million output tokens ^[2]. Third-party benchmark reporting lists Claude Opus 4.7 at 87.6% on SWE-bench Verified and 94.2% on GPQA Diamond ^[5].
Kimi K2.6 is an open-weight multimodal MoE model with 1T total parameters, 32B active parameters, and a 256K-token context window ^[13]. Its model card reports 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, 66.7% on Terminal-Bench 2.0, 90.5% on GPQA Diamond, and 54.0% on HLE-Full with tools ^[13].
DeepSeek V4-Pro-Max is reported as a 1.6T-parameter open model with much lower API pricing than GPT-5.5 and Claude Opus 4.7 ^[6]. Reported comparison data puts DeepSeek V4-Pro-Max at 90.1% on GPQA Diamond, 37.7% on HLE without tools, 48.2% on HLE with tools, and 67.9% on Terminal-Bench 2.0 ^[6].

Comparison

Model	Best fit	Notable reported benchmarks	Main caveat
GPT-5.5	Best general pick for agentic coding, computer use, research workflows, and tool-heavy tasks	Terminal-Bench 2.0: 82.7%; SWE-Bench Pro: 58.6%; OSWorld-Verified: 78.7%; BrowseComp: 84.4%; CyberGym: 81.8% ^[4]	OpenAI’s strongest numbers include internal or vendor-run evals, so independent confirmation matters ^[4].
GPT-5.5 Pro	Harder reasoning/search/math where cost is less important	BrowseComp: 90.1%; FrontierMath Tier 1–3: 52.4%; FrontierMath Tier 4: 39.6% ^[4]	Higher-tier “Pro” results are not directly comparable to base GPT-5.5 or open models unless the same effort/budget is used ^[4].
Claude Opus 4.7	Premium coding, long-context enterprise workflows, careful reasoning, and high-reliability agents	SWE-bench Verified: 87.6%; GPQA Diamond: 94.2%; 1M context window ^[2]^[5]	Some Anthropic benchmark results are presented in release materials or partner/internal evals, and not every benchmark is directly comparable to public leaderboard runs ^[8].
Kimi K2.6	Best open-weight option if you want strong coding/agentic performance and self-hostability	SWE-Bench Verified: 80.2%; SWE-Bench Pro: 58.6%; Terminal-Bench 2.0: 66.7%; GPQA Diamond: 90.5%; HLE-Full with tools: 54.0% ^[13]	Kimi’s own model card notes different settings and re-evaluations for some comparison scores, so treat cross-model deltas cautiously ^[13].
DeepSeek V4-Pro-Max	Best value candidate; strong open-model performance with much lower reported API cost	GPQA Diamond: 90.1%; HLE no tools: 37.7%; HLE with tools: 48.2%; Terminal-Bench 2.0: 67.9% ^[6]	Evidence is newer and less independently settled; I would wait for more third-party validation before treating it as equal to closed frontier models ^[6].

How to interpret the numbers

For software engineering, Claude Opus 4.7’s 87.6% SWE-bench Verified score is the standout among the reported figures I found, while GPT-5.5’s 82.7% Terminal-Bench 2.0 score is the strongest reported command-line/agentic coding number in this set ^[4]^[5].
For broad tool-using agent tasks, GPT-5.5 looks especially strong because its official release reports high scores across OSWorld-Verified, BrowseComp, Toolathlon, Terminal-Bench 2.0, and CyberGym rather than just one narrow benchmark ^[4].
For open-weight deployment, Kimi K2.6 and DeepSeek V4 are the two to watch: Kimi K2.6 has a strong official benchmark table across coding, reasoning, vision, and agentic tasks, while DeepSeek V4 emphasizes frontier-like performance at much lower cost ^[6]^[13].
For hard academic/scientific reasoning, Claude Opus 4.7’s reported 94.2% GPQA Diamond leads the figures I found, while GPT-5.5 and DeepSeek V4 are reported lower on GPQA Diamond in the comparison data I found ^[5]^[6].
For price/performance, DeepSeek V4 appears meaningfully cheaper than GPT-5.5 and Claude Opus 4.7 in reported API pricing, but cheaper inference only matters if its quality is sufficient for your workload ^[6].

Evidence notes

The strongest source for GPT-5.5 is OpenAI’s official GPT-5.5 release page, which includes the benchmark table for Terminal-Bench 2.0, SWE-Bench Pro, OSWorld-Verified, BrowseComp, FrontierMath, Toolathlon, and CyberGym ^[4].
The strongest source for Claude Opus 4.7 positioning, context window, availability, and pricing is Anthropic’s Opus 4.7 product/release material ^[2]^[8]. The cleanest text benchmark figures I found for Opus 4.7’s SWE-bench Verified and GPQA Diamond scores came from LLM Stats reporting ^[5].
The strongest source for Kimi K2.6 is Moonshot AI’s Hugging Face model card, which includes architecture details, context length, and the evaluation table ^[13].
The strongest text source I found for DeepSeek V4’s direct comparison against GPT-5.5 and Claude Opus 4.7 was VentureBeat’s launch/benchmark report ^[6].

Limitations / uncertainty

Insufficient evidence for one single, independent, fully apples-to-apples benchmark suite that evaluates all four models with the same prompts, tools, budgets, reasoning effort, and scoring harness.
Some numbers are vendor-reported, some are third-party summaries, and some benchmarks use different tool access or reasoning-effort settings, so small score differences should not be overinterpreted ^[4]^[8]^[13].
For real buying/deployment decisions, I would run your own 20–50 task eval set with your actual prompts, files, tools, latency targets, and cost constraints.

Summary

Fuentes

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model ...latent.space
Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Clou...
[6] moonshotai/Kimi-K2.6 - Hugging Facehuggingface.co
OSWorld-Verified 73.1 75.0 72.7 63.3 Coding Terminal-Bench 2.0 (Terminus-2) 66.7 65.4 65.4 68.5 50.8 SWE-Bench Pro 58.6 57.7 53.4 54.2 50.7 SWE-Bench Multilingual 76.7 77.8 76.9 73.0 SWE-Bench Verified 80.2 80.8 80.6 76.8 SciCode 52.2 56.6 51.9 58.9 48.7 OJ...
[11] AI Leaderboard 2026 - Compare Top AI Models & Rankingsllm-stats.com
19 Image 20: Moonshot AI Kimi K2.6NEW Moonshot AI 1,157 — 90.5% 80.2% 262K $0.95 $4.00 Open Source 20 Image 21: OpenAI GPT-5.2 Codex OpenAI 1,148 812 — — 400K $1.75 $14.00 Proprietary [...] 6 Image 7: Anthropic Claude Opus 4.5 Anthropic 1,614 1,342 87.0% 80...
[16] Claude Opus 4.7: Benchmarks, Pricing, Context & What's Newllm-stats.com
LLM Stats Logo Make AI phone calls with one API call Claude Opus 4.7: Benchmarks, Pricing, Context & What's New Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$2...
[17] Introducing Claude Opus 4.7 - Anthropicanthropic.com
CyberGym: Opus 4.6’s score has been updated from the originally reported 66.6 to 73.8, as we updated our harness parameters to better elicit cyber capability. SWE-bench Multimodal: We used an internal implementation for both Opus 4.7 and Opus 4.6. Scores ar...
[18] SWE-Bench Verified Leaderboard - LLM Statsllm-stats.com
Model Score Size Context Cost License --- --- --- 1 Anthropic Claude Mythos Preview Anthropic 0.939 — — $25.00 / $125.00 2 Anthropic Claude Opus 4.7 Anthropic 0.876 — 1.0M $5.00 / $25.00 3 Anthropic Claude Opus 4.5 Anthropic 0.809 — 200K $5.00 / $25.00 4 An...
[24] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 4...
[27] GPT-5.5 Benchmarks 2026: Scores, Rankings & Performancebenchlm.ai
Core Rankings Specialized Use Cases Dashboards Directories Guides & Lists Tools GPT-5.5 According to BenchLM.ai, GPT-5.5 ranks 5 out of 112 models on the provisional leaderboard with an overall score of 89/100. It also ranks 2 out of 16 on the verified lead...
[28] GPT-5.5: Pricing, Benchmarks & Performance - LLM Statsllm-stats.com
9Image 42GPT-5 mini 0.22 10Image 43o3 0.16 GPQAView → 4 of 10 Image 44: LLM Stats Logo A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, w...
[30] AI Model Benchmarks Apr 2026 | Compare GPT-5, Claude ...lmcouncil.ai
METR Time Horizons Model Minutes --- 1 Claude Opus 4.6 (unknown thinking) 718.8 ±1815.2 2 GPT-5.2 (high) 352.2 ±335.5 3 GPT-5.3 Codex 349.5 ±333.1 4 Claude Opus 4.5 (no thinking) 293.0 ±239.0 5 Claude Opus 4.5 (16k thinking) 288.9 ±558.2 SWE-bench Verified...
[36] GPT-5.5 Doubles the Price, Google Goes Full Agent, DeepSeek V4 ...thecreatorsai.com
GPT-5.5 is out — $5 per million input, $30 per million output. That's exactly double GPT-5.4 and 20% more than Claude Opus 4.7. OpenAI released ... 21 hours ago
[38] Introducing GPT-5.5 - OpenAIopenai.com
Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis ... 2 days ago