BerichteVeröffentlicht28. Apr. 2026Last edited 6. Mai 20268 Quellen

GPT-5.5 vs. Claude Opus 4.7 vs. DeepSeek V4 vs. Kimi K2.6: Wer gewinnt welchen Benchmark?

Es gibt keinen klaren Gesamtsieger: Claude Opus 4.7 führt bei GPQA Diamond mit 94,2 % und bei Humanity’s Last Exam ohne Tools mit 46,9 %, GPT 5.5 führt bei Terminal Bench 2.0 mit 82,7 %, und GPT 5.5 Pro führt bei HLE... DeepSeek V4 Pro Max ist in der gemeinsamen Tabelle konkurrenzfähig, führt aber keine der gelistet...

Suchen und Fakten prüfen mit Studio Global AI Mehr von Entdecken ansehen

15K0

Editorial illustration of GPT-5.5, Claude Opus 4.7, DeepSeek V4 and Kimi K2.6 compared across AI benchmark categories — GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by CategoryAI-generated editorial illustration for comparing frontier model benchmark winners by category.
KI-Prompt
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by Category. Article summary: No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr.... Topic tags: ai, llm benchmarks, openai, anthropic, deepseek. Reference image context from search candidates: Reference image 1: visual subject "Kimi K2.6 ties GPT-5.5 on SWE-bench Pro at 5–6x lower cost — with agent swarms, 13-hour autonomous runs, and open weights. In practice it is the first open-source model that can su" source context "Kimi K2.6: The Complete Developer Guide (2026) - Codersera" Reference image 2: visual subject "# Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which S
openai.com

Benchmark-Tabellen sehen oft eindeutiger aus, als sie sind. Für Teams, die ein Modell auswählen wollen, ist deshalb nicht die Frage entscheidend, wer die Rangliste insgesamt gewinnt. Wichtiger ist: Welcher Benchmark ähnelt Ihrer tatsächlichen Aufgabe?

Die belastbarste gemeinsame Tabelle vergleicht GPT-5.5, GPT-5.5 Pro, soweit ausgewiesen, Claude Opus 4.7 und DeepSeek-V4-Pro-Max. Kimi K2.6 taucht dagegen vor allem in separaten Vergleichen auf. Dadurch ist Kimi für einzelne Signale interessant, aber nicht in jeder Kategorie sauber mit den anderen drei Modellen vergleichbar ^[4]^[11]^[13].

Die Gewinner nach Einsatzgebiet

Einsatzgebiet	Am besten belegte Wahl	Warum
Wissenschaftliches Reasoning	Claude Opus 4.7	94,2 % auf GPQA Diamond, vor GPT-5.5 mit 93,6 % und DeepSeek-V4-Pro-Max mit 90,1 % ^[4]
Experten-Reasoning ohne Tools	Claude Opus 4.7	46,9 % auf Humanity’s Last Exam ohne Tools, vor GPT-5.5 Pro mit 43,1 %, GPT-5.5 mit 41,4 % und DeepSeek-V4-Pro-Max mit 37,7 % ^[4]
Toolgestütztes Prüfungs-Reasoning	GPT-5.5 Pro	57,2 % auf Humanity’s Last Exam mit Tools, vor Claude Opus 4.7 mit 54,7 % ^[4]
Terminal- und agentische Computeraufgaben	GPT-5.5	82,7 % auf Terminal-Bench 2.0, vor Claude Opus 4.7 mit 69,4 % und DeepSeek-V4-Pro-Max mit 67,9 % ^[4]^[5]
Bedienung von Betriebssystem-Umgebungen	GPT-5.5	78,7 % auf OSWorld-Verified gegenüber 78,0 % für Claude Opus 4.7 ^[5]
Frontier-Mathematik	GPT-5.5	51,7 % auf FrontierMath Tiers 1–3 gegenüber 43,8 % für Claude Opus 4.7 ^[5]
Software Engineering in der gemeinsamen Tabelle	Claude Opus 4.7	64,3 % auf SWE-Bench Pro / SWE Pro, vor GPT-5.5 mit 58,6 % und DeepSeek-V4-Pro-Max mit 55,4 % ^[4]
Browsing	GPT-5.5 Pro	90,1 % auf BrowseComp, vor GPT-5.5 mit 84,4 %, DeepSeek-V4-Pro-Max mit 83,4 % und Claude Opus 4.7 mit 79,3 % ^[4]
MCP-artige öffentliche Tool-Workflows	Claude Opus 4.7	79,1 % auf MCP Atlas / MCPAtlas Public, vor GPT-5.5 mit 75,3 % und DeepSeek-V4-Pro-Max mit 73,6 % ^[4]
Vision und Dokumentanalyse	Claude Opus 4.7	Als Nummer 1 in der Vision & Document Arena berichtet, mit Siegen in den Unterkategorien Diagramme, Hausaufgaben und OCR ^[1]
Preisbewusste Auswahl	DeepSeek V4	VentureBeat beschreibt DeepSeek V4 als nahezu State-of-the-Art bei etwa einem Sechstel der Kosten von Opus 4.7 und GPT-5.5; das sollte aber am eigenen Workload geprüft werden ^[4]
Am wenigsten sauberer Vierer-Vergleich	Kimi K2.6	Kimi hat brauchbare gemeldete Werte, die zitierten Belege stammen aber überwiegend aus separaten Vergleichen statt aus derselben GPT-5.5-, Claude- und DeepSeek-Tabelle ^[11]^[13]

Benchmark-Tabelle im Detail

Benchmark / Fähigkeit	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	Am besten belegte Lesart
GPQA Diamond	93,6 % ^[4]	nicht berichtet	94,2 % ^[4]	90,1 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt die gemeinsame Tabelle an ^[4]
Humanity’s Last Exam, ohne Tools	41,4 % ^[4]	43,1 % ^[4]	46,9 % ^[4]	37,7 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt die gemeinsame Tabelle an ^[4]
Humanity’s Last Exam, mit Tools	52,2 % ^[4]	57,2 % ^[4]	54,7 % ^[4]	48,2 % für DeepSeek-V4-Pro-Max ^[4]	54,0 % in einem separaten Kimi-Vergleich ^[13]	GPT-5.5 Pro führt die gemeinsame Tabelle an ^[4]
Terminal-Bench 2.0	82,7 % ^[4]^[5]	nicht berichtet	69,4 % ^[4]^[5]	67,9 % für DeepSeek-V4-Pro-Max ^[4]	66,7 % in einem separaten Kimi-Vergleich ^[13]	GPT-5.5 führt ^[4]^[5]
SWE-Bench Pro / SWE Pro	58,6 % ^[4]	nicht berichtet	64,3 % ^[4]	55,4 % für DeepSeek-V4-Pro-Max ^[4]	58,6 % in einem separaten Kimi-Vergleich ^[13]	Claude führt die gemeinsame Tabelle an ^[4]
BrowseComp	84,4 % ^[4]	90,1 % ^[4]	79,3 % ^[4]	83,4 % für DeepSeek-V4-Pro-Max ^[4]; 83,4 % für DeepSeek-V4 Pro in einem weiteren Vergleich ^[11]	83,2 % in einem Kimi-vs.-DeepSeek-Vergleich ^[11]	GPT-5.5 Pro führt die gemeinsame Tabelle an ^[4]
MCP Atlas / MCPAtlas Public	75,3 % ^[4]	nicht berichtet	79,1 % ^[4]	73,6 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt ^[4]
OSWorld-Verified	78,7 % ^[5]	nicht berichtet	78,0 % ^[5]	nicht berichtet	nicht berichtet	GPT-5.5 liegt knapp vor Claude ^[5]
FrontierMath Tiers 1–3	51,7 % ^[5]	nicht berichtet	43,8 % ^[5]	nicht berichtet	nicht berichtet	GPT-5.5 führt vor Claude ^[5]
Vision & Document Arena	nicht berichtet	nicht berichtet	Als Nummer 1 insgesamt berichtet ^[1]	nicht berichtet	nicht berichtet	Claude hat hier das einzige zitierte Ergebnis ^[1]
AIME 2026	nicht berichtet	nicht berichtet	nicht berichtet	in der zitierten Kimi-vs.-DeepSeek-Tabelle nicht verfügbar ^[11]	96,4 % im Thinking Mode ^[11]	Nützliches Kimi-Signal, kein Vierer-Ranking ^[11]
APEX Agents	nicht berichtet	nicht berichtet	nicht berichtet	in der zitierten Kimi-vs.-DeepSeek-Tabelle nicht verfügbar ^[11]	27,9 % im Thinking Mode ^[11]	Nützliches Kimi-Signal, kein Vierer-Ranking ^[11]
Kontextfenster	nicht berichtet	nicht berichtet	1.000k Tokens in einem Artificial-Analysis-Vergleich ^[3]	1.000k Tokens für DeepSeek V4 Pro im selben Vergleich ^[3]	nicht berichtet	Claude und DeepSeek V4 Pro liegen in dieser Konfiguration gleichauf ^[3]

Wichtig: Zeilen, die mehrere Quellen mischen, sollten vorsichtig gelesen werden. Ein Kimi-Wert aus einem Kimi-fokussierten Vergleich ist informativ, aber weniger belastbar als ein Resultat aus derselben Tabelle und demselben Testaufbau wie GPT-5.5, Claude Opus 4.7 und DeepSeek-V4-Pro-Max ^[4]^[11]^[13].

GPT-5.5: stark bei Terminal, Betriebssystem, Mathematik und Tool-Nutzung

Der klarste Sieg von GPT-5.5 liegt bei Terminal-Bench 2.0: 82,7 % gegenüber 69,4 % für Claude Opus 4.7 und 67,9 % für DeepSeek-V4-Pro-Max in der gemeinsamen Tabelle ^[4]^[5]. Das ist einer der größten Abstände im zitierten Benchmark-Set.

Auch bei OSWorld-Verified liegt GPT-5.5 vor Claude Opus 4.7, allerdings nur knapp mit 78,7 % zu 78,0 % ^[5]. Auf FrontierMath Tiers 1–3 ist der Vorsprung deutlicher: 51,7 % für GPT-5.5 gegenüber 43,8 % für Claude ^[5].

GPT-5.5 Pro verändert das Bild, sobald Tools oder Browsing zentral sind. Auf Humanity’s Last Exam mit Tools erreicht GPT-5.5 Pro 57,2 % und liegt damit vor Claude Opus 4.7 mit 54,7 %, GPT-5.5 mit 52,2 % und DeepSeek-V4-Pro-Max mit 48,2 % ^[4]. Bei BrowseComp führt GPT-5.5 Pro mit 90,1 %, vor GPT-5.5 mit 84,4 %, DeepSeek-V4-Pro-Max mit 83,4 % und Claude Opus 4.7 mit 79,3 % ^[4].

GPT-5.5 gewinnt aber nicht jedes Reasoning-Szenario. Claude Opus 4.7 liegt auf GPQA Diamond in der gemeinsamen Tabelle knapp vorn, mit 94,2 % gegenüber 93,6 % für GPT-5.5 ^[4]. Ein separater GPT-5.5-Leitfaden nennt außerdem GPT-5.5-only-Werte wie 91,7 % auf Harvey BigLaw Bench, 88,5 % auf einem internen Investment-Banking-Benchmark und 80,5 % auf BixBench. Diese Werte sollten aber nicht als Vierer-Siege gelesen werden, weil der zitierte Ausschnitt keine entsprechenden Ergebnisse für Claude Opus 4.7, DeepSeek V4 und Kimi K2.6 berichtet ^[7].

Claude Opus 4.7: stark bei Reasoning ohne Werkzeuge und bei Dokumenten

Claude Opus 4.7 hat in der wichtigsten gemeinsamen Tabelle das beste Profil für Reasoning ohne externe Werkzeuge. Das Modell führt GPQA Diamond mit 94,2 % und Humanity’s Last Exam ohne Tools mit 46,9 % an ^[4]. Auch auf SWE-Bench Pro / SWE Pro liegt Claude mit 64,3 % vorn, ebenso auf MCP Atlas / MCPAtlas Public mit 79,1 % ^[4].

Schwächer wirkt Claude in den zitierten Daten bei terminalartigen Aufgaben. GPT-5.5 liegt auf Terminal-Bench 2.0 mehr als 13 Punkte vor Claude, 82,7 % zu 69,4 %, und führt auch bei OSWorld-Verified sowie FrontierMath Tiers 1–3 ^[4]^[5].

Das stärkste belegte multimodale Signal kommt dagegen von Claude. Eine Quelle berichtet, dass Claude Opus 4.7 in der Vision & Document Arena Platz 1 erreicht, sich in der Document Arena um 4 Punkte gegenüber Opus 4.6 verbessert und in den Unterkategorien Diagramme, Hausaufgaben und OCR gewinnt ^[1]. Dieselbe Quelle liefert jedoch keine vergleichbaren numerischen Vision-&-Document-Arena-Werte für GPT-5.5, DeepSeek V4 oder Kimi K2.6. Das stützt also Claudes Dokumentstärke, aber kein vollständiges multimodales Vierer-Ranking ^[1].

DeepSeek V4: konkurrenzfähig, aber der belegte Trumpf ist Preis-Leistung

Bei DeepSeek ist die Modellbezeichnung wichtig. Die gemeinsame Benchmark-Tabelle berichtet DeepSeek-V4-Pro-Max, während der Artificial-Analysis-Vergleich DeepSeek V4 Pro mit einem Kontextfenster von 1.000k Tokens aufführt ^[4]^[3]. Diese Labels sollten nicht automatisch gleichgesetzt werden.

In der gemeinsamen Tabelle ist DeepSeek-V4-Pro-Max konkurrenzfähig, führt aber keine Zeile an. Genannt werden 90,1 % auf GPQA Diamond, 37,7 % auf Humanity’s Last Exam ohne Tools, 48,2 % auf Humanity’s Last Exam mit Tools, 67,9 % auf Terminal-Bench 2.0, 55,4 % auf SWE-Bench Pro / SWE Pro, 83,4 % auf BrowseComp und 73,6 % auf MCP Atlas / MCPAtlas Public ^[4].

Der wichtigste belegte Produktvorteil ist nicht ein einzelner Kategoriesieg, sondern Kosten-Leistung. VentureBeat beschreibt DeepSeek V4 als nahezu State-of-the-Art bei etwa einem Sechstel der Kosten von Opus 4.7 und GPT-5.5 ^[4]. Das ist ein guter Grund, DeepSeek bei kostenkritischen Workloads zu testen, aber kein Ersatz für eine eigene Qualitätsmessung.

Für Long-Context-Screenings listet ein Artificial-Analysis-Vergleich sowohl DeepSeek V4 Pro als auch Claude Opus 4.7 mit 1.000k-Token-Kontextfenstern ^[3]. Das spricht für Gleichstand in genau diesen aufgeführten Konfigurationen, nicht automatisch für jede DeepSeek- oder Claude-Variante ^[3].

Kimi K2.6: interessante Werte, aber schwächere direkte Vergleichbarkeit

Kimi K2.6 ist in diesem Set am schwierigsten sauber einzuordnen, weil es nicht in der zentralen gemeinsamen Tabelle mit GPT-5.5, Claude Opus 4.7 und DeepSeek-V4-Pro-Max auftaucht ^[4]. Ein Kimi-fokussierter Vergleich nennt für K2.6 58,6 % auf SWE-Bench Pro, 80,2 % auf SWE-Bench Verified, 66,7 % auf Terminal-Bench 2.0, 54,0 % auf Humanity’s Last Exam mit Tools und 89,6 % auf LiveCodeBench v6 ^[13]. Die Quelle schreibt, dass die K2.6-Werte aus einer offiziellen Moonshot-AI-Model-Card stammen, vergleicht aber hauptsächlich mit Claude Opus 4.6 und GPT-5.4 statt mit exakt der Vierergruppe dieses Artikels ^[13].

Ein separater Kimi-vs.-DeepSeek-Vergleich meldet für Kimi K2.6 96,4 % auf AIME 2026 im Thinking Mode, 27,9 % auf APEX Agents im Thinking Mode und 83,2 % auf BrowseComp mit Thinking Mode und Kontextmanagement ^[11]. In derselben Quelle steht DeepSeek-V4 Pro bei 83,4 % auf BrowseComp; für AIME 2026 und APEX Agents sind dort keine DeepSeek-Werte verfügbar ^[11].

Damit bleibt Kimi K2.6 ein Modell, das man besonders für Coding-, Agenten-, Mathematik- und Browsing-Aufgaben testen kann. Die vorliegenden Belege reichen aber nicht für ein sauberes Gesamturteil gegen GPT-5.5 und Claude Opus 4.7 über dieselbe Benchmark-Suite hinweg ^[11]^[13].

Welche Modelle sollten Sie zuerst testen?

Testen Sie GPT-5.5 zuerst für terminal-lastige Agenten, Betriebssystem-Aufgaben und FrontierMath-ähnliche Arbeit; es führt in den zitierten Terminal-Bench-2.0-, OSWorld-Verified- und FrontierMath-Ergebnissen ^[4]^[5].
Testen Sie GPT-5.5 Pro zuerst, wenn toolgestütztes Reasoning oder Browsing im Mittelpunkt steht; es führt bei Humanity’s Last Exam mit Tools und BrowseComp in der gemeinsamen Tabelle ^[4].
Testen Sie Claude Opus 4.7 zuerst für GPQA-artiges Wissenschafts-Reasoning, Expertenfragen ohne Tools, SWE-Bench-Pro-ähnliches Software Engineering, MCP-artige Workflows und dokumentlastige multimodale Aufgaben ^[4]^[1].
Testen Sie DeepSeek V4 zuerst, wenn Kosten-Leistung die wichtigste Grenze ist und Sie eigene Qualitätsprüfungen durchführen können; der belegte Vorteil ist die berichtete Near-Frontier-Leistung bei etwa einem Sechstel der Kosten von Opus 4.7 und GPT-5.5 ^[4].
Testen Sie Kimi K2.6 zuerst, wenn Sie gezielt die gemeldeten Coding-, Agenten-, Mathematik- und Browsing-Signale prüfen wollen. Vergleichen Sie es dann mit denselben Prompts, Tools, Kontextgrenzen, Latenzzielen und Bewertungsregeln wie die anderen Modelle ^[11]^[13].

Benchmark-Fallstricke, die wirklich zählen

Diese Zahlen sind keine universelle Rangliste. Die Quellen mischen Basis- und Pro-Varianten, darunter GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, Claude Opus 4.7 und Kimi K2.6 ^[3]^[4]^[11]^[13]. Einige Resultate sind außerdem anbieterberichtet; OpenAI weist für seine GPT-Evaluierungen zu ARC-AGI darauf hin, dass sie mit Reasoning Effort xhigh in einer Forschungsumgebung liefen, die in Einzelfällen von der Produktionsversion von ChatGPT abweichen kann ^[5]^[8].

Knappere Abstände sollte man eher als Richtungssignal lesen. Claudes Vorsprung gegenüber GPT-5.5 auf GPQA Diamond beträgt 0,6 Punkte, und GPT-5.5 liegt auf OSWorld-Verified nur 0,7 Punkte vor Claude ^[4]^[5]. Größere Lücken sind handlungsrelevanter: GPT-5.5 liegt auf Terminal-Bench 2.0 mehr als 13 Punkte vor Claude, und auf FrontierMath beträgt der Vorsprung gegenüber Claude 7,9 Punkte ^[5].

Die praktische Schlussfolgerung: Unter GPT-5.5, Claude Opus 4.7, DeepSeek V4 und Kimi K2.6 gibt es keinen einzelnen Sieger für alles. Wählen Sie zuerst die Benchmark-Kategorie, die Ihrem echten Workload am nächsten kommt, und testen Sie danach die infrage kommenden Modelle mit Ihren eigenen Aufgaben erneut.

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Suchen und Fakten prüfen mit Studio Global AI

Wichtige Erkenntnisse

Es gibt keinen klaren Gesamtsieger: Claude Opus 4.7 führt bei GPQA Diamond mit 94,2 % und bei Humanity’s Last Exam ohne Tools mit 46,9 %, GPT 5.5 führt bei Terminal Bench 2.0 mit 82,7 %, und GPT 5.5 Pro führt bei HLE...
DeepSeek V4 Pro Max ist in der gemeinsamen Tabelle konkurrenzfähig, führt aber keine der gelisteten Zeilen an; der wichtigste belegte Vorteil ist die Darstellung als nahezu State of the Art bei etwa einem Sechstel der...
Kimi K2.6 liefert nützliche Scores, ist aber weniger sauber vergleichbar, weil die zitierten Kimi Daten überwiegend aus separaten Vergleichen stammen [11][13].

Die Leute fragen auch

Wie lautet die kurze Antwort auf „GPT-5.5 vs. Claude Opus 4.7 vs. DeepSeek V4 vs. Kimi K2.6: Wer gewinnt welchen Benchmark?“?

Was sind die wichtigsten Punkte, die zuerst validiert werden müssen?

Was soll ich als nächstes in der Praxis tun?

Kimi K2.6 liefert nützliche Scores, ist aber weniger sauber vergleichbar, weil die zitierten Kimi Daten überwiegend aus separaten Vergleichen stammen [11][13].

Welches verwandte Thema sollte ich als nächstes untersuchen?

Fahren Sie mit „Claude Opus 4.7 vs. GPT-5.5 vs. DeepSeek V4 vs. Kimi K2.6: Benchmark-Vergleich 2026“ für einen anderen Blickwinkel und zusätzliche Zitate fort.

Zugehörige Seite öffnen

Womit soll ich das vergleichen?

Vergleichen Sie diese Antwort mit „DeepSeek V4: 1 Mio. Token, MoE und der API-Umstieg“.

Zugehörige Seite öffnen

Setzen Sie Ihre Recherche fort

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 vs. GPT-5.5 vs. DeepSeek V4 vs. Kimi K2.6: Benchmark-Vergleich 2026

Claude Opus 4.7 vs. GPT-5.5 vs. DeepSeek V4 vs. Kimi K2.6: Benchmarks 2026 im Vergleich

DeepSeek V4 工程架构示意图，包含 1M 上下文、MoE 专家路由和 API 服务化元素

DeepSeek V4: 1 Mio. Token, MoE und der API-Umstieg

DeepSeek V4: Was hinter 1 Mio. Token, MoE und API-Migration steckt

Northwest vs. Southeast Timber: Warum die Antwort „larger; larger“ lautet

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

Schlägt DeepSeek OpenAI, Claude, Gemini und Grok? Entscheidend sind Kosten und Vertrauen

DeepSeek gegen OpenAI, Claude, Gemini und Grok: Warum Kosten und Vertrauen wichtiger sind als Ranglisten

Forschungsgespräch

Du27. Apr. 2026

Forschungsfrage

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI27. Apr. 202621 Quellen

Zitierte Antwort

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena ^[4]^[1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools ^[4]^[5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison ^[4]^[18]^[1].

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Quellen

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

Trendthemen auf Entdecken

BerichteVeröffentlicht28. Apr. 2026Last edited 6. Mai 20268 Quellen

GPT-5.5 vs. Claude Opus 4.7 vs. DeepSeek V4 vs. Kimi K2.6: Wer gewinnt welchen Benchmark?

Suchen und Fakten prüfen mit Studio Global AI Mehr von Entdecken ansehen

15K0

Die Gewinner nach Einsatzgebiet

Einsatzgebiet	Am besten belegte Wahl	Warum
Wissenschaftliches Reasoning	Claude Opus 4.7	94,2 % auf GPQA Diamond, vor GPT-5.5 mit 93,6 % und DeepSeek-V4-Pro-Max mit 90,1 % ^[4]
Experten-Reasoning ohne Tools	Claude Opus 4.7	46,9 % auf Humanity’s Last Exam ohne Tools, vor GPT-5.5 Pro mit 43,1 %, GPT-5.5 mit 41,4 % und DeepSeek-V4-Pro-Max mit 37,7 % ^[4]
Toolgestütztes Prüfungs-Reasoning	GPT-5.5 Pro	57,2 % auf Humanity’s Last Exam mit Tools, vor Claude Opus 4.7 mit 54,7 % ^[4]
Terminal- und agentische Computeraufgaben	GPT-5.5	82,7 % auf Terminal-Bench 2.0, vor Claude Opus 4.7 mit 69,4 % und DeepSeek-V4-Pro-Max mit 67,9 % ^[4]^[5]
Bedienung von Betriebssystem-Umgebungen	GPT-5.5	78,7 % auf OSWorld-Verified gegenüber 78,0 % für Claude Opus 4.7 ^[5]
Frontier-Mathematik	GPT-5.5	51,7 % auf FrontierMath Tiers 1–3 gegenüber 43,8 % für Claude Opus 4.7 ^[5]
Software Engineering in der gemeinsamen Tabelle	Claude Opus 4.7	64,3 % auf SWE-Bench Pro / SWE Pro, vor GPT-5.5 mit 58,6 % und DeepSeek-V4-Pro-Max mit 55,4 % ^[4]
Browsing	GPT-5.5 Pro	90,1 % auf BrowseComp, vor GPT-5.5 mit 84,4 %, DeepSeek-V4-Pro-Max mit 83,4 % und Claude Opus 4.7 mit 79,3 % ^[4]
MCP-artige öffentliche Tool-Workflows	Claude Opus 4.7	79,1 % auf MCP Atlas / MCPAtlas Public, vor GPT-5.5 mit 75,3 % und DeepSeek-V4-Pro-Max mit 73,6 % ^[4]
Vision und Dokumentanalyse	Claude Opus 4.7	Als Nummer 1 in der Vision & Document Arena berichtet, mit Siegen in den Unterkategorien Diagramme, Hausaufgaben und OCR ^[1]
Preisbewusste Auswahl	DeepSeek V4	VentureBeat beschreibt DeepSeek V4 als nahezu State-of-the-Art bei etwa einem Sechstel der Kosten von Opus 4.7 und GPT-5.5; das sollte aber am eigenen Workload geprüft werden ^[4]
Am wenigsten sauberer Vierer-Vergleich	Kimi K2.6	Kimi hat brauchbare gemeldete Werte, die zitierten Belege stammen aber überwiegend aus separaten Vergleichen statt aus derselben GPT-5.5-, Claude- und DeepSeek-Tabelle ^[11]^[13]

Benchmark-Tabelle im Detail

Benchmark / Fähigkeit	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	Am besten belegte Lesart
GPQA Diamond	93,6 % ^[4]	nicht berichtet	94,2 % ^[4]	90,1 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt die gemeinsame Tabelle an ^[4]
Humanity’s Last Exam, ohne Tools	41,4 % ^[4]	43,1 % ^[4]	46,9 % ^[4]	37,7 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt die gemeinsame Tabelle an ^[4]
Humanity’s Last Exam, mit Tools	52,2 % ^[4]	57,2 % ^[4]	54,7 % ^[4]	48,2 % für DeepSeek-V4-Pro-Max ^[4]	54,0 % in einem separaten Kimi-Vergleich ^[13]	GPT-5.5 Pro führt die gemeinsame Tabelle an ^[4]
Terminal-Bench 2.0	82,7 % ^[4]^[5]	nicht berichtet	69,4 % ^[4]^[5]	67,9 % für DeepSeek-V4-Pro-Max ^[4]	66,7 % in einem separaten Kimi-Vergleich ^[13]	GPT-5.5 führt ^[4]^[5]
SWE-Bench Pro / SWE Pro	58,6 % ^[4]	nicht berichtet	64,3 % ^[4]	55,4 % für DeepSeek-V4-Pro-Max ^[4]	58,6 % in einem separaten Kimi-Vergleich ^[13]	Claude führt die gemeinsame Tabelle an ^[4]
BrowseComp	84,4 % ^[4]	90,1 % ^[4]	79,3 % ^[4]	83,4 % für DeepSeek-V4-Pro-Max ^[4]; 83,4 % für DeepSeek-V4 Pro in einem weiteren Vergleich ^[11]	83,2 % in einem Kimi-vs.-DeepSeek-Vergleich ^[11]	GPT-5.5 Pro führt die gemeinsame Tabelle an ^[4]
MCP Atlas / MCPAtlas Public	75,3 % ^[4]	nicht berichtet	79,1 % ^[4]	73,6 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt ^[4]
OSWorld-Verified	78,7 % ^[5]	nicht berichtet	78,0 % ^[5]	nicht berichtet	nicht berichtet	GPT-5.5 liegt knapp vor Claude ^[5]
FrontierMath Tiers 1–3	51,7 % ^[5]	nicht berichtet	43,8 % ^[5]	nicht berichtet	nicht berichtet	GPT-5.5 führt vor Claude ^[5]
Vision & Document Arena	nicht berichtet	nicht berichtet	Als Nummer 1 insgesamt berichtet ^[1]	nicht berichtet	nicht berichtet	Claude hat hier das einzige zitierte Ergebnis ^[1]
AIME 2026	nicht berichtet	nicht berichtet	nicht berichtet	in der zitierten Kimi-vs.-DeepSeek-Tabelle nicht verfügbar ^[11]	96,4 % im Thinking Mode ^[11]	Nützliches Kimi-Signal, kein Vierer-Ranking ^[11]
APEX Agents	nicht berichtet	nicht berichtet	nicht berichtet	in der zitierten Kimi-vs.-DeepSeek-Tabelle nicht verfügbar ^[11]	27,9 % im Thinking Mode ^[11]	Nützliches Kimi-Signal, kein Vierer-Ranking ^[11]
Kontextfenster	nicht berichtet	nicht berichtet	1.000k Tokens in einem Artificial-Analysis-Vergleich ^[3]	1.000k Tokens für DeepSeek V4 Pro im selben Vergleich ^[3]	nicht berichtet	Claude und DeepSeek V4 Pro liegen in dieser Konfiguration gleichauf ^[3]

GPT-5.5: stark bei Terminal, Betriebssystem, Mathematik und Tool-Nutzung

Claude Opus 4.7: stark bei Reasoning ohne Werkzeuge und bei Dokumenten

DeepSeek V4: konkurrenzfähig, aber der belegte Trumpf ist Preis-Leistung

Kimi K2.6: interessante Werte, aber schwächere direkte Vergleichbarkeit

Welche Modelle sollten Sie zuerst testen?

Testen Sie GPT-5.5 zuerst für terminal-lastige Agenten, Betriebssystem-Aufgaben und FrontierMath-ähnliche Arbeit; es führt in den zitierten Terminal-Bench-2.0-, OSWorld-Verified- und FrontierMath-Ergebnissen ^[4]^[5].
Testen Sie GPT-5.5 Pro zuerst, wenn toolgestütztes Reasoning oder Browsing im Mittelpunkt steht; es führt bei Humanity’s Last Exam mit Tools und BrowseComp in der gemeinsamen Tabelle ^[4].
Testen Sie Claude Opus 4.7 zuerst für GPQA-artiges Wissenschafts-Reasoning, Expertenfragen ohne Tools, SWE-Bench-Pro-ähnliches Software Engineering, MCP-artige Workflows und dokumentlastige multimodale Aufgaben ^[4]^[1].
Testen Sie DeepSeek V4 zuerst, wenn Kosten-Leistung die wichtigste Grenze ist und Sie eigene Qualitätsprüfungen durchführen können; der belegte Vorteil ist die berichtete Near-Frontier-Leistung bei etwa einem Sechstel der Kosten von Opus 4.7 und GPT-5.5 ^[4].
Testen Sie Kimi K2.6 zuerst, wenn Sie gezielt die gemeldeten Coding-, Agenten-, Mathematik- und Browsing-Signale prüfen wollen. Vergleichen Sie es dann mit denselben Prompts, Tools, Kontextgrenzen, Latenzzielen und Bewertungsregeln wie die anderen Modelle ^[11]^[13].

Benchmark-Fallstricke, die wirklich zählen

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Suchen und Fakten prüfen mit Studio Global AI

Wichtige Erkenntnisse

Es gibt keinen klaren Gesamtsieger: Claude Opus 4.7 führt bei GPQA Diamond mit 94,2 % und bei Humanity’s Last Exam ohne Tools mit 46,9 %, GPT 5.5 führt bei Terminal Bench 2.0 mit 82,7 %, und GPT 5.5 Pro führt bei HLE...
DeepSeek V4 Pro Max ist in der gemeinsamen Tabelle konkurrenzfähig, führt aber keine der gelisteten Zeilen an; der wichtigste belegte Vorteil ist die Darstellung als nahezu State of the Art bei etwa einem Sechstel der...
Kimi K2.6 liefert nützliche Scores, ist aber weniger sauber vergleichbar, weil die zitierten Kimi Daten überwiegend aus separaten Vergleichen stammen [11][13].

Die Leute fragen auch

Wie lautet die kurze Antwort auf „GPT-5.5 vs. Claude Opus 4.7 vs. DeepSeek V4 vs. Kimi K2.6: Wer gewinnt welchen Benchmark?“?

Was sind die wichtigsten Punkte, die zuerst validiert werden müssen?

Was soll ich als nächstes in der Praxis tun?

Kimi K2.6 liefert nützliche Scores, ist aber weniger sauber vergleichbar, weil die zitierten Kimi Daten überwiegend aus separaten Vergleichen stammen [11][13].

Welches verwandte Thema sollte ich als nächstes untersuchen?

Fahren Sie mit „Claude Opus 4.7 vs. GPT-5.5 vs. DeepSeek V4 vs. Kimi K2.6: Benchmark-Vergleich 2026“ für einen anderen Blickwinkel und zusätzliche Zitate fort.

Zugehörige Seite öffnen

Womit soll ich das vergleichen?

Vergleichen Sie diese Antwort mit „DeepSeek V4: 1 Mio. Token, MoE und der API-Umstieg“.

Zugehörige Seite öffnen

Setzen Sie Ihre Recherche fort

Claude Opus 4.7 vs. GPT-5.5 vs. DeepSeek V4 vs. Kimi K2.6: Benchmark-Vergleich 2026

Claude Opus 4.7 vs. GPT-5.5 vs. DeepSeek V4 vs. Kimi K2.6: Benchmarks 2026 im Vergleich

DeepSeek V4: 1 Mio. Token, MoE und der API-Umstieg

DeepSeek V4: Was hinter 1 Mio. Token, MoE und API-Migration steckt

Northwest vs. Southeast Timber: Warum die Antwort „larger; larger“ lautet

Schlägt DeepSeek OpenAI, Claude, Gemini und Grok? Entscheidend sind Kosten und Vertrauen

DeepSeek gegen OpenAI, Claude, Gemini und Grok: Warum Kosten und Vertrauen wichtiger sind als Ranglisten

Forschungsgespräch

Du27. Apr. 2026

Forschungsfrage

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI27. Apr. 202621 Quellen

Zitierte Antwort

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Quellen

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

Trendthemen auf Entdecken

BerichteVeröffentlicht28. Apr. 2026Last edited 6. Mai 20268 Quellen

GPT-5.5 vs. Claude Opus 4.7 vs. DeepSeek V4 vs. Kimi K2.6: Wer gewinnt welchen Benchmark?

Suchen und Fakten prüfen mit Studio Global AI Mehr von Entdecken ansehen

15K0

Die Gewinner nach Einsatzgebiet

Einsatzgebiet	Am besten belegte Wahl	Warum
Wissenschaftliches Reasoning	Claude Opus 4.7	94,2 % auf GPQA Diamond, vor GPT-5.5 mit 93,6 % und DeepSeek-V4-Pro-Max mit 90,1 % ^[4]
Experten-Reasoning ohne Tools	Claude Opus 4.7	46,9 % auf Humanity’s Last Exam ohne Tools, vor GPT-5.5 Pro mit 43,1 %, GPT-5.5 mit 41,4 % und DeepSeek-V4-Pro-Max mit 37,7 % ^[4]
Toolgestütztes Prüfungs-Reasoning	GPT-5.5 Pro	57,2 % auf Humanity’s Last Exam mit Tools, vor Claude Opus 4.7 mit 54,7 % ^[4]
Terminal- und agentische Computeraufgaben	GPT-5.5	82,7 % auf Terminal-Bench 2.0, vor Claude Opus 4.7 mit 69,4 % und DeepSeek-V4-Pro-Max mit 67,9 % ^[4]^[5]
Bedienung von Betriebssystem-Umgebungen	GPT-5.5	78,7 % auf OSWorld-Verified gegenüber 78,0 % für Claude Opus 4.7 ^[5]
Frontier-Mathematik	GPT-5.5	51,7 % auf FrontierMath Tiers 1–3 gegenüber 43,8 % für Claude Opus 4.7 ^[5]
Software Engineering in der gemeinsamen Tabelle	Claude Opus 4.7	64,3 % auf SWE-Bench Pro / SWE Pro, vor GPT-5.5 mit 58,6 % und DeepSeek-V4-Pro-Max mit 55,4 % ^[4]
Browsing	GPT-5.5 Pro	90,1 % auf BrowseComp, vor GPT-5.5 mit 84,4 %, DeepSeek-V4-Pro-Max mit 83,4 % und Claude Opus 4.7 mit 79,3 % ^[4]
MCP-artige öffentliche Tool-Workflows	Claude Opus 4.7	79,1 % auf MCP Atlas / MCPAtlas Public, vor GPT-5.5 mit 75,3 % und DeepSeek-V4-Pro-Max mit 73,6 % ^[4]
Vision und Dokumentanalyse	Claude Opus 4.7	Als Nummer 1 in der Vision & Document Arena berichtet, mit Siegen in den Unterkategorien Diagramme, Hausaufgaben und OCR ^[1]
Preisbewusste Auswahl	DeepSeek V4	VentureBeat beschreibt DeepSeek V4 als nahezu State-of-the-Art bei etwa einem Sechstel der Kosten von Opus 4.7 und GPT-5.5; das sollte aber am eigenen Workload geprüft werden ^[4]
Am wenigsten sauberer Vierer-Vergleich	Kimi K2.6	Kimi hat brauchbare gemeldete Werte, die zitierten Belege stammen aber überwiegend aus separaten Vergleichen statt aus derselben GPT-5.5-, Claude- und DeepSeek-Tabelle ^[11]^[13]

Benchmark-Tabelle im Detail

Benchmark / Fähigkeit	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	Am besten belegte Lesart
GPQA Diamond	93,6 % ^[4]	nicht berichtet	94,2 % ^[4]	90,1 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt die gemeinsame Tabelle an ^[4]
Humanity’s Last Exam, ohne Tools	41,4 % ^[4]	43,1 % ^[4]	46,9 % ^[4]	37,7 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt die gemeinsame Tabelle an ^[4]
Humanity’s Last Exam, mit Tools	52,2 % ^[4]	57,2 % ^[4]	54,7 % ^[4]	48,2 % für DeepSeek-V4-Pro-Max ^[4]	54,0 % in einem separaten Kimi-Vergleich ^[13]	GPT-5.5 Pro führt die gemeinsame Tabelle an ^[4]
Terminal-Bench 2.0	82,7 % ^[4]^[5]	nicht berichtet	69,4 % ^[4]^[5]	67,9 % für DeepSeek-V4-Pro-Max ^[4]	66,7 % in einem separaten Kimi-Vergleich ^[13]	GPT-5.5 führt ^[4]^[5]
SWE-Bench Pro / SWE Pro	58,6 % ^[4]	nicht berichtet	64,3 % ^[4]	55,4 % für DeepSeek-V4-Pro-Max ^[4]	58,6 % in einem separaten Kimi-Vergleich ^[13]	Claude führt die gemeinsame Tabelle an ^[4]
BrowseComp	84,4 % ^[4]	90,1 % ^[4]	79,3 % ^[4]	83,4 % für DeepSeek-V4-Pro-Max ^[4]; 83,4 % für DeepSeek-V4 Pro in einem weiteren Vergleich ^[11]	83,2 % in einem Kimi-vs.-DeepSeek-Vergleich ^[11]	GPT-5.5 Pro führt die gemeinsame Tabelle an ^[4]
MCP Atlas / MCPAtlas Public	75,3 % ^[4]	nicht berichtet	79,1 % ^[4]	73,6 % für DeepSeek-V4-Pro-Max ^[4]	nicht berichtet	Claude führt ^[4]
OSWorld-Verified	78,7 % ^[5]	nicht berichtet	78,0 % ^[5]	nicht berichtet	nicht berichtet	GPT-5.5 liegt knapp vor Claude ^[5]
FrontierMath Tiers 1–3	51,7 % ^[5]	nicht berichtet	43,8 % ^[5]	nicht berichtet	nicht berichtet	GPT-5.5 führt vor Claude ^[5]
Vision & Document Arena	nicht berichtet	nicht berichtet	Als Nummer 1 insgesamt berichtet ^[1]	nicht berichtet	nicht berichtet	Claude hat hier das einzige zitierte Ergebnis ^[1]
AIME 2026	nicht berichtet	nicht berichtet	nicht berichtet	in der zitierten Kimi-vs.-DeepSeek-Tabelle nicht verfügbar ^[11]	96,4 % im Thinking Mode ^[11]	Nützliches Kimi-Signal, kein Vierer-Ranking ^[11]
APEX Agents	nicht berichtet	nicht berichtet	nicht berichtet	in der zitierten Kimi-vs.-DeepSeek-Tabelle nicht verfügbar ^[11]	27,9 % im Thinking Mode ^[11]	Nützliches Kimi-Signal, kein Vierer-Ranking ^[11]
Kontextfenster	nicht berichtet	nicht berichtet	1.000k Tokens in einem Artificial-Analysis-Vergleich ^[3]	1.000k Tokens für DeepSeek V4 Pro im selben Vergleich ^[3]	nicht berichtet	Claude und DeepSeek V4 Pro liegen in dieser Konfiguration gleichauf ^[3]

GPT-5.5: stark bei Terminal, Betriebssystem, Mathematik und Tool-Nutzung

Claude Opus 4.7: stark bei Reasoning ohne Werkzeuge und bei Dokumenten

DeepSeek V4: konkurrenzfähig, aber der belegte Trumpf ist Preis-Leistung

Kimi K2.6: interessante Werte, aber schwächere direkte Vergleichbarkeit

Welche Modelle sollten Sie zuerst testen?

Testen Sie GPT-5.5 zuerst für terminal-lastige Agenten, Betriebssystem-Aufgaben und FrontierMath-ähnliche Arbeit; es führt in den zitierten Terminal-Bench-2.0-, OSWorld-Verified- und FrontierMath-Ergebnissen ^[4]^[5].
Testen Sie GPT-5.5 Pro zuerst, wenn toolgestütztes Reasoning oder Browsing im Mittelpunkt steht; es führt bei Humanity’s Last Exam mit Tools und BrowseComp in der gemeinsamen Tabelle ^[4].
Testen Sie Claude Opus 4.7 zuerst für GPQA-artiges Wissenschafts-Reasoning, Expertenfragen ohne Tools, SWE-Bench-Pro-ähnliches Software Engineering, MCP-artige Workflows und dokumentlastige multimodale Aufgaben ^[4]^[1].
Testen Sie DeepSeek V4 zuerst, wenn Kosten-Leistung die wichtigste Grenze ist und Sie eigene Qualitätsprüfungen durchführen können; der belegte Vorteil ist die berichtete Near-Frontier-Leistung bei etwa einem Sechstel der Kosten von Opus 4.7 und GPT-5.5 ^[4].
Testen Sie Kimi K2.6 zuerst, wenn Sie gezielt die gemeldeten Coding-, Agenten-, Mathematik- und Browsing-Signale prüfen wollen. Vergleichen Sie es dann mit denselben Prompts, Tools, Kontextgrenzen, Latenzzielen und Bewertungsregeln wie die anderen Modelle ^[11]^[13].

Benchmark-Fallstricke, die wirklich zählen

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Suchen und Fakten prüfen mit Studio Global AI

Wichtige Erkenntnisse

Es gibt keinen klaren Gesamtsieger: Claude Opus 4.7 führt bei GPQA Diamond mit 94,2 % und bei Humanity’s Last Exam ohne Tools mit 46,9 %, GPT 5.5 führt bei Terminal Bench 2.0 mit 82,7 %, und GPT 5.5 Pro führt bei HLE...
DeepSeek V4 Pro Max ist in der gemeinsamen Tabelle konkurrenzfähig, führt aber keine der gelisteten Zeilen an; der wichtigste belegte Vorteil ist die Darstellung als nahezu State of the Art bei etwa einem Sechstel der...
Kimi K2.6 liefert nützliche Scores, ist aber weniger sauber vergleichbar, weil die zitierten Kimi Daten überwiegend aus separaten Vergleichen stammen [11][13].

Die Leute fragen auch

Wie lautet die kurze Antwort auf „GPT-5.5 vs. Claude Opus 4.7 vs. DeepSeek V4 vs. Kimi K2.6: Wer gewinnt welchen Benchmark?“?

Was sind die wichtigsten Punkte, die zuerst validiert werden müssen?

Was soll ich als nächstes in der Praxis tun?

Kimi K2.6 liefert nützliche Scores, ist aber weniger sauber vergleichbar, weil die zitierten Kimi Daten überwiegend aus separaten Vergleichen stammen [11][13].

Welches verwandte Thema sollte ich als nächstes untersuchen?

Fahren Sie mit „Claude Opus 4.7 vs. GPT-5.5 vs. DeepSeek V4 vs. Kimi K2.6: Benchmark-Vergleich 2026“ für einen anderen Blickwinkel und zusätzliche Zitate fort.

Zugehörige Seite öffnen

Womit soll ich das vergleichen?

Vergleichen Sie diese Antwort mit „DeepSeek V4: 1 Mio. Token, MoE und der API-Umstieg“.

Zugehörige Seite öffnen

Setzen Sie Ihre Recherche fort

Forschungsgespräch

Du27. Apr. 2026

Forschungsfrage

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI27. Apr. 202621 Quellen

Zitierte Antwort

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Quellen

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...