LaporanDipublikasikan28 Apr 2026Last edited 6 Mei 20268 sumber

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6

Claude Opus 4.7 memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9%, sedangkan GPT 5.5 memimpin Terminal Bench 2.0 dengan 82,7% [4][5]. GPT 5.5 Pro menjadi pilihan terkuat di hasil yang dikutip untuk reasoning dengan tool dan browsing: 57,2% di Humanity’s Last Exam with tools dan 90,1...

Cari dan periksa fakta dengan Studio Global AI Jelajahi lebih banyak dari Discover

15K0

Editorial illustration of GPT-5.5, Claude Opus 4.7, DeepSeek V4 and Kimi K2.6 compared across AI benchmark categories — GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by CategoryAI-generated editorial illustration for comparing frontier model benchmark winners by category.
AI Perintah
Create a landscape editorial hero image for this Studio Global article: GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6: Benchmark Winners by Category. Article summary: No single model wins across the available 2026 benchmark evidence: Claude Opus 4.7 leads GPQA Diamond at 94.2% and Humanity’s Last Exam without tools at 46.9%, GPT 5.5 leads Terminal Bench 2.0 at 82.7%, and GPT 5.5 Pr.... Topic tags: ai, llm benchmarks, openai, anthropic, deepseek. Reference image context from search candidates: Reference image 1: visual subject "Kimi K2.6 ties GPT-5.5 on SWE-bench Pro at 5–6x lower cost — with agent swarms, 13-hour autonomous runs, and open weights. In practice it is the first open-source model that can su" source context "Kimi K2.6: The Complete Developer Guide (2026) - Codersera" Reference image 2: visual subject "# Kimi K2.6 vs DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Which S
openai.com

Kalau hanya melihat tabel benchmark, adu GPT-5.5, Claude Opus 4.7, DeepSeek V4, dan Kimi K2.6 tampak seperti lomba dengan satu pemenang. Kenyataannya tidak sesederhana itu. Tabel bersama yang paling kuat mencakup GPT-5.5, GPT-5.5 Pro bila tersedia, Claude Opus 4.7, dan DeepSeek-V4-Pro-Max; sementara Kimi K2.6 lebih sering muncul di perbandingan terpisah, sehingga tidak selalu apel dengan apel di semua kategori ^[4]^[11]^[13].

Cara paling aman membacanya: mulai dari jenis pekerjaan Anda. Untuk sains dan tanya jawab ahli tanpa tool, Claude lebih kuat. Untuk terminal, operasi OS, matematika frontier, dan browsing versi Pro, GPT-5.5 punya beberapa kemenangan jelas. Untuk biaya, DeepSeek V4 patut diuji. Untuk Kimi K2.6, sinyalnya menarik, tetapi perlu pengujian ulang dengan harness yang sama.

Pemenang cepat per kategori

Kebutuhan kerja	Pilihan dengan dukungan data terkuat	Alasannya
Reasoning sains	Claude Opus 4.7	94,2% di GPQA Diamond, di atas GPT-5.5 pada 93,6% dan DeepSeek-V4-Pro-Max pada 90,1% ^[4]
Reasoning ahli tanpa tool	Claude Opus 4.7	46,9% di Humanity’s Last Exam tanpa tool, di atas GPT-5.5 Pro 43,1%, GPT-5.5 41,4%, dan DeepSeek-V4-Pro-Max 37,7% ^[4]
Reasoning ujian dengan tool	GPT-5.5 Pro	57,2% di Humanity’s Last Exam with tools, di atas Claude Opus 4.7 pada 54,7% ^[4]
Terminal dan komputasi agentic	GPT-5.5	82,7% di Terminal-Bench 2.0, jauh di atas Claude Opus 4.7 69,4% dan DeepSeek-V4-Pro-Max 67,9% ^[4]^[5]
Operasi OS	GPT-5.5	78,7% di OSWorld-Verified versus Claude Opus 4.7 pada 78,0% ^[5]
Matematika frontier	GPT-5.5	51,7% di FrontierMath Tiers 1–3 versus Claude Opus 4.7 pada 43,8% ^[5]
Software engineering dalam tabel bersama	Claude Opus 4.7	64,3% di SWE-Bench Pro / SWE Pro, di atas GPT-5.5 58,6% dan DeepSeek-V4-Pro-Max 55,4% ^[4]
Browsing	GPT-5.5 Pro	90,1% di BrowseComp, di atas GPT-5.5 84,4%, DeepSeek-V4-Pro-Max 83,4%, dan Claude Opus 4.7 79,3% ^[4]
Workflow tool publik ala MCP	Claude Opus 4.7	79,1% di MCP Atlas / MCPAtlas Public, di atas GPT-5.5 75,3% dan DeepSeek-V4-Pro-Max 73,6% ^[4]
Vision dan analisis dokumen	Claude Opus 4.7	Dilaporkan nomor 1 di Vision & Document Arena, termasuk menang di subkategori diagram, homework, dan OCR ^[1]
Evaluasi sensitif biaya	DeepSeek V4	VentureBeat menyebut DeepSeek V4 mendekati state-of-the-art dengan biaya sekitar seperenam Opus 4.7 dan GPT-5.5, tetapi klaim biaya tetap perlu divalidasi pada workload sendiri ^[4]
Perbandingan empat arah paling tidak bersih	Kimi K2.6	Skor Kimi berguna, tetapi bukti yang dikutip sebagian besar berasal dari perbandingan terpisah, bukan tabel bersama yang sama ^[11]^[13]

Tabel benchmark utama

Benchmark / kemampuan	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	Bacaan paling aman
GPQA Diamond	93,6% ^[4]	Tidak dilaporkan	94,2% ^[4]	90,1% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin tabel bersama ^[4]
Humanity’s Last Exam, tanpa tool	41,4% ^[4]	43,1% ^[4]	46,9% ^[4]	37,7% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin tabel bersama ^[4]
Humanity’s Last Exam, dengan tool	52,2% ^[4]	57,2% ^[4]	54,7% ^[4]	48,2% untuk DeepSeek-V4-Pro-Max ^[4]	54,0% dalam perbandingan Kimi terpisah ^[13]	GPT-5.5 Pro memimpin tabel bersama ^[4]
Terminal-Bench 2.0	82,7% ^[4]^[5]	Tidak dilaporkan	69,4% ^[4]^[5]	67,9% untuk DeepSeek-V4-Pro-Max ^[4]	66,7% dalam perbandingan Kimi terpisah ^[13]	GPT-5.5 memimpin ^[4]^[5]
SWE-Bench Pro / SWE Pro	58,6% ^[4]	Tidak dilaporkan	64,3% ^[4]	55,4% untuk DeepSeek-V4-Pro-Max ^[4]	58,6% dalam perbandingan Kimi terpisah ^[13]	Claude memimpin tabel bersama ^[4]
BrowseComp	84,4% ^[4]	90,1% ^[4]	79,3% ^[4]	83,4% untuk DeepSeek-V4-Pro-Max ^[4]; 83,4% untuk DeepSeek-V4 Pro dalam perbandingan lain ^[11]	83,2% dalam perbandingan Kimi vs DeepSeek ^[11]	GPT-5.5 Pro memimpin tabel bersama ^[4]
MCP Atlas / MCPAtlas Public	75,3% ^[4]	Tidak dilaporkan	79,1% ^[4]	73,6% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin ^[4]
OSWorld-Verified	78,7% ^[5]	Tidak dilaporkan	78,0% ^[5]	Tidak dilaporkan	Tidak dilaporkan	GPT-5.5 unggul tipis atas Claude ^[5]
FrontierMath Tiers 1–3	51,7% ^[5]	Tidak dilaporkan	43,8% ^[5]	Tidak dilaporkan	Tidak dilaporkan	GPT-5.5 memimpin Claude ^[5]
Vision & Document Arena	Tidak dilaporkan	Tidak dilaporkan	Dilaporkan nomor 1 secara keseluruhan ^[1]	Tidak dilaporkan	Tidak dilaporkan	Hanya Claude yang punya hasil dikutip ^[1]
AIME 2026	Tidak dilaporkan	Tidak dilaporkan	Tidak dilaporkan	Tidak tersedia dalam tabel Kimi vs DeepSeek yang dikutip ^[11]	96,4% dalam Thinking mode ^[11]	Sinyal Kimi berguna, bukan ranking empat arah ^[11]
APEX Agents	Tidak dilaporkan	Tidak dilaporkan	Tidak dilaporkan	Tidak tersedia dalam tabel Kimi vs DeepSeek yang dikutip ^[11]	27,9% dalam Thinking mode ^[11]	Sinyal Kimi berguna, bukan ranking empat arah ^[11]
Context window	Tidak dilaporkan	Tidak dilaporkan	1.000k token dalam satu perbandingan Artificial Analysis ^[3]	1.000k token untuk DeepSeek V4 Pro dalam perbandingan yang sama ^[3]	Tidak dilaporkan	Claude dan DeepSeek V4 Pro seimbang dalam konfigurasi itu ^[3]

Baris yang mencampur sumber harus dibaca hati-hati. Skor Kimi yang muncul dalam perbandingan berfokus Kimi tetap berguna, tetapi bobotnya tidak sekuat hasil dari tabel bersama dan harness yang sama dengan GPT-5.5, Claude Opus 4.7, dan DeepSeek-V4-Pro-Max ^[4]^[11]^[13].

GPT-5.5: paling menonjol untuk terminal, OS, matematika, dan tool

Kemenangan paling jelas GPT-5.5 ada di Terminal-Bench 2.0: 82,7% versus Claude Opus 4.7 pada 69,4% dan DeepSeek-V4-Pro-Max pada 67,9% dalam tabel bersama ^[4]^[5]. Selisih ini termasuk yang paling besar dalam kumpulan benchmark yang dikutip.

GPT-5.5 juga unggul atas Claude Opus 4.7 di OSWorld-Verified, tetapi tipis: 78,7% berbanding 78,0% ^[5]. Di FrontierMath Tiers 1–3, selisihnya lebih terasa: 51,7% untuk GPT-5.5 versus 43,8% untuk Claude ^[5].

Jika tool dan browsing menjadi inti pekerjaan, GPT-5.5 Pro mengubah peta. Model ini memimpin Humanity’s Last Exam with tools dengan 57,2%, di atas Claude Opus 4.7 54,7%, GPT-5.5 52,2%, dan DeepSeek-V4-Pro-Max 48,2% ^[4]. GPT-5.5 Pro juga memimpin BrowseComp dengan 90,1%, di atas GPT-5.5 84,4%, DeepSeek-V4-Pro-Max 83,4%, dan Claude Opus 4.7 79,3% ^[4].

Namun GPT-5.5 bukan pemenang di semua tes reasoning. Claude Opus 4.7 unggul tipis di GPQA Diamond, 94,2% versus 93,6%, dalam tabel bersama ^[4]. Ada pula hasil domain khusus GPT-5.5 seperti 91,7% di Harvey BigLaw Bench, 88,5% di benchmark internal investment banking, dan 80,5% di BixBench; tetapi hasil itu tidak boleh dibaca sebagai kemenangan empat arah karena kutipan yang tersedia tidak melaporkan skor setara untuk Claude Opus 4.7, DeepSeek V4, dan Kimi K2.6 ^[7].

Claude Opus 4.7: kuat untuk reasoning tanpa tool dan dokumen

Claude Opus 4.7 punya profil reasoning tanpa tool paling kuat dalam tabel bersama utama. Model ini memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9% ^[4]. Dalam tabel yang sama, Claude juga memimpin SWE-Bench Pro / SWE Pro dengan 64,3% dan MCP Atlas / MCPAtlas Public dengan 79,1% ^[4].

Area yang lebih lemah dalam data yang dikutip adalah operasi bergaya terminal. GPT-5.5 unggul lebih dari 13 poin atas Claude di Terminal-Bench 2.0, 82,7% versus 69,4%, dan juga unggul di OSWorld-Verified serta FrontierMath Tiers 1–3 ^[4]^[5].

Untuk multimodal dan dokumen, Claude punya sinyal paling kuat. Satu sumber melaporkan Claude Opus 4.7 sebagai nomor 1 di Vision & Document Arena, naik 4 poin dari Opus 4.6 di Document Arena, serta menang di subkategori diagram, homework, dan OCR ^[1]. Tetapi sumber yang sama tidak memberikan skor Vision & Document Arena yang sebanding untuk GPT-5.5, DeepSeek V4, atau Kimi K2.6, jadi ini mendukung kekuatan Claude di dokumen, bukan ranking multimodal empat arah yang lengkap ^[1].

DeepSeek V4: kompetitif, tetapi keunggulan utama yang dikutip adalah biaya

Ada beberapa label DeepSeek dalam sumber. Tabel bersama memakai DeepSeek-V4-Pro-Max, sementara perbandingan Artificial Analysis menyebut DeepSeek V4 Pro dengan context window 1.000k token ^[4]^[3]. Label-label ini tidak otomatis bisa dianggap sama.

Dalam tabel bersama utama, DeepSeek-V4-Pro-Max kompetitif tetapi tidak memimpin satu pun baris. Skornya adalah 90,1% di GPQA Diamond, 37,7% di Humanity’s Last Exam tanpa tool, 48,2% di Humanity’s Last Exam dengan tool, 67,9% di Terminal-Bench 2.0, 55,4% di SWE-Bench Pro / SWE Pro, 83,4% di BrowseComp, dan 73,6% di MCP Atlas / MCPAtlas Public ^[4].

Klaim produk DeepSeek yang paling kuat dalam sumber adalah biaya-kinerja. VentureBeat menggambarkan DeepSeek V4 sebagai model dengan intelligence mendekati state-of-the-art pada biaya sekitar seperenam Opus 4.7 dan GPT-5.5 ^[4]. Itu alasan bagus untuk mengujinya pada workload sensitif biaya, tetapi bukan alasan untuk melewati validasi kualitas di pekerjaan nyata.

Untuk penyaringan long-context, satu perbandingan Artificial Analysis mencantumkan DeepSeek V4 Pro dan Claude Opus 4.7 sama-sama memiliki context window 1.000k token ^[3]. Artinya keduanya setara pada konfigurasi yang dicantumkan itu, bukan klaim umum untuk semua mode DeepSeek atau Claude ^[3].

Kimi K2.6: skor menjanjikan, tetapi perbandingan langsungnya kurang rapi

Kimi K2.6 adalah model yang paling sulit ditempatkan secara bersih dalam perbandingan ini karena tidak masuk tabel bersama utama melawan GPT-5.5, Claude Opus 4.7, dan DeepSeek-V4-Pro-Max ^[4]. Satu perbandingan berfokus Kimi melaporkan K2.6 pada 58,6% di SWE-Bench Pro, 80,2% di SWE-Bench Verified, 66,7% di Terminal-Bench 2.0, 54,0% di Humanity’s Last Exam with tools, dan 89,6% di LiveCodeBench v6 ^[13]. Sumber itu menyebut angka K2.6 berasal dari model card resmi Moonshot AI, tetapi kelompok pembanding utamanya adalah Claude Opus 4.6 dan GPT-5.4, bukan persis empat model dalam artikel ini ^[13].

Perbandingan Kimi vs DeepSeek lain melaporkan Kimi K2.6 pada 96,4% di AIME 2026 dalam Thinking mode, 27,9% di APEX Agents dalam Thinking mode, dan 83,2% di BrowseComp dengan Thinking mode serta context management ^[11]. Dalam sumber yang sama, DeepSeek-V4 Pro tercatat 83,4% di BrowseComp, sementara nilai DeepSeek untuk AIME 2026 dan APEX Agents tidak tersedia ^[11].

Jadi, Kimi layak diuji, terutama untuk coding, agentic task, matematika, dan browsing. Namun materi yang dikutip belum cukup untuk menyusun ranking menyeluruh yang bersih terhadap GPT-5.5 dan Claude Opus 4.7 di suite benchmark yang sama ^[11]^[13].

Model mana yang sebaiknya dites dulu?

Uji GPT-5.5 lebih dulu untuk agent berbasis terminal, tugas operasi OS, dan pekerjaan mirip FrontierMath; model ini memimpin hasil Terminal-Bench 2.0, OSWorld-Verified, dan FrontierMath yang dikutip ^[4]^[5].
Uji GPT-5.5 Pro lebih dulu jika reasoning dengan tool atau browsing adalah kebutuhan utama; model ini memimpin Humanity’s Last Exam with tools dan BrowseComp dalam tabel bersama ^[4].
Uji Claude Opus 4.7 lebih dulu untuk reasoning sains ala GPQA, tanya jawab ahli tanpa tool, software engineering ala SWE-Bench Pro, workflow MCP-style, dan pekerjaan multimodal yang berat dokumen ^[4]^[1].
Uji DeepSeek V4 lebih dulu jika biaya-kinerja adalah batasan utama dan Anda bisa menjalankan pemeriksaan kualitas sendiri; keunggulan yang dikutip adalah performa mendekati frontier dengan biaya sekitar seperenam Opus 4.7 dan GPT-5.5 ^[4].
Uji Kimi K2.6 lebih dulu jika Anda memang ingin mengevaluasi skor coding, agentic, matematika, dan browsing yang dilaporkan; bandingkan dengan prompt, tool, batas konteks, target latensi, dan aturan scoring yang sama seperti model lain ^[11]^[13].

Catatan penting sebelum memakai angka benchmark

Ini bukan leaderboard universal. Sumber mencampur varian base dan Pro, termasuk GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, Claude Opus 4.7, dan Kimi K2.6 ^[3]^[4]^[11]^[13]. Sebagian hasil juga dilaporkan vendor; OpenAI mencatat evaluasi GPT untuk ARC dijalankan dengan reasoning effort xhigh dalam lingkungan riset yang bisa berbeda dari produksi ChatGPT ^[5]^[8].

Selisih tipis sebaiknya dianggap sebagai arah, bukan vonis. Keunggulan Claude atas GPT-5.5 di GPQA Diamond hanya 0,6 poin, dan keunggulan GPT-5.5 atas Claude di OSWorld-Verified hanya 0,7 poin ^[4]^[5]. Selisih besar lebih berguna untuk keputusan awal: GPT-5.5 unggul lebih dari 13 poin atas Claude di Terminal-Bench 2.0, dan unggul 7,9 poin di FrontierMath ^[5].

Kesimpulan praktisnya: tidak ada satu pemenang tunggal di antara GPT-5.5, Claude Opus 4.7, DeepSeek V4, dan Kimi K2.6. Pilih kategori benchmark yang paling mirip dengan pekerjaan nyata Anda, lalu jalankan evaluasi ulang pada model yang benar-benar bisa Anda deploy.

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Cari dan periksa fakta dengan Studio Global AI

Poin-poin penting

Claude Opus 4.7 memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9%, sedangkan GPT 5.5 memimpin Terminal Bench 2.0 dengan 82,7% [4][5].
GPT 5.5 Pro menjadi pilihan terkuat di hasil yang dikutip untuk reasoning dengan tool dan browsing: 57,2% di Humanity’s Last Exam with tools dan 90,1% di BrowseComp [4].
DeepSeek V4 paling menarik dari sisi biaya kinerja, sementara Kimi K2.6 punya skor menjanjikan tetapi belum sebersih tiga model lain untuk perbandingan empat arah yang sama [4][11][13].

Orang-orang juga bertanya

Apa jawaban singkat untuk "GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6"?

Claude Opus 4.7 memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9%, sedangkan GPT 5.5 memimpin Terminal Bench 2.0 dengan 82,7% [4][5].

Apa poin penting yang harus divalidasi terlebih dahulu?

Apa yang harus saya lakukan selanjutnya dalam latihan?

DeepSeek V4 paling menarik dari sisi biaya kinerja, sementara Kimi K2.6 punya skor menjanjikan tetapi belum sebersih tiga model lain untuk perbandingan empat arah yang sama [4][11][13].

Topik terkait manakah yang harus saya jelajahi selanjutnya?

Lanjutkan dengan "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: siapa unggul di benchmark 2026?" untuk sudut pandang lain dan kutipan tambahan.

Buka halaman terkait

Dengan apa saya harus membandingkannya?

Periksa ulang jawaban ini dengan "Northwest vs. Southeast Timber: Mengapa Jawabannya larger; larger?".

Buka halaman terkait

Lanjutkan penelitian Anda

Comparativa de benchmarks 2026 entre Claude Opus 4.7, GPT-5.5, DeepSeek V4 y Kimi K2.6

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: siapa unggul di benchmark 2026?

Benchmark AI 2026: Claude Opus 4.7, GPT-5.5, DeepSeek V4, dan Kimi K2.6 dibandingkan

Northwest vs. Southeast Timber: Mengapa Jawabannya larger; larger?

Northwest vs. Southeast Timber: Mengapa Substitusi Lebih Tinggi Berarti larger; larger

抽象 AI 晶片與多個聊天機器人平台競爭的示意圖，象徵 DeepSeek 挑戰 OpenAI、Claude、Gemini 與 Grok

Bisakah DeepSeek Mengalahkan OpenAI, Claude, Gemini, dan Grok? Masalahnya Bukan Cuma Benchmark

Bisakah DeepSeek Mengalahkan OpenAI, Claude, Gemini, dan Grok? Kuncinya Biaya dan Kepercayaan

Illustration of an AI coding agent connected to cloud database and backup systems

Insiden PocketOS: Pelajaran di Balik Klaim AI Menghapus Database

Klaim AI menghapus database PocketOS: apa yang diketahui dari insiden Claude/Cursor

Percakapan penelitian

kamu27 Apr 2026

Pertanyaan penelitian

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI27 Apr 202621 sumber

Jawaban yang dikutip

The available evidence supports a clear but incomplete picture: Claude Opus 4.7 leads the cited GPQA Diamond and Humanity’s Last Exam no-tools results and is reported #1 in Vision & Document Arena ^[4]^[1]. GPT-5.5 leads the cited Terminal-Bench 2.0, OSWorld-Verified, and FrontierMath results, while GPT-5.5 Pro leads Humanity’s Last Exam with tools ^[4]^[5]. DeepSeek V4 is reported as near-frontier at much lower cost and as the top open-weight model on one coding benchmark, but Kimi K2.6 has insufficient numeric evidence in the provided excerpts for a full benchmark comparison ^[4]^[18]^[1].

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Sumber

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

Temukan yang Sedang Tren

LaporanDipublikasikan28 Apr 2026Last edited 6 Mei 20268 sumber

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6

Cari dan periksa fakta dengan Studio Global AI Jelajahi lebih banyak dari Discover

15K0

Pemenang cepat per kategori

Kebutuhan kerja	Pilihan dengan dukungan data terkuat	Alasannya
Reasoning sains	Claude Opus 4.7	94,2% di GPQA Diamond, di atas GPT-5.5 pada 93,6% dan DeepSeek-V4-Pro-Max pada 90,1% ^[4]
Reasoning ahli tanpa tool	Claude Opus 4.7	46,9% di Humanity’s Last Exam tanpa tool, di atas GPT-5.5 Pro 43,1%, GPT-5.5 41,4%, dan DeepSeek-V4-Pro-Max 37,7% ^[4]
Reasoning ujian dengan tool	GPT-5.5 Pro	57,2% di Humanity’s Last Exam with tools, di atas Claude Opus 4.7 pada 54,7% ^[4]
Terminal dan komputasi agentic	GPT-5.5	82,7% di Terminal-Bench 2.0, jauh di atas Claude Opus 4.7 69,4% dan DeepSeek-V4-Pro-Max 67,9% ^[4]^[5]
Operasi OS	GPT-5.5	78,7% di OSWorld-Verified versus Claude Opus 4.7 pada 78,0% ^[5]
Matematika frontier	GPT-5.5	51,7% di FrontierMath Tiers 1–3 versus Claude Opus 4.7 pada 43,8% ^[5]
Software engineering dalam tabel bersama	Claude Opus 4.7	64,3% di SWE-Bench Pro / SWE Pro, di atas GPT-5.5 58,6% dan DeepSeek-V4-Pro-Max 55,4% ^[4]
Browsing	GPT-5.5 Pro	90,1% di BrowseComp, di atas GPT-5.5 84,4%, DeepSeek-V4-Pro-Max 83,4%, dan Claude Opus 4.7 79,3% ^[4]
Workflow tool publik ala MCP	Claude Opus 4.7	79,1% di MCP Atlas / MCPAtlas Public, di atas GPT-5.5 75,3% dan DeepSeek-V4-Pro-Max 73,6% ^[4]
Vision dan analisis dokumen	Claude Opus 4.7	Dilaporkan nomor 1 di Vision & Document Arena, termasuk menang di subkategori diagram, homework, dan OCR ^[1]
Evaluasi sensitif biaya	DeepSeek V4	VentureBeat menyebut DeepSeek V4 mendekati state-of-the-art dengan biaya sekitar seperenam Opus 4.7 dan GPT-5.5, tetapi klaim biaya tetap perlu divalidasi pada workload sendiri ^[4]
Perbandingan empat arah paling tidak bersih	Kimi K2.6	Skor Kimi berguna, tetapi bukti yang dikutip sebagian besar berasal dari perbandingan terpisah, bukan tabel bersama yang sama ^[11]^[13]

Tabel benchmark utama

Benchmark / kemampuan	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	Bacaan paling aman
GPQA Diamond	93,6% ^[4]	Tidak dilaporkan	94,2% ^[4]	90,1% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin tabel bersama ^[4]
Humanity’s Last Exam, tanpa tool	41,4% ^[4]	43,1% ^[4]	46,9% ^[4]	37,7% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin tabel bersama ^[4]
Humanity’s Last Exam, dengan tool	52,2% ^[4]	57,2% ^[4]	54,7% ^[4]	48,2% untuk DeepSeek-V4-Pro-Max ^[4]	54,0% dalam perbandingan Kimi terpisah ^[13]	GPT-5.5 Pro memimpin tabel bersama ^[4]
Terminal-Bench 2.0	82,7% ^[4]^[5]	Tidak dilaporkan	69,4% ^[4]^[5]	67,9% untuk DeepSeek-V4-Pro-Max ^[4]	66,7% dalam perbandingan Kimi terpisah ^[13]	GPT-5.5 memimpin ^[4]^[5]
SWE-Bench Pro / SWE Pro	58,6% ^[4]	Tidak dilaporkan	64,3% ^[4]	55,4% untuk DeepSeek-V4-Pro-Max ^[4]	58,6% dalam perbandingan Kimi terpisah ^[13]	Claude memimpin tabel bersama ^[4]
BrowseComp	84,4% ^[4]	90,1% ^[4]	79,3% ^[4]	83,4% untuk DeepSeek-V4-Pro-Max ^[4]; 83,4% untuk DeepSeek-V4 Pro dalam perbandingan lain ^[11]	83,2% dalam perbandingan Kimi vs DeepSeek ^[11]	GPT-5.5 Pro memimpin tabel bersama ^[4]
MCP Atlas / MCPAtlas Public	75,3% ^[4]	Tidak dilaporkan	79,1% ^[4]	73,6% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin ^[4]
OSWorld-Verified	78,7% ^[5]	Tidak dilaporkan	78,0% ^[5]	Tidak dilaporkan	Tidak dilaporkan	GPT-5.5 unggul tipis atas Claude ^[5]
FrontierMath Tiers 1–3	51,7% ^[5]	Tidak dilaporkan	43,8% ^[5]	Tidak dilaporkan	Tidak dilaporkan	GPT-5.5 memimpin Claude ^[5]
Vision & Document Arena	Tidak dilaporkan	Tidak dilaporkan	Dilaporkan nomor 1 secara keseluruhan ^[1]	Tidak dilaporkan	Tidak dilaporkan	Hanya Claude yang punya hasil dikutip ^[1]
AIME 2026	Tidak dilaporkan	Tidak dilaporkan	Tidak dilaporkan	Tidak tersedia dalam tabel Kimi vs DeepSeek yang dikutip ^[11]	96,4% dalam Thinking mode ^[11]	Sinyal Kimi berguna, bukan ranking empat arah ^[11]
APEX Agents	Tidak dilaporkan	Tidak dilaporkan	Tidak dilaporkan	Tidak tersedia dalam tabel Kimi vs DeepSeek yang dikutip ^[11]	27,9% dalam Thinking mode ^[11]	Sinyal Kimi berguna, bukan ranking empat arah ^[11]
Context window	Tidak dilaporkan	Tidak dilaporkan	1.000k token dalam satu perbandingan Artificial Analysis ^[3]	1.000k token untuk DeepSeek V4 Pro dalam perbandingan yang sama ^[3]	Tidak dilaporkan	Claude dan DeepSeek V4 Pro seimbang dalam konfigurasi itu ^[3]

GPT-5.5: paling menonjol untuk terminal, OS, matematika, dan tool

Claude Opus 4.7: kuat untuk reasoning tanpa tool dan dokumen

DeepSeek V4: kompetitif, tetapi keunggulan utama yang dikutip adalah biaya

Kimi K2.6: skor menjanjikan, tetapi perbandingan langsungnya kurang rapi

Model mana yang sebaiknya dites dulu?

Uji GPT-5.5 lebih dulu untuk agent berbasis terminal, tugas operasi OS, dan pekerjaan mirip FrontierMath; model ini memimpin hasil Terminal-Bench 2.0, OSWorld-Verified, dan FrontierMath yang dikutip ^[4]^[5].
Uji GPT-5.5 Pro lebih dulu jika reasoning dengan tool atau browsing adalah kebutuhan utama; model ini memimpin Humanity’s Last Exam with tools dan BrowseComp dalam tabel bersama ^[4].
Uji Claude Opus 4.7 lebih dulu untuk reasoning sains ala GPQA, tanya jawab ahli tanpa tool, software engineering ala SWE-Bench Pro, workflow MCP-style, dan pekerjaan multimodal yang berat dokumen ^[4]^[1].
Uji DeepSeek V4 lebih dulu jika biaya-kinerja adalah batasan utama dan Anda bisa menjalankan pemeriksaan kualitas sendiri; keunggulan yang dikutip adalah performa mendekati frontier dengan biaya sekitar seperenam Opus 4.7 dan GPT-5.5 ^[4].
Uji Kimi K2.6 lebih dulu jika Anda memang ingin mengevaluasi skor coding, agentic, matematika, dan browsing yang dilaporkan; bandingkan dengan prompt, tool, batas konteks, target latensi, dan aturan scoring yang sama seperti model lain ^[11]^[13].

Catatan penting sebelum memakai angka benchmark

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Cari dan periksa fakta dengan Studio Global AI

Poin-poin penting

Claude Opus 4.7 memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9%, sedangkan GPT 5.5 memimpin Terminal Bench 2.0 dengan 82,7% [4][5].
GPT 5.5 Pro menjadi pilihan terkuat di hasil yang dikutip untuk reasoning dengan tool dan browsing: 57,2% di Humanity’s Last Exam with tools dan 90,1% di BrowseComp [4].
DeepSeek V4 paling menarik dari sisi biaya kinerja, sementara Kimi K2.6 punya skor menjanjikan tetapi belum sebersih tiga model lain untuk perbandingan empat arah yang sama [4][11][13].

Orang-orang juga bertanya

Apa jawaban singkat untuk "GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6"?

Claude Opus 4.7 memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9%, sedangkan GPT 5.5 memimpin Terminal Bench 2.0 dengan 82,7% [4][5].

Apa poin penting yang harus divalidasi terlebih dahulu?

Apa yang harus saya lakukan selanjutnya dalam latihan?

DeepSeek V4 paling menarik dari sisi biaya kinerja, sementara Kimi K2.6 punya skor menjanjikan tetapi belum sebersih tiga model lain untuk perbandingan empat arah yang sama [4][11][13].

Topik terkait manakah yang harus saya jelajahi selanjutnya?

Lanjutkan dengan "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: siapa unggul di benchmark 2026?" untuk sudut pandang lain dan kutipan tambahan.

Buka halaman terkait

Dengan apa saya harus membandingkannya?

Periksa ulang jawaban ini dengan "Northwest vs. Southeast Timber: Mengapa Jawabannya larger; larger?".

Buka halaman terkait

Lanjutkan penelitian Anda

Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: siapa unggul di benchmark 2026?

Benchmark AI 2026: Claude Opus 4.7, GPT-5.5, DeepSeek V4, dan Kimi K2.6 dibandingkan

Northwest vs. Southeast Timber: Mengapa Jawabannya larger; larger?

Northwest vs. Southeast Timber: Mengapa Substitusi Lebih Tinggi Berarti larger; larger

Bisakah DeepSeek Mengalahkan OpenAI, Claude, Gemini, dan Grok? Masalahnya Bukan Cuma Benchmark

Bisakah DeepSeek Mengalahkan OpenAI, Claude, Gemini, dan Grok? Kuncinya Biaya dan Kepercayaan

Insiden PocketOS: Pelajaran di Balik Klaim AI Menghapus Database

Klaim AI menghapus database PocketOS: apa yang diketahui dari insiden Claude/Cursor

Percakapan penelitian

kamu27 Apr 2026

Pertanyaan penelitian

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI27 Apr 202621 sumber

Jawaban yang dikutip

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Sumber

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...

Temukan yang Sedang Tren

LaporanDipublikasikan28 Apr 2026Last edited 6 Mei 20268 sumber

GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6

Cari dan periksa fakta dengan Studio Global AI Jelajahi lebih banyak dari Discover

15K0

Pemenang cepat per kategori

Kebutuhan kerja	Pilihan dengan dukungan data terkuat	Alasannya
Reasoning sains	Claude Opus 4.7	94,2% di GPQA Diamond, di atas GPT-5.5 pada 93,6% dan DeepSeek-V4-Pro-Max pada 90,1% ^[4]
Reasoning ahli tanpa tool	Claude Opus 4.7	46,9% di Humanity’s Last Exam tanpa tool, di atas GPT-5.5 Pro 43,1%, GPT-5.5 41,4%, dan DeepSeek-V4-Pro-Max 37,7% ^[4]
Reasoning ujian dengan tool	GPT-5.5 Pro	57,2% di Humanity’s Last Exam with tools, di atas Claude Opus 4.7 pada 54,7% ^[4]
Terminal dan komputasi agentic	GPT-5.5	82,7% di Terminal-Bench 2.0, jauh di atas Claude Opus 4.7 69,4% dan DeepSeek-V4-Pro-Max 67,9% ^[4]^[5]
Operasi OS	GPT-5.5	78,7% di OSWorld-Verified versus Claude Opus 4.7 pada 78,0% ^[5]
Matematika frontier	GPT-5.5	51,7% di FrontierMath Tiers 1–3 versus Claude Opus 4.7 pada 43,8% ^[5]
Software engineering dalam tabel bersama	Claude Opus 4.7	64,3% di SWE-Bench Pro / SWE Pro, di atas GPT-5.5 58,6% dan DeepSeek-V4-Pro-Max 55,4% ^[4]
Browsing	GPT-5.5 Pro	90,1% di BrowseComp, di atas GPT-5.5 84,4%, DeepSeek-V4-Pro-Max 83,4%, dan Claude Opus 4.7 79,3% ^[4]
Workflow tool publik ala MCP	Claude Opus 4.7	79,1% di MCP Atlas / MCPAtlas Public, di atas GPT-5.5 75,3% dan DeepSeek-V4-Pro-Max 73,6% ^[4]
Vision dan analisis dokumen	Claude Opus 4.7	Dilaporkan nomor 1 di Vision & Document Arena, termasuk menang di subkategori diagram, homework, dan OCR ^[1]
Evaluasi sensitif biaya	DeepSeek V4	VentureBeat menyebut DeepSeek V4 mendekati state-of-the-art dengan biaya sekitar seperenam Opus 4.7 dan GPT-5.5, tetapi klaim biaya tetap perlu divalidasi pada workload sendiri ^[4]
Perbandingan empat arah paling tidak bersih	Kimi K2.6	Skor Kimi berguna, tetapi bukti yang dikutip sebagian besar berasal dari perbandingan terpisah, bukan tabel bersama yang sama ^[11]^[13]

Tabel benchmark utama

Benchmark / kemampuan	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4 / V4 Pro Max	Kimi K2.6	Bacaan paling aman
GPQA Diamond	93,6% ^[4]	Tidak dilaporkan	94,2% ^[4]	90,1% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin tabel bersama ^[4]
Humanity’s Last Exam, tanpa tool	41,4% ^[4]	43,1% ^[4]	46,9% ^[4]	37,7% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin tabel bersama ^[4]
Humanity’s Last Exam, dengan tool	52,2% ^[4]	57,2% ^[4]	54,7% ^[4]	48,2% untuk DeepSeek-V4-Pro-Max ^[4]	54,0% dalam perbandingan Kimi terpisah ^[13]	GPT-5.5 Pro memimpin tabel bersama ^[4]
Terminal-Bench 2.0	82,7% ^[4]^[5]	Tidak dilaporkan	69,4% ^[4]^[5]	67,9% untuk DeepSeek-V4-Pro-Max ^[4]	66,7% dalam perbandingan Kimi terpisah ^[13]	GPT-5.5 memimpin ^[4]^[5]
SWE-Bench Pro / SWE Pro	58,6% ^[4]	Tidak dilaporkan	64,3% ^[4]	55,4% untuk DeepSeek-V4-Pro-Max ^[4]	58,6% dalam perbandingan Kimi terpisah ^[13]	Claude memimpin tabel bersama ^[4]
BrowseComp	84,4% ^[4]	90,1% ^[4]	79,3% ^[4]	83,4% untuk DeepSeek-V4-Pro-Max ^[4]; 83,4% untuk DeepSeek-V4 Pro dalam perbandingan lain ^[11]	83,2% dalam perbandingan Kimi vs DeepSeek ^[11]	GPT-5.5 Pro memimpin tabel bersama ^[4]
MCP Atlas / MCPAtlas Public	75,3% ^[4]	Tidak dilaporkan	79,1% ^[4]	73,6% untuk DeepSeek-V4-Pro-Max ^[4]	Tidak dilaporkan	Claude memimpin ^[4]
OSWorld-Verified	78,7% ^[5]	Tidak dilaporkan	78,0% ^[5]	Tidak dilaporkan	Tidak dilaporkan	GPT-5.5 unggul tipis atas Claude ^[5]
FrontierMath Tiers 1–3	51,7% ^[5]	Tidak dilaporkan	43,8% ^[5]	Tidak dilaporkan	Tidak dilaporkan	GPT-5.5 memimpin Claude ^[5]
Vision & Document Arena	Tidak dilaporkan	Tidak dilaporkan	Dilaporkan nomor 1 secara keseluruhan ^[1]	Tidak dilaporkan	Tidak dilaporkan	Hanya Claude yang punya hasil dikutip ^[1]
AIME 2026	Tidak dilaporkan	Tidak dilaporkan	Tidak dilaporkan	Tidak tersedia dalam tabel Kimi vs DeepSeek yang dikutip ^[11]	96,4% dalam Thinking mode ^[11]	Sinyal Kimi berguna, bukan ranking empat arah ^[11]
APEX Agents	Tidak dilaporkan	Tidak dilaporkan	Tidak dilaporkan	Tidak tersedia dalam tabel Kimi vs DeepSeek yang dikutip ^[11]	27,9% dalam Thinking mode ^[11]	Sinyal Kimi berguna, bukan ranking empat arah ^[11]
Context window	Tidak dilaporkan	Tidak dilaporkan	1.000k token dalam satu perbandingan Artificial Analysis ^[3]	1.000k token untuk DeepSeek V4 Pro dalam perbandingan yang sama ^[3]	Tidak dilaporkan	Claude dan DeepSeek V4 Pro seimbang dalam konfigurasi itu ^[3]

GPT-5.5: paling menonjol untuk terminal, OS, matematika, dan tool

Claude Opus 4.7: kuat untuk reasoning tanpa tool dan dokumen

DeepSeek V4: kompetitif, tetapi keunggulan utama yang dikutip adalah biaya

Kimi K2.6: skor menjanjikan, tetapi perbandingan langsungnya kurang rapi

Model mana yang sebaiknya dites dulu?

Uji GPT-5.5 lebih dulu untuk agent berbasis terminal, tugas operasi OS, dan pekerjaan mirip FrontierMath; model ini memimpin hasil Terminal-Bench 2.0, OSWorld-Verified, dan FrontierMath yang dikutip ^[4]^[5].
Uji GPT-5.5 Pro lebih dulu jika reasoning dengan tool atau browsing adalah kebutuhan utama; model ini memimpin Humanity’s Last Exam with tools dan BrowseComp dalam tabel bersama ^[4].
Uji Claude Opus 4.7 lebih dulu untuk reasoning sains ala GPQA, tanya jawab ahli tanpa tool, software engineering ala SWE-Bench Pro, workflow MCP-style, dan pekerjaan multimodal yang berat dokumen ^[4]^[1].
Uji DeepSeek V4 lebih dulu jika biaya-kinerja adalah batasan utama dan Anda bisa menjalankan pemeriksaan kualitas sendiri; keunggulan yang dikutip adalah performa mendekati frontier dengan biaya sekitar seperenam Opus 4.7 dan GPT-5.5 ^[4].
Uji Kimi K2.6 lebih dulu jika Anda memang ingin mengevaluasi skor coding, agentic, matematika, dan browsing yang dilaporkan; bandingkan dengan prompt, tool, batas konteks, target latensi, dan aturan scoring yang sama seperti model lain ^[11]^[13].

Catatan penting sebelum memakai angka benchmark

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Cari dan periksa fakta dengan Studio Global AI

Poin-poin penting

Claude Opus 4.7 memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9%, sedangkan GPT 5.5 memimpin Terminal Bench 2.0 dengan 82,7% [4][5].
GPT 5.5 Pro menjadi pilihan terkuat di hasil yang dikutip untuk reasoning dengan tool dan browsing: 57,2% di Humanity’s Last Exam with tools dan 90,1% di BrowseComp [4].
DeepSeek V4 paling menarik dari sisi biaya kinerja, sementara Kimi K2.6 punya skor menjanjikan tetapi belum sebersih tiga model lain untuk perbandingan empat arah yang sama [4][11][13].

Orang-orang juga bertanya

Apa jawaban singkat untuk "GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4 vs Kimi K2.6"?

Claude Opus 4.7 memimpin GPQA Diamond dengan 94,2% dan Humanity’s Last Exam tanpa tool dengan 46,9%, sedangkan GPT 5.5 memimpin Terminal Bench 2.0 dengan 82,7% [4][5].

Apa poin penting yang harus divalidasi terlebih dahulu?

Apa yang harus saya lakukan selanjutnya dalam latihan?

DeepSeek V4 paling menarik dari sisi biaya kinerja, sementara Kimi K2.6 punya skor menjanjikan tetapi belum sebersih tiga model lain untuk perbandingan empat arah yang sama [4][11][13].

Topik terkait manakah yang harus saya jelajahi selanjutnya?

Lanjutkan dengan "Claude Opus 4.7 vs GPT-5.5 vs DeepSeek V4 vs Kimi K2.6: siapa unggul di benchmark 2026?" untuk sudut pandang lain dan kutipan tambahan.

Buka halaman terkait

Dengan apa saya harus membandingkannya?

Periksa ulang jawaban ini dengan "Northwest vs. Southeast Timber: Mengapa Jawabannya larger; larger?".

Buka halaman terkait

Lanjutkan penelitian Anda

Percakapan penelitian

kamu27 Apr 2026

Pertanyaan penelitian

Research benchmarks for GPT-5.5, Claude Opus 4.7, DeepSeek V4, and Kimi K2.6, and compare them as comprehensively as possible. Create a research report on this.

Studio Global AI27 Apr 202621 sumber

Jawaban yang dikutip

Key findings

Claude Opus 4.7 has the strongest cited results on GPQA Diamond at 94.2% versus GPT-5.5 at 93.6% and DeepSeek-V4-Pro-Max at 90.1% ^[4].
Claude Opus 4.7 also leads the cited Humanity’s Last Exam no-tools result at 46.9%, ahead of GPT-5.5 Pro at 43.1%, GPT-5.5 at 41.4%, and DeepSeek-V4-Pro-Max at 37.7% ^[4].
GPT-5.5 Pro leads the cited Humanity’s Last Exam with-tools result at 57.2%, ahead of Claude Opus 4.7 at 54.7%, GPT-5.5 at 52.2%, and DeepSeek-V4-Pro-Max at 48.2% ^[4].
GPT-5.5 is strongest in the cited terminal/agentic-computing benchmarks: it scores 82.7% on Terminal-Bench 2.0, compared with Claude Opus 4.7 at 69.4% and DeepSeek-V4-Pro-Max at 67.9% ^[4]^[5].
GPT-5.5 narrowly leads Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 leads Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena, with a +4 point improvement over Opus 4.6 in Document Arena and wins in diagram, homework, and OCR subcategories ^[1].
DeepSeek V4 is described as achieving near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the available evidence does not provide the underlying price schedule or methodology ^[4].
DeepSeek V4 is claimed to be the #1 open-weight model on a Vibe Code Benchmark, ahead of Kimi K2.6 at #2, but this evidence comes from a Reddit snippet rather than a full benchmark report ^[18].
Kimi K2.6 is described as a leading open-model refresh, but the provided evidence does not include enough numeric Kimi K2.6 scores to compare it comprehensively with GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].

Benchmark comparison table

Benchmark / capability	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	DeepSeek V4	Kimi K2.6	Leader in available evidence
GPQA Diamond	93.6% ^[4]	Insufficient evidence	94.2% ^[4]	90.1% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, no tools	41.4% ^[4]	43.1% ^[4]	46.9% ^[4]	37.7% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	Claude Opus 4.7 ^[4]
Humanity’s Last Exam, with tools	52.2% ^[4]	57.2% ^[4]	54.7% ^[4]	48.2% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 Pro ^[4]
Terminal-Bench 2.0	82.7% ^[4]^[5]	Insufficient evidence	69.4% ^[4]^[5]	67.9% for DeepSeek-V4-Pro-Max ^[4]	Insufficient evidence	GPT-5.5 ^[4]^[5]
OSWorld-Verified	78.7% ^[5]	Insufficient evidence	78.0% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
FrontierMath Tiers 1–3	51.7% ^[5]	Insufficient evidence	43.8% ^[5]	Insufficient evidence	Insufficient evidence	GPT-5.5 ^[5]
Vision & Document Arena	Insufficient evidence	Insufficient evidence	Reported #1 overall ^[1]	Insufficient evidence	Insufficient evidence	Claude Opus 4.7 ^[1]
Vibe Code Benchmark	Insufficient evidence	Insufficient evidence	Insufficient evidence	Claimed #1 open-weight model ^[18]	Claimed #2 open-weight model ^[18]	DeepSeek V4 among open-weight models, low-confidence evidence ^[18]
Context window	Insufficient evidence	Insufficient evidence	1,000k tokens in one cited comparison ^[3]	1,000k tokens for DeepSeek V4 Pro in one cited comparison ^[3]	Insufficient evidence	Tie between Claude Opus 4.7 and DeepSeek V4 Pro in available evidence ^[3]

Model-by-model assessment

GPT-5.5

GPT-5.5’s clearest advantage is agentic computing and operational task performance, led by its 82.7% Terminal-Bench 2.0 score ^[4]^[5].
GPT-5.5 also edges Claude Opus 4.7 on OSWorld-Verified, 78.7% versus 78.0% ^[5].
GPT-5.5 shows a larger advantage over Claude Opus 4.7 on FrontierMath Tiers 1–3, 51.7% versus 43.8% ^[5].
GPT-5.5 trails Claude Opus 4.7 on GPQA Diamond by 0.6 points, 93.6% versus 94.2% ^[4].
GPT-5.5 Pro is the best cited model on Humanity’s Last Exam with tools, scoring 57.2% versus Claude Opus 4.7 at 54.7% ^[4].
Additional GPT-5.5-only domain benchmarks include 91.7% on Harvey BigLaw Bench with 43% perfect scores, 88.5% on an internal investment-banking benchmark, and 80.5% on BixBench bioinformatics ^[7]. These results are not directly comparable to the other three models because the provided excerpt does not include their scores on those same benchmarks ^[7].

Claude Opus 4.7

Claude Opus 4.7 is the strongest cited model on GPQA Diamond, scoring 94.2% ^[4].
Claude Opus 4.7 is also the strongest cited model on Humanity’s Last Exam without tools, scoring 46.9% ^[4].
Claude Opus 4.7 ranks below GPT-5.5 Pro on Humanity’s Last Exam with tools, 54.7% versus 57.2% ^[4].
Claude Opus 4.7 trails GPT-5.5 on Terminal-Bench 2.0 by more than 13 points, 69.4% versus 82.7% ^[4]^[5].
Claude Opus 4.7 is reported #1 in Vision & Document Arena and is said to lead in diagram, homework, and OCR subcategories ^[1].
Claude Opus 4.7 has a cited 1,000k-token context window in an Artificial Analysis comparison with DeepSeek V4 Pro ^[3].

DeepSeek V4

DeepSeek-V4-Pro-Max is competitive but trails GPT-5.5 and Claude Opus 4.7 on the cited GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 results ^[4].
DeepSeek-V4-Pro-Max scores 90.1% on GPQA Diamond, 37.7% on Humanity’s Last Exam without tools, 48.2% on Humanity’s Last Exam with tools, and 67.9% on Terminal-Bench 2.0 ^[4].
DeepSeek V4 is described as delivering near state-of-the-art intelligence at about one-sixth the cost of Opus 4.7 and GPT-5.5, but the excerpt does not provide enough detail to verify cost normalization or workload assumptions ^[4].
DeepSeek V4 Pro is cited with a 1,000k-token context window in a comparison against Claude Opus 4.7 ^[3].
A Reddit snippet claims DeepSeek V4 is the #1 open-weight model on a Vibe Code Benchmark and ranks above Kimi K2.6, but this should be treated as low-confidence evidence because the provided excerpt lacks a full methodology or score table ^[18].

Kimi K2.6

Kimi K2.6 has the weakest quantitative coverage in the available evidence ^[1]^[18].
One source describes Kimi K2.6 as a leading open-model refresh, but the provided excerpt does not expose benchmark scores that can be compared against GPT-5.5, Claude Opus 4.7, or DeepSeek V4 ^[1].
The only direct Kimi ranking in the available evidence is a Reddit snippet claiming Kimi K2.6 is #2 behind DeepSeek V4 on a Vibe Code Benchmark among open-weight models ^[18].
Insufficient evidence: the provided material does not support a comprehensive numerical evaluation of Kimi K2.6 across reasoning, math, coding, agentic-computing, multimodal, or long-context benchmarks.

Evidence notes

The most usable quantitative cross-model evidence comes from the cited table comparing DeepSeek-V4-Pro-Max, GPT-5.5, GPT-5.5 Pro where available, and Claude Opus 4.7 across GPQA Diamond, Humanity’s Last Exam, and Terminal-Bench 2.0 ^[4].
GPT-5.5 versus Claude Opus 4.7 is additionally supported by a separate source reporting the same Terminal-Bench 2.0 figures and adding OSWorld-Verified and FrontierMath results ^[5].
The cited Artificial Analysis comparison provides context-window information for DeepSeek V4 Pro and Claude Opus 4.7, both listed at 1,000k tokens in that comparison ^[3].
The Vision & Document Arena evidence supports Claude Opus 4.7’s multimodal/document strength, but it does not provide numeric scores for GPT-5.5, DeepSeek V4, or Kimi K2.6 ^[1].
The Mashable and Design for Online snippets do not provide benchmark values in the supplied evidence, so they cannot support quantitative conclusions here ^[2]^[6].

Limitations / uncertainty

Insufficient evidence: no primary model cards, official benchmark reports, full prompt settings, confidence intervals, or independent reproducibility details are included in the provided evidence.
Insufficient evidence: Kimi K2.6 lacks numeric benchmark coverage in the provided excerpts, so any Kimi ranking beyond the Vibe Code claim is unsupported ^[1]^[18].
The benchmark comparisons mix model variants and modes, including GPT-5.5, GPT-5.5 Pro, DeepSeek-V4-Pro-Max, DeepSeek V4 Pro, and Claude Opus 4.7 with differing reasoning or effort settings ^[3]^[4].
The Vibe Code claim for DeepSeek V4 and Kimi K2.6 is low confidence because the available evidence is a Reddit snippet without benchmark scores or methodology ^[18].
The cost-performance claim for DeepSeek V4 is promising but under-specified because the provided evidence reports “1/6th the cost” without enough details to normalize by token pricing, latency, benchmark workload, or deployment assumptions ^[4].

Summary

Best cited reasoning/science model: Claude Opus 4.7, based on GPQA Diamond and Humanity’s Last Exam without tools ^[4].
Best cited tool-augmented exam result: GPT-5.5 Pro, based on Humanity’s Last Exam with tools ^[4].
Best cited terminal/agentic-computing model: GPT-5.5, based on Terminal-Bench 2.0 ^[4]^[5].
Best cited OS/world-operation and math results between GPT-5.5 and Claude Opus 4.7: GPT-5.5, based on OSWorld-Verified and FrontierMath ^[5].
Best cited multimodal/document model: Claude Opus 4.7, based on Vision & Document Arena reporting ^[1].
Best cited open-weight coding claim: DeepSeek V4 over Kimi K2.6 on Vibe Code, but this is low-confidence because the evidence is only a Reddit snippet ^[18].
Most under-evidenced model: Kimi K2.6, because the provided evidence does not include enough quantitative benchmark results for a comprehensive comparison ^[1]^[18].

Sumber

[1] [AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)latent.space
Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking 1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included dia...
[3] DeepSeek V4 Pro (Reasoning, Max Effort) vs Claude Opus 4.7 (Non-reasoning, High Effort): Model Comparisonartificialanalysis.ai
Metric DeepSeek logoDeepSeek V4 Pro (Reasoning, Max Effort) Anthropic logoClaude Opus 4.7 (Non-reasoning, High Effort) Analysis --- --- Creator DeepSeek Anthropic Context Window 1000k tokens ( 1500 A4 pages of size 12 Arial font) 1000k tokens ( 1500 A4 page...
[4] DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th ...venturebeat.com
BenchmarkDeepSeek-V4-Pro-MaxGPT-5.5GPT-5.5 Pro, where shownClaude Opus 4.7Best result among these GPQA Diamond90.1%93.6%—94.2%Claude Opus 4.7 Humanity’s Last Exam, no tools37.7%41.4%43.1%46.9%Claude Opus 4.7 Humanity’s Last Exam, with tools48.2%52.2%57.2%54...
[5] Everything You Need to Know About GPT-5.5vellum.ai
The headline numbers GPT-5.5 achieves state-of-the-art on Terminal-Bench 2.0 at 82.7%, leading Claude Opus 4.7 (69.4%) by over 13 points. On OSWorld-Verified, which tests real computer environment operation, it edges out Claude at 78.7% vs 78.0%. On Frontie...
[7] GPT-5.5: The Complete Guide (2026) - o-mega | AIo-mega.ai
Domain-Specific Benchmarks Benchmark GPT-5.5 Notes --- Harvey BigLaw Bench 91.7% (43% perfect scores) Legal reasoning, audience calibration Internal Investment Banking 88.5% Financial analysis tasks BixBench (bioinformatics) 80.5% (up from 74.0%) +6.5pts ov...
[8] Introducing GPT-5.5 - OpenAIopenai.com
Abstract reasoning EvalGPT-5.5GPT‑5.4GPT-5.5 ProGPT‑5.4 ProClaude Opus 4.7Gemini 3.1 Pro ARC-AGI-1 (Verified)95.0%93.7%-94.5%93.5%98.0% ARC-AGI-2 (Verified)85.0%73.3%-83.3%75.8%77.1% Evals of GPT were run with reasoning effort set to xhigh and were conducte...
[11] Kimi K2.6 vs DeepSeek-V4 Pro - DocsBot AIdocsbot.ai
Benchmark Kimi K2.6 DeepSeek-V4 Pro --- AIME 2026 American Invitational Mathematics Examination 2026 - Evaluates advanced mathematical problem-solving abilities (contest-level math) 96.4% Thinking mode Source Not available APEX Agents Evaluates long-horizon...
[13] Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4 - Verdent AIverdent.ai
Benchmark K2.6 Claude Opus 4.6 GPT-5.4 Notes --- --- SWE-Bench Pro 58.60% 53.40% 57.70% Moonshot in-house harness; SEAL mini-swe-agent puts GPT-5.4 at 59.1%, Opus 4.6 at 51.9% SWE-Bench Verified 80.20% 80.80% 80% Tight cluster; Opus 4.7 now leads at 87.6% T...