This is where the models show their true colors. Qwen3.7-Max is a mathematical powerhouse, achieving the highest scores among the group on HMMT 2026 (97.1%) and GPQA Diamond (92.4%) . It's a pure reasoning engine.
Kimi K2.6 takes the opposite approach, dominating benchmarks that measure reasoning with the aid of tools and research. It posts the top score on Humanity's Last Exam (HLE) with tools (54.0) and a commanding 92.5 F1 on DeepSearchQA, designed to test web research synthesis . If your use case is "agentic research," Kimi K2.6 is the specialist.
DeepSeek V4 Pro is competitive in reasoning but rarely leads, trailing Qwen on math and Kimi on tool-use, though it handles Chinese-language factual questions impressively with an 84.4 on Chinese SimpleQA .
The price gap is enormous and will be the deciding factor for many builders. All prices below are per 1 million tokens in USD.
| Model | Input (Cache Miss) | Output | Cached Input | Context Window | Open Weights |
|---|---|---|---|---|---|
| DeepSeek V4 Pro | $0.435 | $0.87 | $0.0036 | 1M tokens | Yes |
| Qwen3.7-Max | $2.50 | $7.50 | $0.25 | 1M tokens | |
| Kimi K2.6 | $0.95 | $4.00 | $0.16 | 256K tokens |
DeepSeek Pricing Note: DeepSeek's initial 75% launch discount became permanent in late May 2026. The prices above reflect this permanent rate, making it the most cost-effective frontier model on the market by a wide margin
.
The value equation is clear:
A significant red flag was raised for DeepSeek V4 Pro in a May 2026 evaluation by NIST's CAISI program. The government body's independent tests, which include non-public benchmarks, found that DeepSeek V4 Pro's performance was more comparable to the capabilities of GPT-5 (released in August 2025) rather than the more recent models it was benchmarked against in its own reports .
This means the impressive self-reported benchmarks may overstate its real-world edge, and developers should factor this discrepancy into their evaluation.
| No |
| Yes |
Comments
0 comments