AnswersPublished3 days ago24 sources

DeepSeek V4 Pro: 43-Day GPU Showdown — Nvidia, AMD, Huawei, and the 110x Software Miracle

Nvidia's GB300 NVL72 leads with up to 11,474 tok/s/GPU, while Huawei's Ascend 950DT achieved historic Day 0 production support, matching Nvidia's CUDA ecosystem for the first time on a frontier scale model. AMD's MI355X started at a catastrophic 20.4 tok/s/GPU but recovered to 2,256 tok/s/GPU in 26 days — a 110.5x t...

Search & fact-check with Studio Global AI Browse more Trending pages

115K0

DeepSeek V4 Pro GPU inference benchmark comparison across Nvidia, AMD, and Huawei chips — What key findings emerged from SemiAnalysis's 43-day performance analysis of DeepSeek V4 Pro inference across AMD MI355X, Nvidia GB300 NVL72DeepSeek V4 Pro's 43-day inference performance timeline across Nvidia, AMD, and Huawei accelerators
AI Prompt
Create a landscape editorial hero image for this Studio Global article: What key findings emerged from SemiAnalysis's 43-day performance analysis of DeepSeek V4 Pro inference across AMD MI355X, Nvidia GB300 NVL72. Article summary: Nvidia's GB300 NVL72 dominates raw throughput. Huawei's Ascend 950DT matched Nvidia on software readiness — a historic first. AMD's MI355X went from a broken launch to a 110.5x throughput recovery in 26 days, but still t. Topic tags: general, general web, user generated, documentation. Reference image context from search candidates: Reference image 1: visual subject "Sign up for Instagram to stay in the loop. Photo by NVIDIA on February 20, 2026. The latest SemiAnalysis InferenceX data proves that the best performance drives the lowest inferenc" source context "Instagram" Reference image 2: visual subject "DeepSeek V4 Pro 1.6T· GPU comparison. # DeepSeek V4 Pro
openai.com

When DeepSeek dropped the 1.6-trillion-parameter V4 Pro on April 24, 2026, the real test wasn't the model's benchmark scores — it was whether any shipping hardware could serve it at million-token contexts without bankrupting the operator. SemiAnalysis fired up InferenceX and tracked every major AI accelerator for 43 straight days. The results reveal less about the chips themselves and more about the state of the software ecosystems that control them.

The Great Software Divide: Day 0 Readiness

Launch-day support split into three tiers. Nvidia's CUDA ecosystem, running on Blackwell B300 and the larger GB300 NVL72 rack-scale system, delivered full readiness immediately. vLLM and SGLang both worked out of the box, with the B300 pushing 8,075 tok/s/GPU at launch .

Huawei pulled off something genuinely historic. The Ascend 950DT + CANN framework achieved the same Day-0 full-stack support as Nvidia — the first time a non-CUDA platform has matched Nvidia on launch-day readiness for a frontier-scale model . Independent benchmarks on the shipping Ascend 950PR showed FP4 compute at 2.87x the H20's rate and MoE inference up to 1.96x faster in RL rollouts . The catch: China's hardware performance ceiling remains constrained by fabrication limits, so absolute throughput still trails Nvidia's best.

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Platform	Throughput (tok/s/GPU)	Cost ($/M tok)	Day-0 Support	Key Limitation
Nvidia GB300 NVL72	8,075–11,474	$0.064–$0.166	Full (CUDA, vLLM, SGLang)	Availability
AMD MI355X (Day 26+)	1,411–2,256	$0.291–$0.585	Broken (FP8 only, fixed later)	ROCm maturity
Nvidia B200	1,411	$0.387	Full (CUDA via Dynamo TRT)	Mid-range positioning
Huawei Ascend 950DT	No public tok/s/GPU yet	N/A	Full (CANN, Day-0)	Fabrication-constrained peak perf
Nvidia H200	186	N/A	Full (CUDA)	Previous-gen Hopper

DeepSeek V4 Pro: 43-Day GPU Showdown — Nvidia, AMD, Huawei, and the 110x Software Miracle

The Great Software Divide: Day 0 Readiness

Search, cite, and publish your own answer

People also ask

What is the short answer to "DeepSeek V4 Pro: 43-Day GPU Showdown — Nvidia, AMD, Huawei, and the 110x Software Miracle"?

What are the key points to validate first?

What should I do next in practice?

Sources

Comments

AMD's 110.5x Recovery: The Software Sprint

The Full Leaderboard at 43 Days

The Real Enabler: DeepSeek V4's Hybrid Attention Architecture

What This Means