AMD's launch was, bluntly, a disaster. The MI355X could only run FP8 on Day 0 — the native FP4+FP8 checkpoint was unusable — and delivered a paltry 20.4 tok/s/GPU at 2.4 tok/s/user on the 8K/1K workload . The ROCm stack wasn't optimized; it was placeholder.
What followed is the most dramatic software-driven performance recovery in recent AI infrastructure memory. Over 26 days, the AMD SGLang engineering team under HaiShaw merged 31 performance optimization pull requests on the amd/deepseek_v4 side branch . Each PR addressed a layer of the ROCm stack that had been essentially untested with a MoE model of this scale.
By Day 26 (May 20, 2026), the MI355X hit 2,256 tok/s/GPU at 9.4 tok/s/user on the 8K/1K workload — a 110.5x improvement in throughput per GPU, while simultaneously improving interactivity by 3.85x . This is the rare optimization result where both axes climb together: you don't normally get faster per-user response times and higher aggregate throughput from the same hardware.
At comparable interactivity points on InferenceX, the Day-26 MI355X achieves roughly 1,411 tok/s/GPU, placing it in the same competitive tier as Nvidia's B200 at low interactivity settings . It still trails the Nvidia GB300 NVL72 (which reaches up to 11,474 tok/s/GPU under optimal conditions), but the trajectory matters: the MI355X went from a non-functional launch to a viable inference SKU
.
Cost-per-million-tokens tells the economic story. At low interactivity, GB300 NVL72 delivers tokens at $0.064/M tok — roughly 4.5x cheaper than the MI355X at $0.291/M tok . At higher interactivity (lower concurrency, higher per-user speed), the MI355X becomes cost-competitive against the B200, with both hovering around $1.22–$1.45/M tok at 23–24 concurrent users
.
The benchmark numbers are dramatic, but they'd be economically meaningless without DeepSeek's architectural breakthrough on the model side. Standard Transformer attention scales quadratically with context length. At 1 million tokens, the compute and memory costs make serving economically infeasible on any hardware. DeepSeek V4 replaces standard attention with a hybrid of two complementary mechanisms:
Compressed Sparse Attention (CSA) operates at the fine-grained level. It groups roughly 4 tokens into a single learned KV entry using data-dependent weighting with overlapping windows, then applies sparse selection — each query only attends to the top-k most relevant compressed entries, indexed by a component called the Lightning Indexer . A sliding window attention branch runs alongside to preserve local precision
.
Heavily Compressed Attention (HCA) takes a coarser bet. It compresses up to 128 tokens into a single KV entry and applies dense attention over the heavily compressed stream . This trades retrieval precision for comprehensive global coverage at near-zero marginal cost. HCA runs in early layers to provide a cheap contextual panorama; the final layer retains full attention for maximum precision
.
The combined efficiency gains are staggering compared to DeepSeek V3.2 at 1M tokens:
Without CSA and HCA, the 1.6-trillion-parameter Pro model at 1M context would be a science project — interesting but commercially unservable. With them, it's a product.
The 43-day SemiAnalysis data tells three stories in parallel. First, Nvidia's ecosystem advantage remains structurally enormous: CUDA works, Day 0, every time, and the hardware is genuinely ahead. Second, Huawei has closed the software readiness gap entirely — a historic shift with geopolitical implications that will play out over the next hardware generation. Third, AMD's hardware is capable but the ROCm software stack is still a bottleneck that requires heroic engineering effort to overcome for each new frontier model — a 110.5x improvement in 26 days is impressive, but it shouldn't be necessary.
The bigger picture belongs to the architects. DeepSeek's CSA + HCA hybrid attention doesn't just improve efficiency — it fundamentally changes what's commercially viable to serve. When a 1.6-trillion-parameter model at million-token context costs single-digit cents per million tokens on the right hardware, the conversation shifts from "can we do this?" to "what do we build with it?"
Comments
0 comments