NVIDIA swept every benchmark in MLPerf Training v6.0, achieving the fastest time to train at scale and the highest per accelerator performance across all seven workloads, while CoreWeave trained the demanding DeepSeek... The GB300 NVL72 (Blackwell Ultra) delivered up to 1.6x faster training and up to 2.77x faster in...

Create a landscape editorial hero image for this Studio Global article: What are the key highlights from the MLPerf Training v6.0 results, including Nvidia's performance across all benchmarks on its Blackwell pla. Article summary: ## MLPerf Training v6.0 Key Highlights. Topic tags: general, documentation, news, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "Home » News » NVIDIA Sets MLPerf Inference v6.0 Records with Blackwell Ultra Platform. # NVIDIA Sets MLPerf Inference v6.0 Records with Blackwell Ultra Platform. NVIDIA has publish" source context "NVIDIA Sets MLPerf Inference v6.0 Records with Blackwell Ultra Platform - StorageReview.com" Reference image 2: visual subject "# MLPerf Inference v6.0 Results Explained: GPU Performance Rankings for AI Workloads (2026). MLPerf Inference v6.0 results dropped April 1, 2026, and
The MLPerf Training v6.0 results, released by the MLCommons consortium, underscore a pivotal shift in the AI infrastructure landscape: the industry is moving decisively toward massive-scale, mixture-of-experts (MoE) models, and NVIDIA's Blackwell platform has set a formidable new performance bar. NVIDIA achieved a clean sweep across every training benchmark, but the headline story belongs to the debut of new, real-world MoE workloads and the staggering speed at which they can now be trained on production cloud infrastructure.
The most dramatic result from this round was the training performance on the newly introduced DeepSeek-V3 benchmark. DeepSeek-V3 is a large-scale pretraining model with an MoE architecture, utilizing 671 billion total parameters, of which 37 billion are activated per token . Running on the same CoreWeave Cloud infrastructure available to customers, CoreWeave trained this enormously complex model to the benchmark's quality target in just 2.02 minutes
.
This record-setting run used 8,192 NVIDIA GB300 NVL72 GPUs across 2,048 nodes—the largest GB300 cluster submitted in the round . CoreWeave's results were the fastest across all Closed/Available-cloud submissions, achieved through full-stack optimizations across networking, orchestration, and storage layers
. The performance demonstrated effective near-linear scaling, with the same workload completing in 3.09 minutes on 4,096 GPUs and 5.54 minutes on 2,048 GPUs
.
NVIDIA was the only platform to submit results on every one of the seven benchmarks in this round, and it won them all—delivering the fastest time to train at scale and the highest performance on a per-accelerator basis . This continues a streak that has given NVIDIA nine times more cumulative MLPerf wins than all other submitters combined
.
NVIDIA's GB300 NVL72 (Blackwell Ultra) system, which benefits from larger memory and power budgets, demonstrated substantially higher training throughput over the previous GB200 NVL72. At the same scale, the GB300 NVL72 delivered up to 1.6x faster training than its predecessor . Beyond raw hardware, NVIDIA's continuous software improvements played a critical part: training throughput for DeepSeek-V3 improved by 1.3x in just three months on identical hardware, driven by innovations like full-iteration CUDA graphs and CuTe DSL fusions
.
The combination of the NVLink-switch domain and the scale-out Spectrum-X Ethernet fabric enabled NVIDIA's partners to achieve record times across all workloads :
The generational uplift from Blackwell to Blackwell Ultra is pronounced. In inference workloads, the GB300 NVL72 delivers up to 2.77x higher token-per-second throughput compared to the GB200 NVL72 . This leap is attributed not only to architectural silicon gains but also to the aggressive software-hardware co-design that has become a hallmark of the Blackwell platform. The combination of these two systems allows data center operators to choose configurations optimized for specific cost, throughput, and latency requirements.
For the first time, MLPerf Training introduced benchmarks that mirror the architecture of the most advanced frontier models in production today. Alongside the massive DeepSeek-V3 671B workload, a smaller GPT-OSS-20B MoE pretraining benchmark was added . NVIDIA was the only platform to submit results on both new MoE workloads, using the GB300 NVL72 system with custom software stacks, CUDA graphs, and advanced MoE routing to optimize the sparse, bursty communication patterns inherent to these models
.
The v6.0 round saw its broadest participation yet, with 24 organizations submitting results across 95 distinct systems using 13 different hardware accelerators . The technical diversity extended beyond just hardware to include multiple precision recipes, such as NVFP4 from NVIDIA and MXFP4 from AMD. AMD, in particular, demonstrated competitive performance with its Instinct MI355X GPU, coming within 5% of NVIDIA's B200 platform on Llama 2-70B fine-tuning and within 6% on Llama 3.1-8B pretraining—an important sign of a developing competitive landscape
.
The ability to scale a single training job to thousands of GPUs is critical for MoE models, which feature bursty all-to-all communication as tokens are dynamically routed to different experts. To handle this, NVIDIA's partners leveraged Spectrum-X Ethernet with Adaptive Routing and Congestion Control, sustaining near-theoretical fabric bandwidth across 8,192 Blackwell GPUs in hyperscale clusters . This networking stack is as vital as raw GPU compute in turning hardware specifications into benchmark-topping results and, ultimately, into faster training for real-world AI factories.
Studio Global AI
Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.
NVIDIA swept every benchmark in MLPerf Training v6.0, achieving the fastest time to train at scale and the highest per accelerator performance across all seven workloads, while CoreWeave trained the demanding DeepSeek...
NVIDIA swept every benchmark in MLPerf Training v6.0, achieving the fastest time to train at scale and the highest per accelerator performance across all seven workloads, while CoreWeave trained the demanding DeepSeek... The GB300 NVL72 (Blackwell Ultra) delivered up to 1.6x faster training and up to 2.77x faster inference token throughput compared to the previous GB200 NVL72, driven by larger memory, power budgets, and NVIDIA's co op...
MLCommons introduced new production scale mixture of experts benchmarks, while a record 24 organizations participated using 95 systems and 13 different hardware accelerators, signaling growing technical diversity with...
Loading comments...
Comments
0 comments