AnswersPublished2 weeks agoLast edited 2 weeks ago21 sources

NVIDIA Blackwell Sweeps MLPerf Training v6.0 as CoreWeave Trains DeepSeek-V3 671B in 2 Minutes

NVIDIA swept every benchmark in MLPerf Training v6.0, achieving the fastest time to train at scale and the highest per accelerator performance across all seven workloads, while CoreWeave trained the demanding DeepSeek... The GB300 NVL72 (Blackwell Ultra) delivered up to 1.6x faster training and up to 2.77x faster in...

Search & fact-check with Studio Global AI Browse more Trending pages

509K0

NVIDIA Blackwell Ultra GPUs powering record-breaking MLPerf Training v6.0 results for massive AI models. — What are the key highlights from the MLPerf Training v6.0 results, including Nvidia's performance across all benchmarks on its Blackwell plaNVIDIA's Blackwell platform set new performance records across all MLPerf Training v6.0 benchmarks, driven by the powerful GB300 NVL72 system.
AI Prompt
Create a landscape editorial hero image for this Studio Global article: What are the key highlights from the MLPerf Training v6.0 results, including Nvidia's performance across all benchmarks on its Blackwell pla. Article summary: ## MLPerf Training v6.0 Key Highlights. Topic tags: general, documentation, news, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "Home » News » NVIDIA Sets MLPerf Inference v6.0 Records with Blackwell Ultra Platform. # NVIDIA Sets MLPerf Inference v6.0 Records with Blackwell Ultra Platform. NVIDIA has publish" source context "NVIDIA Sets MLPerf Inference v6.0 Records with Blackwell Ultra Platform - StorageReview.com" Reference image 2: visual subject "# MLPerf Inference v6.0 Results Explained: GPU Performance Rankings for AI Workloads (2026). MLPerf Inference v6.0 results dropped April 1, 2026, and
openai.com

The MLPerf Training v6.0 results, released by the MLCommons consortium, underscore a pivotal shift in the AI infrastructure landscape: the industry is moving decisively toward massive-scale, mixture-of-experts (MoE) models, and NVIDIA's Blackwell platform has set a formidable new performance bar. NVIDIA achieved a clean sweep across every training benchmark, but the headline story belongs to the debut of new, real-world MoE workloads and the staggering speed at which they can now be trained on production cloud infrastructure.

The DeepSeek-V3 Moment: CoreWeave's 2.02-Minute Training Run

The most dramatic result from this round was the training performance on the newly introduced DeepSeek-V3 benchmark. DeepSeek-V3 is a large-scale pretraining model with an MoE architecture, utilizing 671 billion total parameters, of which 37 billion are activated per token . Running on the same CoreWeave Cloud infrastructure available to customers, CoreWeave trained this enormously complex model to the benchmark's quality target in just 2.02 minutes .

This record-setting run used 8,192 NVIDIA GB300 NVL72 GPUs across 2,048 nodes—the largest GB300 cluster submitted in the round . CoreWeave's results were the fastest across all Closed/Available-cloud submissions, achieved through full-stack optimizations across networking, orchestration, and storage layers . The performance demonstrated effective near-linear scaling, with the same workload completing in 3.09 minutes on 4,096 GPUs and 5.54 minutes on 2,048 GPUs .

NVIDIA's Clean Sweep on Every Benchmark

NVIDIA was the only platform to submit results on every one of the seven benchmarks in this round, and it won them all—delivering the fastest time to train at scale and the highest performance on a per-accelerator basis . This continues a streak that has given NVIDIA nine times more cumulative MLPerf wins than all other submitters combined .

NVIDIA's GB300 NVL72 (Blackwell Ultra) system, which benefits from larger memory and power budgets, demonstrated substantially higher training throughput over the previous GB200 NVL72. At the same scale, the GB300 NVL72 delivered up to 1.6x faster training than its predecessor . Beyond raw hardware, NVIDIA's continuous software improvements played a critical part: training throughput for DeepSeek-V3 improved by 1.3x in just three months on identical hardware, driven by innovations like full-iteration CUDA graphs and CuTe DSL fusions .

Key Training Times Across Benchmarks

The combination of the NVLink-switch domain and the scale-out Spectrum-X Ethernet fabric enabled NVIDIA's partners to achieve record times across all workloads :

Llama 3.1 8B Pretraining: 5.2 minutes
Llama 2 70B LoRA Fine-Tuning: 0.40 minutes
Image Generation (FLUX.1): 12.5 minutes
Recommender (DLRM-DCNv2): 0.71 minutes
Graph Neural Network (R-GAT): 0.84 minutes
Object Detection (RetinaNet): 1.4 minutes

GB300 NVL72 vs. GB200 NVL72: The Performance Leap

The generational uplift from Blackwell to Blackwell Ultra is pronounced. In inference workloads, the GB300 NVL72 delivers up to 2.77x higher token-per-second throughput compared to the GB200 NVL72 . This leap is attributed not only to architectural silicon gains but also to the aggressive software-hardware co-design that has become a hallmark of the Blackwell platform. The combination of these two systems allows data center operators to choose configurations optimized for specific cost, throughput, and latency requirements.

New MoE Workloads Reflect Industry Reality

For the first time, MLPerf Training introduced benchmarks that mirror the architecture of the most advanced frontier models in production today. Alongside the massive DeepSeek-V3 671B workload, a smaller GPT-OSS-20B MoE pretraining benchmark was added . NVIDIA was the only platform to submit results on both new MoE workloads, using the GB300 NVL72 system with custom software stacks, CUDA graphs, and advanced MoE routing to optimize the sparse, bursty communication patterns inherent to these models .

Record Participation and a More Diverse Ecosystem

The v6.0 round saw its broadest participation yet, with 24 organizations submitting results across 95 distinct systems using 13 different hardware accelerators . The technical diversity extended beyond just hardware to include multiple precision recipes, such as NVFP4 from NVIDIA and MXFP4 from AMD. AMD, in particular, demonstrated competitive performance with its Instinct MI355X GPU, coming within 5% of NVIDIA's B200 platform on Llama 2-70B fine-tuning and within 6% on Llama 3.1-8B pretraining—an important sign of a developing competitive landscape .

The Underlying Network: Scaling to 8,192 GPUs

The ability to scale a single training job to thousands of GPUs is critical for MoE models, which feature bursty all-to-all communication as tokens are dynamically routed to different experts. To handle this, NVIDIA's partners leveraged Spectrum-X Ethernet with Adaptive Routing and Congestion Control, sustaining near-theoretical fabric bandwidth across 8,192 Blackwell GPUs in hyperscale clusters . This networking stack is as vital as raw GPU compute in turning hardware specifications into benchmark-topping results and, ultimately, into faster training for real-world AI factories.

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Search & fact-check with Studio Global AI

Sources

Comments

0 comments

Loading comments...

← Back to Trending

AnswersPublished2 weeks agoLast edited 2 weeks ago21 sources

NVIDIA Blackwell Sweeps MLPerf Training v6.0 as CoreWeave Trains DeepSeek-V3 671B in 2 Minutes

Search & fact-check with Studio Global AI Browse more Trending pages

509K0

The DeepSeek-V3 Moment: CoreWeave's 2.02-Minute Training Run

NVIDIA's Clean Sweep on Every Benchmark

Key Training Times Across Benchmarks

The combination of the NVLink-switch domain and the scale-out Spectrum-X Ethernet fabric enabled NVIDIA's partners to achieve record times across all workloads :

Llama 3.1 8B Pretraining: 5.2 minutes
Llama 2 70B LoRA Fine-Tuning: 0.40 minutes
Image Generation (FLUX.1): 12.5 minutes
Recommender (DLRM-DCNv2): 0.71 minutes
Graph Neural Network (R-GAT): 0.84 minutes
Object Detection (RetinaNet): 1.4 minutes

GB300 NVL72 vs. GB200 NVL72: The Performance Leap

New MoE Workloads Reflect Industry Reality

Record Participation and a More Diverse Ecosystem

The Underlying Network: Scaling to 8,192 GPUs

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

NVIDIA Blackwell Sweeps MLPerf Training v6.0 as CoreWeave Trains DeepSeek-V3 671B in 2 Minutes

The DeepSeek-V3 Moment: CoreWeave's 2.02-Minute Training Run

NVIDIA's Clean Sweep on Every Benchmark

Key Training Times Across Benchmarks

GB300 NVL72 vs. GB200 NVL72: The Performance Leap

New MoE Workloads Reflect Industry Reality

Record Participation and a More Diverse Ecosystem

The Underlying Network: Scaling to 8,192 GPUs

Search, cite, and publish your own answer

People also ask

What is the short answer to "NVIDIA Blackwell Sweeps MLPerf Training v6.0 as CoreWeave Trains DeepSeek-V3 671B in 2 Minutes"?

What are the key points to validate first?

What should I do next in practice?

Sources

Comments

NVIDIA Blackwell Sweeps MLPerf Training v6.0 as CoreWeave Trains DeepSeek-V3 671B in 2 Minutes

The DeepSeek-V3 Moment: CoreWeave's 2.02-Minute Training Run

NVIDIA's Clean Sweep on Every Benchmark

Key Training Times Across Benchmarks

GB300 NVL72 vs. GB200 NVL72: The Performance Leap

New MoE Workloads Reflect Industry Reality

Record Participation and a More Diverse Ecosystem

The Underlying Network: Scaling to 8,192 GPUs

Search, cite, and publish your own answer

People also ask

What is the short answer to "NVIDIA Blackwell Sweeps MLPerf Training v6.0 as CoreWeave Trains DeepSeek-V3 671B in 2 Minutes"?

What are the key points to validate first?

What should I do next in practice?

Sources

Comments