The UltraSpeed mode is not a new model class but an engineering-driven serving mode layered on top of MiMo-V2.5-Pro, a 1.02-trillion-parameter Mixture-of-Experts architecture with 42 billion active parameters and a 1-million-token context window .
Xiaomi's official documentation describes a full-stack model-system co-design that combines three coordinated techniques to push throughput past 1,000 tokens/s .
Only the MoE (Mixture of Experts) expert layers are quantized to FP4 precision, while all other layers retain their original precision . Quantization-aware training (QAT) reduces the model's memory footprint and bandwidth pressure with the goal of maintaining near-lossless quality
. This selective approach avoids degrading non-expert components that are more sensitive to precision loss.
DFlash replaces traditional autoregressive draft generation with block-level masked parallel prediction . The draft model uses sliding-window attention (SWA) to keep prediction cost near constant, rather than scaling with sequence length
. A Muon optimizer and self-distillation are used to improve acceptance rates, directly boosting inference throughput
. In coding scenarios, reports indicate an average accepted length around 6.30 tokens per verification step
.
The TileRT system abandons the conventional per-operator kernel launch model in favor of a persistent kernel engine where the compute pipeline stays resident on the GPU . Full-pipeline prefetching overlaps data movement with computation, dramatically reducing idle GPU cycles
. The system also decomposes communication, data movement, and tensor computation across different warps with dedicated roles, effectively turning the GPU into a continuously flowing, heterogeneous execution system
.
Input pricing follows the same 3× multiplier, with cache-hit input at $0.0108 per million tokens and cache-miss input at $1.305 per million tokens . Xiaomi markets this as "3× the price, 10× the output experience," emphasizing the roughly 10× throughput gain for 3× the token cost
.
The UltraSpeed trial period is explicitly time-boxed: June 9 through June 23, 2026, at 23:59 . Access is application-based due to limited high-speed inference resources, with priority given to enterprise and professional developer use cases
.
Approved users receive a free chat experience during the two-week window, subject to fairness rules: a maximum of 10 successful queue entries per account per day, a 30-minute session limit, and automatic resource release after 5 minutes of idle time . Xiaomi does not guarantee review timeliness or approval rates
.
The underlying model, referred to as MiMo-V2.5-Pro-FP4-DFlash, was released as open-source alongside the UltraSpeed announcement . The FP4-quantized weights and DFlash model checkpoints are available on HuggingFace, consistent with Xiaomi's documentation identifying FP4 quantization and DFlash speculative decoding as core system components
.
The UltraSpeed mode demonstrates that trillion-parameter inference at interactive speeds can run on commodity infrastructure without custom silicon, a departure from the specialized-hardware approach seen elsewhere in the industry . For developers building latency-sensitive agentic applications, tool-calling pipelines, or real-time code generation, the combination of high throughput and a 1-million-token context window signals a practical path toward faster, more capable production systems — provided they can gain access during the limited trial window.
Comments
0 comments