MiMo-V2.5-Pro-FP4-DFlash Tames Massive MoE Models With Block Drafting

    
        By vramkickedin    
     | 
    
            June 24, 2026 at 6:02 pm        
    
     | 
    
        2 min read

MiMo-V2.5-Pro-FP4-DFlash is a new release that combines expert-level 4-bit quantization with block-diffusion speculative decoding to shrink both the memory footprint and the number of forward passes needed during decoding. The model applies MXFP4 quantization exclusively to the mixture-of-experts experts while keeping attention and other modules at higher precision, preserving near-lossless quality. A lightweight DFlash drafter fills entire token blocks in one shot, eliminating the serial autoregression bottleneck of conventional speculative decoding.

The Xiaomi MiMo Team (who previously released MiMo-V2.5-Pro ) built this version to tackle the severe memory-bandwidth and compute costs that plague trillion-parameter inference. They trained the drafter using the Muon optimizer and model self-distillation so it stays fast and accurate even with small block sizes. The FP4 backbone and drafter work together to cut the two dominant cost factors of large-model serving: per-parameter bit width and repeated backbone calls.

Expert-only quantization and block drafting

Key Features

Only MoE experts quantized to FP4 precision.
Lightweight DFlash drafter for block decoding.
Acceptance lengths up to 6.30 on WebDev.
Sliding window attention reduces draft compute cost.
Custom Muon optimizer and self-distillation training.
Supported in SGLang with speculative decoding flags.
Maintains near-lossless output versus FP8 baseline.
Designed for million-token contexts with constant cost.

Teams running very large MoE models for code generation, general reasoning, or long-context tasks will see lower latency and reduced hardware pressure. The FP4 backbone typically matches or even slightly improves some agent and code benchmarks compared to its FP8 predecessor. Deployment is straightforward through SGLang, where the drafter inherits the backbone’s parallel topology.

Deployment notes and accuracy trade-offs

The drafter relies entirely on sliding window attention and caps its mask block size at eight tokens, keeping verification overhead small while maintaining strong acceptance rates across programming and math tasks. Benchmarks show the FP4 model stays close to the FP8 baseline, with only minor regressions on challenging exams like Humanity’s Last Exam. The team welcomes community feedback and provides a full deployment example using standard SGLang launch flags.