MiMo-V2-Flash: Sparse MoE Transformer
- The paper introduces MiMo-V2-Flash, a 309B parameter sparse MoE transformer achieving fast reasoning and efficient inference with 15B active parameters per token.
- It employs a hybrid attention mechanism that interleaves sliding-window and global attention, optimizing long-context processing while reducing computational overhead.
- Advanced pretraining with Multi-Token Prediction and Multi-Teacher On-Policy Distillation underpins its competitive performance on reasoning benchmarks.
MiMo-V2-Flash is a large Mixture-of-Experts (MoE) transformer model focused on fast reasoning, efficient inference, and agentic capabilities, featuring 309 billion total parameters with 15 billion active parameters per token. Its architecture integrates a high degree of sparsity and a @@@@1@@@@ to enable efficient operation on long contexts. The model is pretrained on 27 trillion tokens, utilizes Multi-Token Prediction (MTP) objectives, and employs Multi-Teacher On-Policy Distillation (MOPD) post-training to consolidate expertise across diverse domains. Empirical benchmarks show MiMo-V2-Flash matching or surpassing models with much higher parameter counts, with significant throughput advantages enabled by speculative decoding using the MTP subnetwork (Xiao et al., 6 Jan 2026).
1. Architectural Design and MoE Composition
MiMo-V2-Flash is structured around a 48-layer Transformer backbone. The architecture is augmented with a sparsely-gated Mixture-of-Experts (MoE) feed-forward network in all layers except the first. Each MoE layer contains 256 experts, of which 8 are activated per token, yielding 15B active parameters per token out of 309B total model parameters.
The routing mechanism selects expert subsets per input position. For each input hidden state , the router determines expert probabilities via a learned matrix in FP32:
Top-8 experts by are selected, ensuring load balancing through auxiliary losses.
Each expert employs a dense feed-forward module. Hybrid block organization consists of eight major blocks, with each block composed of five SWA+MoE layers followed by a single GA+MoE layer, except the initial block, which uses GA plus a dense FFN. Table 1 from the report summarizes this as 39 SWA and 9 GA layers in total, with MoE present in all except the first dense-only layer (Xiao et al., 6 Jan 2026).
2. Hybrid Attention Mechanism
MiMo-V2-Flash's attention paradigm is a hybrid of Sliding-Window Attention (SWA) and Global Attention (GA), interleaved in a 5:1 ratio (39 SWA:9 GA). This structure aims to optimize scalability and reduce quadratic compute inherent in full attention while preserving representation quality for long-context reasoning.
- SWA employs a window size , such that at position , attention is restricted to . An attention sink bias is introduced, incorporated into the attention calculation: \begin{align*} a_{ij} &= \frac{q_i\cdot k_jT}{\sqrt{d}}, \quad |i-j|\leq W/2 \ m_i &= \max(\max_j a_{ij}, \text{sink}) \ s_{ij} &= \frac{\exp(a_{ij}-m_i)}{\exp(\text{sink}-m_i) + \sum_j \exp(a_{ij}-m_i)} \ o_i &= \sum_j s_{ij} v_j \end{align*}
- GA applies identical computation over the full context, retaining the sink bias.
- Hybridization allows blocks 2–8 to consist of five SWA layers followed by one GA, with the first block using GA to stabilize early layer representations.
This configuration achieves a near six-fold reduction in key-value cache storage and compute for long contexts, with minimal observed performance loss (Xiao et al., 6 Jan 2026).
3. Pretraining and Context Extension via Multi-Token Prediction (MTP)
Pretraining is conducted using a Multi-Token Prediction (MTP) objective over 27T tokens. Initially, pretraining utilizes a context length of 32K tokens, later extended to 256K during continued training with updated RoPE frequencies.
The MTP loss augments the standard next-token loss with -step parallel future prediction:
This approach improves computational efficiency and arithmetic intensity in both training and inference.
During the extension phase (after 26T tokens), context is expanded from 32K to 256K tokens, with RoPE base frequencies adjusted accordingly. The adaptation to long sequences is achieved by a brief continuation of training at the longer sequence length (Xiao et al., 6 Jan 2026).
4. Multi-Teacher On-Policy Distillation (MOPD) Paradigm
Post-training is structured in three stages: supervised fine-tuning (SFT), reinforcement learning (RL) specialization via domain-expert teachers, and on-policy distillation. In the terminal stage, MOPD orchestrates synchronized policy transfer from multiple domain-specialized RL teachers to the student model.
The central loss takes the form of a reverse KL divergence:
The implemented surrogate is:
where is the importance sampling weight and combines logits advantage with reward-model advantage.
This procedure enables the model to assimilate domain-specialized knowledge densely, mitigating trade-offs common in sequential fine-tuning or naive parameter merging (Xiao et al., 6 Jan 2026).
5. Inference Acceleration and Speculative Decoding via MTP
Inference leverages the MTP subnetwork for speculative decoding, wherein the MTP head proposes up to draft tokens per step. The output is then verified in parallel by the full model, accepting a prefix of length . The two principal metrics are:
- Acceptance Length : With 3 MTP layers, average acceptance length reaches up to 3.6 tokens in low-entropy contexts, with an empirically fit functional form:
where is the next-token cross-entropy.
- Decoding Speedup : Speedup scales nearly linearly with . For input length 16K and output 1K, 3-layer MTP achieves up to 2.6 speedup, e.g., batch size 64 yields at .
MTP blocks for draft decoding are intentionally lightweight (SWA-only, dense FFN, 0.33B params per block) to prevent new compute bottlenecks (Xiao et al., 6 Jan 2026).
6. Empirical Performance and Comparative Evaluation
MiMo-V2-Flash-Base (309B total; 15B active) is benchmarked against Kimi-K2-Base (1043B total; 32B active) and DeepSeek-V3.2-Exp-Base (671B total; 37B active). Results from key domains include:
| Benchmark | MiMo-V2-Flash | Kimi-K2 | DeepSeek-V3.2 |
|---|---|---|---|
| MMLU-Pro (5-shot) | 73.2 | 69.2 | 62.1 |
| GPQA-Diamond | 55.1 | 48.1 | 52.0 |
| AIME24/25 (2-shot) | 35.3 | 31.6 | 24.8 |
| NIAH-Multi (context @ 128K) | ≈100% | 99.5% | <99.5% |
| GSM-Infinite Hard (128K) | 29.0 | 8.8 | 25.7 |
Post-MOPD, MiMo-V2-Flash matches or slightly trails the largest competitors in post-training benchmarks:
- MMLU-Pro: 84.9 (MiMo) vs. 84.6 (Kimi-K2) vs. 85.0 (DeepSeek)
- AIME 2025: 94.1 (MiMo) vs. 94.5 (Kimi-K2) vs. 93.1 (DeepSeek)
- LiveCodeBench: 85.1 (MiMo) vs. 83.1 (Kimi-K2) vs. 83.3 (DeepSeek)
- SWE-Bench Verified: 73.4% (MiMo) vs. 71.3% (Kimi-K2) vs. 73.1% (DeepSeek)
Despite drastically fewer active parameters, MiMo-V2-Flash achieves comparable or superior efficiency and accuracy, especially on long-context retrieval and reasoning tasks, and shows 2–3 inference speedups on standard hardware (Xiao et al., 6 Jan 2026).
7. Key Implementation Details and Ablations
- Attention Sink Bias and Window Size: Ablations indicate that the use of window size with the sink bias outperforms both all-GA and alternatives on general, long-context, and reasoning tasks.
- Attention Head Structure: Each SWA block utilizes 64 query heads and 8 key/value heads (Grouped Query Attention); GA blocks use 64 query heads and 4 key/value heads.
- Position Encoding: Rotary embeddings (RoPE) are applied to the first 64 dimensions of Q/K vectors.
- Context Management: The operational context length is extended natively to 256K tokens.
- Open Research: Both the main model and 3-layer MTP weights are available for community use.
A plausible implication is that the aggressive sparsity schedule, backed by adaptive hybrid attention and efficient MOPD training, may serve as a reference design for scalable, high-performance LLMs constrained by training and inference budgets (Xiao et al., 6 Jan 2026).