FlashDMoE Architecture
- The paper demonstrates FlashDMoE’s breakthrough by activating only 15B of 309B parameters per token using sparse, top-k expert routing for efficient scaling.
- It employs a hybrid attention backbone combining sliding window and global attention, reducing compute complexity by up to 5.3× while supporting long-context processing.
- The architecture integrates a monolithic GPU kernel and speculative decoding, yielding up to 2.6× faster inference with enhanced throughput and latency efficiency.
FlashDMoE designates a class of scalable Mixture-of-Experts (MoE) Transformer architectures and their associated GPU-optimized implementation frameworks, as exemplified by the 309B-parameter MiMo-V2-Flash model (Xiao et al., 6 Jan 2026) and the distributed systems advancements detailed in (Aimuyo et al., 5 Jun 2025). The architecture integrates high-parameter-count MoE backbones with efficient GPU pipelining, hybrid attention mechanisms, context-scaling techniques, multi-teacher distillation, and speculative inference optimizations. FlashDMoE achieves leading throughput, latency, and active parameter efficiency relative to similarly sized open-weight MoEs and demonstrates a pattern of hardware–software co-design for large-scale distributed model deployments.
1. Mixture-of-Experts Model Structure
FlashDMoE adopts a sparse-activated MoE Transformer layout with the following characteristics (Xiao et al., 6 Jan 2026):
- Parameterization: 309 billion total parameters with only 15 billion active per token.
- Layer Structure: 48 transformer layers, segmented into 39 employing Sliding Window Attention (SWA) and 9 employing Global Attention (GA). Each layer contains 256 experts, with Top-8 selection per token.
- Routing Mechanism: Each token’s hidden state is routed via a learned projection , softmax, and TopK selection:
Assignments denote the token 's proportion to each selected expert .
- Expert Load Balancing: To prevent collapse and promote balanced expert utilization, an auxiliary loss with term , , is applied, with being the average gate value for expert over the batch.
- Expert Capacity and Scaling: Each expert processes tokens per batch, guaranteeing constant sparsity and allowing near-linear scale-up in both parameter and expert count with maintained activation sparsity at .
2. Hybrid Attention Backbone
FlashDMoE’s backbone interleaves local and global attention to efficiently scale to long contexts (Xiao et al., 6 Jan 2026):
- Sliding Window Attention (SWA): Each SWA block restricts attention to a window of tokens ( per layer, is sequence length).
- Global Attention (GA): Incorporated in a 5:1 ratio; every sixth layer employs GA for full-sequence modeling.
- Overall Complexity: Compared to full self-attention’s per input, the hybrid SWA/GA approach reduces it to , yielding up to 5.3 reduction in both compute and memory overhead (key-value cache).
- Learnable Sink Bias: A learnable bias is added to the SWA softmax denominator, facilitating tokens being ignored by certain heads and promoting robustness to windowing:
where , .
3. Pre-training and Context Extension
- Multi-Token Prediction (MTP): MTP attaches dense prediction heads, each forecasting the next -step token relative to the current hidden state . The loss
is used throughout pre-training and adapted for joint prediction in post-training.
- Context Length Regime: The model is pre-trained natively up to sequence length 32,768, with rotary position encoding (RoPE) base frequencies set differently for GA ($640$K) and SWA ($10$K) blocks. Subsequent finetuning extends to length 262,144, using RoPE base $5$M and position interpolation for stability.
4. Multi-Teacher On-Policy Distillation (MOPD)
The post-training paradigm is structured as follows (Xiao et al., 6 Jan 2026):
- Distillation Pipeline:
1. Supervised Fine-Tuning (SFT) using instruction–response data. 2. Specialized RL/SFT teachers target domains (mathematics, coding, search, etc.). 3. Student policy generates samples, with token-level KL-based rewards from relevant teacher .
- Distillation Objective:
with
Token-level rewards circumvent sample inefficiency and mitigate trade-offs between expertise domains, while modularity in teacher addition/removal is preserved without full retraining.
- Motivations: The regime is designed to eliminate dataset re-generation, maintain stable multi-domain learning via token-level feedback, and support efficient, scalable, and modular teacher–student coevolution.
5. GPU Kernel Implementation and Distributed Compute
The “FlashDMoE” implementation addresses prevailing MoE deployment bottlenecks (Aimuyo et al., 5 Jun 2025):
- Monolithic Kernel: The entire forward pass—gate, dispatch, expert FFNs, combinations—executes within a single persistent CUDA kernel per layer, eliminating frequent launches and host-initiated communication.
- Kernel Design:
- Pipelines: Three concurrent GPU-resident actors:
- Subscriber fetches and decodes incoming token tile packets.
- Processor (N–1 thread blocks per GPU) draws and processes tile-level tasks via in-kernel GEMMs and activation.
- Scheduler block (warp) assigns tasks using shared memory queues.
- Thread Structure: Each processor block sets up a (128×64) tile for GEMM; admin block handles task coordination.
- Queueing: All intra-kernel handoff via shared memory or NVSHMEM atomics; no CPU or NCCL intervention after kernel launch.
- Inter-GPU Communication: Implements device-initiated, one-sided (R)DMA via NVSHMEM, removing global barriers associated with AllToAll and drastically raising payload efficiency. Tile-aligned symmetric tensor buffers encode data for dispatch/combine rounds with in-place local buffer padding; only active token tiles are transmitted.
- Performance Effects:
- Achieves up to GPU utilization, lower latency, and higher throughput (17.7M tokens/sec on 8 H100s at S=16K, E=128, top-2 routing, capacity=1.0) relative to prior frameworks using FP16, despite running in FP32.
- Overlap efficiency remains high (, ), while SM utilization flatlines with increasing expert count.
- Marginal memory overhead: for K, , total 0.5GB per GPU.
6. Inference Optimization through Speculative Decoding
FlashDMoE leverages Multi-Token Prediction block as a lightweight speculative “draft” model during inference (Xiao et al., 6 Jan 2026):
- Procedure: At each step,
- The MTP “draft” model proposes tokens.
- The full FlashDMoE model validates these sequentially, accepting the longest matching prefix ().
- Unaccepted tokens are rolled back, and the process repeats.
Efficiency gains: This recasting yields mean acceptance length and overall up to decoding speedup with a three-layer MTP configuration.
7. Comparative Metrics and Open Access
FlashDMoE compares favorably to contemporary large-scale open-weight MoEs on reasoning and long-context performance (Xiao et al., 6 Jan 2026):
| Model | Active Params | Total Params | MMLU-Pro | AIME-2025 |
|---|---|---|---|---|
| FlashDMoE (MiMo-V2-Flash) | 15 B | 309 B | 73.2 | 94.1 |
| DeepSeek-V3.2 | 37 B | 671 B | 62.1 | 95.0 |
| Kimi-K2 | 32 B | 1043 B | – | – |
Despite using $1/2$ to $1/3$ as many active parameters as these peers, FlashDMoE matches or outperforms on MMLU-Pro and achieves leading long-context capabilities.
The architecture and weights are open-sourced, including standalone 3-layer MTP modules, at MiMo-V2-Flash repository. Ongoing research is directed toward further scaling, dynamic hybrid window selection, and iterative improvement of the MOPD paradigm to refine cross-domain reasoning and agentic properties.
References
- MiMo-V2-Flash Technical Report, X. Jia et al., (Xiao et al., 6 Jan 2026)
- FlashDMoE: Fast Distributed MoE in a Single Kernel, Y. Zhu et al., (Aimuyo et al., 5 Jun 2025)