FlashDMoE Architecture

Updated 15 March 2026

The paper demonstrates FlashDMoE’s breakthrough by activating only 15B of 309B parameters per token using sparse, top-k expert routing for efficient scaling.
It employs a hybrid attention backbone combining sliding window and global attention, reducing compute complexity by up to 5.3× while supporting long-context processing.
The architecture integrates a monolithic GPU kernel and speculative decoding, yielding up to 2.6× faster inference with enhanced throughput and latency efficiency.

FlashDMoE designates a class of scalable Mixture-of-Experts (MoE) Transformer architectures and their associated GPU-optimized implementation frameworks, as exemplified by the 309B-parameter MiMo-V2-Flash model (Xiao et al., 6 Jan 2026) and the distributed systems advancements detailed in (Aimuyo et al., 5 Jun 2025). The architecture integrates high-parameter-count MoE backbones with efficient GPU pipelining, hybrid attention mechanisms, context-scaling techniques, multi-teacher distillation, and speculative inference optimizations. FlashDMoE achieves leading throughput, latency, and active parameter efficiency relative to similarly sized open-weight MoEs and demonstrates a pattern of hardware–software co-design for large-scale distributed model deployments.

1. Mixture-of-Experts Model Structure

FlashDMoE adopts a sparse-activated MoE Transformer layout with the following characteristics (Xiao et al., 6 Jan 2026):

Parameterization: 309 billion total parameters with only 15 billion active per token.
Layer Structure: 48 transformer layers, segmented into 39 employing Sliding Window Attention (SWA) and 9 employing Global Attention (GA). Each layer contains 256 experts, with Top-8 selection per token.
Routing Mechanism: Each token’s hidden state $x \in \mathbb{R}^d$ is routed via a learned projection $W_g \in \mathbb{R}^{E \times d}$ , softmax, and TopK selection:

$\mathrm{Gate}(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E, \quad \mathcal{E}(x) = \mathrm{TopK}(\mathrm{Gate}(x)),\; k=8.$

Assignments $g_{i,e}$ denote the token $i$ 's proportion to each selected expert $e$ .

Expert Load Balancing: To prevent collapse and promote balanced expert utilization, an auxiliary loss with term $\mathcal{L}_{\mathrm{aux}} = \lambda \sum_{e=1}^E \left( \bar{g}_e - \tfrac1E \right)^2$ , $\lambda = 10^{-5}$ , is applied, with $\bar{g}_e$ being the average gate value for expert $e$ over the batch.
Expert Capacity and Scaling: Each expert processes $\sim N \cdot k/E$ tokens per batch, guaranteeing constant sparsity and allowing near-linear scale-up in both parameter and expert count with maintained activation sparsity at $k/E \approx 3\%$ .

2. Hybrid Attention Backbone

FlashDMoE’s backbone interleaves local and global attention to efficiently scale to long contexts (Xiao et al., 6 Jan 2026):

Sliding Window Attention (SWA): Each SWA block restricts attention to a window of $W=128$ tokens ( $O(n W)$ per layer, $n$ is sequence length).
Global Attention (GA): Incorporated in a 5:1 ratio; every sixth layer employs GA for full-sequence modeling.
Overall Complexity: Compared to full self-attention’s $O(48 n^2)$ per input, the hybrid SWA/GA approach reduces it to $O(9 n^2 + 39 n W)$ , yielding up to 5.3 $\times$ reduction in both compute and memory overhead (key-value cache).
Learnable Sink Bias: A learnable bias is added to the SWA softmax denominator, facilitating tokens being ignored by certain heads and promoting robustness to windowing:

$s_{ij} = \frac{\exp(a_{ij} - m_i)}{\exp(\mathrm{sink} - m_i) + \sum_{j'} \exp(a_{ij'} - m_i)},$

where $a_{ij} = q_i k_j^\top/\sqrt{d}$ , $m_i = \max(\max_j a_{ij}, \mathrm{sink})$ .

3. Pre-training and Context Extension

Multi-Token Prediction (MTP): MTP attaches $K$ dense prediction heads, each forecasting the next $k$ -step token relative to the current hidden state $h_t$ . The loss

$\mathcal{L}_{\mathrm{MTP}} = -\frac{1}{K}\sum_{k=1}^K \log p(y_{t+k} | h_t)$

is used throughout pre-training and adapted for joint prediction in post-training.

Context Length Regime: The model is pre-trained natively up to sequence length 32,768, with rotary position encoding (RoPE) base frequencies set differently for GA ($640$K) and SWA ($10$K) blocks. Subsequent finetuning extends to length 262,144, using RoPE base $5$M and position interpolation for stability.

4. Multi-Teacher On-Policy Distillation (MOPD)

The post-training paradigm is structured as follows (Xiao et al., 6 Jan 2026):

Distillation Pipeline:

1. Supervised Fine-Tuning (SFT) using instruction–response data. 2. Specialized RL/SFT teachers target domains (mathematics, coding, search, etc.). 3. Student policy $\pi_\theta$ generates samples, with token-level KL-based rewards from relevant teacher $\pi_{\mathrm{domain}}$ .

Distillation Objective:

$\mathcal{L}_{\mathrm{MOPD}}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}, y \sim \mu_\theta} \sum_{t=1}^{|y|} w_t \hat{A}_t \log \pi_\theta(y_t | x, y_{<t})$

with

$\hat A_t = \mathrm{sg}\left[ \log \frac{\pi_{\mathrm{domain}}}{\pi_\theta} \right] + \alpha \hat A_{\mathrm{ORM}}$

Token-level rewards circumvent sample inefficiency and mitigate trade-offs between expertise domains, while modularity in teacher addition/removal is preserved without full retraining.

Motivations: The regime is designed to eliminate dataset re-generation, maintain stable multi-domain learning via token-level feedback, and support efficient, scalable, and modular teacher–student coevolution.

5. GPU Kernel Implementation and Distributed Compute

The “FlashDMoE” implementation addresses prevailing MoE deployment bottlenecks (Aimuyo et al., 5 Jun 2025):

Monolithic Kernel: The entire forward pass—gate, dispatch, expert FFNs, combinations—executes within a single persistent CUDA kernel per layer, eliminating frequent launches and host-initiated communication.
Kernel Design:
- Pipelines: Three concurrent GPU-resident actors:
- Subscriber fetches and decodes incoming token tile packets.
- Processor (N–1 thread blocks per GPU) draws and processes tile-level tasks via in-kernel GEMMs and activation.
- Scheduler block (warp) assigns tasks using shared memory queues.
- Thread Structure: Each processor block sets up a (128×64) tile for GEMM; admin block handles task coordination.
- Queueing: All intra-kernel handoff via shared memory or NVSHMEM atomics; no CPU or NCCL intervention after kernel launch.
Inter-GPU Communication: Implements device-initiated, one-sided (R)DMA via NVSHMEM, removing global barriers associated with AllToAll and drastically raising payload efficiency. Tile-aligned symmetric tensor buffers encode data for dispatch/combine rounds with in-place local buffer padding; only active token tiles are transmitted.
Performance Effects:
- Achieves up to $9\times$ GPU utilization, $6\times$ lower latency, and $5.7\times$ higher throughput (17.7M tokens/sec on 8 H100s at S=16K, E=128, top-2 routing, capacity=1.0) relative to prior frameworks using FP16, despite running in FP32.
- Overlap efficiency remains high ( $O_e(4) \approx 0.85$ , $O_e(8) \approx 0.83$ ), while SM utilization flatlines with increasing expert count.
- Marginal memory overhead: for $S=16$ K, $E=128$ , total $\sim$ 0.5GB per GPU.

6. Inference Optimization through Speculative Decoding

FlashDMoE leverages Multi-Token Prediction block as a lightweight speculative “draft” model during inference (Xiao et al., 6 Jan 2026):

Procedure: At each step,
1. The MTP “draft” model proposes $K$ tokens.
2. The full FlashDMoE model validates these sequentially, accepting the longest matching prefix ( $\ell \leq K$ ).
3. Unaccepted tokens are rolled back, and the process repeats.
Efficiency gains: This recasting yields mean acceptance length $\approx 3.6$ and overall up to $2.6\times$ decoding speedup with a three-layer MTP configuration.

7. Comparative Metrics and Open Access

FlashDMoE compares favorably to contemporary large-scale open-weight MoEs on reasoning and long-context performance (Xiao et al., 6 Jan 2026):

Model	Active Params	Total Params	MMLU-Pro	AIME-2025
FlashDMoE (MiMo-V2-Flash)	15 B	309 B	73.2	94.1
DeepSeek-V3.2	37 B	671 B	62.1	95.0
Kimi-K2	32 B	1043 B	–	–

Despite using $1/2$ to $1/3$ as many active parameters as these peers, FlashDMoE matches or outperforms on MMLU-Pro and achieves leading long-context capabilities.

The architecture and weights are open-sourced, including standalone 3-layer MTP modules, at MiMo-V2-Flash repository. Ongoing research is directed toward further scaling, dynamic hybrid window selection, and iterative improvement of the MOPD paradigm to refine cross-domain reasoning and agentic properties.

References

MiMo-V2-Flash Technical Report, X. Jia et al., (Xiao et al., 6 Jan 2026)
FlashDMoE: Fast Distributed MoE in a Single Kernel, Y. Zhu et al., (Aimuyo et al., 5 Jun 2025)

Markdown Report Issue Upgrade to Chat

References (2)

MiMo-V2-Flash Technical Report (2026)

FlashDMoE: Fast Distributed MoE in a Single Kernel (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashDMoE Architecture.