MoE Multi-Token Prediction (MTP) layer is a specialized component that predicts several future tokens, improving overall language model performance.
It integrates with a mixture-of-experts architecture by leveraging gated routing, small transformer submodules, and a shared output head to manage predictions.
Empirical results indicate performance improvements of up to 1 point on benchmarks and higher multi-token acceptance rates in speculative decoding.
The MoEMulti-Token Prediction (MTP) layer is a specialized architectural component integrated into the SlimQwen model’s mixture-of-experts (MoE) backbone. Designed to augment conventional next-token prediction, the MTP layer enables direct supervision over a short span of future tokens at each position. This approach synergizes with knowledge distillation (KD) and pruning-based compression, enhancing model efficiency and language modeling performance, particularly in downstream knowledge-intensive benchmarks (Tang et al., 9 May 2026).
1. Architectural Integration of the MTP Layer
In SlimQwen, the transformer block processes each input token embedding xi through L alternating Gated Attention (or Gated DeltaNet) and MoE sublayers. The MoE sublayer’s router computes
z(x)=softmax(TopK(xWG,k))∈Rnrouted
for routed experts, along with a shared-expert gate
zs(x)=σ(xwsh)∈Rnshared
Each expert employs a SwiGLU MLP, and the MoE output combines routed and shared expert results: MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)
After passing through all layers, the architecture yields hidden states h1:T0∈RT×d. The MTP module operates atop these representations, generalizing next-token prediction to simultaneous prediction of up to D future tokens per position.
For each position i and prediction depth k=1,…,D, MTP:
Normalizes and concatenates hik−1 with the embedding L0,
Projects the concatenated vector to L1 dimensions via L2,
Applies a small transformer block L3 (unique per L4),
Uses a shared linear “OutHead” projecting to the vocabulary size.
The process is formalized by: L5
where L6 is the predicted distribution over the vocabulary at future offset L7. For L8, this reduces to traditional next-token prediction; for L9, multiple future tokens are predicted in parallel.
2. Multi-Token Distillation and Training Objective
The training objective jointly balances the standard next-token LM losses, their distillation (KD) analogs, and corresponding MTP losses for all depths z(x)=softmax(TopK(xWG,k))∈Rnrouted0 in z(x)=softmax(TopK(xWG,k))∈Rnrouted1. For a sequence length z(x)=softmax(TopK(xWG,k))∈Rnrouted2 with vocabulary of size z(x)=softmax(TopK(xWG,k))∈Rnrouted3, the losses are:
Next-token KD: z(x)=softmax(TopK(xWG,k))∈Rnrouted5 (where z(x)=softmax(TopK(xWG,k))∈Rnrouted6 is the teacher's distribution)
MTP LM: z(x)=softmax(TopK(xWG,k))∈Rnrouted7
MTP KD: z(x)=softmax(TopK(xWG,k))∈Rnrouted8
The total objective combines these terms with scheduled weights: z(x)=softmax(TopK(xWG,k))∈Rnrouted9
Here, zs(x)=σ(xwsh)∈Rnshared0 decays linearly from zs(x)=σ(xwsh)∈Rnshared1 to zs(x)=σ(xwsh)∈Rnshared2 and zs(x)=σ(xwsh)∈Rnshared3 cosine-decays from zs(x)=σ(xwsh)∈Rnshared4 to zs(x)=σ(xwsh)∈Rnshared5 over training.
3. Interaction with MoE Gating and Expert Selection
The MTP layer operates atop the MoE backbone but remains tightly coupled with the expert routing mechanism. Gradients arising from the MTP head propagate through the entire MoE network, including router gates. The partial derivative of the MTP-KD term with respect to expert parameters zs(x)=σ(xwsh)∈Rnshared6 is modulated by the gating scores zs(x)=σ(xwsh)∈Rnshared7, enabling the router to forward tokens for which the expert selection improves future-token prediction. No new gating mechanism is introduced within MTP itself, as it leverages the MoE’s pre-existing soft-top-k routing protocol.
4. MTP Forward and Backward Pass
The core stages in the MTP layer’s dataflow are summarized below (see (Tang et al., 9 May 2026), pseudocode section):
Forward pass:
Initialize zs(x)=σ(xwsh)∈Rnshared8 from token embeddings.
Iterate through zs(x)=σ(xwsh)∈Rnshared9 transformer blocks to obtain MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)0 with router gating.
For each prediction depth MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)1 in MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)2, concatenate RMS normalized MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)3 and MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)4 embedding; project and pass through transformer block MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)5.
Compute MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)6 via the shared OutHead.
Aggregate loss terms: MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)7, MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)8, MoE(x)=e=1∑nroutedze(x)Experte(x)+s=1∑nsharedzs(x)Experts(x)9, h1:T0∈RT×d0, then compute overall h1:T0∈RT×d1 as above.
Backward pass: Compute h1:T0∈RT×d2 via autodiff and update h1:T0∈RT×d3 using the chosen optimizer.
All computational steps are efficiently vectorized over batch and sequence. MTP introduces only h1:T0∈RT×d4 small transformer submodules and one shared OutHead, which do not share parameters across h1:T0∈RT×d5.
5. Empirical Gains from MTP Distillation
Empirical results in SlimQwen manifest consistent improvements from the application of MTP distillation. In comparisons on the 23A2B model trained for 120B tokens, performance across different objectives is:
This demonstrates that combining MTP with KD and next-token LM losses produces gains of h1:T0∈RT×d6–h1:T0∈RT×d7 point on knowledge-intensive benchmarks.
In speculative decoding—a protocol where the MTP head drafts multiple future tokens and the main backbone verifies them—MTP KD boosts multi-token acceptance rates. For instance, during GSM8K pretraining, two-token acceptance (“acc_2”) increases from h1:T0∈RT×d8 with MTP LM alone to h1:T0∈RT×d9 with MTP KD. In supervised fine-tuning (MTBench), four-token acceptance (“acc_4”) rises from D0 to D1 when using MTP KD.
6. Significance and Practical Considerations
By supervising both next-token and multiple future-token predictions through a lightweight MTP head, SlimQwen demonstrates an ability to more effectively tune its MoE experts. The architecture achieves improved language modeling quality and increased efficiency in multi-token generation, especially under speculative decoding protocols. These effects are realized with modest overhead: D2 additional transformer submodules (non-shared across depths) and a shared output layer. The integration of MTP thus suggests practical value in scaling multi-token objectives for efficient LLM pretraining and inference (Tang et al., 9 May 2026).