Papers
Topics
Authors
Recent
2000 character limit reached

Llama 4 Scout (109B) MoE Transformer

Updated 19 November 2025
  • Llama 4 Scout is a 109B-parameter Mixture-of-Experts Transformer that uses FarSkip-Collective to decouple layer dependencies and overlap communication with computation.
  • The model restores its high-fidelity performance via a self-distillation process that limits accuracy loss to within 1% compared to the original instruction-tuned version.
  • It achieves substantial throughput gains in distributed environments, validated across 11 downstream tasks, while ensuring minimal degradation in core capabilities.

Llama 4 Scout (109B) denotes a 109-billion-parameter Mixture-of-Experts (MoE) Transformer model, exemplifying the scaling and deployment challenges of contemporary LLMs in distributed environments. Recent work demonstrates that Llama 4 Scout can be fully converted to employ the FarSkip-Collective methodology, which radically alters intra-layer connectivity to overlap collective communication and local computation. This results in substantial throughput improvements while preserving the core capabilities of the original instruction-tuned model to within 1% accuracy loss across broad downstream tasks (Dukler et al., 14 Nov 2025).

1. Architectural Transformation via FarSkip-Collective

The key innovation in FarSkip-Collective is the systematic rewiring of residual (“skip”) connections to relax strict layer-wise dependencies on blocking communication operations. In standard Transformer architectures, layer kk computes its output as ok=fk(ok1)+ok1o_k = f_k(o_{k-1}) + o_{k-1}, requiring the entire output oko_k, including any results from blocking collective operations (e.g., all-reduce, all-to-all), before proceeding to the next layer. FarSkip-Collective decouples this by introducing the notion of an available activation oko_k^*, which may be either “outdated” (ok=ok1o_k^* = o_{k-1}) or “partial” (ok=ok1+fk(ok1)o_k^* = o_{k-1} + f_k^*(o_{k-1})), depending on the sub-block in question.

Within a MoE block, the approach applies as follows:

  • Attention sub-block (partial input):

attn_ink=ok2+attn_outk1+shared_exp_outk1\mathrm{attn\_in}_k = o_{k-2} + \mathrm{attn\_out}_{k-1} + \mathrm{shared\_exp\_out}_{k-1}

This facilitates overlapping the Combine(all-to-all) communication on expert outputs with the core attention computation.

  • MLP sub-block (outdated input):

mlp_ink=ok1\mathrm{mlp\_in}_k = o_{k-1}

This enables overlapping Dispatch(all-to-all) for routed experts with the preceding attention computation.

This architectural modification “drops” immediate dependencies, allowing next-layer kernels to launch on stale or partial activations while the necessary collectives complete in parallel.

2. Self-Distillation for Capability Restoration

The altered activation flow changes the effective input distributions to downstream layers, necessitating a re-training process to restore original model capabilities. FarSkip-Collective Self-Distillation (FCSD) employs a teacher-student protocol, where the original instruction-tuned Llama 4 Scout serves as the teacher (qq) and the FarSkip-converted model is the student (pθp_\theta). The primary loss is token-wise KL-divergence:

LKL(θ)=ExDt=1yKL(q(x,y<t)pθ(x,y<t))\mathcal{L}_{\mathrm{KL}(\theta)} = \mathbb{E}_{x\sim\mathcal D} \sum_{t=1}^{|y|} \mathrm{KL}(q(\cdot|x, y_{<t}) \| p_\theta(\cdot| x, y_{<t}))

Additional objectives such as token-level cross-entropy and intermediate L2L_2 representation supervision are optionally mixed:

Ldistill=αLCE+(1α)LKL(0α1)\mathcal{L}_\mathrm{distill} = \alpha \mathcal{L}_\mathrm{CE} + (1-\alpha) \mathcal{L}_\mathrm{KL} \quad (0 \le \alpha \le 1)

Early stopping on MBPP+ (Multiple Big Public Programming Problems) benchmarks is used to avoid late-stage distillation instabilities.

3. Communication Overlap Algorithm and Implementation

FarSkip-Collective leverages asynchronous primitives for all-to-all and all-reduce communications, re-ordered to maximize computation-communication overlap. In a typical forward pass on a MoE layer with Expert Parallelism (EP) and no Tensor Parallelism (TP), the routine comprises steps preparing attention inputs, launching token dispatch/all-to-all asynchronously, computing core attention, running expert MLPs post-async synchronization, and combining expert outputs via further async all-to-all.

Each asynchronous operation is launched on a secondary CUDA stream to ensure parallel execution with ongoing compute. The effective overlap for each layer kk is measured via:

Tcomm,Dispatch+Tcomm,CombineTcomp,other ops=Tlayer(Tgate+Trouted)(10)T_{\mathrm{comm},\mathrm{Dispatch}} + T_{\mathrm{comm},\mathrm{Combine}} \leq T_{\mathrm{comp},\,\text{other ops}} = T_{\mathrm{layer}} - (T_{\mathrm{gate}} + T_{\mathrm{routed}}) \tag{10}

Maximal overlap occurs when non-routed computation dominates collective communication time, minimizing idle hardware.

In the backward pass, a stateful approach records handles for forward and backward all-to-all operations. Custom hooks ensure delayed synchronization and reorder the backward graph to allow expert-parallel gradient computation to proceed jointly with attention backpropagation.

For inference under vLLM, all-reduce operations aggregating nonzero expert outputs and attention projections are deferred and launched asynchronously to reduce blocking.

4. Empirical Performance Across Quality and Efficiency Metrics

Empirical evaluations on Llama 4 Scout and its FarSkip-Collective variants demonstrate minimal degradation in model accuracy and substantial throughput improvements. The average performance across 11 standard downstream tasks is summarized:

Model Params Avg-score
Llama-4-Scout (Orig.) 109 B 76.0
Llama-4-Scout (FCSD) 109 B 75.1
Llama-4-Scout (SFT) 109 B 65.6

The distillation process (FCSD) recovers accuracy to within 1% of the original, whereas naive supervised fine-tuning (SFT) incurs notable quality losses, particularly on code-oriented tasks (e.g. HumanEval+).

Training overlap and inference speed-up metrics include:

  • Forward pass overlap (Megatron-LM, EP=8, 8×MI325X GPUs): 87.6%
  • Backward pass overlap: 89.0%
  • End-to-end training speed-up (16B model): 11%
  • Inference speed-up (vLLM, FP8 quant, EP=8, 8×MI300X): Time-to-first-token accelerates by 12.2% (512 ctx) up to 18.5% (2048 ctx).

All results derive from averaging multiple runs; observed accuracy gaps reside within the evaluation variance for models at this scale. Early stopping based on held-out MBPP+ benchmarks mitigates overfitting and training collapse.

FarSkip-Collective is contrasted with several contemporary communication-overlap strategies:

  • Ladder-residual (Zhang et al. ’25): Uses outdated residuals to overlap all-reduce operations in dense TP, limited to smaller models (<10B) and not applied to MoE.
  • Kraken (Prabhakar et al. ’24), Partial-TP (Lamprecht et al. ’25): Employ partial activation synchronization in dense TP, again restricted to sub-10B models.
  • Apple track-parallelism (Günter et al. ’24) and Pre-gated MoE (Hwang et al. ’24): Aim to reduce MoE communication through design of the gating mechanism rather than overlapping collective operations.

FarSkip-Collective distinguishes itself by:

  • Modifying every layer of a >100B MoE model (Llama 4 Scout) to remove blocking dependencies.
  • Deploying universal self-distillation (FCSD) to recover high-fidelity capabilities.
  • Achieving and empirically validating 88%–97% communication-computation overlap at large scale.
  • Operating entirely at the PyTorch API level, ensuring portability to prominent training frameworks (Megatron-LM) and inference engines (vLLM).

6. Practical Deployment and Limitations

Converting Llama 4 Scout to FarSkip-Collective requires a rewiring of residual connections across all layers, hundreds of millions of self-distillation steps, and implementation of overlapped all-to-all/all-reduce collectives. The outcome is a highly scalable MoE model that trains and serves ~15% faster in distributed infrastructure, with verified accuracy preservation. Early stopping and careful loss blending are required for stable distillation, particularly to avoid collapse on code benchmarks such as MBPP+.

A plausible implication is that FarSkip-style connectivity and distillation may generalize to future MoE architectures at even larger scales, subject to the stability of the self-distillation regime and communication infrastructure. Constraints could arise in environments with non-uniform compute or communication capabilities, or when extending to architectures employing hybrid parallelism at extreme scales.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Llama 4 Scout (109B).