Llama 4 Scout (109B) MoE Transformer

Updated 19 November 2025

Llama 4 Scout is a 109B-parameter Mixture-of-Experts Transformer that uses FarSkip-Collective to decouple layer dependencies and overlap communication with computation.
The model restores its high-fidelity performance via a self-distillation process that limits accuracy loss to within 1% compared to the original instruction-tuned version.
It achieves substantial throughput gains in distributed environments, validated across 11 downstream tasks, while ensuring minimal degradation in core capabilities.

Llama 4 Scout (109B) denotes a 109-billion-parameter Mixture-of-Experts (MoE) Transformer model, exemplifying the scaling and deployment challenges of contemporary LLMs in distributed environments. Recent work demonstrates that Llama 4 Scout can be fully converted to employ the FarSkip-Collective methodology, which radically alters intra-layer connectivity to overlap collective communication and local computation. This results in substantial throughput improvements while preserving the core capabilities of the original instruction-tuned model to within 1% accuracy loss across broad downstream tasks (Dukler et al., 14 Nov 2025).

1. Architectural Transformation via FarSkip-Collective

The key innovation in FarSkip-Collective is the systematic rewiring of residual (“skip”) connections to relax strict layer-wise dependencies on blocking communication operations. In standard Transformer architectures, layer $k$ computes its output as $o_k = f_k(o_{k-1}) + o_{k-1}$ , requiring the entire output $o_k$ , including any results from blocking collective operations (e.g., all-reduce, all-to-all), before proceeding to the next layer. FarSkip-Collective decouples this by introducing the notion of an available activation $o_k^*$ , which may be either “outdated” ( $o_k^* = o_{k-1}$ ) or “partial” ( $o_k^* = o_{k-1} + f_k^*(o_{k-1})$ ), depending on the sub-block in question.

Within a MoE block, the approach applies as follows:

Attention sub-block (partial input):

$\mathrm{attn\_in}_k = o_{k-2} + \mathrm{attn\_out}_{k-1} + \mathrm{shared\_exp\_out}_{k-1}$

This facilitates overlapping the Combine(all-to-all) communication on expert outputs with the core attention computation.

MLP sub-block (outdated input):

$\mathrm{mlp\_in}_k = o_{k-1}$

This enables overlapping Dispatch(all-to-all) for routed experts with the preceding attention computation.

This architectural modification “drops” immediate dependencies, allowing next-layer kernels to launch on stale or partial activations while the necessary collectives complete in parallel.

2. Self-Distillation for Capability Restoration

The altered activation flow changes the effective input distributions to downstream layers, necessitating a re-training process to restore original model capabilities. FarSkip-Collective Self-Distillation (FCSD) employs a teacher-student protocol, where the original instruction-tuned Llama 4 Scout serves as the teacher ( $q$ ) and the FarSkip-converted model is the student ( $p_\theta$ ). The primary loss is token-wise KL-divergence:

$\mathcal{L}_{\mathrm{KL}(\theta)} = \mathbb{E}_{x\sim\mathcal D} \sum_{t=1}^{|y|} \mathrm{KL}(q(\cdot|x, y_{<t}) \| p_\theta(\cdot| x, y_{<t}))$

Additional objectives such as token-level cross-entropy and intermediate $L_2$ representation supervision are optionally mixed:

$\mathcal{L}_\mathrm{distill} = \alpha \mathcal{L}_\mathrm{CE} + (1-\alpha) \mathcal{L}_\mathrm{KL} \quad (0 \le \alpha \le 1)$

Early stopping on MBPP+ (Multiple Big Public Programming Problems) benchmarks is used to avoid late-stage distillation instabilities.

3. Communication Overlap Algorithm and Implementation

FarSkip-Collective leverages asynchronous primitives for all-to-all and all-reduce communications, re-ordered to maximize computation-communication overlap. In a typical forward pass on a MoE layer with Expert Parallelism (EP) and no Tensor Parallelism (TP), the routine comprises steps preparing attention inputs, launching token dispatch/all-to-all asynchronously, computing core attention, running expert MLPs post-async synchronization, and combining expert outputs via further async all-to-all.

Each asynchronous operation is launched on a secondary CUDA stream to ensure parallel execution with ongoing compute. The effective overlap for each layer $k$ is measured via:

$T_{\mathrm{comm},\mathrm{Dispatch}} + T_{\mathrm{comm},\mathrm{Combine}} \leq T_{\mathrm{comp},\,\text{other ops}} = T_{\mathrm{layer}} - (T_{\mathrm{gate}} + T_{\mathrm{routed}}) \tag{10}$

Maximal overlap occurs when non-routed computation dominates collective communication time, minimizing idle hardware.

In the backward pass, a stateful approach records handles for forward and backward all-to-all operations. Custom hooks ensure delayed synchronization and reorder the backward graph to allow expert-parallel gradient computation to proceed jointly with attention backpropagation.

For inference under vLLM, all-reduce operations aggregating nonzero expert outputs and attention projections are deferred and launched asynchronously to reduce blocking.

4. Empirical Performance Across Quality and Efficiency Metrics

Empirical evaluations on Llama 4 Scout and its FarSkip-Collective variants demonstrate minimal degradation in model accuracy and substantial throughput improvements. The average performance across 11 standard downstream tasks is summarized:

Model	Params	Avg-score
Llama-4-Scout (Orig.)	109 B	76.0
Llama-4-Scout (FCSD)	109 B	75.1
Llama-4-Scout (SFT)	109 B	65.6

The distillation process (FCSD) recovers accuracy to within 1% of the original, whereas naive supervised fine-tuning (SFT) incurs notable quality losses, particularly on code-oriented tasks (e.g. HumanEval+).

Training overlap and inference speed-up metrics include:

Forward pass overlap (Megatron-LM, EP=8, 8×MI325X GPUs): 87.6%
Backward pass overlap: 89.0%
End-to-end training speed-up (16B model): 11%
Inference speed-up (vLLM, FP8 quant, EP=8, 8×MI300X): Time-to-first-token accelerates by 12.2% (512 ctx) up to 18.5% (2048 ctx).

All results derive from averaging multiple runs; observed accuracy gaps reside within the evaluation variance for models at this scale. Early stopping based on held-out MBPP+ benchmarks mitigates overfitting and training collapse.

FarSkip-Collective is contrasted with several contemporary communication-overlap strategies:

Ladder-residual (Zhang et al. ’25): Uses outdated residuals to overlap all-reduce operations in dense TP, limited to smaller models (<10B) and not applied to MoE.
Kraken (Prabhakar et al. ’24), Partial-TP (Lamprecht et al. ’25): Employ partial activation synchronization in dense TP, again restricted to sub-10B models.
Apple track-parallelism (Günter et al. ’24) and Pre-gated MoE (Hwang et al. ’24): Aim to reduce MoE communication through design of the gating mechanism rather than overlapping collective operations.

FarSkip-Collective distinguishes itself by:

Modifying every layer of a >100B MoE model (Llama 4 Scout) to remove blocking dependencies.
Deploying universal self-distillation (FCSD) to recover high-fidelity capabilities.
Achieving and empirically validating 88%–97% communication-computation overlap at large scale.
Operating entirely at the PyTorch API level, ensuring portability to prominent training frameworks (Megatron-LM) and inference engines (vLLM).

6. Practical Deployment and Limitations

Converting Llama 4 Scout to FarSkip-Collective requires a rewiring of residual connections across all layers, hundreds of millions of self-distillation steps, and implementation of overlapped all-to-all/all-reduce collectives. The outcome is a highly scalable MoE model that trains and serves ~15% faster in distributed infrastructure, with verified accuracy preservation. Early stopping and careful loss blending are required for stable distillation, particularly to avoid collapse on code benchmarks such as MBPP+.

A plausible implication is that FarSkip-style connectivity and distillation may generalize to future MoE architectures at even larger scales, subject to the stability of the self-distillation regime and communication infrastructure. Constraints could arise in environments with non-uniform compute or communication capabilities, or when extending to architectures employing hybrid parallelism at extreme scales.

PDF Markdown Chat (Pro)

References (1)

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Llama 4 Scout (109B).

Llama 4 Scout (109B) MoE Transformer

1. Architectural Transformation via FarSkip-Collective

2. Self-Distillation for Capability Restoration

3. Communication Overlap Algorithm and Implementation

4. Empirical Performance Across Quality and Efficiency Metrics

5. Comparison to Related Communication-Overlapping Techniques

6. Practical Deployment and Limitations

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics