Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoDA: Mixture-of-Depths Attention in Transformers

Updated 17 March 2026
  • Mixture-of-Depths Attention (MoDA) is a transformer innovation that uses dynamic token routing and depth-aware attention to allocate computation efficiently.
  • It employs mechanisms like token-sparse routing and selective layer processing to reduce computational cost and overcome dense architecture bottlenecks.
  • MoDA supports multimodal and long-sequence tasks by enabling efficient scaling, faster convergence, and improved performance across various benchmarks.

Mixture-of-Depths Attention (MoDA) is a class of architectural and algorithmic innovations for transformer models that integrates dynamic, data-driven compute allocation along the depth (layer) axis, aiming to improve computational efficiency, scalability, and representational fidelity especially as models become deeper or handle multimodal, long-sequence, or complex reasoning tasks. MoDA encompasses both token-sparse routing—where tokens are adaptively processed at select layers—and depth-aware attention, where attention heads retrieve information from activations across multiple layers. These mechanisms address the inefficiencies and information bottlenecks inherent in dense, fixed-depth transformer architectures. MoDA is now central to the efficient scaling of both unimodal and multimodal transformers, as evidenced by recent advances in unified large models and depth-recurrent architectures.

1. Core Mechanisms and Formulations

The fundamental principle of Mixture-of-Depths Attention is to decouple the notion that all tokens must be processed in all layers, and, in recent depth-aware variants, to enable each attention head to attend over both the current layer’s and preceding layers’ key-value pairs.

Token Routing and Gating

In classic MoDA routing, at each transformer layer, a lightweight router network computes a per-token importance score. Tokens with high scores are processed by the full attention and feed-forward subnet; others are routed via the residual connection only, bypassing heavy computation. The routing mechanism is typically implemented as: ri=σ(Wrxi+br)r_i = \sigma(W_r x_i + b_r) where xix_i is the token embedding and rir_i is its scalar importance score. A hard threshold or percentile-selection (top-k or threshold-p) is used to decide which tokens to process: x~i={xi+D(xi),rios xi,ri<os\tilde x_i = \begin{cases} x_i + D(x_i), & r_i \geq o_s \ x_i, & r_i < o_s \end{cases} where D()D(\cdot) denotes the attention+FFN block (Mao et al., 10 Feb 2025, Raposo et al., 2024, Zhang et al., 2024, Zhang et al., 2024).

Depth-Aware Attention

Recent MoDA variants enable direct attention over intermediate representations from multiple preceding layers, augmenting sequence attention with “depth” attention (Zhu et al., 16 Mar 2026, Knupp et al., 29 Jan 2026). At layer ll, each head jointly attends to the current sequence and to a set of depth-indexed key-value memories. This is formalized as: MoDAh(Qhl,Kall,Vall)=softmax(QhlKalld+M)Vall\mathrm{MoDA}_h(Q_h^l, K_{\text{all}}, V_{\text{all}}) = \mathrm{softmax}\left(\frac{Q_h^l K_{\text{all}}^\top}{\sqrt{d}} + \mathcal{M}\right) V_{\text{all}} where KallK_{\text{all}} and VallV_{\text{all}} concatenate current and all previous layer KVs, and M\mathcal{M} masks for causality and validity.

2. Architectural Variants and Mathematical Details

MoDA encompasses several design families, including but not limited to:

Variant/Family Routing Principle Depth Mixing
Token Pruning MoDA Per-token router (MLP/linear) selects tokens to process per layer (Raposo et al., 2024, Mao et al., 10 Feb 2025) No
Attention-based MoDA (A-MoD) Uses normalized attention scores from previous layer as routing scores (no extra parameters) (Gadhikar et al., 2024) No
Threshold-p Gating Replaces fixed top-k with adaptive thresholding for better effectiveness (Zhang et al., 2024) No
Depth-aware MoDA Every attention head accesses activations from all or a subset of previous layers (depth-key/value) with a unified softmax (Zhu et al., 16 Mar 2026, Knupp et al., 29 Jan 2026) Yes
Multimodal MoDA Separate routers or schedules for vision and language tokens, task- or modality-aware (Mao et al., 10 Feb 2025, Zhang et al., 2024, Luo et al., 2024) Optional

“A-MoD” and “MoDA” are sometimes used synonymously for attention-routed token pruning, but recent works in depth-recurrent architectures (Zhu et al., 16 Mar 2026, Knupp et al., 29 Jan 2026) use MoDA to mean genuine depth attention.

Layer- and Token-level Routing

  • Top-k Routing: Selects exactly kk tokens per layer via scoring; enables static computation graphs, but is rigid (Raposo et al., 2024).
  • Threshold-p Routing: Dynamically determines, per layer, how many tokens are kept based on a score threshold so the retained fraction is ≈pp; reduces routing artifacts and matches input heterogeneity (Zhang et al., 2024).
  • Task-/Modality-Aware Routers: Separate scoring networks/thresholds per data source or objective, exploiting heterogeneity in token redundancy (Mao et al., 10 Feb 2025, Zhang et al., 2024, Luo et al., 2024).

Depth Attention and Recurrent Mixtures

  • Full Depth Attention: Each attention head can mix information from all prior layer activations for its token position, dynamically retrieving the most relevant depth (Zhu et al., 16 Mar 2026).
  • Hybrid Mixture-of-Attentions: Modular stacking of sequence attention, depth attention, and sparse MoE expert attention in a single block; enables cross-depth retrieval and capacity scaling (Knupp et al., 29 Jan 2026).

3. Efficiency, Scalability, and Hardware Considerations

MoDA achieves substantial reductions in computational cost, GPU memory, and wall-clock training/inference time by bypassing redundant computations at both the token and layer level.

  • Token skipping yields FLOPs and memory savings scaling with the skipped fraction. In UniMoD (Mao et al., 10 Feb 2025), 15–40% training/inference FLOPs are saved with negligible or even improved benchmark accuracy; in p-MoD, up to 44% reduction in inference FLOPs and 54% lower KV cache usage is achieved in multimodal decoders (Zhang et al., 2024).
  • A-MoD (attention-driven routing) introduces zero additional parameters and negligible overhead beyond leveraging already-computed attention matrices; this not only simplifies implementation but also accelerates convergence (2× faster in transfer settings) and can yield up to 2% accuracy improvements at equal FLOPs compared to standard MoD (Gadhikar et al., 2024).
  • Depth-attentive MoDA incurs only a moderate computational overhead (+3.7% FLOPs for 1.5B LLMs) while improving average perplexity by 0.2 and increasing downstream accuracy by 2.11% (Zhu et al., 16 Mar 2026).
  • Hardware-efficient deployments rely on custom memory layouts, chunk-aware indexing, and group query attention packing to achieve over 97% of vanilla FlashAttention-2 throughput even for 64K sequence length (Zhu et al., 16 Mar 2026).

Token skipping and depth attention together allow both FLOP budget predictability and adaptive, data-dependent compute allocation. These characteristics are crucial for scaling to deeper architectures and longer sequences.

4. Empirical Findings and Applications

Performance and Efficiency Metrics

  • On unified multimodal benchmarks (Show-o, Emu3), task-aware MoDA layers in UniMoD maintain or improve accuracy with 15–40% fewer FLOPs. For example, Emu3's MMU benchmark rises from 881.3→901.0 score at 40% lower FLOPs (Mao et al., 10 Feb 2025).
  • In computer vision, DeiT/ViT models with A-MoD routed pruning attain up to +2% accuracy over baselines at matched FLOPs and require half the epochs to reach peak transfer learning accuracy (Gadhikar et al., 2024).
  • Multimodal LLMs using progressive retention schedules (p-MoD) halve FLOPs and cache with no accuracy loss across 14 tasks (Zhang et al., 2024).
  • Depth-wise attentive LLMs (MoDA, Dreamer) exhibit robust gains in language modeling and downstream reasoning, requiring up to 8× fewer training tokens for equivalent accuracy (e.g., +11–19 points improvement in 0-shot math reasoning vs matched LA baselines) (Knupp et al., 29 Jan 2026, Zhu et al., 16 Mar 2026).
  • γ-MoD demonstrates over 50% speedup in inference and training with <2% drop in accuracy by converting 80–90% of layers to MoD blocks using the ARank redundancy metric (Luo et al., 2024).

Token/Layer Redundancy and Routing Insights

  • Token redundancy is highly layer- and task-dependent; generation tasks exhibit higher ARank (lower redundancy), while understanding tasks allow more aggressive pruning.
  • Skipping early layers dramatically harms accuracy, confirming their representational importance; redundancy generally grows with depth (Mao et al., 10 Feb 2025).
  • Depth-attentive heads display diverse attention distributions over sequence and depth slots, dynamically retrieving contextually useful features and mitigating attention sinks (Zhu et al., 16 Mar 2026).

5. Practical Deployment and Tuning

Integration Procedures

  • MoDA and its attention-based variant (A-MoD) can be integrated into pretrained transformers without architecture retraining. Token routing uses either newly-trained lightweight routers, attention-based scoring, or shared routers tuned with small auxiliary losses (Gadhikar et al., 2024, Luo et al., 2024).
  • Layer selection for MoD conversion is optimized via redundancy metrics such as ARank (Luo et al., 2024, Mao et al., 10 Feb 2025). Typically, the majority of intermediate layers can be safely converted to MoD variants.
  • In multimodal transformers, per-task or per-modality routers decouple redundancy schedules, preserving cross-task performance and stability (Mao et al., 10 Feb 2025). Auxiliary losses enforce global or per-layer compute budgets.

Optimization Details

  • Choice of skip ratio, thresholding method (top-k vs threshold-p), and gating normalization (e.g., TanhNorm) is critical. Optimal settings may include interleaving routed with dense layers and scheduling layer-wise retention via shifted-cosine or learned profiles (Zhang et al., 2024, Zhang et al., 2024).
  • Shared routers with masking (γ-MoD) are preferred for stability when retrofitting to large MLLMs (Luo et al., 2024).
  • MoDA is distinct from Mixture-of-Experts (MoE): MoDA routes tokens along the depth axis, skipping entire blocks, whereas MoE routes tokens to different experts for MLP but executes full attention/MLP for all tokens (Zhang et al., 2024).
  • Depth attention generalizes DenseNet-style cross-layer concatenation, achieving adaptation and feature retrieval with linear, rather than quadratic, overhead (Zhu et al., 16 Mar 2026).

7. Future Directions and Outlook

  • Hardware and Scaling: Further optimization of CUDA kernels, fused depth/sequence attention, and advanced memory management are anticipated to extend MoDA to trillion-parameter and ultra-long-context regimes (Zhu et al., 16 Mar 2026).
  • Architectural Extensions: MoDA's depth-aware principles may generalize to cross-modal retrieval (vision, speech), bounded-sliding depth memories, or hybrid data- and depth-based retrieval (Zhu et al., 16 Mar 2026).
  • Theoretical Questions: Investigating the optimal balance of depth/sequence attention and expressivity-redundancy tradeoffs as depth increases, as well as robust redundancy metrics for task-adaptive MoDA scheduling, remains an open area.

MoDA establishes a new paradigm in efficient transformer design: by merging depth-aware computation and token-adaptive pruning, it enables deeper, more effective, and computationally tractable models across domains and modalities (Zhu et al., 16 Mar 2026, Mao et al., 10 Feb 2025, Gadhikar et al., 2024, Zhang et al., 2024, Knupp et al., 29 Jan 2026, Luo et al., 2024, Zhang et al., 2024, Raposo et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Depths Attention (MoDA).