Papers
Topics
Authors
Recent
Search
2000 character limit reached

All-or-Here Attention (AHA) in Transformers

Updated 30 December 2025
  • All-or-Here Attention (AHA) is a dynamic mechanism that routes each token between global and local (sliding-window) attention, optimizing computation in Transformers.
  • It replaces over 90% of full attention calls with efficient local alternatives, using a lightweight binary router per token-head pair.
  • Empirical evaluations show that, even with extreme sparsity, models retain or exceed baseline performance across diverse NLP tasks.

All-or-Here Attention (AHA) is a dynamic architectural mechanism designed to drastically reduce the frequency of full global attention computations in Transformer-based LLMs by enabling per-head, per-token routing between global (full) and local (sliding-window) attention. With a lightweight router and supervised fine-tuning, AHA achieves extensive sparsity in attention computations, replacing over 90% of global attention with local alternatives at no cost to downstream accuracy for practical window sizes. Empirical analysis reveals that the requirement for global attention exhibits a highly skewed distribution, with only a minority of tokens and heads necessitating access to the entire context, while the majority operate effectively within local context windows (Luo et al., 27 Dec 2025).

1. Formal Specification and Mechanisms

AHA extends the standard decoder-only Transformer architecture by incorporating a binary routing function, per layer and per head, that determines whether each token–head pair requires global or local context. Let XRn×dX\in\mathbb{R}^{n\times d} denote hidden states, nn sequence length, dd model dimensionality. Each attention head hh defines projections WQ,WK,WVRd×(d/m)W_Q, W_K, W_V \in \mathbb{R}^{d\times(d/m)}:

  • Full Self-Attention (Global):

Attentionall(qt, K, V)=Softmax(qtKdk)V,qt=XtWQ, dk=dm\text{Attention}_\text{all}(q_t,\ K,\ V) = \text{Softmax}\left(\frac{q_t K^\top}{\sqrt{d_k}}\right)V,\quad q_t=X_t W_Q,\ d_k=\frac{d}{m}

Attentionhere(qt,Ktw+1:t,Vtw+1:t)=Softmax(qtKtw+1:tdk)Vtw+1:t\text{Attention}_\text{here}(q_t, K_{t-w+1:t}, V_{t-w+1:t}) = \text{Softmax}\left(\frac{q_t K_{t-w+1:t}^\top}{\sqrt{d_k}}\right)V_{t-w+1:t}

The routing function is implemented as S=σ(XWrouter)(0,1)n×mS = \sigma(X W_\text{router}) \in (0,1)^{n\times m}, where WrouterRd×mW_\text{router}\in\mathbb{R}^{d\times m}. Binarization follows, with nn0 for threshold nn1. Each head's output is

nn2

Gradients during backpropagation are passed via a Straight-Through Estimator (STE): nn3.

2. Architectural Integration and Modifications

AHA substitutes the standard full self-attention module in each Transformer (decoder-only) block with an AHA block. The router is applied as a linear projection immediately before standard Q/K/V computations. For each token and head, binary gating determines whether to perform a full global attention or a local sliding-window attention read. All other architectural components, including normalization, feed-forward modules, and residual pathways, remain unchanged. The system is designed for compatibility with established models (e.g., LLaMA, OLMo-2); implementation proceeds by initializing from a pre-trained checkpoint, inserting nn4, and continuing with supervised fine-tuning (Luo et al., 27 Dec 2025).

3. Training Objective and Regularization

The optimization objective is a sum of standard autoregressive language modeling loss and an nn5-based regularization on the router's output:

nn6

where

  • nn7: cross-entropy next-token prediction,
  • nn8 penalizes global attention usage,
  • nn9 (dd0 by default) controls sparsity-performance trade-off,
  • threshold dd1. STE enables gradients to flow through discrete gate selection.

4. Empirical Evaluation and Performance

AHA was evaluated by fine-tuning OLMo-2-0425-1B-SFT for one epoch on TULU-v3 using AdamW optimization (learning rate dd2, dd3warmup=3\%, batch=128). Window sizes dd4 explored: 16, 32, 64, 128, 256.

Downstream evaluation included:

  • Single-token: MMLU (Acc), HellaSwag (Acc_norm), CSQA (Acc)
  • Multi-token: GSM8K (EM, 5-shot), MBPP (Pass@1, zero-shot), MultiNews (ROUGE, zero-shot)

Key quantitative results for dd5:

  • Relative retained accuracy: 102.5% of baseline
  • Average global attention usage dd6—i.e., 93.3% of full attention operations supplanted by local attention
  • At dd7, dd8; at dd9, hh0
  • Even with extreme locality (hh1), model retains hh2 baseline performance

5. Distributional Analysis of Context Dependency

Analysis reveals a pronounced “long-tail” in the necessity of full attention as a function of window size:

  • As hh3 increases: hh4 drops from 52.7% (16) → 41.4% (32) → 28.1% (64) → 11.6% (128) → 6.7% (256)
  • Most heads and tokens rarely invoke global attention, confirmed by heatmaps across layers/heads where only a minority display high global attention rates (“heavy-hitters”)

Representative per-head “full-attention gap” (number of tokens between global-attention activations):

  • Layer 5, Head 2: gap hh5 1 (always-on)
  • Layer 12, Head 8: gap in tens–hundreds
  • Layer 3, Head 6: gap in thousands

This suggests most model computations can be executed efficiently with only local context, with sporadic global reads required.

6. Inference Algorithm and Implementation

Efficient inference with AHA is enabled by jointly computing importance scores and per-head binary gating, followed by selective routing to either global or local attention:

WQ,WK,WVRd×(d/m)W_Q, W_K, W_V \in \mathbb{R}^{d\times(d/m)}2

During backpropagation, gradients propagate through the STE to train hh6 in an end-to-end manner.

7. Limitations and Potential Extensions

AHA exhibits several constraints and extension opportunities:

  • Sparsity–Performance Trade-off: Extreme regularization (e.g., hh7, hh8) can hurt accuracy (e.g., GSM8K EM drop from 0.4291 to 0.3730), while insufficient regularization yields excess global attention (hh9, WQ,WK,WVRd×(d/m)W_Q, W_K, W_V \in \mathbb{R}^{d\times(d/m)}0) without accuracy benefit. WQ,WK,WVRd×(d/m)W_Q, W_K, W_V \in \mathbb{R}^{d\times(d/m)}1 achieves effective sparsity–retention balance.
  • System-Level Speedup: While AHA achieves algorithmic sparsity, current hardware and attention kernels (e.g., FlashAttention) are optimized for static dense computation. Realizing wall-clock speedups is contingent on hardware-adaptive, dynamic computation support.
  • Scalability and Generalization: No from-scratch pre-training is required for AHA deployment. Integration and extension possibilities include multi-choice routing (heads select among multiple window sizes or global), mixture-of-experts/spans routing, and layering AHA atop other efficient attention mechanisms (sparse, low-rank, index-based) (Luo et al., 27 Dec 2025).

AHA establishes a minimal, data-dependent conditional computation design for dynamic sparsity in large-scale Transformers, conclusively showing that full global attention is usually redundant except for a narrow subset of model computations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to All-or-Here Attention (AHA).