All-or-Here Attention (AHA) in Transformers
- All-or-Here Attention (AHA) is a dynamic mechanism that routes each token between global and local (sliding-window) attention, optimizing computation in Transformers.
- It replaces over 90% of full attention calls with efficient local alternatives, using a lightweight binary router per token-head pair.
- Empirical evaluations show that, even with extreme sparsity, models retain or exceed baseline performance across diverse NLP tasks.
All-or-Here Attention (AHA) is a dynamic architectural mechanism designed to drastically reduce the frequency of full global attention computations in Transformer-based LLMs by enabling per-head, per-token routing between global (full) and local (sliding-window) attention. With a lightweight router and supervised fine-tuning, AHA achieves extensive sparsity in attention computations, replacing over 90% of global attention with local alternatives at no cost to downstream accuracy for practical window sizes. Empirical analysis reveals that the requirement for global attention exhibits a highly skewed distribution, with only a minority of tokens and heads necessitating access to the entire context, while the majority operate effectively within local context windows (Luo et al., 27 Dec 2025).
1. Formal Specification and Mechanisms
AHA extends the standard decoder-only Transformer architecture by incorporating a binary routing function, per layer and per head, that determines whether each token–head pair requires global or local context. Let denote hidden states, %%%%1%%%% sequence length, model dimensionality. Each attention head defines projections :
- Full Self-Attention (Global):
- Sliding-Window Attention (Local, window ):
The routing function is implemented as , where . Binarization follows, with for threshold . Each head's output is
Gradients during backpropagation are passed via a Straight-Through Estimator (STE): .
2. Architectural Integration and Modifications
AHA substitutes the standard full self-attention module in each Transformer (decoder-only) block with an AHA block. The router is applied as a linear projection immediately before standard Q/K/V computations. For each token and head, binary gating determines whether to perform a full global attention or a local sliding-window attention read. All other architectural components, including normalization, feed-forward modules, and residual pathways, remain unchanged. The system is designed for compatibility with established models (e.g., LLaMA, OLMo-2); implementation proceeds by initializing from a pre-trained checkpoint, inserting , and continuing with supervised fine-tuning (Luo et al., 27 Dec 2025).
3. Training Objective and Regularization
The optimization objective is a sum of standard autoregressive language modeling loss and an -based regularization on the router's output:
where
- : cross-entropy next-token prediction,
- penalizes global attention usage,
- ( by default) controls sparsity-performance trade-off,
- threshold . STE enables gradients to flow through discrete gate selection.
4. Empirical Evaluation and Performance
AHA was evaluated by fine-tuning OLMo-2-0425-1B-SFT for one epoch on TULU-v3 using AdamW optimization (learning rate , $\beta_1 = 0.9,\ \beta_2 = 0.95,\$warmup=3\%, batch=128). Window sizes explored: 16, 32, 64, 128, 256.
Downstream evaluation included:
- Single-token: MMLU (Acc), HellaSwag (Acc_norm), CSQA (Acc)
- Multi-token: GSM8K (EM, 5-shot), MBPP (Pass@1, zero-shot), MultiNews (ROUGE, zero-shot)
Key quantitative results for :
- Relative retained accuracy: 102.5% of baseline
- Average global attention usage —i.e., 93.3% of full attention operations supplanted by local attention
- At , ; at ,
- Even with extreme locality (), model retains baseline performance
5. Distributional Analysis of Context Dependency
Analysis reveals a pronounced “long-tail” in the necessity of full attention as a function of window size:
- As increases: drops from 52.7% (16) → 41.4% (32) → 28.1% (64) → 11.6% (128) → 6.7% (256)
- Most heads and tokens rarely invoke global attention, confirmed by heatmaps across layers/heads where only a minority display high global attention rates (“heavy-hitters”)
Representative per-head “full-attention gap” (number of tokens between global-attention activations):
- Layer 5, Head 2: gap 1 (always-on)
- Layer 12, Head 8: gap in tens–hundreds
- Layer 3, Head 6: gap in thousands
This suggests most model computations can be executed efficiently with only local context, with sporadic global reads required.
6. Inference Algorithm and Implementation
Efficient inference with AHA is enabled by jointly computing importance scores and per-head binary gating, followed by selective routing to either global or local attention:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
for each layer k in 1..L: X ← hidden_states S = sigmoid( X @ W_router^(k) ) # shape n×m G = (S > τ).float() # binary gates n×m for each head h in 1..m: Q_h = X @ W_Q^(k,h) K_h = X @ W_K^(k,h) V_h = X @ W_V^(k,h) for each token t in 1..n: if G[t,h]==1: # full causal attention A[t,h] = Attention( Q_h[t], K_h[1:t], V_h[1:t] ) else: # sliding window attention start = max(1, t-w+1) A[t,h] = Attention( Q_h[t], K_h[start:t], V_h[start:t] ) # Output aggregation Concat all A[:, h] across heads, project by W_O, add resid & FFN |
During backpropagation, gradients propagate through the STE to train in an end-to-end manner.
7. Limitations and Potential Extensions
AHA exhibits several constraints and extension opportunities:
- Sparsity–Performance Trade-off: Extreme regularization (e.g., , ) can hurt accuracy (e.g., GSM8K EM drop from 0.4291 to 0.3730), while insufficient regularization yields excess global attention (, ) without accuracy benefit. achieves effective sparsity–retention balance.
- System-Level Speedup: While AHA achieves algorithmic sparsity, current hardware and attention kernels (e.g., FlashAttention) are optimized for static dense computation. Realizing wall-clock speedups is contingent on hardware-adaptive, dynamic computation support.
- Scalability and Generalization: No from-scratch pre-training is required for AHA deployment. Integration and extension possibilities include multi-choice routing (heads select among multiple window sizes or global), mixture-of-experts/spans routing, and layering AHA atop other efficient attention mechanisms (sparse, low-rank, index-based) (Luo et al., 27 Dec 2025).
AHA establishes a minimal, data-dependent conditional computation design for dynamic sparsity in large-scale Transformers, conclusively showing that full global attention is usually redundant except for a narrow subset of model computations.