All-or-Here Attention (AHA) in Transformers

Updated 30 December 2025

All-or-Here Attention (AHA) is a dynamic mechanism that routes each token between global and local (sliding-window) attention, optimizing computation in Transformers.
It replaces over 90% of full attention calls with efficient local alternatives, using a lightweight binary router per token-head pair.
Empirical evaluations show that, even with extreme sparsity, models retain or exceed baseline performance across diverse NLP tasks.

All-or-Here Attention (AHA) is a dynamic architectural mechanism designed to drastically reduce the frequency of full global attention computations in Transformer-based LLMs by enabling per-head, per-token routing between global (full) and local (sliding-window) attention. With a lightweight router and supervised fine-tuning, AHA achieves extensive sparsity in attention computations, replacing over 90% of global attention with local alternatives at no cost to downstream accuracy for practical window sizes. Empirical analysis reveals that the requirement for global attention exhibits a highly skewed distribution, with only a minority of tokens and heads necessitating access to the entire context, while the majority operate effectively within local context windows (Luo et al., 27 Dec 2025).

1. Formal Specification and Mechanisms

AHA extends the standard decoder-only Transformer architecture by incorporating a binary routing function, per layer and per head, that determines whether each token–head pair requires global or local context. Let $X\in\mathbb{R}^{n\times d}$ denote hidden states, %%%%1%%%% sequence length, $d$ model dimensionality. Each attention head $h$ defines projections $W_Q, W_K, W_V \in \mathbb{R}^{d\times(d/m)}$ :

Full Self-Attention (Global):

$\text{Attention}_\text{all}(q_t,\ K,\ V) = \text{Softmax}\left(\frac{q_t K^\top}{\sqrt{d_k}}\right)V,\quad q_t=X_t W_Q,\ d_k=\frac{d}{m}$

Sliding-Window Attention (Local, window $w$ ):

$\text{Attention}_\text{here}(q_t, K_{t-w+1:t}, V_{t-w+1:t}) = \text{Softmax}\left(\frac{q_t K_{t-w+1:t}^\top}{\sqrt{d_k}}\right)V_{t-w+1:t}$

The routing function is implemented as $S = \sigma(X W_\text{router}) \in (0,1)^{n\times m}$ , where $W_\text{router}\in\mathbb{R}^{d\times m}$ . Binarization follows, with $g_{t,h} = \mathbb{1}(s_{t,h}>\tau)$ for threshold $\tau=0.5$ . Each head's output is

$a_{t,h} = g_{t,h}\ \text{Attention}_\text{all}(q_{t,h},K_{1:t,h},V_{1:t,h}) + (1-g_{t,h})\ \text{Attention}_\text{here}(q_{t,h},K_{t-w+1:t,h},V_{t-w+1:t,h})$

Gradients during backpropagation are passed via a Straight-Through Estimator (STE): $\partial\mathcal{L}/\partial s_{t,h} \approx \partial\mathcal{L}/\partial g_{t,h}$ .

2. Architectural Integration and Modifications

AHA substitutes the standard full self-attention module in each Transformer (decoder-only) block with an AHA block. The router is applied as a linear projection immediately before standard Q/K/V computations. For each token and head, binary gating determines whether to perform a full global attention or a local sliding-window attention read. All other architectural components, including normalization, feed-forward modules, and residual pathways, remain unchanged. The system is designed for compatibility with established models (e.g., LLaMA, OLMo-2); implementation proceeds by initializing from a pre-trained checkpoint, inserting $W_\text{router}$ , and continuing with supervised fine-tuning (Luo et al., 27 Dec 2025).

3. Training Objective and Regularization

The optimization objective is a sum of standard autoregressive language modeling loss and an $L_1$ -based regularization on the router's output:

$\mathcal{L} = \mathcal{L}_\text{LM} + \lambda\,\mathcal{L}_\text{reg}$

where

$\mathcal{L}_\text{LM}$ : cross-entropy next-token prediction,
$\mathcal{L}_\text{reg} = \frac{1}{L n m}\sum_{k=1}^L\sum_{t=1}^n\sum_{h=1}^m s_{t,h}^{(k)}$ penalizes global attention usage,
$\lambda$ ( $3\times10^{-4}$ by default) controls sparsity-performance trade-off,
threshold $\tau = 0.5$ . STE enables gradients to flow through discrete gate selection.

4. Empirical Evaluation and Performance

AHA was evaluated by fine-tuning OLMo-2-0425-1B-SFT for one epoch on TULU-v3 using AdamW optimization (learning rate $3\times 10^{-5}$ , $\beta_1 = 0.9,\ \beta_2 = 0.95,\$warmup=3\%, batch=128). Window sizes $w$ explored: 16, 32, 64, 128, 256.

Downstream evaluation included:

Single-token: MMLU (Acc), HellaSwag (Acc_norm), CSQA (Acc)
Multi-token: GSM8K (EM, 5-shot), MBPP (Pass@1, zero-shot), MultiNews (ROUGE, zero-shot)

Key quantitative results for $w=256$ :

Relative retained accuracy: 102.5% of baseline
Average global attention usage $\mu_f \approx 6.7\%$ —i.e., 93.3% of full attention operations supplanted by local attention
At $w=128$ , $\mu_f\approx11.6\%$ ; at $w=16$ , $\mu_f\approx52.7\%$
Even with extreme locality ( $w=16$ ), model retains $\approx92.7\%$ baseline performance

5. Distributional Analysis of Context Dependency

Analysis reveals a pronounced “long-tail” in the necessity of full attention as a function of window size:

As $w$ increases: $\mu_f$ drops from 52.7% (16) → 41.4% (32) → 28.1% (64) → 11.6% (128) → 6.7% (256)
Most heads and tokens rarely invoke global attention, confirmed by heatmaps across layers/heads where only a minority display high global attention rates (“heavy-hitters”)

Representative per-head “full-attention gap” (number of tokens between global-attention activations):

Layer 5, Head 2: gap $\approx$ 1 (always-on)
Layer 12, Head 8: gap in tens–hundreds
Layer 3, Head 6: gap in thousands

This suggests most model computations can be executed efficiently with only local context, with sporadic global reads required.

6. Inference Algorithm and Implementation

Efficient inference with AHA is enabled by jointly computing importance scores and per-head binary gating, followed by selective routing to either global or local attention:

for each layer k in 1..L:
  X ← hidden_states
  S = sigmoid( X @ W_router^(k) )          # shape n×m
  G = (S > τ).float()                      # binary gates n×m
  for each head h in 1..m:
    Q_h = X @ W_Q^(k,h)
    K_h = X @ W_K^(k,h)
    V_h = X @ W_V^(k,h)
    for each token t in 1..n:
      if G[t,h]==1:
        # full causal attention
        A[t,h] = Attention( Q_h[t], K_h[1:t], V_h[1:t] )
      else:
        # sliding window attention
        start = max(1, t-w+1)
        A[t,h] = Attention( Q_h[t], K_h[start:t], V_h[start:t] )
  # Output aggregation
  Concat all A[:, h] across heads, project by W_O, add resid & FFN

During backpropagation, gradients propagate through the STE to train $W_\text{router}$ in an end-to-end manner.

7. Limitations and Potential Extensions

AHA exhibits several constraints and extension opportunities:

Sparsity–Performance Trade-off: Extreme regularization (e.g., $\lambda = 1\times10^{-3}$ , $\mu_f \approx 4.5\%$ ) can hurt accuracy (e.g., GSM8K EM drop from 0.4291 to 0.3730), while insufficient regularization yields excess global attention ( $\lambda = 1\times10^{-4}$ , $\mu_f \approx 53.5\%$ ) without accuracy benefit. $\lambda = 3\times10^{-4}$ achieves effective sparsity–retention balance.
System-Level Speedup: While AHA achieves algorithmic sparsity, current hardware and attention kernels (e.g., FlashAttention) are optimized for static dense computation. Realizing wall-clock speedups is contingent on hardware-adaptive, dynamic computation support.
Scalability and Generalization: No from-scratch pre-training is required for AHA deployment. Integration and extension possibilities include multi-choice routing (heads select among multiple window sizes or global), mixture-of-experts/spans routing, and layering AHA atop other efficient attention mechanisms (sparse, low-rank, index-based) (Luo et al., 27 Dec 2025).

AHA establishes a minimal, data-dependent conditional computation design for dynamic sparsity in large-scale Transformers, conclusively showing that full global attention is usually redundant except for a narrow subset of model computations.

PDF Markdown Chat (Pro)

References (1)

Learning When Not to Attend Globally (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to All-or-Here Attention (AHA).