Papers
Topics
Authors
Recent
2000 character limit reached

All-or-Here Attention (AHA) in Transformers

Updated 30 December 2025
  • All-or-Here Attention (AHA) is a dynamic mechanism that routes each token between global and local (sliding-window) attention, optimizing computation in Transformers.
  • It replaces over 90% of full attention calls with efficient local alternatives, using a lightweight binary router per token-head pair.
  • Empirical evaluations show that, even with extreme sparsity, models retain or exceed baseline performance across diverse NLP tasks.

All-or-Here Attention (AHA) is a dynamic architectural mechanism designed to drastically reduce the frequency of full global attention computations in Transformer-based LLMs by enabling per-head, per-token routing between global (full) and local (sliding-window) attention. With a lightweight router and supervised fine-tuning, AHA achieves extensive sparsity in attention computations, replacing over 90% of global attention with local alternatives at no cost to downstream accuracy for practical window sizes. Empirical analysis reveals that the requirement for global attention exhibits a highly skewed distribution, with only a minority of tokens and heads necessitating access to the entire context, while the majority operate effectively within local context windows (Luo et al., 27 Dec 2025).

1. Formal Specification and Mechanisms

AHA extends the standard decoder-only Transformer architecture by incorporating a binary routing function, per layer and per head, that determines whether each token–head pair requires global or local context. Let XRn×dX\in\mathbb{R}^{n\times d} denote hidden states, %%%%1%%%% sequence length, dd model dimensionality. Each attention head hh defines projections WQ,WK,WVRd×(d/m)W_Q, W_K, W_V \in \mathbb{R}^{d\times(d/m)}:

  • Full Self-Attention (Global):

Attentionall(qt, K, V)=Softmax(qtKdk)V,qt=XtWQ, dk=dm\text{Attention}_\text{all}(q_t,\ K,\ V) = \text{Softmax}\left(\frac{q_t K^\top}{\sqrt{d_k}}\right)V,\quad q_t=X_t W_Q,\ d_k=\frac{d}{m}

  • Sliding-Window Attention (Local, window ww):

Attentionhere(qt,Ktw+1:t,Vtw+1:t)=Softmax(qtKtw+1:tdk)Vtw+1:t\text{Attention}_\text{here}(q_t, K_{t-w+1:t}, V_{t-w+1:t}) = \text{Softmax}\left(\frac{q_t K_{t-w+1:t}^\top}{\sqrt{d_k}}\right)V_{t-w+1:t}

The routing function is implemented as S=σ(XWrouter)(0,1)n×mS = \sigma(X W_\text{router}) \in (0,1)^{n\times m}, where WrouterRd×mW_\text{router}\in\mathbb{R}^{d\times m}. Binarization follows, with gt,h=1(st,h>τ)g_{t,h} = \mathbb{1}(s_{t,h}>\tau) for threshold τ=0.5\tau=0.5. Each head's output is

at,h=gt,h Attentionall(qt,h,K1:t,h,V1:t,h)+(1gt,h) Attentionhere(qt,h,Ktw+1:t,h,Vtw+1:t,h)a_{t,h} = g_{t,h}\ \text{Attention}_\text{all}(q_{t,h},K_{1:t,h},V_{1:t,h}) + (1-g_{t,h})\ \text{Attention}_\text{here}(q_{t,h},K_{t-w+1:t,h},V_{t-w+1:t,h})

Gradients during backpropagation are passed via a Straight-Through Estimator (STE): L/st,hL/gt,h\partial\mathcal{L}/\partial s_{t,h} \approx \partial\mathcal{L}/\partial g_{t,h}.

2. Architectural Integration and Modifications

AHA substitutes the standard full self-attention module in each Transformer (decoder-only) block with an AHA block. The router is applied as a linear projection immediately before standard Q/K/V computations. For each token and head, binary gating determines whether to perform a full global attention or a local sliding-window attention read. All other architectural components, including normalization, feed-forward modules, and residual pathways, remain unchanged. The system is designed for compatibility with established models (e.g., LLaMA, OLMo-2); implementation proceeds by initializing from a pre-trained checkpoint, inserting WrouterW_\text{router}, and continuing with supervised fine-tuning (Luo et al., 27 Dec 2025).

3. Training Objective and Regularization

The optimization objective is a sum of standard autoregressive language modeling loss and an L1L_1-based regularization on the router's output:

L=LLM+λLreg\mathcal{L} = \mathcal{L}_\text{LM} + \lambda\,\mathcal{L}_\text{reg}

where

  • LLM\mathcal{L}_\text{LM}: cross-entropy next-token prediction,
  • Lreg=1Lnmk=1Lt=1nh=1mst,h(k)\mathcal{L}_\text{reg} = \frac{1}{L n m}\sum_{k=1}^L\sum_{t=1}^n\sum_{h=1}^m s_{t,h}^{(k)} penalizes global attention usage,
  • λ\lambda (3×1043\times10^{-4} by default) controls sparsity-performance trade-off,
  • threshold τ=0.5\tau = 0.5. STE enables gradients to flow through discrete gate selection.

4. Empirical Evaluation and Performance

AHA was evaluated by fine-tuning OLMo-2-0425-1B-SFT for one epoch on TULU-v3 using AdamW optimization (learning rate 3×1053\times 10^{-5}, $\beta_1 = 0.9,\ \beta_2 = 0.95,\$warmup=3\%, batch=128). Window sizes ww explored: 16, 32, 64, 128, 256.

Downstream evaluation included:

  • Single-token: MMLU (Acc), HellaSwag (Acc_norm), CSQA (Acc)
  • Multi-token: GSM8K (EM, 5-shot), MBPP (Pass@1, zero-shot), MultiNews (ROUGE, zero-shot)

Key quantitative results for w=256w=256:

  • Relative retained accuracy: 102.5% of baseline
  • Average global attention usage μf6.7%\mu_f \approx 6.7\%—i.e., 93.3% of full attention operations supplanted by local attention
  • At w=128w=128, μf11.6%\mu_f\approx11.6\%; at w=16w=16, μf52.7%\mu_f\approx52.7\%
  • Even with extreme locality (w=16w=16), model retains 92.7%\approx92.7\% baseline performance

5. Distributional Analysis of Context Dependency

Analysis reveals a pronounced “long-tail” in the necessity of full attention as a function of window size:

  • As ww increases: μf\mu_f drops from 52.7% (16) → 41.4% (32) → 28.1% (64) → 11.6% (128) → 6.7% (256)
  • Most heads and tokens rarely invoke global attention, confirmed by heatmaps across layers/heads where only a minority display high global attention rates (“heavy-hitters”)

Representative per-head “full-attention gap” (number of tokens between global-attention activations):

  • Layer 5, Head 2: gap \approx 1 (always-on)
  • Layer 12, Head 8: gap in tens–hundreds
  • Layer 3, Head 6: gap in thousands

This suggests most model computations can be executed efficiently with only local context, with sporadic global reads required.

6. Inference Algorithm and Implementation

Efficient inference with AHA is enabled by jointly computing importance scores and per-head binary gating, followed by selective routing to either global or local attention:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
for each layer k in 1..L:
  X  hidden_states
  S = sigmoid( X @ W_router^(k) )          # shape n×m
  G = (S > τ).float()                      # binary gates n×m
  for each head h in 1..m:
    Q_h = X @ W_Q^(k,h)
    K_h = X @ W_K^(k,h)
    V_h = X @ W_V^(k,h)
    for each token t in 1..n:
      if G[t,h]==1:
        # full causal attention
        A[t,h] = Attention( Q_h[t], K_h[1:t], V_h[1:t] )
      else:
        # sliding window attention
        start = max(1, t-w+1)
        A[t,h] = Attention( Q_h[t], K_h[start:t], V_h[start:t] )
  # Output aggregation
  Concat all A[:, h] across heads, project by W_O, add resid & FFN

During backpropagation, gradients propagate through the STE to train WrouterW_\text{router} in an end-to-end manner.

7. Limitations and Potential Extensions

AHA exhibits several constraints and extension opportunities:

  • Sparsity–Performance Trade-off: Extreme regularization (e.g., λ=1×103\lambda = 1\times10^{-3}, μf4.5%\mu_f \approx 4.5\%) can hurt accuracy (e.g., GSM8K EM drop from 0.4291 to 0.3730), while insufficient regularization yields excess global attention (λ=1×104\lambda = 1\times10^{-4}, μf53.5%\mu_f \approx 53.5\%) without accuracy benefit. λ=3×104\lambda = 3\times10^{-4} achieves effective sparsity–retention balance.
  • System-Level Speedup: While AHA achieves algorithmic sparsity, current hardware and attention kernels (e.g., FlashAttention) are optimized for static dense computation. Realizing wall-clock speedups is contingent on hardware-adaptive, dynamic computation support.
  • Scalability and Generalization: No from-scratch pre-training is required for AHA deployment. Integration and extension possibilities include multi-choice routing (heads select among multiple window sizes or global), mixture-of-experts/spans routing, and layering AHA atop other efficient attention mechanisms (sparse, low-rank, index-based) (Luo et al., 27 Dec 2025).

AHA establishes a minimal, data-dependent conditional computation design for dynamic sparsity in large-scale Transformers, conclusively showing that full global attention is usually redundant except for a narrow subset of model computations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to All-or-Here Attention (AHA).