Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention-Based Peeking (ABP) in Transformers

Updated 4 July 2026
  • ABP is a technique that controls transformer attention by manipulating token access via masking or multi-token prediction heads.
  • In causal transformers, ABP isolates an 'All-for-One' subgraph that enables efficient mental arithmetic with precise layer configurations and high faithfulness.
  • In ASR, ABP enhances contextual biasing by leveraging future-token logits to score named entities without additional bias encoders.

Searching arXiv for the cited ABP papers and closely related context. Attention-Based Peeking (ABP) denotes two distinct techniques introduced in late 2025 under the same acronym, each using constrained or augmented attention to expose or exploit latent structure in sequence processing. In "All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens," ABP is a causal-attention masking intervention for transformer analysis, designed to restrict cross-token information transfer so that only the last token may access earlier tokens during designated layers, thereby isolating an "All-for-One" subgraph for mental arithmetic (Mamidanna et al., 11 Sep 2025). In "Peeking Into The Future For Contextual Biasing," ABP is an inference and training mechanism for attention-based encoder-decoder automatic speech recognition, in which multiple future-token prediction heads are used to score candidate named entities from a bias list without adding a separate bias encoder or cross-attention layers (Selvakumar et al., 19 Dec 2025). The shared label reflects a common high-level intuition—controlled access to nonlocal information—but the two methods differ substantially in objective, architecture, formalization, and empirical setting.

1. Terminological scope and problem settings

The ABP method in (Mamidanna et al., 11 Sep 2025) arises from a mechanistic interpretability question for causal transformers. The motivating issue is whether, in practice, a LLM solving direct mental-math next-token prediction actually uses the full computational latitude theoretically available under causal self-attention and multilayer perceptron layers, or whether useful computation is concentrated in a much narrower subgraph. ABP is paired with Context-Aware Mean Ablation (CAMA) to test whether cross-token attention can be pruned almost everywhere while preserving performance on arithmetic tasks (Mamidanna et al., 11 Sep 2025).

The ABP method in (Selvakumar et al., 19 Dec 2025) addresses a different problem: contextual biasing in end-to-end automatic speech recognition. Here the goal is to improve recognition of rare or unseen named entities supplied at inference time in a bias list, while avoiding additional entity encoders or cross-attention modules. The method uses multi-token prediction heads to "peek into the future" and score bias-list entities directly from decoder logits (Selvakumar et al., 19 Dec 2025).

This terminological overlap can be a source of confusion. In the first case, ABP is fundamentally an attention-mask surgery over existing transformer heads; in the second, it is a multi-head future-token prediction extension on an attention-based encoder-decoder decoder. A plausible implication is that "ABP" should be interpreted contextually rather than as a single standardized method family.

2. ABP for causal-transformer circuit isolation

In (Mamidanna et al., 11 Sep 2025), ABP is introduced to enforce a precise information-flow regime in a standard causal transformer. During designated "information-transfer" layers, only the last token may peek at other tokens, whereas all non-last tokens are restricted to self-attention together with an always-allowed beginning-of-sequence token. In later layers, even the last token is reduced to self-attention, so that all input-specific computation must occur within the last token’s residual stream (Mamidanna et al., 11 Sep 2025).

Formally, with sequence length TT and transformer depth LL, each attention head in layer ll produces a pre-softmax attention-score matrix M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}, with standard causal masking forbidding attention to future positions. ABP introduces peeking index sets Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\} for each query position qq, and updates attention scores by retaining Mq,k(l)M^{(l)}_{q,k} only when k∈Kqk \in K_q, replacing all other entries by −∞-\infty (Mamidanna et al., 11 Sep 2025). The resulting softmax enforces the desired sparsity pattern:

$\widetilde M^{(l)}_{q,k} \;=\; \begin{cases} M^{(l)}_{q,k}, & k\in K_q,\[6pt] -\infty, & k\notin K_q, \end{cases} \qquad\forall\,q,k\in\{1,\dots,T\}.$

Two canonical cases are defined. "Full-peeking" uses LL0 and recovers standard causal attention. "Self-peeking" uses LL1. To preserve the common attention sink on the beginning-of-sequence token, every set is augmented by the BOS position 1, so effectively self-peeking becomes LL2 (Mamidanna et al., 11 Sep 2025).

The method is applied in a three-phase layer partition. In waiting layers LL3, CAMA is used and ABP is not applied. In transfer layers LL4, ABP allows the last token LL5 full-peeking, LL6, while all other tokens remain in self-peeking mode. In final layers LL7, all tokens, including the last, are restricted to self-peeking (Mamidanna et al., 11 Sep 2025). This structure is intended to determine whether a narrow window of cross-token transfer suffices to support final-token arithmetic.

3. Integration with CAMA and the All-for-One subgraph

ABP in (Mamidanna et al., 11 Sep 2025) is not a standalone intervention but part of a decomposition used to identify an "All-for-One" subgraph (AF1). CAMA first strips away early cross-token dependence while preserving task-general processing by replacing each token’s residual representation with a context-aware expectation conditioned on the token identity. The paper gives the phase-1 replacement at layer LL8 as

LL9

ABP then governs the middle and late layers, ensuring that only the last token receives information from other tokens in the transfer window and that subsequent computation remains local to the last token’s residual stream (Mamidanna et al., 11 Sep 2025).

This arrangement operationalizes a claim about where and when input-specific computation occurs. According to the paper, because CAMA eliminates early cross-token paths and ABP prunes middle and late ones, the resulting AF1 subgraph shows that all input-specific computation occurs at the very last token, and only during the ll0 transfer layers (Mamidanna et al., 11 Sep 2025). The faithfulness metric used to evaluate such subgraphs is

ll1

The central significance of ABP in this setting is therefore diagnostic rather than architectural. It is a controlled masking method for testing whether a sharply restricted practical information-flow graph can reproduce the full model’s behavior on prompts the full model already solves correctly. This suggests an intervention-based view of transformer computation in which layerwise and positional sparsification can expose compact functional circuits.

4. Empirical behavior of ABP in mental-math transformers

The main empirical focus in (Mamidanna et al., 11 Sep 2025) is Llama-3-8B on the ll2 mental-math next-token task, with additional cross-model tests on Llama-3.1-8B, Pythia-6.9B, and GPT-J-6B for two-operand tasks. For Llama-3-8B, the authors grid-searched ll3 and ll4 and found a sharp threshold at ll5, corresponding to 15 CAMA layers followed by 2 transfer layers (Mamidanna et al., 11 Sep 2025).

The reported results identify phase transitions. Waiting layers with ll6 yield faithfulness of approximately ll7, whereas beyond ll8 faithfulness collapses to approximately ll9. Transfer layers with M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}0 also yield faithfulness of approximately M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}1, while M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}2 collapses to near M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}3 (Mamidanna et al., 11 Sep 2025). The interpretation offered in the paper is that the model can defer practically all meaningful arithmetic work until very late in depth, provided the last token receives access to earlier tokens for a brief middle-layer burst.

Head-level ablations further refine the picture. Within the two transfer layers, 64 heads exist; iterative removal of the least important heads permits removal of 59 heads with only a 1–2% drop, while a final handful of heads, including L15H13, L15H3, L16H1, and L16H21, each cause drastic collapse when ablated (Mamidanna et al., 11 Sep 2025). This suggests that the ABP-defined transfer window is sparse not only across layers and positions but also across attention heads.

Cross-model transfer yields weaker but related patterns. Pythia and GPT-J exhibit AF1-style circuits with shorter waits and longer transfer windows, recovering only approximately M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}4–M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}5 faithfulness (Mamidanna et al., 11 Sep 2025). The paper also reports that alternative masking and ablation strategies—direct-embedding copy, random-token mean, self-peek-as-waiting, and isolated-forward-pass—fail to preserve accuracy when M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}6 is large, whereas CAMA plus ABP uncovers the minimal subgraph (Mamidanna et al., 11 Sep 2025).

5. ABP for contextual biasing in attention-based encoder-decoder ASR

In (Selvakumar et al., 19 Dec 2025), ABP refers to a different mechanism built on multi-token prediction in an attention-based encoder-decoder architecture. The baseline model uses 80-dim log-Mel frames as input, two convolutional downsampling layers plus a linear projection followed by a 12-block conformer stack, and a standard autoregressive decoder producing

M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}7

At decoder step M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}8, the decoder state M(l)∈RT×TM^{(l)} \in \mathbb{R}^{T \times T}9 is mapped to next-token logits Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}0 (Selvakumar et al., 19 Dec 2025).

ABP extends this decoder by attaching Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}1 parallel prediction heads to the decoder’s top layer. Each head Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}2 maps the shared decoder output Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}3 into logits Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}4 for predicting Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}5, yielding the approximation

Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}6

with all Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}7 heads sharing the final projection Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}8 to reduce parameters (Selvakumar et al., 19 Dec 2025). The multi-token training objective is a weighted sum of cross-entropies:

Kq⊆{1,2,…,q}K_q \subseteq \{1,2,\dots,q\}9

with qq0 for qq1 in the reported setup (Selvakumar et al., 19 Dec 2025).

The contextual-biasing use of ABP begins by collecting the qq2 future-token logits into a tensor qq3. Given a dynamic bias list qq4, where each entity qq5 is a sub-word sequence qq6, the model extracts a vector

qq7

for each entity, with padding or truncation as needed (Selvakumar et al., 19 Dec 2025). A special no-bias entity qq8 is added, and a small trainable scorer qq9 computes Mq,k(l)M^{(l)}_{q,k}0, from which an entity posterior is formed by softmax over candidate entities and Mq,k(l)M^{(l)}_{q,k}1.

Training combines the multi-token prediction loss with an entity classification loss,

Mq,k(l)M^{(l)}_{q,k}2

where Mq,k(l)M^{(l)}_{q,k}3 is the index of the entity beginning at step Mq,k(l)M^{(l)}_{q,k}4, or Mq,k(l)M^{(l)}_{q,k}5 if none, and the total loss is

Mq,k(l)M^{(l)}_{q,k}6

At inference time, the method computes Mq,k(l)M^{(l)}_{q,k}7, applies a pruning threshold Mq,k(l)M^{(l)}_{q,k}8, and builds a unified candidate score Mq,k(l)M^{(l)}_{q,k}9 over static tokens and bias-list entities, using bias weight k∈Kqk \in K_q0 to control the strength of entity insertion (Selvakumar et al., 19 Dec 2025).

6. Experimental results in ASR contextual biasing

The ASR study in (Selvakumar et al., 19 Dec 2025) is trained on LibriSpeech-960h and evaluated on test-clean and test-other. The reported setup uses 80-dim log-Mel features with SpecAugment, a 12-layer Conformer encoder with k∈Kqk \in K_q1, 4x expansion FFN, and 8 heads, and a 6-layer Transformer-style decoder with the same dimensionality and number of heads. Named-entity annotations use spaCy "en_core_web_trf", with approximately 700 unique entities in test (Selvakumar et al., 19 Dec 2025).

The principal metrics are WER, B-WER, and U-WER, where B-WER is the error rate only on named entities and U-WER is the error rate on all other words. The paper reports results for bias-list sizes k∈Kqk \in K_q2 and compares a baseline AED model, CLAS, AED + MTP without bias, and the ABP method with k∈Kqk \in K_q3 and k∈Kqk \in K_q4 (Selvakumar et al., 19 Dec 2025).

Model test-clean test-other
Baseline AED, k∈Kqk \in K_q5 2.73/17.52/2.27 6.01/32.34/5.07
AED + MTP w/o bias, k∈Kqk \in K_q6 2.58/17.27/2.27 6.00/30.63/5.12
Ours k∈Kqk \in K_q7, k∈Kqk \in K_q8 2.34/10.98/2.07 5.82/21.85/5.24
Ours k∈Kqk \in K_q9, −∞-\infty0 2.27/8.70/2.07 5.64/17.22/5.22

At −∞-\infty1, test-clean B-WER drops from −∞-\infty2 to −∞-\infty3, described as an approximately −∞-\infty4 relative reduction, with no U-WER degradation (Selvakumar et al., 19 Dec 2025). Increasing −∞-\infty5 to −∞-\infty6 further lowers B-WER at slight cost of total WER while U-WER remains stable (Selvakumar et al., 19 Dec 2025). The paper characterizes the method as architecturally simple because it uses no separate bias-phrase encoder and no cross-attention, and as mitigating entity fragmentation by treating each multi-subword entity as a single score (Selvakumar et al., 19 Dec 2025).

The limitations reported are also specific. Fixed −∞-\infty7 constrains the maximum entity length that can be peeked; the paper uses −∞-\infty8 and states that 87% of entities are of length at most 4 tokens. Additional overhead arises from −∞-\infty9 forward heads per decoding step, and careful negative sampling of bias lists is required during training (Selvakumar et al., 19 Dec 2025).

7. Comparative interpretation, misconceptions, and open questions

The two ABP methods share a structural intuition—selective access to information that would otherwise be diffuse—but they operate at different levels. In (Mamidanna et al., 11 Sep 2025), ABP is a masking-based intervention on the internal information-flow graph of a causal transformer. In (Selvakumar et al., 19 Dec 2025), ABP is a predictive extension that converts future-token logits into entity scores for contextual biasing. One common misconception would be to treat them as minor variants of the same algorithm; the formal definitions in the two papers do not support that interpretation.

The mechanistic conclusions from (Mamidanna et al., 11 Sep 2025) are narrow in scope. The method was tested on pure mental-math prompts, and the paper notes that word problems or embedded code fail under the same AF1 circuit, suggesting that additional subgraphs govern semantic understanding (Mamidanna et al., 11 Sep 2025). It also requires a tokenizer that assigns each integer a single token; models that split large numbers into multiple subtokens need an extension of ABP and CAMA (Mamidanna et al., 11 Sep 2025). More generally, whether similarly sparse subgraphs can be revealed for multi-step chain-of-thought, commonsense reasoning, or factual retrieval remains open (Mamidanna et al., 11 Sep 2025).

The ASR contextual-biasing ABP likewise has bounded scope. It is designed for attention-based encoder-decoder models with a dynamic bias list and uses a conditional-independence approximation over multiple future-token heads (Selvakumar et al., 19 Dec 2025). The method assumes that meaningful entity evidence can be extracted from a short lookahead window and compressed into a small scorer over candidate entities. This suggests a lightweight alternative to explicit bias encoders, but the fixed lookahead horizon and dependence on candidate-list construction remain substantive design constraints.

Taken together, the two 2025 ABP methods illustrate a broader research pattern in sequence modeling: attention can be made more analytically transparent or more operationally useful by restricting or repurposing access patterns. In one case, ABP exposes a sparse last-token arithmetic circuit in causal transformers (Mamidanna et al., 11 Sep 2025). In the other, it converts decoder lookahead into contextual biasing signals for named-entity recognition in ASR (Selvakumar et al., 19 Dec 2025). The shared acronym therefore names not a single established paradigm but a pair of contemporaneous methods linked by the idea of controlled peeking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-Based Peeking (ABP).