Attention-Based Peeking (ABP) in Transformers

Updated 4 July 2026

ABP is a technique that controls transformer attention by manipulating token access via masking or multi-token prediction heads.
In causal transformers, ABP isolates an 'All-for-One' subgraph that enables efficient mental arithmetic with precise layer configurations and high faithfulness.
In ASR, ABP enhances contextual biasing by leveraging future-token logits to score named entities without additional bias encoders.

Searching arXiv for the cited ABP papers and closely related context. Attention-Based Peeking (ABP) denotes two distinct techniques introduced in late 2025 under the same acronym, each using constrained or augmented attention to expose or exploit latent structure in sequence processing. In "All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens," ABP is a causal-attention masking intervention for transformer analysis, designed to restrict cross-token information transfer so that only the last token may access earlier tokens during designated layers, thereby isolating an "All-for-One" subgraph for mental arithmetic (Mamidanna et al., 11 Sep 2025). In "Peeking Into The Future For Contextual Biasing," ABP is an inference and training mechanism for attention-based encoder-decoder automatic speech recognition, in which multiple future-token prediction heads are used to score candidate named entities from a bias list without adding a separate bias encoder or cross-attention layers (Selvakumar et al., 19 Dec 2025). The shared label reflects a common high-level intuition—controlled access to nonlocal information—but the two methods differ substantially in objective, architecture, formalization, and empirical setting.

1. Terminological scope and problem settings

The ABP method in (Mamidanna et al., 11 Sep 2025) arises from a mechanistic interpretability question for causal transformers. The motivating issue is whether, in practice, a LLM solving direct mental-math next-token prediction actually uses the full computational latitude theoretically available under causal self-attention and multilayer perceptron layers, or whether useful computation is concentrated in a much narrower subgraph. ABP is paired with Context-Aware Mean Ablation (CAMA) to test whether cross-token attention can be pruned almost everywhere while preserving performance on arithmetic tasks (Mamidanna et al., 11 Sep 2025).

The ABP method in (Selvakumar et al., 19 Dec 2025) addresses a different problem: contextual biasing in end-to-end automatic speech recognition. Here the goal is to improve recognition of rare or unseen named entities supplied at inference time in a bias list, while avoiding additional entity encoders or cross-attention modules. The method uses multi-token prediction heads to "peek into the future" and score bias-list entities directly from decoder logits (Selvakumar et al., 19 Dec 2025).

This terminological overlap can be a source of confusion. In the first case, ABP is fundamentally an attention-mask surgery over existing transformer heads; in the second, it is a multi-head future-token prediction extension on an attention-based encoder-decoder decoder. A plausible implication is that "ABP" should be interpreted contextually rather than as a single standardized method family.

2. ABP for causal-transformer circuit isolation

In (Mamidanna et al., 11 Sep 2025), ABP is introduced to enforce a precise information-flow regime in a standard causal transformer. During designated "information-transfer" layers, only the last token may peek at other tokens, whereas all non-last tokens are restricted to self-attention together with an always-allowed beginning-of-sequence token. In later layers, even the last token is reduced to self-attention, so that all input-specific computation must occur within the last token’s residual stream (Mamidanna et al., 11 Sep 2025).

Formally, with sequence length $T$ and transformer depth $L$ , each attention head in layer $l$ produces a pre-softmax attention-score matrix $M^{(l)} \in \mathbb{R}^{T \times T}$ , with standard causal masking forbidding attention to future positions. ABP introduces peeking index sets $K_q \subseteq \{1,2,\dots,q\}$ for each query position $q$ , and updates attention scores by retaining $M^{(l)}_{q,k}$ only when $k \in K_q$ , replacing all other entries by $-\infty$ (Mamidanna et al., 11 Sep 2025). The resulting softmax enforces the desired sparsity pattern:

$\widetilde M^{(l)}_{q,k} \;=\; \begin{cases} M^{(l)}_{q,k}, & k\in K_q,\[6pt] -\infty, & k\notin K_q, \end{cases} \qquad\forall\,q,k\in\{1,\dots,T\}.$

Two canonical cases are defined. "Full-peeking" uses $L$ 0 and recovers standard causal attention. "Self-peeking" uses $L$ 1. To preserve the common attention sink on the beginning-of-sequence token, every set is augmented by the BOS position 1, so effectively self-peeking becomes $L$ 2 (Mamidanna et al., 11 Sep 2025).

The method is applied in a three-phase layer partition. In waiting layers $L$ 3, CAMA is used and ABP is not applied. In transfer layers $L$ 4, ABP allows the last token $L$ 5 full-peeking, $L$ 6, while all other tokens remain in self-peeking mode. In final layers $L$ 7, all tokens, including the last, are restricted to self-peeking (Mamidanna et al., 11 Sep 2025). This structure is intended to determine whether a narrow window of cross-token transfer suffices to support final-token arithmetic.

3. Integration with CAMA and the All-for-One subgraph

ABP in (Mamidanna et al., 11 Sep 2025) is not a standalone intervention but part of a decomposition used to identify an "All-for-One" subgraph (AF1). CAMA first strips away early cross-token dependence while preserving task-general processing by replacing each token’s residual representation with a context-aware expectation conditioned on the token identity. The paper gives the phase-1 replacement at layer $L$ 8 as

$L$ 9

ABP then governs the middle and late layers, ensuring that only the last token receives information from other tokens in the transfer window and that subsequent computation remains local to the last token’s residual stream (Mamidanna et al., 11 Sep 2025).

This arrangement operationalizes a claim about where and when input-specific computation occurs. According to the paper, because CAMA eliminates early cross-token paths and ABP prunes middle and late ones, the resulting AF1 subgraph shows that all input-specific computation occurs at the very last token, and only during the $l$ 0 transfer layers (Mamidanna et al., 11 Sep 2025). The faithfulness metric used to evaluate such subgraphs is

$l$ 1

The central significance of ABP in this setting is therefore diagnostic rather than architectural. It is a controlled masking method for testing whether a sharply restricted practical information-flow graph can reproduce the full model’s behavior on prompts the full model already solves correctly. This suggests an intervention-based view of transformer computation in which layerwise and positional sparsification can expose compact functional circuits.

4. Empirical behavior of ABP in mental-math transformers

The main empirical focus in (Mamidanna et al., 11 Sep 2025) is Llama-3-8B on the $l$ 2 mental-math next-token task, with additional cross-model tests on Llama-3.1-8B, Pythia-6.9B, and GPT-J-6B for two-operand tasks. For Llama-3-8B, the authors grid-searched $l$ 3 and $l$ 4 and found a sharp threshold at $l$ 5, corresponding to 15 CAMA layers followed by 2 transfer layers (Mamidanna et al., 11 Sep 2025).

The reported results identify phase transitions. Waiting layers with $l$ 6 yield faithfulness of approximately $l$ 7, whereas beyond $l$ 8 faithfulness collapses to approximately $l$ 9. Transfer layers with $M^{(l)} \in \mathbb{R}^{T \times T}$ 0 also yield faithfulness of approximately $M^{(l)} \in \mathbb{R}^{T \times T}$ 1, while $M^{(l)} \in \mathbb{R}^{T \times T}$ 2 collapses to near $M^{(l)} \in \mathbb{R}^{T \times T}$ 3 (Mamidanna et al., 11 Sep 2025). The interpretation offered in the paper is that the model can defer practically all meaningful arithmetic work until very late in depth, provided the last token receives access to earlier tokens for a brief middle-layer burst.

Head-level ablations further refine the picture. Within the two transfer layers, 64 heads exist; iterative removal of the least important heads permits removal of 59 heads with only a 1–2% drop, while a final handful of heads, including L15H13, L15H3, L16H1, and L16H21, each cause drastic collapse when ablated (Mamidanna et al., 11 Sep 2025). This suggests that the ABP-defined transfer window is sparse not only across layers and positions but also across attention heads.

Cross-model transfer yields weaker but related patterns. Pythia and GPT-J exhibit AF1-style circuits with shorter waits and longer transfer windows, recovering only approximately $M^{(l)} \in \mathbb{R}^{T \times T}$ 4– $M^{(l)} \in \mathbb{R}^{T \times T}$ 5 faithfulness (Mamidanna et al., 11 Sep 2025). The paper also reports that alternative masking and ablation strategies—direct-embedding copy, random-token mean, self-peek-as-waiting, and isolated-forward-pass—fail to preserve accuracy when $M^{(l)} \in \mathbb{R}^{T \times T}$ 6 is large, whereas CAMA plus ABP uncovers the minimal subgraph (Mamidanna et al., 11 Sep 2025).

5. ABP for contextual biasing in attention-based encoder-decoder ASR

In (Selvakumar et al., 19 Dec 2025), ABP refers to a different mechanism built on multi-token prediction in an attention-based encoder-decoder architecture. The baseline model uses 80-dim log-Mel frames as input, two convolutional downsampling layers plus a linear projection followed by a 12-block conformer stack, and a standard autoregressive decoder producing

$M^{(l)} \in \mathbb{R}^{T \times T}$ 7

At decoder step $M^{(l)} \in \mathbb{R}^{T \times T}$ 8, the decoder state $M^{(l)} \in \mathbb{R}^{T \times T}$ 9 is mapped to next-token logits $K_q \subseteq \{1,2,\dots,q\}$ 0 (Selvakumar et al., 19 Dec 2025).

ABP extends this decoder by attaching $K_q \subseteq \{1,2,\dots,q\}$ 1 parallel prediction heads to the decoder’s top layer. Each head $K_q \subseteq \{1,2,\dots,q\}$ 2 maps the shared decoder output $K_q \subseteq \{1,2,\dots,q\}$ 3 into logits $K_q \subseteq \{1,2,\dots,q\}$ 4 for predicting $K_q \subseteq \{1,2,\dots,q\}$ 5, yielding the approximation

$K_q \subseteq \{1,2,\dots,q\}$ 6

with all $K_q \subseteq \{1,2,\dots,q\}$ 7 heads sharing the final projection $K_q \subseteq \{1,2,\dots,q\}$ 8 to reduce parameters (Selvakumar et al., 19 Dec 2025). The multi-token training objective is a weighted sum of cross-entropies:

$K_q \subseteq \{1,2,\dots,q\}$ 9

with $q$ 0 for $q$ 1 in the reported setup (Selvakumar et al., 19 Dec 2025).

The contextual-biasing use of ABP begins by collecting the $q$ 2 future-token logits into a tensor $q$ 3. Given a dynamic bias list $q$ 4, where each entity $q$ 5 is a sub-word sequence $q$ 6, the model extracts a vector

$q$ 7

for each entity, with padding or truncation as needed (Selvakumar et al., 19 Dec 2025). A special no-bias entity $q$ 8 is added, and a small trainable scorer $q$ 9 computes $M^{(l)}_{q,k}$ 0, from which an entity posterior is formed by softmax over candidate entities and $M^{(l)}_{q,k}$ 1.

Training combines the multi-token prediction loss with an entity classification loss,

$M^{(l)}_{q,k}$ 2

where $M^{(l)}_{q,k}$ 3 is the index of the entity beginning at step $M^{(l)}_{q,k}$ 4, or $M^{(l)}_{q,k}$ 5 if none, and the total loss is

$M^{(l)}_{q,k}$ 6

At inference time, the method computes $M^{(l)}_{q,k}$ 7, applies a pruning threshold $M^{(l)}_{q,k}$ 8, and builds a unified candidate score $M^{(l)}_{q,k}$ 9 over static tokens and bias-list entities, using bias weight $k \in K_q$ 0 to control the strength of entity insertion (Selvakumar et al., 19 Dec 2025).

6. Experimental results in ASR contextual biasing

The ASR study in (Selvakumar et al., 19 Dec 2025) is trained on LibriSpeech-960h and evaluated on test-clean and test-other. The reported setup uses 80-dim log-Mel features with SpecAugment, a 12-layer Conformer encoder with $k \in K_q$ 1, 4x expansion FFN, and 8 heads, and a 6-layer Transformer-style decoder with the same dimensionality and number of heads. Named-entity annotations use spaCy "en_core_web_trf", with approximately 700 unique entities in test (Selvakumar et al., 19 Dec 2025).

The principal metrics are WER, B-WER, and U-WER, where B-WER is the error rate only on named entities and U-WER is the error rate on all other words. The paper reports results for bias-list sizes $k \in K_q$ 2 and compares a baseline AED model, CLAS, AED + MTP without bias, and the ABP method with $k \in K_q$ 3 and $k \in K_q$ 4 (Selvakumar et al., 19 Dec 2025).

Model	test-clean	test-other
Baseline AED, $k \in K_q$ 5	2.73/17.52/2.27	6.01/32.34/5.07
AED + MTP w/o bias, $k \in K_q$ 6	2.58/17.27/2.27	6.00/30.63/5.12
Ours $k \in K_q$ 7, $k \in K_q$ 8	2.34/10.98/2.07	5.82/21.85/5.24
Ours $k \in K_q$ 9, $-\infty$ 0	2.27/8.70/2.07	5.64/17.22/5.22

At $-\infty$ 1, test-clean B-WER drops from $-\infty$ 2 to $-\infty$ 3, described as an approximately $-\infty$ 4 relative reduction, with no U-WER degradation (Selvakumar et al., 19 Dec 2025). Increasing $-\infty$ 5 to $-\infty$ 6 further lowers B-WER at slight cost of total WER while U-WER remains stable (Selvakumar et al., 19 Dec 2025). The paper characterizes the method as architecturally simple because it uses no separate bias-phrase encoder and no cross-attention, and as mitigating entity fragmentation by treating each multi-subword entity as a single score (Selvakumar et al., 19 Dec 2025).

The limitations reported are also specific. Fixed $-\infty$ 7 constrains the maximum entity length that can be peeked; the paper uses $-\infty$ 8 and states that 87% of entities are of length at most 4 tokens. Additional overhead arises from $-\infty$ 9 forward heads per decoding step, and careful negative sampling of bias lists is required during training (Selvakumar et al., 19 Dec 2025).

7. Comparative interpretation, misconceptions, and open questions

The two ABP methods share a structural intuition—selective access to information that would otherwise be diffuse—but they operate at different levels. In (Mamidanna et al., 11 Sep 2025), ABP is a masking-based intervention on the internal information-flow graph of a causal transformer. In (Selvakumar et al., 19 Dec 2025), ABP is a predictive extension that converts future-token logits into entity scores for contextual biasing. One common misconception would be to treat them as minor variants of the same algorithm; the formal definitions in the two papers do not support that interpretation.

The mechanistic conclusions from (Mamidanna et al., 11 Sep 2025) are narrow in scope. The method was tested on pure mental-math prompts, and the paper notes that word problems or embedded code fail under the same AF1 circuit, suggesting that additional subgraphs govern semantic understanding (Mamidanna et al., 11 Sep 2025). It also requires a tokenizer that assigns each integer a single token; models that split large numbers into multiple subtokens need an extension of ABP and CAMA (Mamidanna et al., 11 Sep 2025). More generally, whether similarly sparse subgraphs can be revealed for multi-step chain-of-thought, commonsense reasoning, or factual retrieval remains open (Mamidanna et al., 11 Sep 2025).

The ASR contextual-biasing ABP likewise has bounded scope. It is designed for attention-based encoder-decoder models with a dynamic bias list and uses a conditional-independence approximation over multiple future-token heads (Selvakumar et al., 19 Dec 2025). The method assumes that meaningful entity evidence can be extracted from a short lookahead window and compressed into a small scorer over candidate entities. This suggests a lightweight alternative to explicit bias encoders, but the fixed lookahead horizon and dependence on candidate-list construction remain substantive design constraints.

Taken together, the two 2025 ABP methods illustrate a broader research pattern in sequence modeling: attention can be made more analytically transparent or more operationally useful by restricting or repurposing access patterns. In one case, ABP exposes a sparse last-token arithmetic circuit in causal transformers (Mamidanna et al., 11 Sep 2025). In the other, it converts decoder lookahead into contextual biasing signals for named-entity recognition in ASR (Selvakumar et al., 19 Dec 2025). The shared acronym therefore names not a single established paradigm but a pair of contemporaneous methods linked by the idea of controlled peeking.

Markdown Report Issue Upgrade to Chat

References (2)

All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens (2025)

Peeking Into The Future For Contextual Biasing (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-Based Peeking (ABP).