Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 39 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 82 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 27 tok/s Pro

2000 character limit reached

CASTLE: Lookahead Attention in Transformers

Updated 10 September 2025

The paper introduces dynamic lookahead key updates that enrich earlier token representations while preserving strict autoregressive constraints.
It employs efficient parallel matrix operations and masking techniques to update keys with reduced computational complexity and improved validation loss.
Empirical results across multiple model scales demonstrate lower perplexity and enhanced accuracy on tasks like language modeling and commonsense reasoning.

CAuSal aTtention with Lookahead kEys (CASTLE) denotes a class of attention mechanisms in transformer architectures in which the key representations for each token are incrementally updated as context expands, allowing earlier tokens' keys to encode information from tokens that arrive later in the input sequence, subject to strict autoregressive constraints. CASTLE attention preserves causal ordering: at each step, neither queries nor values ever utilize future information, and key updates do not introduce information leakage, yet keys become "dynamic" objects that reflect more of the observed context than is possible in standard causal attention. This approach achieves a balance between improved context integration and computational tractability. Empirical evaluation demonstrates improved validation loss and perplexity over conventional causal attention across multiple model scales and downstream task settings (Song et al., 9 Sep 2025).

1. Motivation and Conceptual Foundation

Standard causal attention in autoregressive transformers involves static queries, keys, and values for each position, computed only from available past context. The key limitation of this design is that the key and value representations for early tokens do not benefit from information arising later in the sequence, which restricts the model’s ability to capture global dependencies critical for tasks such as LLMing, commonsense reasoning, and question answering. CASTLE is motivated by the observation that updating earlier positions’ keys—using information from subsequent tokens—can make self-attention more globally expressive without violating the autoregressive property required for sequence generation (Song et al., 9 Sep 2025).

In CASTLE, at generation step $t+1$ , the key representations for positions $s = 1, \ldots, t$ are renewed such that they “look ahead” to tokens at positions $(s+1, \ldots, t)$ , but are never influenced by information at $u > t+1$ . The resulting keys can be formalized as lookahead keys, distinguished from standard causal keys by their dynamic dependence on already-generated context.

2. Mathematical Formulation and Mechanism

CASTLE attention introduces two key groups within each layer: causal keys (static, computed as in standard causal attention) and lookahead keys (dynamic, recurrently updated). For a sequence of length $t$ , lookahead keys $K^{\text{lookahead}}_{1:t}$ for each position are formally updated so that each $K_s^{\text{lookahead}}$ ( $1 \leq s \leq t$ ) incorporates projections of all $X_{s+1}, \ldots, X_t$ , subject to causal masking:

$K^t = \begin{pmatrix} K_1^t \ K_2^t \ \vdots \ K_t^t \end{pmatrix} = \left(\frac{U_t U_t^\top}{\sqrt{d}} \odot \widetilde{M}_t + U_t \right), \quad U_t \in \mathbb{R}^{t \times d}$

where $\widetilde{M}_t$ is a mask ensuring that only positions $j > s$ contribute to the update of $K_s^t$ at step $t$ .

The recurrence for key update at each generation step can be written as

$K^t = \begin{pmatrix} K^{t-1} + \left(\frac{U_{t-1} U_t^\top}{\sqrt{d}} \odot \widetilde{M}_{t-1,t}\right) \ 0^{1 \times d} \end{pmatrix}$

These updates occur after every token is processed, but at no point do queries or values access future tokens—maintaining strict autoregressivity.

Central to CASTLE is the derivation of a parallel computation that is mathematically equivalent to the recurrence, obviating the need to materialize all lookahead keys during training:

$\text{Attention}(L) = \operatorname{row}\left( \frac{C C^\top}{\sqrt{d}} + C - \left[ \left( \frac{C U^\top}{\sqrt{d}} \right) \odot \widetilde{M}^C \left( \frac{U U^\top}{\sqrt{d}} \right) \right] \right)$

where $C$ are causal keys, $U$ are lookahead projections, and $\widetilde{M}^C$ is a mask encoding causality. This parallelization enables training with $O(L^2 d)$ time and $O(L d)$ memory.

3. Computational Efficiency and Training Algorithms

A naïve approach to recurrent key updating incurs $O(L^3 d)$ complexity for a sequence of length $L$ . CASTLE’s parallelization leverages the low-rank structure of the lookahead correction (rank at most $d$ ) and masking sparsity to maintain computational tractability. The central result, provided in Theorem 1 of (Song et al., 9 Sep 2025), is that the same output as explicit sequential key updating is achievable via efficient matrix operations over the sequence.

During autoregressive inference, a cache (UQ-KV) is maintained, storing the evolving lookahead keys and vanilla key-value pairs. Each incremental decoding step requires only $O(t d)$ operations for the update, where $t$ is the current sequence position, and can be implemented with a single additional matrix-vector multiplication.

4. Empirical Performance and Benchmark Evaluation

After pretraining on 50B tokens, CASTLE demonstrates consistent improvements over baseline causal attention architectures:

Lower validation loss (ex: improvements of 0.0059 to 0.0348 depending on model scale)
Lower perplexity per token
Superior accuracy and generalization on diverse downstream tasks including ARC, BoolQ, HellaSwag, MMLU, OBQA, PIQA, Winogrande.

CASTLE-XL, for instance, achieves lower validation loss and outperforms previous methods on reasoning and commonsense benchmarks. Performance improvements are attributed specifically to the enrichment of context via lookahead key updates—contrasting with approaches where increased model size is the main source of improvement.

Model Scale	Loss Reduction	Perplexity Improvement	Downstream Accuracy
Small	0.0059	Lower (exact value in paper)	Improved over baseline
Medium	0.0245	Lower	Consistent across tasks
Large	0.0356	Lower	Higher in 0-shot/few-shot
XL	0.0348	Lower	Outperforms baseline

5. Architectural Implications and Applications

CASTLE's lookahead key updates address contexts where global dependencies emerge late in the input and improve prediction for sequences with non-local entanglement. Typical applications are:

LLMing requiring garden-path sentence understanding
Question answering, especially when crucial cues arrive late
Commonsense reasoning tasks needing awareness of global context
Natural language generation, machine translation, and summarization requiring autoregressive fidelity

The mechanism is agnostic to the downstream modality and may be applied to any autoregressive transformer variant.

6. Relation to Other Future-Aware and Sparse Attention Mechanisms

CASTLE’s innovation is orthogonal to proposals in planning-based lookahead attention (Du et al., 2023), sparse FlashAttention (Pagliardini et al., 2023), and vision-language mask relaxation (Pei et al., 24 May 2025). Unlike planning methods that append sampled rollouts or vision models that compress future context via pooling and mask manipulation, CASTLE’s approach is to update keys incrementally using strictly observed context, ensuring autoregressive guarantees. Compared to sparse attention, its lookahead key updates enrich the context of early tokens rather than merely reduce computation.

A plausible implication is that CASTLE could be further combined with sparse attention or mask-relaxation techniques where using both dynamic keys and efficient computation yields complementary benefits. For causal reasoning in large models or non-text modalities, the incremental enrichment of key representations without violating autoregressive structure is a general principle.

7. Future Directions and Broader Significance

The incremental renewal of key vectors in CASTLE establishes a new design space for autoregressive self-attention. Potential alterations include evolving value representations ("lookahead values"), gating key updates with learned functions (for instance, using sigmoid or silu activations), and integrating multimodal or structured inputs. Empirical evidence indicates the gains are robust across scales; thus, further exploration into scalable variants and adaptation to domains such as code generation, protein modeling, or cross-modal inference is merited.

This suggests that next-generation autoregressive transformer architectures may systematically incorporate future-aware key dynamics to address global context sensitivity while preserving sample-wise parallel training. The incremental, non-leaky expansion of context is likely to shape training regimes and model architectures across sequence modeling.