CASTLE: Causal Attention with Lookahead Keys
- The paper demonstrates that dynamic updating of key vectors with lookahead tokens reduces perplexity and improves language modeling accuracy.
- CASTLE integrates late-emerging context within a causal framework, mitigating garden-path ambiguities and enhancing global representations.
- The mechanism employs a hybrid attention approach combining static and lookahead keys to achieve efficient, parallelizable computation.
Causal Attention with Lookahead Keys (CASTLE) is an attention mechanism that augments the standard causal attention framework in Transformer architectures by enabling the dynamic updating of key vectors to integrate information as a sequence unfolds, while rigorously preserving autoregressive constraints. CASTLE permits each position’s key to incorporate contextual information from subsequent tokens encountered up to a given step in sequence prediction, providing richer representations for improved modeling of long-range dependencies, particularly in scenarios where key information appears late in a sequence. The mechanism is formulated to admit efficient, fully parallel computation, allowing for practical scaling to large model and data regimes, and demonstrates empirical improvements across core language modeling and downstream language understanding benchmarks (Song et al., 9 Sep 2025).
1. Motivation and Context
Standard causal attention mechanisms in Transformers constrain each prediction to depend only on past and current tokens by restricting every token’s query, key, and value computations to local context up to its position. This autoregressive property is essential for generative modeling and consistent decoding; however, it limits the model’s ability to represent information that emerges later in the sequence. This restriction impairs the resolution of garden-path ambiguities and hinders the encoding of global context, especially in tasks where crucial disambiguating information arrives after the initial tokens.
Prevailing methods, such as BeLLM, Echo Embeddings, and re-reading strategies, tackle backward dependencies or perform repeated passes to approximate global representation. These approaches typically require modified training or evaluation protocols, or are narrowly optimized for sentence embedding efficiency. CASTLE was introduced to enable key representations of prior tokens to absorb information from subsequently observed tokens—without ever violating the causal modeling requirement in pretraining or generation (Song et al., 9 Sep 2025).
2. Mechanism and Distinction from Standard Causal Attention
In standard causal attention, the encoder’s keys are static: once computed, a token’s key incorporates only information up to its own position, and cannot account for contextual evidence that emerges later. This is formalized as: for token , where is its contextualized embedding.
CASTLE introduces a dynamic updating process for the keys, termed lookahead keys. At time step , every token can update its key to incorporate information from tokens through , yielding: where are projections of the representations in the unconstrained (lookahead) space.
Each new prediction step updates these lookahead keys using only the context available up to that step, never accessing future tokens and never violating the autoregressive constraint. Attention output is calculated via a hybrid of static causal keys and dynamic lookahead keys: where is the standard attention score and encodes lookahead content, with serving as a gating (forgetting) mechanism.
Ablation studies show that utilizing only the lookahead keys or only static keys is insufficient for optimal performance; CASTLE’s hybrid attention achieves stronger results (Song et al., 9 Sep 2025).
3. Construction, Parallelization, and Theoretical Guarantees
While the lookahead key update process may appear inherently sequential, CASTLE provides a mathematical equivalence (see Theorem 1 in (Song et al., 9 Sep 2025)) allowing calculation of all lookahead keys for all positions in a single, parallelized pass. For a sequence of length , this avoids prohibitive costs and matches the cost characteristic of conventional attention layers, ensuring scalability to large-batch pretraining.
The CASTLE attention for the entire sequence may be summarized: $\text{Attention}(X^L) = \operatorname{row}\left( \frac{Q^C (K^C)^\top}{\sqrt{d} + M^C - \frac{R}{\sqrt{d}} \right) V^C$ with
where all matrices leverage causal and lookahead projections. and represent lower-triangular attention masks, and designates key eligibility. All matrix computations can be structured for efficiency using optimized blockwise attention routines.
This equivalence guarantees that the parallelized computation strictly matches the sequentially updated lookahead key formulation and that future context is never illicitly accessed. The approach is compatible with high-throughput hardware and does not require modification to standard pretraining or decoding procedures.
4. Empirical Performance and Ablation Findings
Evaluation on standard language modeling benchmarks demonstrates consistent improvements in validation loss and perplexity across all model scales:
- CASTLE-Small reduces validation perplexity from 16.411 to 16.315;
- CASTLE-Medium: 14.004 to 13.665;
- CASTLE-Large: 12.269 to 11.840;
- CASTLE-XL: 11.309 to 10.922 (with corresponding validation loss reductions; see Table 1 in (Song et al., 9 Sep 2025)).
On a range of downstream tasks (ARC, BoolQ, HellaSwag, MMLU, OBQA, PIQA, Winograde; 0-shot and 5-shot), CASTLE consistently outperforms size-matched baselines, with improvements most notable in larger models and challenging reasoning/commonsense tasks.
Ablation analyses confirm that improvements are not artifacts of increased parameter count or key dimensionality, as CASTLE outperforms even when strictly matching the number of heads and key projections to baseline models. The hybrid use of causal and lookahead keys is superior to either alone. Gated forgetting with the SiLU function has minimal impact on perplexity but consistently enhances downstream generalization. This suggests CASTLE’s improvements are due to its capacity to dynamically “absorb” emerging context at each position, especially benefitting late-disambiguation scenarios.
5. Relation to Streaming and Adaptive Lookahead Approaches
CASTLE’s fixed, parallelizable lookahead key mechanism is distinct from adaptive lookahead strategies designed for streaming scenarios, such as DCN (Moritz et al., 2021) or ANCAT (Strimel et al., 2023). In DCN, twin streams (causal/non-causal) and carefully structured key-value exchanges ensure bounded receptive field in deep architectures, avoiding lookahead accumulation. ANCAT generalizes windowed lookahead by learning per-input, per-layer lookahead schedules, dynamically balancing accuracy and algorithmic latency in streaming speech recognition. In contrast, CASTLE’s update scheme is motivated by static sequence modeling, and guarantees that previous tokens’ representations continually absorb information as new tokens are processed (subject to autoregressive constraints and efficient pretraining implementation).
6. Technical Summary and Comparative Table
| Aspect | Standard Causal Attention | CASTLE |
|---|---|---|
| Key vectors | Static; only own/past tokens | Dynamic; updated at each prediction step |
| Context per step | Context up to current position | All context up to current generation |
| Attention formula | Causal softmax, static masking | Hybrid of static and lookahead keys with SiLU |
| Pretraining cost | Efficient () | Efficient via mathematical equivalence () |
| Empirical gain | Limited for late context | Lower perplexity; improved downstream accuracy |
CASTLE’s innovations permit each position to “re-encode” its history in the light of newly revealed context at every step, yielding richer global representations, consistent with autoregressive generation, and scalable to large-scale data and model regimes.
7. Significance and Implications
CASTLE presents a principled resolution to the limitations imposed by static keys in causal attention schemes. By dynamically updating past keys as sequences unfold, it enables the model to resolve ambiguities introduced by delayed context, thereby enhancing modeling capacity for tasks that depend on long-range or globally-distributed evidence. Parallelizable by construction, CASTLE can be directly implemented within libraries optimized for large-batch attention (e.g., FlashAttention), without modifications to the core decoding logic. Ablation and benchmarking evidence suggest the lookahead key mechanism, distinctly from superficial architectural expansions, is responsible for improved perplexity and language understanding performance (Song et al., 9 Sep 2025). A plausible implication is that similar dynamic context encoding techniques may be beneficial in other domains where causal constraints, delayed information, and context re-interpretation are central.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free