Papers
Topics
Authors
Recent
2000 character limit reached

CASTLE: Causal Attention with Lookahead Keys

Updated 4 November 2025
  • The paper demonstrates that dynamic updating of key vectors with lookahead tokens reduces perplexity and improves language modeling accuracy.
  • CASTLE integrates late-emerging context within a causal framework, mitigating garden-path ambiguities and enhancing global representations.
  • The mechanism employs a hybrid attention approach combining static and lookahead keys to achieve efficient, parallelizable computation.

Causal Attention with Lookahead Keys (CASTLE) is an attention mechanism that augments the standard causal attention framework in Transformer architectures by enabling the dynamic updating of key vectors to integrate information as a sequence unfolds, while rigorously preserving autoregressive constraints. CASTLE permits each position’s key to incorporate contextual information from subsequent tokens encountered up to a given step in sequence prediction, providing richer representations for improved modeling of long-range dependencies, particularly in scenarios where key information appears late in a sequence. The mechanism is formulated to admit efficient, fully parallel computation, allowing for practical scaling to large model and data regimes, and demonstrates empirical improvements across core language modeling and downstream language understanding benchmarks (Song et al., 9 Sep 2025).

1. Motivation and Context

Standard causal attention mechanisms in Transformers constrain each prediction to depend only on past and current tokens by restricting every token’s query, key, and value computations to local context up to its position. This autoregressive property is essential for generative modeling and consistent decoding; however, it limits the model’s ability to represent information that emerges later in the sequence. This restriction impairs the resolution of garden-path ambiguities and hinders the encoding of global context, especially in tasks where crucial disambiguating information arrives after the initial tokens.

Prevailing methods, such as BeLLM, Echo Embeddings, and re-reading strategies, tackle backward dependencies or perform repeated passes to approximate global representation. These approaches typically require modified training or evaluation protocols, or are narrowly optimized for sentence embedding efficiency. CASTLE was introduced to enable key representations of prior tokens to absorb information from subsequently observed tokens—without ever violating the causal modeling requirement in pretraining or generation (Song et al., 9 Sep 2025).

2. Mechanism and Distinction from Standard Causal Attention

In standard causal attention, the encoder’s keys are static: once computed, a token’s key incorporates only information up to its own position, and cannot account for contextual evidence that emerges later. This is formalized as: ksC=xsWKCk^C_s = x_s W^C_K for token ss, where xsx_s is its contextualized embedding.

CASTLE introduces a dynamic updating process for the keys, termed lookahead keys. At time step tt, every token sts \leq t can update its key to incorporate information from tokens s+1s+1 through tt, yielding: ksU=sigmoid(qsU(Ks+1:tU)d+mask)Vs+1:tUk^U_s = \operatorname{sigmoid}\left( \frac{q^U_s (K^U_{s+1:t})^\top}{\sqrt{d} + \text{mask}} \right) \cdot V^U_{s+1:t} where qsU,Ks+1:tU,Vs+1:tUq^U_s, K^U_{s+1:t}, V^U_{s+1:t} are projections of the representations in the unconstrained (lookahead) space.

Each new prediction step updates these lookahead keys using only the context available up to that step, never accessing future tokens and never violating the autoregressive constraint. Attention output is calculated via a hybrid of static causal keys and dynamic lookahead keys: wt=softmax(αtCSiLU(αtU))w_t = \operatorname{softmax}(\alpha^C_t - \operatorname{SiLU}(\alpha^U_t)) where αtC\alpha^C_t is the standard attention score and αtU\alpha^U_t encodes lookahead content, with SiLU\operatorname{SiLU} serving as a gating (forgetting) mechanism.

Ablation studies show that utilizing only the lookahead keys or only static keys is insufficient for optimal performance; CASTLE’s hybrid attention achieves stronger results (Song et al., 9 Sep 2025).

3. Construction, Parallelization, and Theoretical Guarantees

While the lookahead key update process may appear inherently sequential, CASTLE provides a mathematical equivalence (see Theorem 1 in (Song et al., 9 Sep 2025)) allowing calculation of all lookahead keys for all positions in a single, parallelized pass. For a sequence of length LL, this avoids prohibitive O(L3)O(L^3) costs and matches the O(L2d)O(L^2 d) cost characteristic of conventional attention layers, ensuring scalability to large-batch pretraining.

The CASTLE attention for the entire sequence may be summarized: $\text{Attention}(X^L) = \operatorname{row}\left( \frac{Q^C (K^C)^\top}{\sqrt{d} + M^C - \frac{R}{\sqrt{d}} \right) V^C$ with

R=(QC(KU)M~C)[sigmoid(QU(KU)d+MU)]R = \left(Q^C (K^U)^\top \odot \tilde{M}^C\right) \cdot \left[ \operatorname{sigmoid}\left( \frac{Q^U (K^U)^\top}{\sqrt{d}} + M^U \right) \right]

where all matrices leverage causal and lookahead projections. MCM^C and MUM^U represent lower-triangular attention masks, and M~C\tilde{M}^C designates key eligibility. All matrix computations can be structured for efficiency using optimized blockwise attention routines.

This equivalence guarantees that the parallelized computation strictly matches the sequentially updated lookahead key formulation and that future context is never illicitly accessed. The approach is compatible with high-throughput hardware and does not require modification to standard pretraining or decoding procedures.

4. Empirical Performance and Ablation Findings

Evaluation on standard language modeling benchmarks demonstrates consistent improvements in validation loss and perplexity across all model scales:

  • CASTLE-Small reduces validation perplexity from 16.411 to 16.315;
  • CASTLE-Medium: 14.004 to 13.665;
  • CASTLE-Large: 12.269 to 11.840;
  • CASTLE-XL: 11.309 to 10.922 (with corresponding validation loss reductions; see Table 1 in (Song et al., 9 Sep 2025)).

On a range of downstream tasks (ARC, BoolQ, HellaSwag, MMLU, OBQA, PIQA, Winograde; 0-shot and 5-shot), CASTLE consistently outperforms size-matched baselines, with improvements most notable in larger models and challenging reasoning/commonsense tasks.

Ablation analyses confirm that improvements are not artifacts of increased parameter count or key dimensionality, as CASTLE outperforms even when strictly matching the number of heads and key projections to baseline models. The hybrid use of causal and lookahead keys is superior to either alone. Gated forgetting with the SiLU function has minimal impact on perplexity but consistently enhances downstream generalization. This suggests CASTLE’s improvements are due to its capacity to dynamically “absorb” emerging context at each position, especially benefitting late-disambiguation scenarios.

5. Relation to Streaming and Adaptive Lookahead Approaches

CASTLE’s fixed, parallelizable lookahead key mechanism is distinct from adaptive lookahead strategies designed for streaming scenarios, such as DCN (Moritz et al., 2021) or ANCAT (Strimel et al., 2023). In DCN, twin streams (causal/non-causal) and carefully structured key-value exchanges ensure bounded receptive field in deep architectures, avoiding lookahead accumulation. ANCAT generalizes windowed lookahead by learning per-input, per-layer lookahead schedules, dynamically balancing accuracy and algorithmic latency in streaming speech recognition. In contrast, CASTLE’s update scheme is motivated by static sequence modeling, and guarantees that previous tokens’ representations continually absorb information as new tokens are processed (subject to autoregressive constraints and efficient pretraining implementation).

6. Technical Summary and Comparative Table

Aspect Standard Causal Attention CASTLE
Key vectors Static; only own/past tokens Dynamic; updated at each prediction step
Context per step Context up to current position All context up to current generation
Attention formula Causal softmax, static masking Hybrid of static and lookahead keys with SiLU
Pretraining cost Efficient (O(L2d)O(L^2d)) Efficient via mathematical equivalence (O(L2d)O(L^2d))
Empirical gain Limited for late context Lower perplexity; improved downstream accuracy

CASTLE’s innovations permit each position to “re-encode” its history in the light of newly revealed context at every step, yielding richer global representations, consistent with autoregressive generation, and scalable to large-scale data and model regimes.

7. Significance and Implications

CASTLE presents a principled resolution to the limitations imposed by static keys in causal attention schemes. By dynamically updating past keys as sequences unfold, it enables the model to resolve ambiguities introduced by delayed context, thereby enhancing modeling capacity for tasks that depend on long-range or globally-distributed evidence. Parallelizable by construction, CASTLE can be directly implemented within libraries optimized for large-batch attention (e.g., FlashAttention), without modifications to the core decoding logic. Ablation and benchmarking evidence suggest the lookahead key mechanism, distinctly from superficial architectural expansions, is responsible for improved perplexity and language understanding performance (Song et al., 9 Sep 2025). A plausible implication is that similar dynamic context encoding techniques may be beneficial in other domains where causal constraints, delayed information, and context re-interpretation are central.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Causal Attention with Lookahead Keys (CASTLE).