Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spectral Koopman Attention (SKA)

Updated 12 May 2026
  • Spectral Koopman Attention (SKA) is a method that replaces conventional self-attention by using closed-form ridge regression and spectral filtering to achieve constant-memory, KV-cache-free associative recall.
  • SKA integrates with state-space model (SSM) blocks, such as Echo and Mamba-2, to efficiently handle long-context reasoning without incurring linear growth in key–value cache.
  • Empirical evaluations demonstrate that SKA significantly improves retrieval accuracy over long token gaps while reducing computational complexity compared to standard attention mechanisms.

Spectral Koopman Attention (SKA) is a drop-in replacement for causal self-attention designed for associative recall with constant inference memory. By leveraging a closed-form dynamical operator and power-iterated spectral filtering, SKA augments state-space model (SSM) blocks to enable key–value binding and retrieval without the need for a linearly-growing key–value (KV) cache. The method accumulates sufficient statistics in O(r2)O(r^2) streaming state, where rr is a small projection rank, and fits a compact linear (Koopman) operator to the key and value stream via kernel ridge regression. This approach is central to Echo, a KV-cache-free associative recall architecture, and allows for infinite-horizon content-based retrieval in long-context agentic inference (Sridhar et al., 7 May 2026).

1. Motivation and Architectural Distinctions

SKA addresses critical limitations in both Transformer-based attention and recurrent SSMs. In standard softmax attention, memory and compute scale as O(N2)O(N^2) and O(Nd)O(Nd), respectively, where NN is sequence length and dd is model dimensionality; this creates a bottleneck for long-context reasoning due to the linear growth of the KV cache. By contrast, SSMs operate at constant memory per token but demonstrate a "memory cliff," with retrieval accuracy decaying exponentially as the gap between a stored fact and its recall increases.

SKA differs fundamentally from prior compression-based or linear attention variants and SSMs in the following respects:

  • Memory Usage: Maintains only three fixed-size matrices per head—r×rr \times r or P×rP \times r, where PP is head dimensionality—resulting in constant O(r2+Pr)O(r^2+Pr) memory per head.
  • Retrieval Dynamics: Rather than compressing all history into a fixed recurrent state, SKA computes and stores sufficient statistics for a ridge-regressed Koopman operator, preserving exact content associations for arbitrarily long horizons.
  • Interleaving with SSMs: In architectures like Echo, SKA layers are alternated with SSM (e.g., Mamba-2) layers, allowing local sequential modeling and global associative retrieval.

2. Mathematical Formulation

SKA operates on input representations rr0, projecting into key, query, and value spaces via matrices rr1 and rr2. Key and query vectors are normalized per block:

rr3

rr4

Three streaming statistics are accumulated for each head:

rr5

rr6

rr7

Closed-form kernel ridge regression computes the Koopman operator and regression weights:

rr8

rr9

With Cholesky factorization O(N2)O(N^2)0, the whitened Koopman operator is O(N2)O(N^2)1 and is spectrally normalized to ensure bounded norm:

O(N2)O(N^2)2

Retrieval proceeds by mapping the query to whitened space, applying O(N2)O(N^2)3 steps of O(N2)O(N^2)4, and then projecting back for value readout:

O(N2)O(N^2)5

O(N2)O(N^2)6

O(N2)O(N^2)7

O(N2)O(N^2)8

Alternatively, this process can be interpreted as spectral weighting of operator eigenmodes.

3. Integration with SSM Blocks

In the Echo architecture, SKA is interleaved with SSM layers, such as those based on Mamba-2. The typical update sequence per token is:

  1. Apply SSM update to obtain O(N2)O(N^2)9 from O(Nd)O(Nd)0 and O(Nd)O(Nd)1.
  2. Compute SKA projections and update streaming statistics (O(Nd)O(Nd)2).
  3. Perform Cholesky factorization, update operator fits (O(Nd)O(Nd)3), apply spectral normalization, and execute power-iterated retrieval.
  4. Add the SKA output to the residual, apply LayerNorm, and feed into an MLP.

This workflow enables the model to maintain constant-size state for associative recall, while allowing high-capacity local sequential modeling through SSM recurrence. Chunk-causal computation supports efficient LM training and ensures equivalence to full-prefix statistics when partitioned into reasonable chunk sizes (e.g., O(Nd)O(Nd)4).

4. Computational Complexity

The computational and memory complexity of SKA, standard attention, and SSM blocks are as follows:

Model Type Training Complexity Inference Complexity State Memory
Standard Attention O(Nd)O(Nd)5 O(Nd)O(Nd)6 O(Nd)O(Nd)7 (KV-cache)
SSM (Mamba-2) O(Nd)O(Nd)8 O(Nd)O(Nd)9 NN0
SKA NN1* NN2 NN3 (fixed)

*Per chunk of size NN4.

In SKA, memory and compute requirements remain constant in sequence length NN5, scaling instead with projection rank NN6 and head dimensionality NN7. The dominant cost arises from Cholesky decomposition and triangular solves in NN8 per token.

5. Empirical Evaluations

Empirical results demonstrate that SKA-augmented SSMs eliminate the memory cliff observed in pure SSMs and outperform both standard attention and hybrid SSM+Attention approaches:

  • Sub-Million Scale Transfer: On mixed synthetic benchmarks (tool-trace, recall, multi-hop) at ~1M parameter scale, SSM+SKA achieves ≈81% mean accuracy compared to SSM+Attn (76%) and pure SSM (54%). In length generalization, SSM+SKA retains 65% retrieval accuracy at gaps of NN9 tokens, where others fall below 5%.
  • Multi-Query Associative Recall (MQAR, 50M): Pure Mamba-2 is at chance (≈3%), Mamba-2+Attn nearly achieves 100%, and Mamba-2+SKA achieves 100% accuracy in all tested configurations, including settings with 32 KV pairs and gaps up to dd0.
  • Language Modeling (180M): Echo-180M achieves best-in-class performance on 5 out of 6 held-out transfer benchmarks and records a perplexity of 16.48 on WikiText-103, outperforming or matching baseline Transformer, Mamba-2/3, and GDN models at comparable scale and less data.

6. Ablation and Analysis

Ablation studies indicate that the retrieval accuracy gains of SKA are attributable primarily to the spectral operator and closed-form ridge regression, not to the prefix mask or masking strategies. Removing the action mask (i.e., accumulating statistics across all tokens) yields negligible differences in training. Both prefix and masked retrieval variants share equivalent operator fits. Chunk-causal training at reasonable chunk size (e.g., dd1) results in no loss in retrieval relative to full-prefix compute, confirming correct accumulation semantics.

7. Implementation Recommendations

Key implementation considerations for SKA are as follows:

  • Projection rank dd2: Should be selected near the per-head dimension dd3 or in the range dd4 for 50–200M models. Larger dd5 offers better conditioning (stability, retrieval bandwidth) but increases cubic cost.
  • Regularization dd6 (dd7): Set near dd8 to dd9 for Gram invertibility while preserving operator expressiveness.
  • Power filter order r×rr \times r0: r×rr \times r1 separates persistent from transient modes; higher values may oversuppress intermediate eigenmodes.
  • Spectral normalization r×rr \times r2: Learnable in r×rr \times r3 to maintain output variance.
  • Numerical precision: Key accumulations and factorizations should employ full FP32 precision; cast outputs to model type for speed.
  • Kernel fusion: Fusing the batched Cholesky/inversion steps and power iterations in a single GPU kernel improves throughput.

In summary, Spectral Koopman Attention enables content-addressed retrieval via explicit kernel ridge regression and spectral filtering from constant streaming state, eliminating the SSM memory cliff and achieving attention-equivalent associative recall with constant inference memory. This provides a backbone for architectures such as Echo, which deliver KV-cache-free associative recall for long-context agentic reasoning and tool-calling scenarios (Sridhar et al., 7 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spectral Koopman Attention (SKA).