Spectral Koopman Attention (SKA)
- Spectral Koopman Attention (SKA) is a method that replaces conventional self-attention by using closed-form ridge regression and spectral filtering to achieve constant-memory, KV-cache-free associative recall.
- SKA integrates with state-space model (SSM) blocks, such as Echo and Mamba-2, to efficiently handle long-context reasoning without incurring linear growth in key–value cache.
- Empirical evaluations demonstrate that SKA significantly improves retrieval accuracy over long token gaps while reducing computational complexity compared to standard attention mechanisms.
Spectral Koopman Attention (SKA) is a drop-in replacement for causal self-attention designed for associative recall with constant inference memory. By leveraging a closed-form dynamical operator and power-iterated spectral filtering, SKA augments state-space model (SSM) blocks to enable key–value binding and retrieval without the need for a linearly-growing key–value (KV) cache. The method accumulates sufficient statistics in streaming state, where is a small projection rank, and fits a compact linear (Koopman) operator to the key and value stream via kernel ridge regression. This approach is central to Echo, a KV-cache-free associative recall architecture, and allows for infinite-horizon content-based retrieval in long-context agentic inference (Sridhar et al., 7 May 2026).
1. Motivation and Architectural Distinctions
SKA addresses critical limitations in both Transformer-based attention and recurrent SSMs. In standard softmax attention, memory and compute scale as and , respectively, where is sequence length and is model dimensionality; this creates a bottleneck for long-context reasoning due to the linear growth of the KV cache. By contrast, SSMs operate at constant memory per token but demonstrate a "memory cliff," with retrieval accuracy decaying exponentially as the gap between a stored fact and its recall increases.
SKA differs fundamentally from prior compression-based or linear attention variants and SSMs in the following respects:
- Memory Usage: Maintains only three fixed-size matrices per head— or , where is head dimensionality—resulting in constant memory per head.
- Retrieval Dynamics: Rather than compressing all history into a fixed recurrent state, SKA computes and stores sufficient statistics for a ridge-regressed Koopman operator, preserving exact content associations for arbitrarily long horizons.
- Interleaving with SSMs: In architectures like Echo, SKA layers are alternated with SSM (e.g., Mamba-2) layers, allowing local sequential modeling and global associative retrieval.
2. Mathematical Formulation
SKA operates on input representations 0, projecting into key, query, and value spaces via matrices 1 and 2. Key and query vectors are normalized per block:
3
4
Three streaming statistics are accumulated for each head:
5
6
7
Closed-form kernel ridge regression computes the Koopman operator and regression weights:
8
9
With Cholesky factorization 0, the whitened Koopman operator is 1 and is spectrally normalized to ensure bounded norm:
2
Retrieval proceeds by mapping the query to whitened space, applying 3 steps of 4, and then projecting back for value readout:
5
6
7
8
Alternatively, this process can be interpreted as spectral weighting of operator eigenmodes.
3. Integration with SSM Blocks
In the Echo architecture, SKA is interleaved with SSM layers, such as those based on Mamba-2. The typical update sequence per token is:
- Apply SSM update to obtain 9 from 0 and 1.
- Compute SKA projections and update streaming statistics (2).
- Perform Cholesky factorization, update operator fits (3), apply spectral normalization, and execute power-iterated retrieval.
- Add the SKA output to the residual, apply LayerNorm, and feed into an MLP.
This workflow enables the model to maintain constant-size state for associative recall, while allowing high-capacity local sequential modeling through SSM recurrence. Chunk-causal computation supports efficient LM training and ensures equivalence to full-prefix statistics when partitioned into reasonable chunk sizes (e.g., 4).
4. Computational Complexity
The computational and memory complexity of SKA, standard attention, and SSM blocks are as follows:
| Model Type | Training Complexity | Inference Complexity | State Memory |
|---|---|---|---|
| Standard Attention | 5 | 6 | 7 (KV-cache) |
| SSM (Mamba-2) | 8 | 9 | 0 |
| SKA | 1* | 2 | 3 (fixed) |
*Per chunk of size 4.
In SKA, memory and compute requirements remain constant in sequence length 5, scaling instead with projection rank 6 and head dimensionality 7. The dominant cost arises from Cholesky decomposition and triangular solves in 8 per token.
5. Empirical Evaluations
Empirical results demonstrate that SKA-augmented SSMs eliminate the memory cliff observed in pure SSMs and outperform both standard attention and hybrid SSM+Attention approaches:
- Sub-Million Scale Transfer: On mixed synthetic benchmarks (tool-trace, recall, multi-hop) at ~1M parameter scale, SSM+SKA achieves ≈81% mean accuracy compared to SSM+Attn (76%) and pure SSM (54%). In length generalization, SSM+SKA retains 65% retrieval accuracy at gaps of 9 tokens, where others fall below 5%.
- Multi-Query Associative Recall (MQAR, 50M): Pure Mamba-2 is at chance (≈3%), Mamba-2+Attn nearly achieves 100%, and Mamba-2+SKA achieves 100% accuracy in all tested configurations, including settings with 32 KV pairs and gaps up to 0.
- Language Modeling (180M): Echo-180M achieves best-in-class performance on 5 out of 6 held-out transfer benchmarks and records a perplexity of 16.48 on WikiText-103, outperforming or matching baseline Transformer, Mamba-2/3, and GDN models at comparable scale and less data.
6. Ablation and Analysis
Ablation studies indicate that the retrieval accuracy gains of SKA are attributable primarily to the spectral operator and closed-form ridge regression, not to the prefix mask or masking strategies. Removing the action mask (i.e., accumulating statistics across all tokens) yields negligible differences in training. Both prefix and masked retrieval variants share equivalent operator fits. Chunk-causal training at reasonable chunk size (e.g., 1) results in no loss in retrieval relative to full-prefix compute, confirming correct accumulation semantics.
7. Implementation Recommendations
Key implementation considerations for SKA are as follows:
- Projection rank 2: Should be selected near the per-head dimension 3 or in the range 4 for 50–200M models. Larger 5 offers better conditioning (stability, retrieval bandwidth) but increases cubic cost.
- Regularization 6 (7): Set near 8 to 9 for Gram invertibility while preserving operator expressiveness.
- Power filter order 0: 1 separates persistent from transient modes; higher values may oversuppress intermediate eigenmodes.
- Spectral normalization 2: Learnable in 3 to maintain output variance.
- Numerical precision: Key accumulations and factorizations should employ full FP32 precision; cast outputs to model type for speed.
- Kernel fusion: Fusing the batched Cholesky/inversion steps and power iterations in a single GPU kernel improves throughput.
In summary, Spectral Koopman Attention enables content-addressed retrieval via explicit kernel ridge regression and spectral filtering from constant streaming state, eliminating the SSM memory cliff and achieving attention-equivalent associative recall with constant inference memory. This provides a backbone for architectures such as Echo, which deliver KV-cache-free associative recall for long-context agentic reasoning and tool-calling scenarios (Sridhar et al., 7 May 2026).