Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated KalmaNet: Full-History, Linear-Memory Model

Updated 3 December 2025
  • Gated KalmaNet is a neural sequence layer that fully conditions on past inputs by formulating sequence updates as an online ridge regression problem solved with Chebyshev iteration.
  • It achieves constant-memory, linear-time computation while employing adaptive regularization and gating mechanisms to ensure numerical stability and precise long-range recall.
  • Empirical evaluations reveal that GKA outperforms fading-memory models on long-context tasks and scales efficiently on modern accelerators with ultra-long sequences.

Gated KalmaNet (GKA) is a neural sequence layer that bridges the performance gap between quadratic-cost softmax attention and linear-memory fading-memory state-space models (SSMs). GKA achieves constant-memory, linear-time computation while conditioning the output at each timestep on the complete sequence history, leveraging test-time online ridge regression solved via a numerically stable Chebyshev iteration. This approach retains the efficiency and scalability of SSMs while enabling exact recall of the entire context, addressing limitations inherent in previous methods.

1. Motivation and Relation to Prior Architectures

Traditional softmax attention mechanisms, as used in Transformers, maintain explicit access to all past key–value pairs, enabling “eidetic” memory at quadratic cost in sequence length. This renders ultra-long-context inference (≫10K tokens) computationally expensive and often impractical. Linear SSM layers such as RetNet, Mamba2, DeltaNet, and GLA replace the attention memory with a fixed-size state StRD×DS_t \in \mathbb{R}^{D\times D}, updated by

St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,

eliminating the quadratic memory cost and reducing per-token computation to O(D2)O(D^2). However, because γt<1\gamma_t<1, the effective state retains only a fading, lossy summary of the distant past, resulting in inferior performance on tasks that require precise, long-range recall.

GKA is designed to preserve the compute and memory efficiency of linear SSMs while, at each timestep, exactly conditioning on all prior inputs. This is accomplished by formulating the sequence model update as a test-time online ridge regression over the entire history, systematically overcoming the recall limitations of fading-memory models (Peng et al., 26 Nov 2025).

2. Mathematical Formulation

At each time tt, GKA computes a state StS_t by solving a regularized, weighted least-squares regression in dual (information) form: S^t=argminSRD×DλSF2+i=1tηiSkivi22,\hat S_t = \arg\min_{S\in\mathbb{R}^{D\times D}} \lambda\|S\|_{\mathrm F}^2 + \sum_{i=1}^t \eta_i \|S k_i - v_i\|_2^2, where, for each step ii:

  • ki,viRDk_i, v_i \in \mathbb{R}^D are the key and value vectors,
  • ηi[0,1]\eta_i\in [0,1] are learned exponential fading weights,
  • St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,0 provides Tikhonov regularization.

The analytic solution is

St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,1

with St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,2 the identity matrix. Output for query St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,3 is St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,4, with St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,5.

By using all past St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,6 (with appropriately chosen or learned exponential fading weights), GKA departs from conventional SSMs’ fixed state summaries, providing a theoretically optimal solution to full-history regression under linear-memory constraints.

3. Adaptive Regularization and Gating Mechanisms

Numerical stability of ridge regression deteriorates on long sequences due to the increasing condition number of the matrix St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,7. Uniform regularization can lead to either loss of memory (if St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,8 is too large) or instability (if too small). GKA addresses this with an adaptive regularization schedule: St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,9 where O(D2)O(D^2)0 is a learnable or set hyperparameter. This binds the condition number to a constant O(D2)O(D^2)1, ensuring numerical tractability across sequence lengths and avoiding catastrophic forgetting or gradient instability.

Additionally, recency bias and flexible memory can be learned using an input-conditioned gating architecture. The per-token fading weights O(D2)O(D^2)2 are parameterized as

O(D2)O(D^2)3

with O(D2)O(D^2)4 the sigmoid activation and O(D2)O(D^2)5 an internal summary state. This product form ensures exponential decay in memory contribution, implemented efficiently in O(D2)O(D^2)6 memory per token.

Ablation studies reveal that omitting adaptive regularization causes spiky gradients and training collapse; removing gating degrades recall by 7–10% on retrieval tasks. These components are thus essential to GKA’s performance and stability (Peng et al., 26 Nov 2025).

4. Chebyshev Iteration and Numerical Solvers

Direct inversion or Cholesky decomposition of O(D2)O(D^2)7 scales as O(D2)O(D^2)8, impractical for modern architectures and large O(D2)O(D^2)9. GKA substitutes this with γt<1\gamma_t<10 iterations of the Chebyshev method, which offers the following properties:

  • Complexity per step: γt<1\gamma_t<11 via matrix-vector products and rank-one updates.
  • Convergence in γt<1\gamma_t<12 steps, where γt<1\gamma_t<13 is the condition number.

In Chebyshev iteration: γt<1\gamma_t<14 and for γt<1\gamma_t<15

γt<1\gamma_t<16

for γt<1\gamma_t<17, γt<1\gamma_t<18, γt<1\gamma_t<19.

Compared to conjugate gradient, Chebyshev iteration is more robust in low-precision environments (such as bfloat16) because it avoids ill-conditioned momentum scaling. The recurrence’s backward structure allows the backward pass to be computed without storing all intermediate iterates, reducing memory overhead.

5. Hardware-Aware Implementation

To further optimize for modern accelerators, GKA uses chunk-wise state management:

  • The token stream of length tt0 is divided into tt1 chunks of size tt2.
  • Only the states tt3 are materialized at chunk boundaries (tt4).
  • Within each chunk, key matrices, gram matrices, and the Frobenius norms are updated in parallel, supporting efficient Chebyshev iterations.
  • The Frobenius norm tt5 is maintained using block/triangular masks and cumulative product vectors from local chunked key sets, avoiding tt6 space per token.

Backward gradients are handled via implicit differentiation and reapplication of Chebyshev iterations on transposed systems, allowing efficient memory usage and low-latency training on hardware such as GPUs and TPUs.

6. Computational Complexity

GKA’s per-token and total complexity is summarized in the following table:

Operation Per-Token Complexity Memory Requirement
Chebyshev tt7 tt8 per chunk state
Update tt9, StS_t0 StS_t1 StS_t2
Total StS_t3 StS_t4

Here StS_t5 is the number of Chebyshev iterations (typically 20–30, independent of StS_t6 or StS_t7). This yields linear compute in sequence length StS_t8 and constant memory with respect to context length, aligning with the best-case characteristics of SSMs.

7. Empirical Findings and Extensions

Empirical results establish GKA’s state-of-the-art performance among linear-memory models:

  • On synthetic associative-recall (MQAR) tasks (up to 8K tokens), GKA outperforms Mamba2, GLA, and Gated DeltaNet by 5–10 points in recall accuracy at matched state dimensions.
  • For short-context language modeling (LM-Harness, 2.8B parameter regime, FDA and SWDE tasks), GKA surpasses SSM baselines by approximately 10% relative, approaching the performance of full softmax Transformers.
  • On long-context applications such as retrieval-augmented generation (RAG) and LongQA (up to 128K tokens), GKA delivers >10% relative improvement over fading-memory methods, with recall competitive to full attention up to 32K tokens.
  • Ablation analyses show that both adaptive regularization and gating are necessary for stability and superior recall; Chebyshev iteration is indispensable for reliable low-precision training and inference.

Further, the architecture supports several prospective enhancements:

  • Sketching the normal equations to dimension StS_t9 can accelerate Chebyshev iterations by about 10% throughput, with under 1% accuracy loss.
  • Hybrid designs that alternate full-attention heads with GKA yield additional recall gains at low incremental cost.
  • Extensions to kernelized or non-linear ridge regression (deep test-time optimization) are open research directions.
  • Scaling to architectures above 10B parameters and integrating efficient inference schemes (prefix caching, custom kernels) are promising for increased deployment.

GKA thus operationalizes full-history regression within a linear-memory, hardware-friendly architecture, substantially mitigating the historical tradeoff between efficiency and memory retention in neural sequence modeling (Peng et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated KalmaNet (GKA).