Papers
Topics
Authors
Recent
2000 character limit reached

Gated KalmaNet: Full-History, Linear-Memory Model

Updated 3 December 2025
  • Gated KalmaNet is a neural sequence layer that fully conditions on past inputs by formulating sequence updates as an online ridge regression problem solved with Chebyshev iteration.
  • It achieves constant-memory, linear-time computation while employing adaptive regularization and gating mechanisms to ensure numerical stability and precise long-range recall.
  • Empirical evaluations reveal that GKA outperforms fading-memory models on long-context tasks and scales efficiently on modern accelerators with ultra-long sequences.

Gated KalmaNet (GKA) is a neural sequence layer that bridges the performance gap between quadratic-cost softmax attention and linear-memory fading-memory state-space models (SSMs). GKA achieves constant-memory, linear-time computation while conditioning the output at each timestep on the complete sequence history, leveraging test-time online ridge regression solved via a numerically stable Chebyshev iteration. This approach retains the efficiency and scalability of SSMs while enabling exact recall of the entire context, addressing limitations inherent in previous methods.

1. Motivation and Relation to Prior Architectures

Traditional softmax attention mechanisms, as used in Transformers, maintain explicit access to all past key–value pairs, enabling “eidetic” memory at quadratic cost in sequence length. This renders ultra-long-context inference (≫10K tokens) computationally expensive and often impractical. Linear SSM layers such as RetNet, Mamba2, DeltaNet, and GLA replace the attention memory with a fixed-size state StRD×DS_t \in \mathbb{R}^{D\times D}, updated by

St=γtSt1+βtvtkt,yt=Stqt,S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,

eliminating the quadratic memory cost and reducing per-token computation to O(D2)O(D^2). However, because γt<1\gamma_t<1, the effective state retains only a fading, lossy summary of the distant past, resulting in inferior performance on tasks that require precise, long-range recall.

GKA is designed to preserve the compute and memory efficiency of linear SSMs while, at each timestep, exactly conditioning on all prior inputs. This is accomplished by formulating the sequence model update as a test-time online ridge regression over the entire history, systematically overcoming the recall limitations of fading-memory models (Peng et al., 26 Nov 2025).

2. Mathematical Formulation

At each time tt, GKA computes a state StS_t by solving a regularized, weighted least-squares regression in dual (information) form: S^t=argminSRD×DλSF2+i=1tηiSkivi22,\hat S_t = \arg\min_{S\in\mathbb{R}^{D\times D}} \lambda\|S\|_{\mathrm F}^2 + \sum_{i=1}^t \eta_i \|S k_i - v_i\|_2^2, where, for each step ii:

  • ki,viRDk_i, v_i \in \mathbb{R}^D are the key and value vectors,
  • ηi[0,1]\eta_i\in [0,1] are learned exponential fading weights,
  • λ>0\lambda > 0 provides Tikhonov regularization.

The analytic solution is

St=Ut(Ht+λI)1,Ut=i=1tηiviki,Ht=i=1tηikiki,S_t = U_t (H_t + \lambda I)^{-1}, \qquad U_t = \sum_{i=1}^t \eta_i v_i k_i^\top, \quad H_t = \sum_{i=1}^t \eta_i k_i k_i^\top,

with II the identity matrix. Output for query qtq_t is yt=Stqt=Utxty_t = S_t q_t = U_t x_t, with xt(Ht+λI)1qtx_t \approx (H_t + \lambda I)^{-1}q_t.

By using all past ki,vik_i, v_i (with appropriately chosen or learned exponential fading weights), GKA departs from conventional SSMs’ fixed state summaries, providing a theoretically optimal solution to full-history regression under linear-memory constraints.

3. Adaptive Regularization and Gating Mechanisms

Numerical stability of ridge regression deteriorates on long sequences due to the increasing condition number of the matrix Ht+λIH_t + \lambda I. Uniform regularization can lead to either loss of memory (if λ\lambda is too large) or instability (if too small). GKA addresses this with an adaptive regularization schedule: λt=aHtF,a>0,\lambda_t = a \|H_t\|_{\mathrm F}, \quad a>0, where aa is a learnable or set hyperparameter. This binds the condition number to a constant (a+1)/a(a+1)/a, ensuring numerical tractability across sequence lengths and avoiding catastrophic forgetting or gradient instability.

Additionally, recency bias and flexible memory can be learned using an input-conditioned gating architecture. The per-token fading weights ηt,i\eta_{t,i} are parameterized as

ηt,i=j=i+1tγj,γj=σ(wg[xj;hj1]+bg),\eta_{t,i} = \prod_{j=i+1}^t \gamma_j, \qquad \gamma_j = \sigma(w_g^\top [x_j;h_{j-1}] + b_g),

with σ\sigma the sigmoid activation and hj1h_{j-1} an internal summary state. This product form ensures exponential decay in memory contribution, implemented efficiently in O(1)O(1) memory per token.

Ablation studies reveal that omitting adaptive regularization causes spiky gradients and training collapse; removing gating degrades recall by 7–10% on retrieval tasks. These components are thus essential to GKA’s performance and stability (Peng et al., 26 Nov 2025).

4. Chebyshev Iteration and Numerical Solvers

Direct inversion or Cholesky decomposition of Ht+λtIH_t + \lambda_t I scales as O(D3)O(D^3), impractical for modern architectures and large DD. GKA substitutes this with rr iterations of the Chebyshev method, which offers the following properties:

  • Complexity per step: O(D2)O(D^2) via matrix-vector products and rank-one updates.
  • Convergence in O(κlog(1/ε))O(\sqrt{\kappa}\log(1/\varepsilon)) steps, where κ\kappa is the condition number.

In Chebyshev iteration: ω0=2,  ξ1=0,  ξ0=2L+μq\omega_0=2,\; \xi^{-1}=0,\; \xi^0 = \frac{2}{L+\mu}q and for k0k\ge 0

ωk+1=44ρ2ωk,ξ(k+1)=ξ(k)2ωk+1L+μ(Hξ(k)q)+(ωk+11)(ξ(k)ξ(k1)),\omega_{k+1} = \frac{4}{4 - \rho^2 \omega_k}, \qquad \xi^{(k+1)} = \xi^{(k)} - \frac{2\omega_{k+1}}{L+\mu}(H\xi^{(k)} - q) + (\omega_{k+1}-1)(\xi^{(k)}-\xi^{(k-1)}),

for μ=λt\mu = \lambda_t, L=HtF+λtL = \|H_t\|_{\mathrm F} + \lambda_t, ρ=(Lμ)/(L+μ)\rho = (L-\mu)/(L+\mu).

Compared to conjugate gradient, Chebyshev iteration is more robust in low-precision environments (such as bfloat16) because it avoids ill-conditioned momentum scaling. The recurrence’s backward structure allows the backward pass to be computed without storing all intermediate iterates, reducing memory overhead.

5. Hardware-Aware Implementation

To further optimize for modern accelerators, GKA uses chunk-wise state management:

  • The token stream of length TT is divided into N=T/CN=T/C chunks of size CC.
  • Only the states Ht0H_{t_0} are materialized at chunk boundaries (t0=0,C,2C,t_0=0, C, 2C, \dots).
  • Within each chunk, key matrices, gram matrices, and the Frobenius norms are updated in parallel, supporting efficient Chebyshev iterations.
  • The Frobenius norm Ht0+cF\|H_{t_0 + c}\|_{\mathrm F} is maintained using block/triangular masks and cumulative product vectors from local chunked key sets, avoiding O(D2)O(D^2) space per token.

Backward gradients are handled via implicit differentiation and reapplication of Chebyshev iterations on transposed systems, allowing efficient memory usage and low-latency training on hardware such as GPUs and TPUs.

6. Computational Complexity

GKA’s per-token and total complexity is summarized in the following table:

Operation Per-Token Complexity Memory Requirement
Chebyshev O(rD2)O(r D^2) O(D2)O(D^2) per chunk state
Update UtU_t, HtH_t O(D2)O(D^2) O(D2)O(D^2)
Total O(rTD2)O(rT D^2) O(D2)O(D^2)

Here rr is the number of Chebyshev iterations (typically 20–30, independent of TT or DD). This yields linear compute in sequence length O(T)O(T) and constant memory with respect to context length, aligning with the best-case characteristics of SSMs.

7. Empirical Findings and Extensions

Empirical results establish GKA’s state-of-the-art performance among linear-memory models:

  • On synthetic associative-recall (MQAR) tasks (up to 8K tokens), GKA outperforms Mamba2, GLA, and Gated DeltaNet by 5–10 points in recall accuracy at matched state dimensions.
  • For short-context language modeling (LM-Harness, 2.8B parameter regime, FDA and SWDE tasks), GKA surpasses SSM baselines by approximately 10% relative, approaching the performance of full softmax Transformers.
  • On long-context applications such as retrieval-augmented generation (RAG) and LongQA (up to 128K tokens), GKA delivers >10% relative improvement over fading-memory methods, with recall competitive to full attention up to 32K tokens.
  • Ablation analyses show that both adaptive regularization and gating are necessary for stability and superior recall; Chebyshev iteration is indispensable for reliable low-precision training and inference.

Further, the architecture supports several prospective enhancements:

  • Sketching the normal equations to dimension dDd\ll D can accelerate Chebyshev iterations by about 10% throughput, with under 1% accuracy loss.
  • Hybrid designs that alternate full-attention heads with GKA yield additional recall gains at low incremental cost.
  • Extensions to kernelized or non-linear ridge regression (deep test-time optimization) are open research directions.
  • Scaling to architectures above 10B parameters and integrating efficient inference schemes (prefix caching, custom kernels) are promising for increased deployment.

GKA thus operationalizes full-history regression within a linear-memory, hardware-friendly architecture, substantially mitigating the historical tradeoff between efficiency and memory retention in neural sequence modeling (Peng et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gated KalmaNet (GKA).