Gated KalmaNet: Full-History, Linear-Memory Model

Updated 3 December 2025

Gated KalmaNet is a neural sequence layer that fully conditions on past inputs by formulating sequence updates as an online ridge regression problem solved with Chebyshev iteration.
It achieves constant-memory, linear-time computation while employing adaptive regularization and gating mechanisms to ensure numerical stability and precise long-range recall.
Empirical evaluations reveal that GKA outperforms fading-memory models on long-context tasks and scales efficiently on modern accelerators with ultra-long sequences.

Gated KalmaNet (GKA) is a neural sequence layer that bridges the performance gap between quadratic-cost softmax attention and linear-memory fading-memory state-space models (SSMs). GKA achieves constant-memory, linear-time computation while conditioning the output at each timestep on the complete sequence history, leveraging test-time online ridge regression solved via a numerically stable Chebyshev iteration. This approach retains the efficiency and scalability of SSMs while enabling exact recall of the entire context, addressing limitations inherent in previous methods.

1. Motivation and Relation to Prior Architectures

Traditional softmax attention mechanisms, as used in Transformers, maintain explicit access to all past key–value pairs, enabling “eidetic” memory at quadratic cost in sequence length. This renders ultra-long-context inference (≫10K tokens) computationally expensive and often impractical. Linear SSM layers such as RetNet, Mamba2, DeltaNet, and GLA replace the attention memory with a fixed-size state $S_t \in \mathbb{R}^{D\times D}$ , updated by

$S_t = \gamma_t S_{t-1} + \beta_t v_t k_t^\top, \quad y_t = S_t q_t,$

eliminating the quadratic memory cost and reducing per-token computation to $O(D^2)$ . However, because $\gamma_t<1$ , the effective state retains only a fading, lossy summary of the distant past, resulting in inferior performance on tasks that require precise, long-range recall.

GKA is designed to preserve the compute and memory efficiency of linear SSMs while, at each timestep, exactly conditioning on all prior inputs. This is accomplished by formulating the sequence model update as a test-time online ridge regression over the entire history, systematically overcoming the recall limitations of fading-memory models (Peng et al., 26 Nov 2025).

2. Mathematical Formulation

At each time $t$ , GKA computes a state $S_t$ by solving a regularized, weighted least-squares regression in dual (information) form: $\hat S_t = \arg\min_{S\in\mathbb{R}^{D\times D}} \lambda\|S\|_{\mathrm F}^2 + \sum_{i=1}^t \eta_i \|S k_i - v_i\|_2^2,$ where, for each step $i$ :

$k_i, v_i \in \mathbb{R}^D$ are the key and value vectors,
$\eta_i\in [0,1]$ are learned exponential fading weights,
$\lambda > 0$ provides Tikhonov regularization.

The analytic solution is

$S_t = U_t (H_t + \lambda I)^{-1}, \qquad U_t = \sum_{i=1}^t \eta_i v_i k_i^\top, \quad H_t = \sum_{i=1}^t \eta_i k_i k_i^\top,$

with $I$ the identity matrix. Output for query $q_t$ is $y_t = S_t q_t = U_t x_t$ , with $x_t \approx (H_t + \lambda I)^{-1}q_t$ .

By using all past $k_i, v_i$ (with appropriately chosen or learned exponential fading weights), GKA departs from conventional SSMs’ fixed state summaries, providing a theoretically optimal solution to full-history regression under linear-memory constraints.

3. Adaptive Regularization and Gating Mechanisms

Numerical stability of ridge regression deteriorates on long sequences due to the increasing condition number of the matrix $H_t + \lambda I$ . Uniform regularization can lead to either loss of memory (if $\lambda$ is too large) or instability (if too small). GKA addresses this with an adaptive regularization schedule: $\lambda_t = a \|H_t\|_{\mathrm F}, \quad a>0,$ where $a$ is a learnable or set hyperparameter. This binds the condition number to a constant $(a+1)/a$ , ensuring numerical tractability across sequence lengths and avoiding catastrophic forgetting or gradient instability.

Additionally, recency bias and flexible memory can be learned using an input-conditioned gating architecture. The per-token fading weights $\eta_{t,i}$ are parameterized as

$\eta_{t,i} = \prod_{j=i+1}^t \gamma_j, \qquad \gamma_j = \sigma(w_g^\top [x_j;h_{j-1}] + b_g),$

with $\sigma$ the sigmoid activation and $h_{j-1}$ an internal summary state. This product form ensures exponential decay in memory contribution, implemented efficiently in $O(1)$ memory per token.

Ablation studies reveal that omitting adaptive regularization causes spiky gradients and training collapse; removing gating degrades recall by 7–10% on retrieval tasks. These components are thus essential to GKA’s performance and stability (Peng et al., 26 Nov 2025).

4. Chebyshev Iteration and Numerical Solvers

Direct inversion or Cholesky decomposition of $H_t + \lambda_t I$ scales as $O(D^3)$ , impractical for modern architectures and large $D$ . GKA substitutes this with $r$ iterations of the Chebyshev method, which offers the following properties:

Complexity per step: $O(D^2)$ via matrix-vector products and rank-one updates.
Convergence in $O(\sqrt{\kappa}\log(1/\varepsilon))$ steps, where $\kappa$ is the condition number.

In Chebyshev iteration: $\omega_0=2,\; \xi^{-1}=0,\; \xi^0 = \frac{2}{L+\mu}q$ and for $k\ge 0$

$\omega_{k+1} = \frac{4}{4 - \rho^2 \omega_k}, \qquad \xi^{(k+1)} = \xi^{(k)} - \frac{2\omega_{k+1}}{L+\mu}(H\xi^{(k)} - q) + (\omega_{k+1}-1)(\xi^{(k)}-\xi^{(k-1)}),$

for $\mu = \lambda_t$ , $L = \|H_t\|_{\mathrm F} + \lambda_t$ , $\rho = (L-\mu)/(L+\mu)$ .

Compared to conjugate gradient, Chebyshev iteration is more robust in low-precision environments (such as bfloat16) because it avoids ill-conditioned momentum scaling. The recurrence’s backward structure allows the backward pass to be computed without storing all intermediate iterates, reducing memory overhead.

5. Hardware-Aware Implementation

To further optimize for modern accelerators, GKA uses chunk-wise state management:

The token stream of length $T$ is divided into $N=T/C$ chunks of size $C$ .
Only the states $H_{t_0}$ are materialized at chunk boundaries ( $t_0=0, C, 2C, \dots$ ).
Within each chunk, key matrices, gram matrices, and the Frobenius norms are updated in parallel, supporting efficient Chebyshev iterations.
The Frobenius norm $\|H_{t_0 + c}\|_{\mathrm F}$ is maintained using block/triangular masks and cumulative product vectors from local chunked key sets, avoiding $O(D^2)$ space per token.

Backward gradients are handled via implicit differentiation and reapplication of Chebyshev iterations on transposed systems, allowing efficient memory usage and low-latency training on hardware such as GPUs and TPUs.

6. Computational Complexity

GKA’s per-token and total complexity is summarized in the following table:

Operation	Per-Token Complexity	Memory Requirement
Chebyshev	$O(r D^2)$	$O(D^2)$ per chunk state
Update $U_t$ , $H_t$	$O(D^2)$	$O(D^2)$
Total	$O(rT D^2)$	$O(D^2)$

Here $r$ is the number of Chebyshev iterations (typically 20–30, independent of $T$ or $D$ ). This yields linear compute in sequence length $O(T)$ and constant memory with respect to context length, aligning with the best-case characteristics of SSMs.

7. Empirical Findings and Extensions

Empirical results establish GKA’s state-of-the-art performance among linear-memory models:

On synthetic associative-recall (MQAR) tasks (up to 8K tokens), GKA outperforms Mamba2, GLA, and Gated DeltaNet by 5–10 points in recall accuracy at matched state dimensions.
For short-context language modeling (LM-Harness, 2.8B parameter regime, FDA and SWDE tasks), GKA surpasses SSM baselines by approximately 10% relative, approaching the performance of full softmax Transformers.
On long-context applications such as retrieval-augmented generation (RAG) and LongQA (up to 128K tokens), GKA delivers >10% relative improvement over fading-memory methods, with recall competitive to full attention up to 32K tokens.
Ablation analyses show that both adaptive regularization and gating are necessary for stability and superior recall; Chebyshev iteration is indispensable for reliable low-precision training and inference.

Further, the architecture supports several prospective enhancements:

Sketching the normal equations to dimension $d\ll D$ can accelerate Chebyshev iterations by about 10% throughput, with under 1% accuracy loss.
Hybrid designs that alternate full-attention heads with GKA yield additional recall gains at low incremental cost.
Extensions to kernelized or non-linear ridge regression (deep test-time optimization) are open research directions.
Scaling to architectures above 10B parameters and integrating efficient inference schemes (prefix caching, custom kernels) are promising for increased deployment.

GKA thus operationalizes full-history regression within a linear-memory, hardware-friendly architecture, substantially mitigating the historical tradeoff between efficiency and memory retention in neural sequence modeling (Peng et al., 26 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Gated KalmaNet (GKA).