Papers
Topics
Authors
Recent
Search
2000 character limit reached

Infinite Multi-State RNNs

Updated 28 April 2026
  • Infinite Multi-State RNNs are an extension of classical RNNs where a new hidden state is appended at each timestep, yielding an unbounded memory.
  • The framework unifies transformer key-value caches with memory recurrent units, leveraging mechanisms like TOVA for efficient and bounded memory compression.
  • Empirical studies show minimal performance loss with aggressive memory compression, enabling memory-efficient language modeling and improved parallel recurrent designs.

Infinite Multi-State Recurrent Neural Networks (MSRNNs) generalize classical RNNs by maintaining an extensible collection of hidden states at each time step, rather than a single vector. In the infinite case, the number of such states grows without bound with sequence length, yielding a memory structure analogous to the key-value cache of transformers. This framework formally unifies sequence models with persistent, non-vanishing memory, encompassing both transformers and recently proposed memory recurrent units (MRUs) that leverage multistability to encode persistent information. Conceptually, infinite MSRNNs model sequences by continuously appending new state components; in practice, compression mechanisms such as Token Omission Via Attention (TOVA) bound this memory with little accuracy loss. This perspective yields practical methods for memory-efficient language modeling, opens new paths for parallelizable recurrent design, and situates transformer and advanced RNN architectures along a common theoretical axis (Oren et al., 2024, Geeter et al., 14 Jan 2026).

1. Formalism: Multi-State RNNs and the Infinite Case

A classical RNN layer maintains at each timestep tt a single hidden vector htlRdh_t^l \in \mathbb{R}^d. In contrast, an MSRNN layer stores a matrix of hidden states,

Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},

where g(t)g(t) indexes the number of concurrent states. The dynamics are given by per-state recurrence,

htl,i=fil(ht1l,i,xtl),i=1,,g(t),h_t^{l,i} = f_i^l(h_{t-1}^{l,i}, x_t^l), \quad i = 1, \ldots, g(t),

with the full state updated via

(xtl+1,Htl)=fMSRNNl(xtl,Ht1l).(x_t^{l+1}, H_t^l) = f_{\mathrm{MSRNN}}^l(x_t^l, H_{t-1}^l).

When g(t)=1g(t)=1, this reduces to a standard RNN; fixed g(t)=k>1g(t)=k>1 yields a finite-capacity MSRNN. The infinite MSRNN is defined by g(t)=tg(t)=t: every timestep appends a new state, resulting in unbounded memory (Oren et al., 2024).

2. Infinite MSRNNs as a Model of the Transformer Key-Value Cache

Decoder-only transformers, commonly considered distinct from RNNs, match precisely the infinite MSRNN formalism. At each step, transformers accumulate a growing sequence of key-value pairs per layer,

Ktl=[k1l,,ktl],Vtl=[v1l,,vtl]Rt×d.K_t^l = [k_1^l, \dots, k_t^l]^\top, \quad V_t^l = [v_1^l, \dots, v_t^l]^\top \in \mathbb{R}^{t \times d}.

Each row htlRdh_t^l \in \mathbb{R}^d0 constitutes an MSRNN sub-state htlRdh_t^l \in \mathbb{R}^d1. At every step, a new key-value pair is appended, and the output is computed via attention-weighted summation over all past values. Unlike conventional RNNs, these states are not overwritten or decayed, but persist indefinitely—an instantiation of an infinite-capacity recurrent system. In this framing, transformer autoregressive decoding becomes a dynamic recurrent process with an unbounded hidden state (Oren et al., 2024).

3. Memory Compression: From Infinite to Bounded MSRNNs

Practical deployment of infinite MSRNNs is limited by hardware memory constraints, since the hidden state (key-value cache) grows as htlRdh_t^l \in \mathbb{R}^d2, where htlRdh_t^l \in \mathbb{R}^d3 is sequence length. To address this, bounded MSRNNs enforce htlRdh_t^l \in \mathbb{R}^d4, for some fixed htlRdh_t^l \in \mathbb{R}^d5, and apply a compression policy when htlRdh_t^l \in \mathbb{R}^d6.

Token Omission Via Attention (TOVA) is a training-free, greedy policy that, at each decoding step, omits the least attended token (row) from the key-value cache of each layer. Operationally:

  1. Append the current key-value to the cache.
  2. Compute attention scores for all htlRdh_t^l \in \mathbb{R}^d7 tokens.
  3. Remove the token with minimal attention. This yields exactly htlRdh_t^l \in \mathbb{R}^d8 cache rows after each step, dynamically selecting the most salient information during generation. TOVA is parameter-free, requires no retraining, and is implemented independently per layer (Oren et al., 2024).

4. Empirical Assessment and Benchmarking

TOVA has been empirically validated across four long-context tasks (language modeling, summarization, long-range QA, story generation) and three LLM architectures (LLaMA-2‐7B, Mistral‐7B, Yi-7B). Results demonstrate negligible loss despite aggressive compression:

  • Language modeling (PG-19): htlRdh_t^l \in \mathbb{R}^d9 perplexity overhead at Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},0 (¼ cache); for Mistral-7B, Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},1 (⅛ cache) is within Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},2 perplexity.
  • Summarization (SQuALITY, ROUGE geometric mean): Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},3 within 1 ROUGE point of the full cache for all models; Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},4 within 1 point for LLaMA-2 and Yi-7B, within 0.8 for Mistral-7B.
  • Long-range QA (QASPER, F1): Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},5 within 1 F1 of full, Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},6 within 1.5 F1.
  • Story generation (GPT-4 preference): at Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},7, TOVA matches/wins 53% vs. full; at Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},8 81%; at Htl=[htl,1 htl,2  htl,g(t)]Rg(t)×d,H_t^l = \begin{bmatrix} h_t^{l,1}\ h_t^{l,2}\ \vdots\ h_t^{l,g(t)}\end{bmatrix} \in \mathbb{R}^{g(t) \times d},9 94%.

On GPUs, shrinking the cache size to g(t)g(t)0 (via g(t)g(t)1) yields an estimated g(t)g(t)2 throughput gain, nearly matching full-context performance in all evaluated domains (Oren et al., 2024).

5. Multistability and Persistent Memory in MRUs

The infinite MSRNN concept is tightly linked to multistability in MRUs—a class of RNNs capable of encoding persistent memories as attractors of the recurrence. An MRU is defined by an implicit update g(t)g(t)3, with the next hidden state selected by continuity from the previous one. When g(t)g(t)4 exhibits g(t)g(t)5 attracting roots, the cell is g(t)g(t)6-stable and can persistently store g(t)g(t)7 distinct memory states dependent only on initialization and input, rather than transient dynamical behavior (Geeter et al., 14 Jan 2026).

To construct truly infinite memory, one replaces the sum over discrete attractors with an integral or a periodic nonlinearity with infinitely many intersections with the identity, such as a sawtooth function. Stability at each attractor is validated by ensuring all Jacobian eigenvalues have magnitude less than unity.

Parallelizable implementations are possible for MRUs, notably the bistable MRU (BMRU). Updates conform to a first-order linear recurrence and support g(t)g(t)8 time computation using prefix-scan algorithms. This enables scaling to long sequences otherwise intractable for classical RNNs (Geeter et al., 14 Jan 2026).

6. Implications and Applications

The infinite MSRNN lens unifies transformer and RNN architectures, demonstrating that the transformer's memory system is a concrete example of an unbounded multi-state recurrent process. All windowing or learned-selection mechanisms correspond to selecting a bounded submatrix of the theoretically infinite hidden state. Techniques like TOVA serve as plug-in compression modules, enabling memory-efficient inference and extending model context ranges without retraining.

Further, the multistability framework undergirds advances in persistent-memory RNNs: by encoding multiple—or infinite—attractors, MRUs can achieve robust, non-vanishing retention of information that is stable against sequence length and supports parallel training paradigms (Oren et al., 2024, Geeter et al., 14 Jan 2026).

7. Experimental Protocols for Infinite MSRNNs

Evaluating infinite MSRNNs requires benchmarks on both synthetic and real-world long-context tasks, including:

  • Copy memory and adding tasks for increasing g(t)g(t)9 (number of attractors).
  • Tracking the gradient norm htl,i=fil(ht1l,i,xtl),i=1,,g(t),h_t^{l,i} = f_i^l(h_{t-1}^{l,i}, x_t^l), \quad i = 1, \ldots, g(t),0 over extended sequences to ensure non-vanishing/exploding gradients.
  • Measuring wall-clock training and inference times, confirming htl,i=fil(ht1l,i,xtl),i=1,,g(t),h_t^{l,i} = f_i^l(h_{t-1}^{l,i}, x_t^l), \quad i = 1, \ldots, g(t),1 depth for parallel implementations compared to htl,i=fil(ht1l,i,xtl),i=1,,g(t),h_t^{l,i} = f_i^l(h_{t-1}^{l,i}, x_t^l), \quad i = 1, \ldots, g(t),2 for classical RNNs.
  • Comparative benchmarks on tasks such as permuted sequential MNIST and Pathfinder to assess persistent memory under distributional shifts and input gaps (Geeter et al., 14 Jan 2026).

Empirical results reveal that infinite MSRNNs, when suitably implemented and regularized, avoid memory fading and outperform classical RNNs, LSTMs, and SSMs on tasks with long-term dependencies. Hybrid architectures (e.g., combining BMRU with SSMs) also yield superior results by coupling persistent memory with efficient transient dynamics (Geeter et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Infinite Multi-State RNNs (MSRNNs).