Predictive-State and Memory Necessity

Updated 5 March 2026

Predictive-State and Memory Necessity is a framework that encodes a process's state through future observable predictions, eliminating reliance on latent variables.
The approach uses recursive filtering and spectral initialization to reduce sample complexity and enhance robustness under partial observability.
Empirical and theoretical studies show that leveraging predictive information provides computational efficiency and insights into biological memory architectures.

Predictive-state representations (PSRs) generalize classical state-space models by representing the internal state of a stochastic process or agent entirely in terms of predictions about future observable quantities, conditioning on action-observation history. The "memory necessity" problem interrogates the minimal sufficient summary of past experience required to optimally predict future observables, learn effective control policies, and reconstruct hidden structure in complex dynamical systems. This article synthesizes the computational theory, key results, empirical findings, and theoretical limitations surrounding predictive-state models and the necessity (or reduction) of explicit memory in both artificial and biological contexts.

1. Predictive-State Representations: Formalism and Principle

PSRs encode the state of a process via the joint conditional distribution of future observations (and optionally, future rewards) given the observed history and a prescribed future action sequence. This reframes classical belief-state tracking and latent-state models: rather than maintaining posteriors over unobserved latent states (as in HMMs or POMDPs), a PSR is a sufficient statistic for prediction, constructed from strictly observable quantities (Hefny et al., 2018).

For a fixed prediction horizon $k$ , let $\psi_t = \mathbb{E}[\phi(o_{t:\,t+k-1})\,|\,do(a_{t:\,t+k-1}),\,h_t]$ denote the vector of conditional expectations of feature functions $\phi$ of the next $k$ observations, conditioned on a history $h_t$ and an intervention (action) sequence. PSRs can be recursively updated using:

State extension: $p_t = W_\text{ext}\,\psi_{t-1}$
Conditioning step: $\psi_t = f_\text{cond}(p_t, a_{t-1}, o_{t-1})$

Alternatively, a single (differentiable) mapping $\psi_t = f_W([\psi_{t-1}; a_{t-1}; o_{t-1}])$ can be used. The resulting $\psi_t$ replaces any recurrent neural network latent state with a minimal, observable "predictive belief" (Hefny et al., 2018).

The successor representation (SR)—the expectation of future discounted visits to each state $M_\gamma(s,s') = \mathbb{E}_\pi[\sum_{t=0}^\infty \gamma^t 1\{s_t=s'\} \mid s_0 = s]$ —is a canonical PSR for planning and reinforcement learning, generalizable to successor features for linear reward models (Momennejad, 2024).

2. Memory Necessity: Theory and Information-Theoretic Limits

The question of how much memory—past or computational state—is necessary for optimal prediction is sharply illuminated by both information-theoretic and algorithmic characterizations.

Finite Mutual Information Predicts Effective Memory: For stationary processes with average past-future mutual information $I(\mathcal{M})$ , an order- $\ell$ Markov model with $\ell=I(\mathcal{M})/\epsilon$ suffices to achieve expected KL prediction error $\epsilon$ relative to the infinite-memory Bayes-optimal predictor (Sharan et al., 2016). For an $n$ -state HMM, $I(\mathcal{M})\leq \log n$ , so $\ell = O\left(\frac{\log n}{\epsilon}\right)$ is sufficient and information-theoretically necessary (see Table 1).

| Model Class | Sufficient Window $\ell$ | Memory Bound | Sample Complexity | |:-----------:|:-----------------------:|:------------:|:-----------------:| | HMM ( $n$ states) | $O(\log n / \epsilon)$ | $O(\log n / \epsilon)$ | $d^{O(\log n/\epsilon)}$ | | General stationary | $I/\epsilon$ | $I/\epsilon$ | $d^{O(I/\epsilon)}$ |

Predictive-State Complexity Dimension: For HMMs and latent stochastic processes, the minimal predictive model is the set of all "mixed states" (distributions over hidden states conditioned on the observed past). For non-unifilar HMMs, this set is generically fractal and uncountably infinite. The statistical complexity dimension $d_\mu$ assesses the scaling law for memory: to achieve error $\epsilon$ one needs $\sim \epsilon^{-d_\mu}$ distinct predictive states. Processes with $d_\mu > 0$ have intrinsically infinite-memory requirements (Jurgens et al., 2021).

3. Empirical and Algorithmic Approaches to Memory Reduction

A series of results in model-based reinforcement learning, planning, and sequence modeling has demonstrated practical reductions in memory or sample complexity by utilizing predictive-state, rather than purely episodic, architectures.

Recursive Spectral Initialization: Two-stage spectral regression methods allow for consistent initialization of the PSR filter, significantly improving learning speed and stability over randomly-initialized or frozen filters, especially under partial observability (Hefny et al., 2018).
Comparison with Finite-Memory and RNN Models: Empirical studies with Recurrent Predictive State Policy (RPSP) networks demonstrate that PSR-based recurrent filters with reactive policies (no learned recurrence) consistently outperform or match gated RNNs (e.g., GRUs) and finite-history Markov models on partially observable control tasks. The predictive-state filter is more robust under observation noise than finite-memory models (Hefny et al., 2018).
Banked Multiscale Predictive-States: Biological and deep learning models benefit from maintaining predictive-state representations at multiple temporal scales (varying discount $\gamma$ ), yielding flexible generalization and rapid adaptation to both reward and transition revaluation (Momennejad, 2024).

4. Predictive-State and Biological Memory Architectures

Neural and behavioral evidence suggests that the brain's memory and planning systems are organized around multiscale predictive-state representations. In mammals:

Hippocampal Hierarchy: Posterior hippocampus encodes fine-scale, short-horizon predictive maps; anterior hippocampus encodes coarse, long-horizon predictions.
Prefrontal Cortex (PFC): Rostro-caudal gradients in PFC support abstraction across temporal scales, assembling predictive memories for scalable planning.
Empirical Signatures: Human behavioral revaluation experiments, fMRI replay studies, and rodent/grid-cell recordings all robustly indicate that predictive-state organization (SR-like encoding) is essential for efficient generalization and adaptation (Momennejad, 2024).

5. Predictive-State Learning, Sample Complexity, and Memory Metrics

Consistency and Sample Efficiency: Under stationarity and ergodicity, the empirical predictive-state estimates converge (in weak topology) to the true predictive law, with explicit rates in Markov, sofic, or renewal processes. For RKHS embeddings, the effective memory needed for approximation error $\varepsilon$ is only $\ell = O(\ln(1/\varepsilon))$ , dramatically compressing the memory required for high-dimensional, non-Markovian processes (Loomis et al., 2021).
Kernel Bayes Rule and Hilbert Embeddings: Truncating to finite memory in RKHS representations introduces only exponentially small error in sequence length, enabling efficient empirical learning (Loomis et al., 2021).

6. Special and Limiting Cases

Quantum Simulation: In quantum contextuality scenarios (e.g., Peres–Mermin square), simulating all quantum measurement statistics classically requires at least three internal states—a sharp, constructive lower and upper bound (Fagundes et al., 2016).
Linear Networks: Memory vs Prediction Tradeoff: In linear recurrent reservoirs, memory capacity and predictive capacity are provably distinct and may be in direct opposition: maximizing "memory" (w.r.t. input reconstruction) can destroy future prediction ability, and vice versa. Optimal predictive models are tightly linked to Wiener-filter theory, not to memory size alone (Marzen, 2017).
Limitations of Episodic and Markovian Models: Purely episodic or one-step memory is insufficient for generalization under partial observability or compositional structure. Predictive-state models provide the unique minimal sufficient statistic for prediction but may still require state-space approximations or hierarchical composition in complex environments (Hefny et al., 2018, Momennejad, 2024).

7. Open Problems and Practical Considerations

Contemporary research identifies several unresolved challenges:

Unavoidable Memory Scaling: For processes with positive statistical complexity dimension $d_\mu$ (fractional, infinite memory), no finite-state or fixed-order Markov predictor is fully optimal as $\epsilon \to 0$ ; approximations must explicitly trade off memory cost and prediction error (Jurgens et al., 2021).
Data-Memory Tradeoff: Increasing memory (e.g., window length) exponentially increases (alphabet-size) sample complexity, an inherent statistical and computational barrier (Sharan et al., 2016).
Real-Time Prediction: In neuroengineering contexts, balancing transition-table memory length versus computational latency and accuracy is an active area with only open questions and schematic analyses to date (Taranath, 9 Mar 2025).
Hierarchical Predictive Control: Integrating predictive-state representations across abstraction levels (episodic buffer to multiscale SR) is critical for scalable planning in both artificial and biological agents (Momennejad, 2024).
Robustness and Model Misspecification: Under model mismatch or function approximation error, explicit learned recurrence (e.g., LSTM/GRU) may still contribute necessary flexibility not afforded by pure PSR models (Hefny et al., 2018).

References

(Hefny et al., 2018) Recurrent Predictive State Policy Networks
(Sharan et al., 2016) Prediction with a Short Memory
(Jurgens et al., 2021) Divergent Predictive States: The Statistical Complexity Dimension of Stationary, Ergodic Hidden Markov Processes
(Momennejad, 2024) Memory, Space, and Planning: Multiscale Predictive Representations
(Loomis et al., 2021) Topology, Convergence, and Reconstruction of Predictive States
(Marzen, 2017) The difference between memory and prediction in linear recurrent networks
(Fagundes et al., 2016) Memory cost for simulating all quantum correlations of the Peres-Mermin scenario
(Wayne et al., 2018) Unsupervised Predictive Memory in a Goal-Directed Agent
(Taranath, 9 Mar 2025) On Questions of Predictability and Control of an Intelligent System Using Probabilistic State-Transitions