Minimal Predictive Sufficiency SSM

Updated 26 February 2026

The paper introduces an info-theoretic framework where the hidden state is a minimal predictive sufficient statistic for accurate future forecasting.
It employs a relaxed Lagrangian objective combining prediction loss with an information regularizer to compress non-causal history efficiently.
Empirical results show MPS-SSM achieves state-of-the-art accuracy and robustness against noisy inputs across multiple time-series benchmarks.

The Minimal Predictive Sufficiency State Space Model (MPS-SSM) is a sequence modeling framework whose content-selective state gating is derived from a first-principle information-theoretic criterion. MPS-SSM builds on the principle that the model’s hidden state should be a minimal sufficient statistic of the past for predicting the future. This results in a model that maximally compresses historical context, learns to ignore non-causal information, and exhibits robustness and accuracy across long-horizon and noisy forecasting scenarios (Wang et al., 5 Aug 2025).

1. Principle of Predictive Sufficiency

The central theoretical construct underlying MPS-SSM is the principle of predictive sufficiency. For a sequence $(U_{1:t}, Y_{t:t+\tau})$ where $U_{1:t}$ is the observed history and $Y_{t:t+\tau}$ denotes a segment of future targets, MPS-SSM demands that the hidden state $h_t$ at every time $t$ satisfies two criteria:

Predictive sufficiency: The hidden state $h_t$ must retain all information in $U_{1:t}$ relevant for predicting $Y_{t:t+\tau}$ ; formally, $I(h_t;Y_{t:t+\tau}) = I(U_{1:t};Y_{t:t+\tau})$ .
Minimality: Among all statistics satisfying sufficiency, $h_t$ should minimize $I(U_{1:t};h_t)$ , i.e., $I(U_{1:t};h_t) \le I(U_{1:t};h_t')$ for any $h_t'$ also satisfying sufficiency.

Collectively, these constraints characterize $h_t$ as a minimal predictive sufficient statistic and can be formalized by the optimization problem: $\min_{p(h_t\mid U_{1:t})} I(U_{1:t};h_t) \quad \text{s.t.} \quad I(h_t;Y_{t:t+\tau}) = I(U_{1:t};Y_{t:t+\tau}).$ This setup ensures that the hidden state captures only the causal structure necessary for accurate sequence prediction and discards spurious or non-predictive variability.

2. MPS-SSM Objective Function Derivation

Directly enforcing the constraint $I(h_t;Y_{t:t+\tau}) = I(U_{1:t};Y_{t:t+\tau})$ is intractable, so MPS-SSM introduces a relaxed Lagrangian objective. The predictive sufficiency criterion is represented by a standard prediction loss,

$\mathcal{L}_\mathrm{Pred} = \frac{1}{T\tau} \sum_{t=1}^T \sum_{i=1}^\tau \mathbb{E}_{p(h_t \mid U_{1:t})} [\|\hat{y}_{t+i}(h_t) - y_{t+i}\|^2],$

while the minimality term is realized as an information-theoretic regularizer,

$\mathcal{L}_\mathrm{Min} = \frac{1}{T} \sum_{t=1}^T I(U_{1:t};h_t).$

The total objective becomes

$\mathcal{L}_\mathrm{Total} = \mathcal{L}_\mathrm{Pred} + \lambda\,\mathcal{L}_\mathrm{Min}$

with $\lambda>0$ balancing prediction performance and information compression.

As direct computation of $I(U_{1:t};h_t)$ is intractable, MPS-SSM employs a variational upper bound using a decoder $q_\theta(u_t|h_t)$ : $I(U_{1:t};h_t) \leq \mathbb{E}_{p(U_{1:t},h_t)}[-\log q_\theta(u_t | h_t)] + \mathrm{const},$ enabling practical and stable optimization via backpropagation with

$\mathcal{L}_\mathrm{Min} \approx \frac{1}{T} \sum_{t=1}^T [-\log q_\theta(u_t | h_t)].$

3. Architecture and Training Methodology

MPS-SSM extends a content-selective SSM backbone—such as Mamba—by integrating several key modules:

Selection Gate $G_\phi(u_t)$ : Computes adaptive state-space parameters $(\Delta_t, B_t, C_t)$ conditioned on each input $u_t$ .
SSM Recurrence: The core transition follows

$h_t = A(\Delta_t) h_{t-1} + B_t u_t, \quad \hat{y}_{t+1} = C_t h_t$

where $A(\Delta_t) = \exp(\Delta_t A)$ is approximated via zero-order hold or NPLR techniques.

Minimality Module: A lightweight decoder $q_\theta(u_t|h_t)$ reconstructs $u_t$ from $h_t$ to facilitate the variational information regularization.
Prediction Head: Projects $h_t$ into target predictions $\{\hat{y}_{t+i}\}_{i=1}^\tau$ .

Training is conducted over entire unrolled sequences, jointly optimizing $\phi$ (gate), $\theta$ (decoder), and SSM matrices to minimize $\mathcal{L}_\mathrm{Total}$ with standard first-order methods. The entire process is efficiently scalable and practical for large-scale time-series tasks.

Training Workflow Table

Step	Operation	Output
Selection Gate	$(\Delta_t,B_t,C_t) \leftarrow G_\phi(u_t)$	Adaptive params
SSM Recurrence	$h_t \leftarrow A(\Delta_t)h_{t-1} + B_tu_t$	Hidden state
Prediction	$\hat{y}_{t+i} \leftarrow C_t h_t$	Forecasted values
Minimality Module	$-\log q_\theta(u_t\|h_t)$	Info loss
Backpropagation	$\nabla \mathcal{L}_\mathrm{Total}$	Parameter update

4. Empirical Results and Robustness Analysis

MPS-SSM has been evaluated on established sequence modeling and forecasting benchmarks, including ETT (ETTh1/2, ETTm1/2), Weather, Electricity, Traffic, and Exchange, across forecast horizons (96, 192, 336, 720) and measured via MSE and MAE.

Key findings include:

Optimal Regularization ( $\lambda$ ) Sensitivity: Each dataset and horizon displays a “sweet-spot” $\lambda$ (e.g., ETTh1: $\lambda \approx 2.0$ ; Weather: $\lambda \approx 0.5$ ; ETTm2: $\lambda \approx 100$ ), and the optimal $\lambda$ increases with forecast length.
State-of-the-Art Accuracy:
- On ETTh1 (96), MPS-SSM achieves MSE = 0.375, second only to PatchTST (0.360).
- On ETTm2 (96), MSE = 0.165, outperforming PatchTST (0.224).
- On Electricity (96), MSE = 0.151 (vs. next-best 0.225).
- On long horizons, MPS-SSM routinely ranks best or second-best.
Robustness to Noise: Under impulse noise perturbations to inputs, increasing $\lambda$ monotonically reduces forecast error degradation; at $\lambda=100$ , degradation is approximately threefold lower than at $\lambda=0$ . This empirically validates the theoretical prediction that MPS-SSM is resilient to non-causal spurious input patterns.

5. Generalization to a Regularization Framework

The MPS principle is not restricted to SSMs and can be instantiated as a model-agnostic regularizer for any sequential architecture. This extension involves:

Selecting an internal representation $z_t$ (e.g., an SSM state, Transformer embedding, or linear hidden vector).
Attaching a lightweight decoder $q(u_t|z_t)$ .
Adding the minimality regularization term $\mathcal{L}_\mathrm{Min} = \frac{1}{T} \sum_t [-\log q(u_t | z_t)]$ to the base task loss.

This general regularization strategy takes the form: $\mathcal{L}_\mathrm{Total} = \mathcal{L}_\mathrm{Task}(f) + \lambda \frac{1}{T}\sum_{t=1}^T[-\log q_\psi(u_t|z_t)]$ where $f$ denotes the original task model.

Empirical evidence demonstrates utility across architectures such as Mamba (MPS-Mamba), linear models (MPS-DLinear), and Transformers (MPS-PatchTST), with consistent improvements on ETT and other datasets (e.g., MPS-PatchTST achieves ETTh1/96 MSE=0.328 compared to 0.360 for vanilla PatchTST).

6. Significance and Implications

MPS-SSM is the first selective SSM whose gating is derived from the information-theoretic requirement that hidden states encode the minimal predictive sufficient statistic. The resulting mutual information penalty confers both empirical state-of-the-art generalization and robustness properties, notably resistance to non-causal and spurious noise. Furthermore, the principle’s generality enables its adoption as an effective regularizer in architectures beyond SSMs, including popular sequence models such as Transformers and linear models (Wang et al., 5 Aug 2025). A plausible implication is the emergence of a new paradigm for designing sequential models grounded in first principles rather than heuristic mechanism design.

Markdown Report Issue Upgrade to Chat

References (1)

Rethinking Selectivity in State Space Models: A Minimal Predictive Sufficiency Approach (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimal Predictive Sufficiency SSM (MPS-SSM).

Minimal Predictive Sufficiency SSM

1. Principle of Predictive Sufficiency

2. MPS-SSM Objective Function Derivation

3. Architecture and Training Methodology

4. Empirical Results and Robustness Analysis

5. Generalization to a Regularization Framework

6. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Minimal Predictive Sufficiency SSM

1. Principle of Predictive Sufficiency

2. MPS-SSM Objective Function Derivation

3. Architecture and Training Methodology

4. Empirical Results and Robustness Analysis

5. Generalization to a Regularization Framework

6. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research