Minimal Predictive Sufficiency SSM
- The paper introduces an info-theoretic framework where the hidden state is a minimal predictive sufficient statistic for accurate future forecasting.
- It employs a relaxed Lagrangian objective combining prediction loss with an information regularizer to compress non-causal history efficiently.
- Empirical results show MPS-SSM achieves state-of-the-art accuracy and robustness against noisy inputs across multiple time-series benchmarks.
The Minimal Predictive Sufficiency State Space Model (MPS-SSM) is a sequence modeling framework whose content-selective state gating is derived from a first-principle information-theoretic criterion. MPS-SSM builds on the principle that the model’s hidden state should be a minimal sufficient statistic of the past for predicting the future. This results in a model that maximally compresses historical context, learns to ignore non-causal information, and exhibits robustness and accuracy across long-horizon and noisy forecasting scenarios (Wang et al., 5 Aug 2025).
1. Principle of Predictive Sufficiency
The central theoretical construct underlying MPS-SSM is the principle of predictive sufficiency. For a sequence where is the observed history and denotes a segment of future targets, MPS-SSM demands that the hidden state at every time satisfies two criteria:
- Predictive sufficiency: The hidden state must retain all information in relevant for predicting ; formally, .
- Minimality: Among all statistics satisfying sufficiency, should minimize , i.e., for any also satisfying sufficiency.
Collectively, these constraints characterize as a minimal predictive sufficient statistic and can be formalized by the optimization problem: This setup ensures that the hidden state captures only the causal structure necessary for accurate sequence prediction and discards spurious or non-predictive variability.
2. MPS-SSM Objective Function Derivation
Directly enforcing the constraint is intractable, so MPS-SSM introduces a relaxed Lagrangian objective. The predictive sufficiency criterion is represented by a standard prediction loss,
while the minimality term is realized as an information-theoretic regularizer,
The total objective becomes
with balancing prediction performance and information compression.
As direct computation of is intractable, MPS-SSM employs a variational upper bound using a decoder : enabling practical and stable optimization via backpropagation with
3. Architecture and Training Methodology
MPS-SSM extends a content-selective SSM backbone—such as Mamba—by integrating several key modules:
- Selection Gate : Computes adaptive state-space parameters conditioned on each input .
- SSM Recurrence: The core transition follows
where is approximated via zero-order hold or NPLR techniques.
- Minimality Module: A lightweight decoder reconstructs from to facilitate the variational information regularization.
- Prediction Head: Projects into target predictions .
Training is conducted over entire unrolled sequences, jointly optimizing (gate), (decoder), and SSM matrices to minimize with standard first-order methods. The entire process is efficiently scalable and practical for large-scale time-series tasks.
Training Workflow Table
| Step | Operation | Output |
|---|---|---|
| Selection Gate | Adaptive params | |
| SSM Recurrence | Hidden state | |
| Prediction | Forecasted values | |
| Minimality Module | Info loss | |
| Backpropagation | Parameter update |
4. Empirical Results and Robustness Analysis
MPS-SSM has been evaluated on established sequence modeling and forecasting benchmarks, including ETT (ETTh1/2, ETTm1/2), Weather, Electricity, Traffic, and Exchange, across forecast horizons (96, 192, 336, 720) and measured via MSE and MAE.
Key findings include:
- Optimal Regularization () Sensitivity: Each dataset and horizon displays a “sweet-spot” (e.g., ETTh1: ; Weather: ; ETTm2: ), and the optimal increases with forecast length.
- State-of-the-Art Accuracy:
- On ETTh1 (96), MPS-SSM achieves MSE = 0.375, second only to PatchTST (0.360).
- On ETTm2 (96), MSE = 0.165, outperforming PatchTST (0.224).
- On Electricity (96), MSE = 0.151 (vs. next-best 0.225).
- On long horizons, MPS-SSM routinely ranks best or second-best.
- Robustness to Noise: Under impulse noise perturbations to inputs, increasing monotonically reduces forecast error degradation; at , degradation is approximately threefold lower than at . This empirically validates the theoretical prediction that MPS-SSM is resilient to non-causal spurious input patterns.
5. Generalization to a Regularization Framework
The MPS principle is not restricted to SSMs and can be instantiated as a model-agnostic regularizer for any sequential architecture. This extension involves:
- Selecting an internal representation (e.g., an SSM state, Transformer embedding, or linear hidden vector).
- Attaching a lightweight decoder .
- Adding the minimality regularization term to the base task loss.
This general regularization strategy takes the form: where denotes the original task model.
Empirical evidence demonstrates utility across architectures such as Mamba (MPS-Mamba), linear models (MPS-DLinear), and Transformers (MPS-PatchTST), with consistent improvements on ETT and other datasets (e.g., MPS-PatchTST achieves ETTh1/96 MSE=0.328 compared to 0.360 for vanilla PatchTST).
6. Significance and Implications
MPS-SSM is the first selective SSM whose gating is derived from the information-theoretic requirement that hidden states encode the minimal predictive sufficient statistic. The resulting mutual information penalty confers both empirical state-of-the-art generalization and robustness properties, notably resistance to non-causal and spurious noise. Furthermore, the principle’s generality enables its adoption as an effective regularizer in architectures beyond SSMs, including popular sequence models such as Transformers and linear models (Wang et al., 5 Aug 2025). A plausible implication is the emergence of a new paradigm for designing sequential models grounded in first principles rather than heuristic mechanism design.