Papers
Topics
Authors
Recent
Search
2000 character limit reached

Minimal Predictive Sufficiency SSM

Updated 26 February 2026
  • The paper introduces an info-theoretic framework where the hidden state is a minimal predictive sufficient statistic for accurate future forecasting.
  • It employs a relaxed Lagrangian objective combining prediction loss with an information regularizer to compress non-causal history efficiently.
  • Empirical results show MPS-SSM achieves state-of-the-art accuracy and robustness against noisy inputs across multiple time-series benchmarks.

The Minimal Predictive Sufficiency State Space Model (MPS-SSM) is a sequence modeling framework whose content-selective state gating is derived from a first-principle information-theoretic criterion. MPS-SSM builds on the principle that the model’s hidden state should be a minimal sufficient statistic of the past for predicting the future. This results in a model that maximally compresses historical context, learns to ignore non-causal information, and exhibits robustness and accuracy across long-horizon and noisy forecasting scenarios (Wang et al., 5 Aug 2025).

1. Principle of Predictive Sufficiency

The central theoretical construct underlying MPS-SSM is the principle of predictive sufficiency. For a sequence (U1:t,Yt:t+τ)(U_{1:t}, Y_{t:t+\tau}) where U1:tU_{1:t} is the observed history and Yt:t+τY_{t:t+\tau} denotes a segment of future targets, MPS-SSM demands that the hidden state hth_t at every time tt satisfies two criteria:

  • Predictive sufficiency: The hidden state hth_t must retain all information in U1:tU_{1:t} relevant for predicting Yt:t+τY_{t:t+\tau}; formally, I(ht;Yt:t+τ)=I(U1:t;Yt:t+τ)I(h_t;Y_{t:t+\tau}) = I(U_{1:t};Y_{t:t+\tau}).
  • Minimality: Among all statistics satisfying sufficiency, hth_t should minimize I(U1:t;ht)I(U_{1:t};h_t), i.e., I(U1:t;ht)I(U1:t;ht)I(U_{1:t};h_t) \le I(U_{1:t};h_t') for any hth_t' also satisfying sufficiency.

Collectively, these constraints characterize hth_t as a minimal predictive sufficient statistic and can be formalized by the optimization problem: minp(htU1:t)I(U1:t;ht)s.t.I(ht;Yt:t+τ)=I(U1:t;Yt:t+τ).\min_{p(h_t\mid U_{1:t})} I(U_{1:t};h_t) \quad \text{s.t.} \quad I(h_t;Y_{t:t+\tau}) = I(U_{1:t};Y_{t:t+\tau}). This setup ensures that the hidden state captures only the causal structure necessary for accurate sequence prediction and discards spurious or non-predictive variability.

2. MPS-SSM Objective Function Derivation

Directly enforcing the constraint I(ht;Yt:t+τ)=I(U1:t;Yt:t+τ)I(h_t;Y_{t:t+\tau}) = I(U_{1:t};Y_{t:t+\tau}) is intractable, so MPS-SSM introduces a relaxed Lagrangian objective. The predictive sufficiency criterion is represented by a standard prediction loss,

LPred=1Tτt=1Ti=1τEp(htU1:t)[y^t+i(ht)yt+i2],\mathcal{L}_\mathrm{Pred} = \frac{1}{T\tau} \sum_{t=1}^T \sum_{i=1}^\tau \mathbb{E}_{p(h_t \mid U_{1:t})} [\|\hat{y}_{t+i}(h_t) - y_{t+i}\|^2],

while the minimality term is realized as an information-theoretic regularizer,

LMin=1Tt=1TI(U1:t;ht).\mathcal{L}_\mathrm{Min} = \frac{1}{T} \sum_{t=1}^T I(U_{1:t};h_t).

The total objective becomes

LTotal=LPred+λLMin\mathcal{L}_\mathrm{Total} = \mathcal{L}_\mathrm{Pred} + \lambda\,\mathcal{L}_\mathrm{Min}

with λ>0\lambda>0 balancing prediction performance and information compression.

As direct computation of I(U1:t;ht)I(U_{1:t};h_t) is intractable, MPS-SSM employs a variational upper bound using a decoder qθ(utht)q_\theta(u_t|h_t): I(U1:t;ht)Ep(U1:t,ht)[logqθ(utht)]+const,I(U_{1:t};h_t) \leq \mathbb{E}_{p(U_{1:t},h_t)}[-\log q_\theta(u_t | h_t)] + \mathrm{const}, enabling practical and stable optimization via backpropagation with

LMin1Tt=1T[logqθ(utht)].\mathcal{L}_\mathrm{Min} \approx \frac{1}{T} \sum_{t=1}^T [-\log q_\theta(u_t | h_t)].

3. Architecture and Training Methodology

MPS-SSM extends a content-selective SSM backbone—such as Mamba—by integrating several key modules:

  • Selection Gate Gϕ(ut)G_\phi(u_t): Computes adaptive state-space parameters (Δt,Bt,Ct)(\Delta_t, B_t, C_t) conditioned on each input utu_t.
  • SSM Recurrence: The core transition follows

ht=A(Δt)ht1+Btut,y^t+1=Cthth_t = A(\Delta_t) h_{t-1} + B_t u_t, \quad \hat{y}_{t+1} = C_t h_t

where A(Δt)=exp(ΔtA)A(\Delta_t) = \exp(\Delta_t A) is approximated via zero-order hold or NPLR techniques.

  • Minimality Module: A lightweight decoder qθ(utht)q_\theta(u_t|h_t) reconstructs utu_t from hth_t to facilitate the variational information regularization.
  • Prediction Head: Projects hth_t into target predictions {y^t+i}i=1τ\{\hat{y}_{t+i}\}_{i=1}^\tau.

Training is conducted over entire unrolled sequences, jointly optimizing ϕ\phi (gate), θ\theta (decoder), and SSM matrices to minimize LTotal\mathcal{L}_\mathrm{Total} with standard first-order methods. The entire process is efficiently scalable and practical for large-scale time-series tasks.

Training Workflow Table

Step Operation Output
Selection Gate (Δt,Bt,Ct)Gϕ(ut)(\Delta_t,B_t,C_t) \leftarrow G_\phi(u_t) Adaptive params
SSM Recurrence htA(Δt)ht1+Btuth_t \leftarrow A(\Delta_t)h_{t-1} + B_tu_t Hidden state
Prediction y^t+iCtht\hat{y}_{t+i} \leftarrow C_t h_t Forecasted values
Minimality Module logqθ(utht)-\log q_\theta(u_t|h_t) Info loss
Backpropagation LTotal\nabla \mathcal{L}_\mathrm{Total} Parameter update

4. Empirical Results and Robustness Analysis

MPS-SSM has been evaluated on established sequence modeling and forecasting benchmarks, including ETT (ETTh1/2, ETTm1/2), Weather, Electricity, Traffic, and Exchange, across forecast horizons (96, 192, 336, 720) and measured via MSE and MAE.

Key findings include:

  • Optimal Regularization (λ\lambda) Sensitivity: Each dataset and horizon displays a “sweet-spot” λ\lambda (e.g., ETTh1: λ2.0\lambda \approx 2.0; Weather: λ0.5\lambda \approx 0.5; ETTm2: λ100\lambda \approx 100), and the optimal λ\lambda increases with forecast length.
  • State-of-the-Art Accuracy:
    • On ETTh1 (96), MPS-SSM achieves MSE = 0.375, second only to PatchTST (0.360).
    • On ETTm2 (96), MSE = 0.165, outperforming PatchTST (0.224).
    • On Electricity (96), MSE = 0.151 (vs. next-best 0.225).
    • On long horizons, MPS-SSM routinely ranks best or second-best.
  • Robustness to Noise: Under impulse noise perturbations to inputs, increasing λ\lambda monotonically reduces forecast error degradation; at λ=100\lambda=100, degradation is approximately threefold lower than at λ=0\lambda=0. This empirically validates the theoretical prediction that MPS-SSM is resilient to non-causal spurious input patterns.

5. Generalization to a Regularization Framework

The MPS principle is not restricted to SSMs and can be instantiated as a model-agnostic regularizer for any sequential architecture. This extension involves:

  1. Selecting an internal representation ztz_t (e.g., an SSM state, Transformer embedding, or linear hidden vector).
  2. Attaching a lightweight decoder q(utzt)q(u_t|z_t).
  3. Adding the minimality regularization term LMin=1Tt[logq(utzt)]\mathcal{L}_\mathrm{Min} = \frac{1}{T} \sum_t [-\log q(u_t | z_t)] to the base task loss.

This general regularization strategy takes the form: LTotal=LTask(f)+λ1Tt=1T[logqψ(utzt)]\mathcal{L}_\mathrm{Total} = \mathcal{L}_\mathrm{Task}(f) + \lambda \frac{1}{T}\sum_{t=1}^T[-\log q_\psi(u_t|z_t)] where ff denotes the original task model.

Empirical evidence demonstrates utility across architectures such as Mamba (MPS-Mamba), linear models (MPS-DLinear), and Transformers (MPS-PatchTST), with consistent improvements on ETT and other datasets (e.g., MPS-PatchTST achieves ETTh1/96 MSE=0.328 compared to 0.360 for vanilla PatchTST).

6. Significance and Implications

MPS-SSM is the first selective SSM whose gating is derived from the information-theoretic requirement that hidden states encode the minimal predictive sufficient statistic. The resulting mutual information penalty confers both empirical state-of-the-art generalization and robustness properties, notably resistance to non-causal and spurious noise. Furthermore, the principle’s generality enables its adoption as an effective regularizer in architectures beyond SSMs, including popular sequence models such as Transformers and linear models (Wang et al., 5 Aug 2025). A plausible implication is the emergence of a new paradigm for designing sequential models grounded in first principles rather than heuristic mechanism design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimal Predictive Sufficiency SSM (MPS-SSM).