CompreSSM: Efficient State Space Model Compression

Updated 18 February 2026

CompreSSM is a suite of techniques for compressing state space models using balanced truncation and selective gating to optimize memory, computation, and maintain accuracy.
It employs in-training model order reduction by discarding low-energy state components, achieving robust performance with significantly reduced model dimensions.
The method leverages information-theoretic rate-distortion trade-offs to dynamically balance compression and expressivity in diverse sequence modeling applications.

CompreSSM refers to methods and theoretical constructs for compressing state space models (SSMs) to minimize memory, computation, or storage without incurring significant loss in expressivity or performance. In contemporary literature, CompreSSM specifically denotes two overlapping but distinct research avenues: (1) in-training model order reduction via balanced truncation for discrete linear SSMs, most notably in "The Curious Case of In-Training Compression of State Space Models" (Chahine et al., 3 Oct 2025), and (2) selective memory compression using gating and information-theoretic rate-distortion trade-offs, framed as "Compressive Selective State Space Models" (Bhat, 2024). Both approaches target efficient long-context sequence modeling, providing algorithmic and theoretical tools to overcome the computational and representational bottlenecks typical of large SSMs.

1. State Space Model Compression: Problem Formulation

State space models process sequential data using recurrence: $x_{k+1} = A x_k + B u_k,\quad y_k = C x_k + D u_k,$ where $x_k\in\mathbb R^n$ is the hidden state, $u_k\in\mathbb R^p$ the input, and $y_k\in\mathbb R^q$ the output. High-dimensional $n$ is necessary for modeling long-range dependencies, but creates quadratic per-step update and storage costs. CompreSSM addresses the need to reduce $n$ —the state-space dimension—while maintaining or, in some cases, improving performance relative to a scratch-trained low-dimensional counterpart.

Model order reduction in SSMs is classically approached via balanced truncation, which exploits the controllability and observability structure encoded in the system matrices ( $A,B,C,D$ ) to identify and discard low-energy state components. Modern techniques further introduce adaptive or selective compression mechanisms, such as input-conditioned gating, offering additional avenues for dynamic, data-driven memory savings (Bhat, 2024).

2. Balanced Truncation and In-Training Model Reduction

Balanced truncation is grounded in control theory and leverages the Hankel singular values (HSVs) of an LTI system to quantify the joint controllability and observability of state dimensions. Given the controllability ( $W_c$ ) and observability ( $W_o$ ) Gramians by the discrete Lyapunov equations,

$W_c = \sum_{k=0}^\infty A^k B B^T (A^T)^k, \quad W_o = \sum_{k=0}^\infty (A^T)^k C^T C A^k,$

the HSVs are the sorted square roots of the eigenvalues of $W_cW_o$ . Balanced realization finds a similarity transform rendering $W_c$ , $W_o$ diagonal and equal, allowing truncation of the state to the $r$ directions with largest HSVs: $\sum_{i=1}^{r} \sigma_i \geq \tau \sum_{i=1}^n \sigma_i,$ with hyperparameter $\tau \in [0,1]$ (Chahine et al., 3 Oct 2025). The truncated system inherits $\mathcal H_\infty$ error guarantees.

CompreSSM (Chahine et al., 3 Oct 2025) uniquely integrates this reduction into stochastic gradient descent training. At scheduled checkpoints during early optimization (e.g., within the first 10% of steps), blockwise reduction is triggered, collapsing $n \to r$ by computing balancing transforms and discarding low-impact directions. This dynamic in-training reduction leads to more robust and performant small models than standard slim-from-scratch training.

3. Selective Gating and Information-Theoretic Compression

A complementary paradigm, introduced in (Bhat, 2024), views compression as dynamic, selective retention of subspaces through gating: $x_t = G(u_t, x_{t-1}) \odot (A x_{t-1} + B u_t) + [1-G(u_t, x_{t-1})] \odot x_{t-1} + w_t,$ where $G(u_t, x_{t-1}) \in [0,1]^d$ is an input- and state-dependent vector of gates, and $\odot$ denotes elementwise multiplication. This mechanism implements a form of adaptive memory compression, reducing the effective dimensionality

$\mathrm{dim}_{\mathrm{eff}}(x_t) = \sum_{i=1}^d g_i(u_t, x_{t-1})$

at each time step. The formalism is grounded in information theory, analyzing the trade-off between the mutual information retained in the compressed state $\hat x_t$ and the input history $u_{1:t}$ : $R(D) = \min_{p(\hat x_t | x_t)} I(x_t ; \hat x_t) \quad \text{s.t.} \; \mathbb{E}\|x_t - \hat x_t\|^2 \le D.$ Tunable $L_1$ regularization on $G$ promotes sparsity, directly managing this trade-off.

Theoretical results establish mean-square convergence under mild conditions and derive explicit rate-distortion bounds linking achievable memory savings to information retention.

4. Algorithmic Workflow and Practical Implementation

The in-training CompreSSM method (Chahine et al., 3 Oct 2025) is realized as follows:

At each designated training checkpoint, extract SSM weights $(A,B,C)$ per block.
Compute Gramians ( $W_c$ , $W_o$ ) and HSVs.
Determine reduced order $r$ satisfying the HSV energy threshold.
Compute the balancing similarity transform $T$ .
Transform to balanced coordinates, truncate to $r$ , and write back reduced matrices.
Resume training with reduced SSM.

This process is integrated into standard machine learning frameworks, e.g., PyTorch, with utility modules handling schedule definition, matrix operations, and Lyapunov solvers. Hyperparameters include the energy threshold $\tau$ (e.g., 0.01–0.2), number of checkpointed compressions (3–5), and minimum allowed reduction fraction (default 0.95). A single epoch's learning-rate warmup typically hosts all compression checkpoints.

The selective gating CompreSSM approach (Bhat, 2024) implements gating networks parameterized by differentiable functions (e.g., sigmoid of affine maps of $(u_t, x_{t-1})$ ), paired with $L_1$ regularization for controlled sparsity. Hyperparameterization targets desired rate-distortion envelope, and observed effective state dimension is routinely an order of magnitude beneath full $d$ .

5. Empirical Validation and Performance Characteristics

Empirically, in-training CompreSSM (Chahine et al., 3 Oct 2025) produces SSMs that are both smaller and more expressive than those directly trained at small $n$ . On CIFAR-10, for instance, an $n=384$ SSM achieves 86.5% accuracy; CompreSSM compresses to $n=57$ (84.4%, $+6.2$ pp over $n=57$ scratch baseline) with $4\%$ less training time. On ListOps, final $n\sim57$ yields 48.3% (CompreSSM) versus 43.4% (vanilla). MNIST sees 95.9% ( $n\approx13$ ), compared to 92.6% for small-from-scratch.

Selective gating SSMs (Bhat, 2024) achieve accuracy and memory trade-offs surpassing standard RNNs/GRUs/LSTMs and classical SSMs, e.g., $1.5\times$ speedup and $0.6\times$ or better memory reduction at matching or higher accuracy across time-series, NLP, and signal tasks. The "gate-off" ablation confirms that the gating mechanism is the primary driver of compression without accuracy degradation.

A summary table of performance comparisons appears below.

Dataset	CompreSSM Accuracy	Baseline/Small SSM	Memory (MB, CompreSSM)	Memory (MB, Baseline)
CIFAR-10	84.4% ( $n$ =57)	78.2% ( $n$ =57)	—	—
ListOps	48.3% ( $n$ ≈57)	43.4% ( $n$ ≈57)	—	—
MNIST	95.9% ( $n$ ≈13)	92.6% ( $n$ ≈13)	—	—
Time-Series [Selective]	92.1%	90.3% (LSTM)	250	400
NLP [Selective]	85.6%	82.5% (LSTM)	210	360

6. Theoretical Guarantees, Limitations, and Extensions

Balanced truncation guarantees input-output error bounded by the sum of discarded HSVs (see Antoulas, §7), and empirical studies support the stability and monotonic ordering of HSVs under gradient-based parameter updates. Selective gating models possess mean-square convergence under mild norm and Lipschitz assumptions and provide rate-distortion-theoretic performance bounds.

Practical limitations include sensitivity to hyperparameters (especially for gating and compression thresholds), theoretical non-constructiveness of rate-distortion inverses, and restricted applicability of linear analysis to nonlinear SSMs. Extensions under consideration include nonlinear and hybrid SSMs, adaptive gating for online learning, multi-task selective gating, and integration of constant-time SSM blocks for ultra-long-context applications (Chahine et al., 3 Oct 2025, Bhat, 2024).

7. Code Availability and Usage Guidelines

Reference implementation for in-training CompreSSM is available at github.com/camail-official/compressm (Chahine et al., 3 Oct 2025). Key components manage compression scheduling, blockwise model reduction, and fast Lyapunov-solving routines. Best practices include setting early-warmup reduction intervals, tuning $\tau$ and minimum reduction fraction based on validation, and employing blockwise processing for models with many SSM layers.

For selective compression, regularization magnitudes governing gate sparsity and gating function architecture should be tuned via validation in accord with target rate-distortion profiles. Use of $L_1$ penalties and restriction to small Lipschitz constants is advised for convergence stability (Bhat, 2024).

CompreSSM, viewed both as a suite of practical methods and a set of theoretical bounds, forms a rigorous, high-performance compression strategy for modern SSM-based sequence models.

Markdown Report Issue Upgrade to Chat

References (2)

The Curious Case of In-Training Compression of State Space Models (2025)

Mathematical Formalism for Memory Compression in Selective State Space Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CompreSSM.