Papers
Topics
Authors
Recent
Search
2000 character limit reached

CompreSSM: Efficient State Space Model Compression

Updated 18 February 2026
  • CompreSSM is a suite of techniques for compressing state space models using balanced truncation and selective gating to optimize memory, computation, and maintain accuracy.
  • It employs in-training model order reduction by discarding low-energy state components, achieving robust performance with significantly reduced model dimensions.
  • The method leverages information-theoretic rate-distortion trade-offs to dynamically balance compression and expressivity in diverse sequence modeling applications.

CompreSSM refers to methods and theoretical constructs for compressing state space models (SSMs) to minimize memory, computation, or storage without incurring significant loss in expressivity or performance. In contemporary literature, CompreSSM specifically denotes two overlapping but distinct research avenues: (1) in-training model order reduction via balanced truncation for discrete linear SSMs, most notably in "The Curious Case of In-Training Compression of State Space Models" (Chahine et al., 3 Oct 2025), and (2) selective memory compression using gating and information-theoretic rate-distortion trade-offs, framed as "Compressive Selective State Space Models" (Bhat, 2024). Both approaches target efficient long-context sequence modeling, providing algorithmic and theoretical tools to overcome the computational and representational bottlenecks typical of large SSMs.

1. State Space Model Compression: Problem Formulation

State space models process sequential data using recurrence: xk+1=Axk+Buk,yk=Cxk+Duk,x_{k+1} = A x_k + B u_k,\quad y_k = C x_k + D u_k, where xkRnx_k\in\mathbb R^n is the hidden state, ukRpu_k\in\mathbb R^p the input, and ykRqy_k\in\mathbb R^q the output. High-dimensional nn is necessary for modeling long-range dependencies, but creates quadratic per-step update and storage costs. CompreSSM addresses the need to reduce nn—the state-space dimension—while maintaining or, in some cases, improving performance relative to a scratch-trained low-dimensional counterpart.

Model order reduction in SSMs is classically approached via balanced truncation, which exploits the controllability and observability structure encoded in the system matrices (A,B,C,DA,B,C,D) to identify and discard low-energy state components. Modern techniques further introduce adaptive or selective compression mechanisms, such as input-conditioned gating, offering additional avenues for dynamic, data-driven memory savings (Bhat, 2024).

2. Balanced Truncation and In-Training Model Reduction

Balanced truncation is grounded in control theory and leverages the Hankel singular values (HSVs) of an LTI system to quantify the joint controllability and observability of state dimensions. Given the controllability (WcW_c) and observability (WoW_o) Gramians by the discrete Lyapunov equations,

Wc=k=0AkBBT(AT)k,Wo=k=0(AT)kCTCAk,W_c = \sum_{k=0}^\infty A^k B B^T (A^T)^k, \quad W_o = \sum_{k=0}^\infty (A^T)^k C^T C A^k,

the HSVs are the sorted square roots of the eigenvalues of WcWoW_cW_o. Balanced realization finds a similarity transform rendering WcW_c, WoW_o diagonal and equal, allowing truncation of the state to the rr directions with largest HSVs: i=1rσiτi=1nσi,\sum_{i=1}^{r} \sigma_i \geq \tau \sum_{i=1}^n \sigma_i, with hyperparameter τ[0,1]\tau \in [0,1] (Chahine et al., 3 Oct 2025). The truncated system inherits H\mathcal H_\infty error guarantees.

CompreSSM (Chahine et al., 3 Oct 2025) uniquely integrates this reduction into stochastic gradient descent training. At scheduled checkpoints during early optimization (e.g., within the first 10% of steps), blockwise reduction is triggered, collapsing nrn \to r by computing balancing transforms and discarding low-impact directions. This dynamic in-training reduction leads to more robust and performant small models than standard slim-from-scratch training.

3. Selective Gating and Information-Theoretic Compression

A complementary paradigm, introduced in (Bhat, 2024), views compression as dynamic, selective retention of subspaces through gating: xt=G(ut,xt1)(Axt1+But)+[1G(ut,xt1)]xt1+wt,x_t = G(u_t, x_{t-1}) \odot (A x_{t-1} + B u_t) + [1-G(u_t, x_{t-1})] \odot x_{t-1} + w_t, where G(ut,xt1)[0,1]dG(u_t, x_{t-1}) \in [0,1]^d is an input- and state-dependent vector of gates, and \odot denotes elementwise multiplication. This mechanism implements a form of adaptive memory compression, reducing the effective dimensionality

dimeff(xt)=i=1dgi(ut,xt1)\mathrm{dim}_{\mathrm{eff}}(x_t) = \sum_{i=1}^d g_i(u_t, x_{t-1})

at each time step. The formalism is grounded in information theory, analyzing the trade-off between the mutual information retained in the compressed state x^t\hat x_t and the input history u1:tu_{1:t}: R(D)=minp(x^txt)I(xt;x^t)s.t.  Extx^t2D.R(D) = \min_{p(\hat x_t | x_t)} I(x_t ; \hat x_t) \quad \text{s.t.} \; \mathbb{E}\|x_t - \hat x_t\|^2 \le D. Tunable L1L_1 regularization on GG promotes sparsity, directly managing this trade-off.

Theoretical results establish mean-square convergence under mild conditions and derive explicit rate-distortion bounds linking achievable memory savings to information retention.

4. Algorithmic Workflow and Practical Implementation

The in-training CompreSSM method (Chahine et al., 3 Oct 2025) is realized as follows:

  1. At each designated training checkpoint, extract SSM weights (A,B,C)(A,B,C) per block.
  2. Compute Gramians (WcW_c, WoW_o) and HSVs.
  3. Determine reduced order rr satisfying the HSV energy threshold.
  4. Compute the balancing similarity transform TT.
  5. Transform to balanced coordinates, truncate to rr, and write back reduced matrices.
  6. Resume training with reduced SSM.

This process is integrated into standard machine learning frameworks, e.g., PyTorch, with utility modules handling schedule definition, matrix operations, and Lyapunov solvers. Hyperparameters include the energy threshold τ\tau (e.g., 0.01–0.2), number of checkpointed compressions (3–5), and minimum allowed reduction fraction (default 0.95). A single epoch's learning-rate warmup typically hosts all compression checkpoints.

The selective gating CompreSSM approach (Bhat, 2024) implements gating networks parameterized by differentiable functions (e.g., sigmoid of affine maps of (ut,xt1)(u_t, x_{t-1})), paired with L1L_1 regularization for controlled sparsity. Hyperparameterization targets desired rate-distortion envelope, and observed effective state dimension is routinely an order of magnitude beneath full dd.

5. Empirical Validation and Performance Characteristics

Empirically, in-training CompreSSM (Chahine et al., 3 Oct 2025) produces SSMs that are both smaller and more expressive than those directly trained at small nn. On CIFAR-10, for instance, an n=384n=384 SSM achieves 86.5% accuracy; CompreSSM compresses to n=57n=57 (84.4%, +6.2+6.2 pp over n=57n=57 scratch baseline) with 4%4\% less training time. On ListOps, final n57n\sim57 yields 48.3% (CompreSSM) versus 43.4% (vanilla). MNIST sees 95.9% (n13n\approx13), compared to 92.6% for small-from-scratch.

Selective gating SSMs (Bhat, 2024) achieve accuracy and memory trade-offs surpassing standard RNNs/GRUs/LSTMs and classical SSMs, e.g., 1.5×1.5\times speedup and 0.6×0.6\times or better memory reduction at matching or higher accuracy across time-series, NLP, and signal tasks. The "gate-off" ablation confirms that the gating mechanism is the primary driver of compression without accuracy degradation.

A summary table of performance comparisons appears below.

Dataset CompreSSM Accuracy Baseline/Small SSM Memory (MB, CompreSSM) Memory (MB, Baseline)
CIFAR-10 84.4% (nn=57) 78.2% (nn=57)
ListOps 48.3% (nn≈57) 43.4% (nn≈57)
MNIST 95.9% (nn≈13) 92.6% (nn≈13)
Time-Series [Selective] 92.1% 90.3% (LSTM) 250 400
NLP [Selective] 85.6% 82.5% (LSTM) 210 360

6. Theoretical Guarantees, Limitations, and Extensions

Balanced truncation guarantees input-output error bounded by the sum of discarded HSVs (see Antoulas, §7), and empirical studies support the stability and monotonic ordering of HSVs under gradient-based parameter updates. Selective gating models possess mean-square convergence under mild norm and Lipschitz assumptions and provide rate-distortion-theoretic performance bounds.

Practical limitations include sensitivity to hyperparameters (especially for gating and compression thresholds), theoretical non-constructiveness of rate-distortion inverses, and restricted applicability of linear analysis to nonlinear SSMs. Extensions under consideration include nonlinear and hybrid SSMs, adaptive gating for online learning, multi-task selective gating, and integration of constant-time SSM blocks for ultra-long-context applications (Chahine et al., 3 Oct 2025, Bhat, 2024).

7. Code Availability and Usage Guidelines

Reference implementation for in-training CompreSSM is available at github.com/camail-official/compressm (Chahine et al., 3 Oct 2025). Key components manage compression scheduling, blockwise model reduction, and fast Lyapunov-solving routines. Best practices include setting early-warmup reduction intervals, tuning τ\tau and minimum reduction fraction based on validation, and employing blockwise processing for models with many SSM layers.

For selective compression, regularization magnitudes governing gate sparsity and gating function architecture should be tuned via validation in accord with target rate-distortion profiles. Use of L1L_1 penalties and restriction to small Lipschitz constants is advised for convergence stability (Bhat, 2024).

CompreSSM, viewed both as a suite of practical methods and a set of theoretical bounds, forms a rigorous, high-performance compression strategy for modern SSM-based sequence models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CompreSSM.