Memory-Light Compression of Second Moments

Updated 4 December 2025

The paper introduces SlimAdam, which leverages SNR-guided thresholding to compress second moments and achieve up to 98% memory savings without sacrificing stability.
The methodology utilizes homomorphic projection and parametric vectorization to compress high-dimensional covariance matrices and bilinear features efficiently.
The SMSO approach normalizes and vectorizes second-order statistics, enabling significant memory reduction in deep visual recognition and large-scale learning tasks.

Memory-light compression of second moments refers to methodologies that dramatically reduce the memory burden associated with storing or operating on second-moment statistics—such as covariance matrices, squared-gradient accumulators, or bilinear feature representations—without incurring a significant loss in information or algorithmic stability. These techniques have become indispensable across optimization in large-scale learning, scientific data analysis, and deep visual recognition, where full second-moment storage is often prohibitive.

1. Second Moments in Modern Algorithms and Their Memory Bottlenecks

Second-moment statistics arise ubiquitously—as moving averages of squared gradients in adaptive optimizers (Adam, RMSProp), as empirical covariance in bilinear pooling, or as correlation matrices in time series analysis. In Adam, the optimizer maintains parameter-wise moving averages:

First moment: $M_t \leftarrow \beta_1 M_{t-1} + (1-\beta_1)g_t$
Second moment: $V_t \leftarrow \beta_2 V_{t-1} + (1-\beta_2)g_t^2$

Storing both $M_t$ and $V_t$ doubles the optimizer’s memory compared to first-order methods. For models with billions of parameters, the $O(N)$ memory for $V_t$ can severely restrict batch size or model scaling (Kalra et al., 3 Mar 2025). In vision models, naive second-order pooling leads to $O(c^2)$ -dimensional heads; e.g., a $c = 512$ feature map would require handling $2.6 \times 10^5$ covariance elements per image (Yu et al., 2018).

In scientific computing, real-time correlation analysis (e.g., XPCS) involves calculation of $G = XX^T$ for $X \in \mathbb{R}^{N \times M}$ , with $M \gg 10^5$ . This results in memory and compute requirements that rapidly become unmanageable (Strempfer et al., 29 Jul 2024).

2. Signal-to-Noise Ratio–Guided Compression in Optimizers

SlimAdam introduces a layer-wise, statistically grounded criterion for compressing second-moment (variance) tensors in Adam. For a tensor $V_t$ of shape $(\text{fan\_out} \times \text{fan\_in})$ , compressibility along dimension set $K$ is quantified as

$\mathrm{SNR}_K(V_t) = \mathbb{E}_{K'}\left[\frac{(\mathbb{E}_K[V_t])^2}{\text{Var}_K[V_t]}\right]$

where $\mathbb{E}_K[\cdot]$ and $\text{Var}_K[\cdot]$ denote mean and variance over axes in $K$ , and $K'$ are the complementary axes. If $\mathrm{SNR}_K \gg 1$ , entries along $K$ are tightly clustered and the second-moment tensor can be replaced by their mean over $K$ with minimal performance loss. The trajectory-averaged compressibility,

$\hat{\mathrm{SNR}}_K = \frac{1}{T} \sum_{t=1}^T \mathrm{SNR}_K(V_t)$

is used for robust decision-making across training (Kalra et al., 3 Mar 2025).

Compression is guided by a threshold:

For each layer $\ell$ , for all $k\in\{\mathrm{none}, \mathrm{fan\_in}, \mathrm{fan\_out}\}$ , compute $\hat{\mathrm{SNR}}^{(\ell)}_k$ .
Let $k^* = \mathrm{argmax}_k \hat{\mathrm{SNR}}^{(\ell)}_k$ .
If $k^* \neq \mathrm{none}$ and $\hat{\mathrm{SNR}}^{(\ell)}_{k^*} > \tau$ (default $\tau \approx 1.0$ ), compress $V^{(\ell)}$ along $k^*$ ; else leave uncompressed.

This procedure realizes up to $98\%$ memory savings for the second moment in Transformers and ResNets at small learning rates, without compromising convergence or stability. Other low-memory Adam variants frequently fail at high learning rates, but SlimAdam's SNR analysis provides performance that matches full Adam across loss and accuracy curves (Kalra et al., 3 Mar 2025).

3. Homomorphic Linear Compression for Second-Moment Computation

For high-throughput correlation (e.g., in XPCS), the homomorphic compression framework projects the data matrix $X \in \mathbb{R}^{N\times M}$ into a lower-dimensional space:

$C(X) = Y_K = X V_K, \qquad V_K \in \mathbb{R}^{M \times K},\, K \ll M$

and directly computes the second moment on the compressed domain:

$\tilde{G} = Y_K Y_K^T = X V_K V_K^T X^T$

If $V_K$ is constructed from the top $K$ right singular vectors of $X$ (as in SVD), the transformation preserves all bilinear forms of the type $X A X^T$ , i.e., $C$ is a homomorphism for such operations. In the lossless case, $K = \mathrm{rank}(X)$ , and $\tilde{G}$ is exact; lossy compression with $K \ll M$ yields the best rank- $K$ approximation in Frobenius norm. Critically, all calculations can proceed in the compressed space, with no need to decompress to full dimension, enabling $10^3$ – $10^4 \times$ speedup and memory reduction in practice (Strempfer et al., 29 Jul 2024).

4. Parametric and Statistical Second-Order Compression in Deep Networks

Statistically Motivated Second-Order Pooling (SMSO) compresses $O(c^2)$ covariance matrices into $p \ll c^2$ -dimensional Gaussian-distributed representations:

Parametric Vectorization (PV):

$z_j = w_j^T S w_j, \quad S = X X^T \in \mathbb{R}^{c\times c}, \quad W \in \mathbb{R}^{c \times p}$

By design, $z_j \propto \chi^2$ distributed (Wishart quadratic forms).

Gaussianization: Each $z_j$ is normalized using a square root and scaling such that for large $n$ , the result is approximately standard normal. A learnable affine $(\gamma_j, \beta_j)$ is applied akin to BatchNorm.

$y_j = \gamma_j \sqrt{2\alpha_j z_j} - \sqrt{2 n - 1} + \beta_j$

All operations are differentiable. In the alternative, more efficient realization, PV can be mapped to 1x1 convolutions followed by global $\ell_2$ -pooling.

Empirically, SMSO achieves $10\times$ – $100\times$ compression of bilinear heads, frequently with $p$ between 64 and 2048, while outperforming both uncompressed and compact random-projection baselines on major vision datasets (Yu et al., 2018).

5. Error Control, Memory Scaling, and Practical Performance

The memory-light compression approaches are characterized by mathematically analyzable tradeoffs:

Optimizer compression (SlimAdam): Memory savings up to $98\%$ for second moments; performance and stability indistinguishable from full Adam over diverse Transformer, ResNet, and ViT architectures (Kalra et al., 3 Mar 2025).
Homomorphic sketching: Using $K \ll M$ , the error in the second-moment estimate is bounded as per the Eckart-Young theorem,

$\|X - X V_K V_K^T\|_F = \left( \sum_{i>K} \sigma_i^2 \right)^{1/2}$

and is controlled by the neglected singular values. Choice of $K$ targeting less than $1\%$ residual energy typically suffices (Strempfer et al., 29 Jul 2024).

SMSO: For $c=256$ , $p=2048$ , SMSO reduces the feature size $32 \times$ over bilinear pooling and outperforms all tested baselines; SMSO-64 features ( $p=64$ ) are $128 \times$ smaller than CBP-8192 but match or exceed their accuracy (Yu et al., 2018).

The computational cost is reduced correspondingly: slim optimizer states allow larger models or batches; homomorphic sketching reduces $O(NM)$ to $O(NK)$ ; SMSO reduces $O(c^2)$ to $O(c p)$ in inference and storage.

6. Deployment Guidelines, Insights, and Limitations

SlimAdam: Select $\tau \approx 1.0$ as the SNR threshold. Rules for which layers and axes to compress should be derived at a small learning rate and retained for the full optimization. Inconsistent compressibility is observed in certain MLP and “Gate” layers, and token embeddings may necessitate uncompressed storage due to low SNR—critical for handling rare tokens (Kalra et al., 3 Mar 2025).
Homomorphic approaches: Optimal for second-moment and linear bilinear forms; do not extend to higher-order or nonlinear statistics. For stationary signals, fixed compression operators suffice, but drift in the signal subspace may warrant re-computation of the compression basis (Strempfer et al., 29 Jul 2024).
SMSO: The parametric vectorization approach preserves second-order statistical structure and regularizes the compressed features to Gaussianity, ensuring effective gradient flow and compatibility with first-order training paradigms. The value of $p$ can be selected to match hardware or deployment constraints, with empirical evidence favoring modest $p$ values for most applications. No SVD or matrix-power operations are needed at inference (Yu et al., 2018).

Potential limitations noted include reduced compressibility for fine-tuning or in highly nonstationary datasets (SlimAdam), possible loss of fidelity for low signal-to-noise eigenmodes (homomorphic sketching), and overfitting risk if $p$ is chosen too large relative to dataset scale in SMSO. Re-evaluation of the compression rule or basis may be needed mid-training if gradient statistics shift significantly.

7. Summary Table: Major Memory-Light Second-Moment Compression Methods

Method	Domain	Mechanism	Savings (Typical)
SlimAdam	Optimizer (Adam)	Layer- and axis-wise SNR, aggregate & threshold, mean/broadcast substitution	Up to 98%
Homomorphic sketch	XPCS, time series	SVD-based projection, computations on compressed space	Up to 10,000×
SMSO	Deep vision pooling	Parametric quadratic form, square-root, learnable affine	10–100×

These advances offer principled, practical solutions for compressing second-moment information, enabling large-scale learning and real-time analytics previously infeasible due to memory or computational constraints (Kalra et al., 3 Mar 2025, Strempfer et al., 29 Jul 2024, Yu et al., 2018).

PDF Markdown Chat (Pro)

References (3)

When Can You Get Away with Low Memory Adam? (2025)

Statistically Motivated Second Order Pooling (2018)

Homomorphic data compression for real time photon correlation analysis (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Memory-Light Compression of Second Moments.