Papers
Topics
Authors
Recent
2000 character limit reached

Memory-Light Compression of Second Moments

Updated 4 December 2025
  • The paper introduces SlimAdam, which leverages SNR-guided thresholding to compress second moments and achieve up to 98% memory savings without sacrificing stability.
  • The methodology utilizes homomorphic projection and parametric vectorization to compress high-dimensional covariance matrices and bilinear features efficiently.
  • The SMSO approach normalizes and vectorizes second-order statistics, enabling significant memory reduction in deep visual recognition and large-scale learning tasks.

Memory-light compression of second moments refers to methodologies that dramatically reduce the memory burden associated with storing or operating on second-moment statistics—such as covariance matrices, squared-gradient accumulators, or bilinear feature representations—without incurring a significant loss in information or algorithmic stability. These techniques have become indispensable across optimization in large-scale learning, scientific data analysis, and deep visual recognition, where full second-moment storage is often prohibitive.

1. Second Moments in Modern Algorithms and Their Memory Bottlenecks

Second-moment statistics arise ubiquitously—as moving averages of squared gradients in adaptive optimizers (Adam, RMSProp), as empirical covariance in bilinear pooling, or as correlation matrices in time series analysis. In Adam, the optimizer maintains parameter-wise moving averages:

  • First moment: Mtβ1Mt1+(1β1)gtM_t \leftarrow \beta_1 M_{t-1} + (1-\beta_1)g_t
  • Second moment: Vtβ2Vt1+(1β2)gt2V_t \leftarrow \beta_2 V_{t-1} + (1-\beta_2)g_t^2

Storing both MtM_t and VtV_t doubles the optimizer’s memory compared to first-order methods. For models with billions of parameters, the O(N)O(N) memory for VtV_t can severely restrict batch size or model scaling (Kalra et al., 3 Mar 2025). In vision models, naive second-order pooling leads to O(c2)O(c^2)-dimensional heads; e.g., a c=512c = 512 feature map would require handling 2.6×1052.6 \times 10^5 covariance elements per image (Yu et al., 2018).

In scientific computing, real-time correlation analysis (e.g., XPCS) involves calculation of G=XXTG = XX^T for XRN×MX \in \mathbb{R}^{N \times M}, with M105M \gg 10^5. This results in memory and compute requirements that rapidly become unmanageable (Strempfer et al., 29 Jul 2024).

2. Signal-to-Noise Ratio–Guided Compression in Optimizers

SlimAdam introduces a layer-wise, statistically grounded criterion for compressing second-moment (variance) tensors in Adam. For a tensor VtV_t of shape (fan_out×fan_in)(\text{fan\_out} \times \text{fan\_in}), compressibility along dimension set KK is quantified as

SNRK(Vt)=EK[(EK[Vt])2VarK[Vt]]\mathrm{SNR}_K(V_t) = \mathbb{E}_{K'}\left[\frac{(\mathbb{E}_K[V_t])^2}{\text{Var}_K[V_t]}\right]

where EK[]\mathbb{E}_K[\cdot] and VarK[]\text{Var}_K[\cdot] denote mean and variance over axes in KK, and KK' are the complementary axes. If SNRK1\mathrm{SNR}_K \gg 1, entries along KK are tightly clustered and the second-moment tensor can be replaced by their mean over KK with minimal performance loss. The trajectory-averaged compressibility,

SNR^K=1Tt=1TSNRK(Vt)\hat{\mathrm{SNR}}_K = \frac{1}{T} \sum_{t=1}^T \mathrm{SNR}_K(V_t)

is used for robust decision-making across training (Kalra et al., 3 Mar 2025).

Compression is guided by a threshold:

  • For each layer \ell, for all k{none,fan_in,fan_out}k\in\{\mathrm{none}, \mathrm{fan\_in}, \mathrm{fan\_out}\}, compute SNR^k()\hat{\mathrm{SNR}}^{(\ell)}_k.
  • Let k=argmaxkSNR^k()k^* = \mathrm{argmax}_k \hat{\mathrm{SNR}}^{(\ell)}_k.
  • If knonek^* \neq \mathrm{none} and SNR^k()>τ\hat{\mathrm{SNR}}^{(\ell)}_{k^*} > \tau (default τ1.0\tau \approx 1.0), compress V()V^{(\ell)} along kk^*; else leave uncompressed.

This procedure realizes up to 98%98\% memory savings for the second moment in Transformers and ResNets at small learning rates, without compromising convergence or stability. Other low-memory Adam variants frequently fail at high learning rates, but SlimAdam's SNR analysis provides performance that matches full Adam across loss and accuracy curves (Kalra et al., 3 Mar 2025).

3. Homomorphic Linear Compression for Second-Moment Computation

For high-throughput correlation (e.g., in XPCS), the homomorphic compression framework projects the data matrix XRN×MX \in \mathbb{R}^{N\times M} into a lower-dimensional space:

C(X)=YK=XVK,VKRM×K,KMC(X) = Y_K = X V_K, \qquad V_K \in \mathbb{R}^{M \times K},\, K \ll M

and directly computes the second moment on the compressed domain:

G~=YKYKT=XVKVKTXT\tilde{G} = Y_K Y_K^T = X V_K V_K^T X^T

If VKV_K is constructed from the top KK right singular vectors of XX (as in SVD), the transformation preserves all bilinear forms of the type XAXTX A X^T, i.e., CC is a homomorphism for such operations. In the lossless case, K=rank(X)K = \mathrm{rank}(X), and G~\tilde{G} is exact; lossy compression with KMK \ll M yields the best rank-KK approximation in Frobenius norm. Critically, all calculations can proceed in the compressed space, with no need to decompress to full dimension, enabling 10310^3104×10^4 \times speedup and memory reduction in practice (Strempfer et al., 29 Jul 2024).

4. Parametric and Statistical Second-Order Compression in Deep Networks

Statistically Motivated Second-Order Pooling (SMSO) compresses O(c2)O(c^2) covariance matrices into pc2p \ll c^2-dimensional Gaussian-distributed representations:

  1. Parametric Vectorization (PV):

zj=wjTSwj,S=XXTRc×c,WRc×pz_j = w_j^T S w_j, \quad S = X X^T \in \mathbb{R}^{c\times c}, \quad W \in \mathbb{R}^{c \times p}

By design, zjχ2z_j \propto \chi^2 distributed (Wishart quadratic forms).

  1. Gaussianization: Each zjz_j is normalized using a square root and scaling such that for large nn, the result is approximately standard normal. A learnable affine (γj,βj)(\gamma_j, \beta_j) is applied akin to BatchNorm.

yj=γj2αjzj2n1+βjy_j = \gamma_j \sqrt{2\alpha_j z_j} - \sqrt{2 n - 1} + \beta_j

All operations are differentiable. In the alternative, more efficient realization, PV can be mapped to 1x1 convolutions followed by global 2\ell_2-pooling.

Empirically, SMSO achieves 10×10\times100×100\times compression of bilinear heads, frequently with pp between 64 and 2048, while outperforming both uncompressed and compact random-projection baselines on major vision datasets (Yu et al., 2018).

5. Error Control, Memory Scaling, and Practical Performance

The memory-light compression approaches are characterized by mathematically analyzable tradeoffs:

  • Optimizer compression (SlimAdam): Memory savings up to 98%98\% for second moments; performance and stability indistinguishable from full Adam over diverse Transformer, ResNet, and ViT architectures (Kalra et al., 3 Mar 2025).
  • Homomorphic sketching: Using KMK \ll M, the error in the second-moment estimate is bounded as per the Eckart-Young theorem,

XXVKVKTF=(i>Kσi2)1/2\|X - X V_K V_K^T\|_F = \left( \sum_{i>K} \sigma_i^2 \right)^{1/2}

and is controlled by the neglected singular values. Choice of KK targeting less than 1%1\% residual energy typically suffices (Strempfer et al., 29 Jul 2024).

  • SMSO: For c=256c=256, p=2048p=2048, SMSO reduces the feature size 32×32 \times over bilinear pooling and outperforms all tested baselines; SMSO-64 features (p=64p=64) are 128×128 \times smaller than CBP-8192 but match or exceed their accuracy (Yu et al., 2018).

The computational cost is reduced correspondingly: slim optimizer states allow larger models or batches; homomorphic sketching reduces O(NM)O(NM) to O(NK)O(NK); SMSO reduces O(c2)O(c^2) to O(cp)O(c p) in inference and storage.

6. Deployment Guidelines, Insights, and Limitations

  • SlimAdam: Select τ1.0\tau \approx 1.0 as the SNR threshold. Rules for which layers and axes to compress should be derived at a small learning rate and retained for the full optimization. Inconsistent compressibility is observed in certain MLP and “Gate” layers, and token embeddings may necessitate uncompressed storage due to low SNR—critical for handling rare tokens (Kalra et al., 3 Mar 2025).
  • Homomorphic approaches: Optimal for second-moment and linear bilinear forms; do not extend to higher-order or nonlinear statistics. For stationary signals, fixed compression operators suffice, but drift in the signal subspace may warrant re-computation of the compression basis (Strempfer et al., 29 Jul 2024).
  • SMSO: The parametric vectorization approach preserves second-order statistical structure and regularizes the compressed features to Gaussianity, ensuring effective gradient flow and compatibility with first-order training paradigms. The value of pp can be selected to match hardware or deployment constraints, with empirical evidence favoring modest pp values for most applications. No SVD or matrix-power operations are needed at inference (Yu et al., 2018).

Potential limitations noted include reduced compressibility for fine-tuning or in highly nonstationary datasets (SlimAdam), possible loss of fidelity for low signal-to-noise eigenmodes (homomorphic sketching), and overfitting risk if pp is chosen too large relative to dataset scale in SMSO. Re-evaluation of the compression rule or basis may be needed mid-training if gradient statistics shift significantly.

7. Summary Table: Major Memory-Light Second-Moment Compression Methods

Method Domain Mechanism Savings (Typical)
SlimAdam Optimizer (Adam) Layer- and axis-wise SNR, aggregate & threshold, mean/broadcast substitution Up to 98%
Homomorphic sketch XPCS, time series SVD-based projection, computations on compressed space Up to 10,000×
SMSO Deep vision pooling Parametric quadratic form, square-root, learnable affine 10–100×

These advances offer principled, practical solutions for compressing second-moment information, enabling large-scale learning and real-time analytics previously infeasible due to memory or computational constraints (Kalra et al., 3 Mar 2025, Strempfer et al., 29 Jul 2024, Yu et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Memory-Light Compression of Second Moments.