4-bit Shampoo: Efficient Matrix Optimization

Updated 31 December 2025

4-bit Shampoo is a set of second-order optimizers that use 4-bit quantization to compress preconditioner states, reducing memory while retaining theoretical and empirical convergence guarantees.
It leverages both eigenvector and Cholesky factor quantization with orthogonality correction and error feedback to robustly approximate matrix inverse roots during deep network training.
Empirical evaluations on vision and language benchmarks demonstrate that 4-bit Shampoo reduces memory footprint significantly while achieving performance nearly matching that of full-precision Shampoo.

4-bit Shampoo is a set of memory-efficient second-order optimizers for deep neural network training that exploit low-bitwidth quantization to compress the states of matrix preconditioners. These optimizers preserve the theoretical and empirical advantages of full-matrix Shampoo while markedly reducing the memory footprint, making second-order optimization practical for large-scale models. 4-bit Shampoo achieves this by quantizing the eigenvector matrices or Cholesky factors of the preconditioners to 4 bits, combined with targeted orthogonality correction and error feedback mechanisms, robustly approximating matrix inverse roots necessary for preconditioned updates. Experimental evaluations across computer vision and natural language benchmarks demonstrate near-lossless accuracy compared to 32-bit Shampoo, unlocking resources for larger models or batch sizes under fixed hardware constraints (Wang et al., 28 May 2024, Li et al., 14 Dec 2024).

1. Mathematical Underpinnings of Shampoo Preconditioning

Shampoo belongs to the class of full-matrix preconditioned optimizers, maintaining for parameter block $W \in \mathbb{R}^{m \times n}$ two positive-definite (PD) matrices: $L_t = \beta L_{t-1} + (1-\beta) G_t G_t^{T} \qquad R_t = \beta R_{t-1} + (1-\beta) G_t^{T} G_t$ where $G_t = \nabla_W \mathcal{L}_t(W_{t-1})$ is the stochastic gradient. The updates require applying the inverse fourth root of each preconditioner: $A_t^{-1/4} = Q_t \Lambda_t^{-1/4} Q_t^{\mathsf{T}}$ with eigen-decomposition $A_t = Q_t \Lambda_t Q_t^{\mathsf{T}}$ ( $Q_t$ orthonormal, $\Lambda_t$ diagonal with positive entries).

A crucial insight in 4-bit Shampoo is that quantizing the eigenvector matrix $Q_t$ yields bounded effects on the preconditioned gradient, whereas quantizing $A_t$ directly causes severe distortions due to sensitivity of inverse roots to small eigenvalue perturbations (Wang et al., 28 May 2024).

2. 4-bit Quantization Schemes

Two principal quantization strategies are established for compressing the PD matrix states.

A. Eigenvector Quantization (Linear-2 Method) (Wang et al., 28 May 2024):

Each block of matrix or vector entries is normalized by its maximum absolute value.
Entries are mapped to a 4-bit code $\mathbb{T}_4 = \{0, ..., 15\}$ via the piecewise quadratic Linear-2 mapping:

$\mathcal{R}(j)= \begin{cases} -(-1 + 2j/15)^2 & j < 7 \ 0 & j = 7 \ (+1 - 2j/15)^2 & j > 7 \end{cases}$

Dequantization reconstructs the value by combining the stored maximum and mapped code.
Linear-2 outperforms dynamic tree quantization at 4 bits in terms of normwise relative error (NRE) and angle error (AE) for inverse-root computations.

B. Cholesky Factor Quantization (Li et al., 14 Dec 2024):

The preconditioner is decomposed as $L_k + \epsilon I = C_k^L (C_k^L)^{T}$ , and similarly for $R_k$ .
Only off-diagonal entries of the lower-triangular Cholesky factors $C_k^L, C_k^R$ are quantized to 4 bits, with diagonals kept in full-precision.
This approach reduces storage by around 50% for each factor.
Spectral properties are better preserved in Cholesky factorization than in direct eigenvector quantization for certain input distributions.

Quantization Target	4-bit Memory Usage	Accuracy Retention
Eigenvectors ( $Q_t$ )	$4n^2 + n^2 / 16$ bits	±0.7% accuracy gap
Cholesky factors ( $C_k$ )	$\sim 3d^2/2$ at 4 bits	±0.17% accuracy gap

This suggests that Cholesky factor quantization combined with error feedback provides additional savings and accuracy restoration over vanilla 4-bit eigenvector quantization.

3. Orthonormality and Error Compensation

Quantization threatens the orthogonality (for eigenvectors) and numerical properties (for Cholesky factors) of states used in matrix root computations.

Björck Orthonormalization (Wang et al., 28 May 2024): After dequantizing the eigenvector matrices, a single step of Björck iteration

$V_1 = 1.5V_0 - 0.5V_0 V_0^{\mathsf{T}} V_0$

where $V_0$ is the reconstructed matrix, restores orthogonality to relative error $<10^{-3}$ , reducing NRE/AE by 30-50% and stabilizing the inverse-root estimation.

Error Feedback for Cholesky Factors (Li et al., 14 Dec 2024): Error state matrices $E_k^L$ and $E_k^R$ are maintained in the opposing triangular slots of the same storage, enabling compensated quantization:

$U_k^L = C_k^L + D(E_{k-1}^L)$

with subsequent error update via EMA on the quantization residual. This mechanism bridges accuracy gaps and maintains optimal convergence rates.

4. Algorithmic Workflow

The streamlined pseudocode structures for each variant maintain four compressed states per layer: left/right preconditioners and left/right inverse-roots, with block-wise quantization/dequantization and periodic orthogonality correction.

For the Cholesky variant, the error feedback logic is intertwined with the update and quantization steps, as follows:

for k=1 to T:
    # Gradient computation
    Gₖ = ∇ℒ(Wₖ)
    
    # Preconditioner update & quantization
    if k % T₁ == 0:
        Lₖ₋₁ = D(\Bar Cₖ₋₁^L) D(\Bar Cₖ₋₁^L)^T
        Cₖ^L = Cholesky(Lₖ + ε I)
        Uₖ^L = Cₖ^L + D(\Bar Eₖ₋₁^L)
        \Bar Cₖ^L = Q(Uₖ^L)
        Eₖ^L = β_e Eₖ₋₁^L + (1-β_e)(Uₖ^L - D(\Bar Cₖ^L))
    # Periodic inverse-root update and subsequent preconditioned step omitted for brevity

Full pseudocode for both methods appears in (Wang et al., 28 May 2024) and (Li et al., 14 Dec 2024).

5. Memory Complexity Analysis

32-bit Shampoo: Stores four dense $d \times d$ matrices per block—total: $4d^2$ floats ( $O(d^2)$ space).
Vanilla 4-bit Shampoo: Four full matrices, each entry quantized; total space reduced by factor ≈7.
Cholesky 4-bit Shampoo: Two lower-triangular Cholesky factors for each block (off-diagonals at 4 bits, diagonals at 32 bits), plus two full-matrix inverse-roots. Overall memory is ≈ $3d^2/16$ 32-bit words, about 25% less than vanilla 4-bit Shampoo.

This implies substantial resource reallocation capability for model and batch size scaling under standard GPU limits.

6. Convergence and Empirical Performance

Theoretical analysis demonstrates that both variants of 4-bit Shampoo retain the optimal convergence rate $O(1/\sqrt{T})$ in smooth nonconvex stochastic optimization, with stationary-point convergence in Whitney-stratifiable (nonsmooth) settings (Li et al., 14 Dec 2024).

Empirical results across CIFAR-100, Tiny-ImageNet, and ImageNet-1k show:

4-bit Shampoo achieves test accuracy within ±0.7% of 32-bit Shampoo in vision architectures such as VGG19, ResNet34/50, ViT-Small, Swin-Tiny, ViT-Base/32.
Cholesky factor quantization with error feedback further narrows the gap, with deviations as low as 0.17% compared to float32 benchmarks.
Overall GPU memory usage drops by 4.5–41% (full run), with state storage shrinking 7×, rising to >50% savings for Cholesky/error-feedback approaches.
Convergence curves for accuracy almost perfectly overlap between 4-bit and full-precision Shampoo, while first-order optimizers (SGDM, AdamW) trail in epoch-wise convergence and final scores (Wang et al., 28 May 2024, Li et al., 14 Dec 2024).

7. Practical Significance and Implications

4-bit Shampoo makes high-quality second-order preconditioning feasible for deep network training under tight memory budgets (Wang et al., 28 May 2024). The quantization and correction scheme circumvents previous barriers to using full-matrix preconditioners by maintaining stability and accuracy in matrix root computation, with theoretical guarantees matching non-quantized Shampoo. Error feedback and triangular storage further enable efficient scaling. Applications span large-scale computer vision and language modeling, with plausible extension to any domain requiring second-order optimization at scale.