Papers
Topics
Authors
Recent
Search
2000 character limit reached

Companded Optimizer State Quantization

Updated 4 March 2026
  • Companded optimizer state quantization is a technique that applies nonlinear mapping to optimizer states to reduce memory usage and quantization error.
  • It integrates group scaling with forward quantization and inverse companding to maintain fidelity in momentum and variance buffers.
  • Empirical studies demonstrate significant memory savings and throughput gains while preserving convergence in large-scale neural network training.

Companded optimizer state quantization refers to a class of techniques designed to reduce the memory footprint and quantization error of optimizer state tensors—such as momentum and variance buffers in Adam-style optimizers—by applying nonlinear mappings ("companders") to the data before quantization, and inverting them after dequantization. These methods enable stable, memory-efficient low-bit representations (typically 8-bit, but extending down to 2–4 bits) while maintaining optimizer convergence and downstream model quality. Companded quantization strategies are now fundamental for scaling large models within stringent accelerator memory constraints and are employed in state-of-the-art systems such as COAT, FlashOptim, and SOLO (Xi et al., 2024, Ortiz et al., 26 Feb 2026, Xu et al., 1 May 2025).

1. Mathematical Principles of Companded Optimizer Quantization

Companding maps transform the distribution of optimizer state values to better exploit the available dynamic range of the target quantization format, focusing quantization precision where it has the greatest impact. The essential steps are:

  • Companding transform: A nonlinear, invertible map f:RRf:\mathbb{R}\to\mathbb{R} compresses or expands state values to match the target format's representable range.
  • Quantization: The companded state is quantized using a uniform quantizer (for fixed-point/float formats) or a nonuniform quantizer (for logarithmic/DE grids).
  • Inverse companding: Upon dequantization, the inverse map f1f^{-1} reconstructs the original value distribution, minimizing information loss.

For each optimizer state (e.g., Adam momentum mm and variance vv), separate companders are often chosen to match statistical properties:

  • COAT (FP8/E4M3): Uses a per-group power-law transform f(x)=sign(x)xkf(x) = \operatorname{sign}(x)|x|^k for both moments, with exponent kk dynamically chosen to match the raw group dynamic range RXR_X to the E4M3 format (RE4M32.29×105R_{E4M3} \approx 2.29\times10^5). k=logRX(RE4M3)k = \log_{R_X}(R_{E4M3}) and is typically clamped to [1,16][1,16] for stability (Xi et al., 2024).
  • FlashOptim (INT8): Applies softsign companding for signed momenta (φm(x)=2x/(1+x)\varphi_m(x) = 2x/(1+|x|), invertible by φm1(z)=z/(2z)\varphi_m^{-1}(z) = z/(2-|z|)) and a square-root compander for nonnegative variances (φv(x)=x\varphi_v(x) = \sqrt{x}, φv1(z)=z2\varphi_v^{-1}(z)=z^2) (Ortiz et al., 26 Feb 2026).
  • SOLO (2–4 bit, log): Uses logarithmic companding for unsigned EMAs, mapping to levels yk=αky_k=\alpha^k over [0,1][0,1], with α\alpha chosen per block from the upper-quantile of values. For signed momenta, SOLO applies a dynamic-exponent quantizer (Xu et al., 1 May 2025).

2. Quantization Workflow and Optimizer Integration

Companded quantization is integrated into the optimizer loop as follows:

  • State grouping and scaling: Optimizer state vectors are partitioned into groups, and per-group scaling is applied: group absmax for signed states, maximum for square-rooted variances, or normalization to [0,1][0,1] for log companding.
  • Forward pass (quantization):
  1. Apply companding ff elementwise.
  2. Linearly quantize (symmetric for signed, asymmetric or log for unsigned).
  3. Store quantized codes and group scales.
  • Backward pass (dequantization):
  1. Dequantize codes using stored scales.
  2. Recover the representative value via f1f^{-1}.

The modified optimizer updates (e.g., AdamW/FlashAdamW) simply replace full-precision moments with quantized states, ensuring all intermediate arithmetic remains in high precision before re-quantizing (Xi et al., 2024, Ortiz et al., 26 Feb 2026).

3. Quantization Error Analysis and Bounds

A key benefit of companded quantization is provably lower quantization error compared to naïve uniform quantization. The error bound is a function of both the quantizer resolution and the compander’s local Lipschitz constant:

  • COAT: For FP8 E4M3, step size δA/448\delta\approx A/448 in the companded domain (“A” is group max). Absolute error after inverse companding is bounded as x^x1ky1/k1δ2|\hat{x}-x| \leq \frac{1}{k}|y|^{1/k-1}\frac{\delta}{2}. Empirically, mean squared error (MSE) for updates m/vm/\sqrt{v} is reduced by up to 1.6×1.6\times relative to plain FP8 (Xi et al., 2024).
  • FlashOptim: Momentum companding yields x^x2/1270.0157|x̂-x|\leq 2/127\approx0.0157; variance error is 0.00393\leq 0.00393 per element, an order of magnitude lower than linear quantization (Ortiz et al., 26 Feb 2026).
  • SOLO: Uniform linear quantization can cause “signal swamping” (no state change) at high β\beta and low bit-width; log companding ensures quantization radii rr are small near zero, preserving EMA dynamics down to 2 bits. The variance of 1/x^1/\sqrt{\hat{x}} is bounded due to log-level clustering near zero (Xu et al., 1 May 2025).

4. Memory, Throughput, and Empirical Performance

Memory savings are substantial:

Method Optimizer State Reduction End-to-End Memory Reduction Speedup
COAT (FP8) 2×\sim2\times vs BF16 1.54×1.54\times (Llama-2-13B) $1.43$-2.25×2.25\times vs BF16
FlashOptim 59.823.459.8\rightarrow23.4 GB 175.2112.9175.2\rightarrow112.9 GB No slowdown
SOLO (4/2b) 50.245.1850.24\rightarrow5.18GB 10×10\times15×15\times Maintains quality

COAT also demonstrates doubled batch sizes and throughput gains (2.25×\sim2.25\times), compared to BF16 in large model training (Xi et al., 2024). FlashOptim incurs virtually zero convergence or accuracy loss on vision/language benchmarks such as ResNet-50, GSM8k, and GPT-2 (Ortiz et al., 26 Feb 2026). SOLO preserves or slightly improves mean metric scores at extreme bit reductions (2–4 bits) for both ImageNet and language tasks (Xu et al., 1 May 2025).

5. Extensions: Optimizer Types, Bit-Widths, and Mapping Strategies

Companded quantization strategies are portable across optimizers:

  • Momentum-only optimizers (SGD with momentum, Lion): Reuse the signed companding pipeline for all state (Ortiz et al., 26 Feb 2026).
  • Low-bit quantization (down to 2 bits): Requires careful mapping (log or DE) and, for signed states, precision-aware β\beta reduction to control variance amplification (Xu et al., 1 May 2025).
  • Compander design: While hand-crafted maps (softsign, sqrt, log) suffice, avenues exist for learned, piecewise-linear, or per-layer companders. The main constraints are invertibility and boundedness on target intervals (Ortiz et al., 26 Feb 2026).

No extra calibration phases are required; scale and map parameters are computed on-the-fly per group/block.

6. Limitations and Theoretical Considerations

Uniform quantization at low bitwidths leads to signal swamping and variance explosion, especially for EMA variance buffers with high decay (β1\beta\to 1). This necessitates companding schemes with denser level allocation near zero (log) or increased momentum decay (lower β\beta) for signed states to counteract excess quantization noise. Group/block size choices balance quantization error (smaller group = tighter scaling) against per-group scale overhead (larger group = more savings) (Xu et al., 1 May 2025). COAT’s power companding and SOLO’s log companding are specifically justified by theoretical results on step-size, state “decay matching,” and quantized EMA preservation.

7. Summary and Outlook

Companded optimizer state quantization constitutes a mathematically principled, empirically validated approach for aggressive memory reduction in neural network training. Techniques such as dynamic power-law companding (COAT), softsign and square-root companding (FlashOptim), and log-base quantization with precision-specific optimizer heuristics (SOLO) enable reliable, lossless, or near-lossless low-bit optimizers across vision and language tasks—even at 2–4 bit state precision. These methods eliminate the main obstacles of signal swamping and variance inflation and are portable to a range of optimizers and deployment scenarios, facilitating efficient large-scale training and enabling new model and batch size regimes (Xi et al., 2024, Ortiz et al., 26 Feb 2026, Xu et al., 1 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Companded Optimizer State Quantization.