Papers
Topics
Authors
Recent
2000 character limit reached

Folded Optimizer with Approximate Moment (FOAM)

Updated 15 December 2025
  • The paper introduces FOAM’s main contribution: a block-wise folding technique that maintains convergence rates while drastically reducing memory overhead.
  • FOAM employs fold-unfold matrices with residual correction to compress gradients without sacrificing per-parameter update accuracy.
  • Empirical results show FOAM achieves comparable perplexity to Adam and increases throughput with up to 50% memory reduction.

Folded Optimizer with Approximate Moment (FOAM) is a memory-efficient adaptive optimization technique designed for LLM training. FOAM compresses optimizer states by folding gradients into block-wise means and restoring full-rank updates via an ephemeral residual. The method achieves convergence rates equivalent to Adam under standard non-convex settings, eliminates up to 90% of optimizer state overhead, and accelerates throughput, rendering it practical for LLMs with multi-billion parameters (Wen et al., 8 Dec 2025).

1. Block-Wise Folding Constructs and Mathematical Foundations

FOAM operates on weight matrices WRm×nW \in \mathbb{R}^{m \times n} and maintains block-compressed first and second moments of the gradient GtG_t. Given a fold level ll (block size 2l2^l), FOAM utilizes a “fold” matrix A(l)Rn×(n/2l)A^{(l)} \in \mathbb{R}^{n \times (n/2^l)} and an “unfold” matrix E(l)R(n/2l)×nE^{(l)} \in \mathbb{R}^{(n/2^l) \times n}. These operators perform averaging and replication within blocks, respectively:

  • Ai,j(l)=1/2lA^{(l)}_{i,j} = 1/2^l if (j1)2l+1ij2l(j-1)\cdot 2^l+1 \leq i \leq j\cdot 2^l, else 0.
  • Ei,j(l)=1E^{(l)}_{i,j} = 1 if (i1)2l+1ji2l(i-1)\cdot 2^l+1 \leq j \leq i\cdot 2^l, else 0.

The idempotent projector P(l)=A(l)E(l)P^{(l)} = A^{(l)} E^{(l)} (with spectral norm 1) projects gradients into block-constant subspaces. Folding compresses both moments to Rm×(n/2l)\mathbb{R}^{m \times (n/2^l)}, and the residual Rt=Gt(GtP(l))R_t = G_t - (G_t P^{(l)}) restores per-parameter information on each update, ensuring differentiation within blocks.

2. Update Scheme, Residual Correction, and Algorithmic Implementation

On each optimization step, FOAM performs:

  1. Compute gradient: Gt=f(Wt1)G_t = \nabla f(W_{t-1}).
  2. Fold moments: G~t=GtA(l)\tilde{G}_t = G_t A^{(l)}.
  3. Compute residual: Rt=Gt(G~tE(l))R_t = G_t - (\tilde{G}_t E^{(l)}).
  4. Update folded moments:
    • M~t=β1M~t1+(1β1)G~t\tilde{M}_t = \beta_1 \tilde{M}_{t-1} + (1-\beta_1)\tilde{G}_t
    • V~t=β2V~t1+(1β2)(G~t2)\tilde{V}_t = \beta_2 \tilde{V}_{t-1} + (1-\beta_2)(\tilde{G}_t^2)
  5. Unfold and incorporate residual:
    • Mt=M~tE(l)+RtM_t = \tilde{M}_t E^{(l)} + R_t
    • Vt=V~tE(l)+Rt2V_t = \tilde{V}_t E^{(l)} + R_t^2
  6. Update parameters: Wt=Wt1ηtαMtVt+ϵW_t = W_{t-1} - \eta_t \alpha \frac{M_t}{\sqrt{V_t+\epsilon}}

FOAM’s design ensures the residual does not persist across steps, minimizing permanent memory overhead.

3. Memory Reduction and Comparative Analysis of Optimizer States

FOAM reduces optimizer state memory as follows:

Optimizer State Storage (scalar count) Empirical Footprint (LLaMA-350M, BF16)
Adam $2mn$ 4.40 GB (model+opt)
MUON Reduced rank 3.80 GB
Adam-Mini $2m$ 3.30 GB
GaLore-1/4 Low-rank, projected 3.58 GB
APOLLO-1/4 Low-rank, projected 3.58 GB
FOAM-2 $2m(n/4)$ 2.75 GB
FOAM-Mini $2m$ 2.22 GB

For fold level l=1l=1 (block size 2), optimizer state is halved. For “FOAM-Mini” (maximal folding, llog2nl \approx \log_2 n), state is O(m)O(m), representing a \sim99% reduction in optimizer overhead. Empirically, FOAM-2 reduces total memory by \approx50% and matches or exceeds throughput, e.g., FOAM-2 reaches 35.7K tok/sec versus Adam's 34.5K on 4×3090 GPUs.

4. Convergence Properties and Theoretical Guarantees

Under non-convex assumptions (L-smoothness, unbiased gradient noise, bounded gradients), FOAM attains the same convergence rate as Adam. Denoting the residual energy ratio δl=maxtRt/Gt1\delta_l = \max_t \|R_t\| / \|G_t\| \leq 1, the main theorem states:

min1tTE[f(Wt)2]=O(logT+δlT)+O(σ2)\min_{1 \leq t \leq T}\mathbb{E}[\|\nabla f(W_t)\|^2] = O\left(\frac{\log T + \delta_l}{\sqrt{T}}\right) + O(\sigma^2)

Proof utilizes spectral properties of P(l)P^{(l)} and moment error bounds. The bound decays O(logT/T)\sim O(\log T/\sqrt{T}) for TT steps, matching Adam’s asymptotics. This suggests block-wise folding with residual does not degrade theoretical optimizer guarantees.

5. Empirical Profiling: Performance, Throughput, and Ablation

FOAM matches or surpasses full-Adam and memory-efficient baselines across LLM pretraining tasks:

  • LLaMA-350M, 1.3B tokens: FOAM-2 achieves final PPL 15.87; Adam 17.33; GaLore-1/4 19.36; APOLLO-1/4 16.73; FOAM-Mini 16.53.
  • FOAM requires roughly half the steps to reach Adam’s PPL.
  • Throughput increases by 4–5% over Adam at equivalent memory.
  • Robustness: FOAM-l degrades <<0.5 PPL even at l=3l=3 (1/8 state).

Ablation studies confirm the necessity of residual injection—removal results in uniform block updates and >>1 PPL degradation. Residual addition to second moment (V) stabilizes optimization. Empirical findings extend to LLMs up to 7B parameters.

6. Compatibility, Limitations, and Open Questions

FOAM is optimizer-agnostic: composable with Adam-Mini, MUON, and compatible with 8-bit quantization. It avoids SVD and random projections, delivering memory and computational efficiency. A plausible implication is FOAM can be combined with gradient-segmentation frameworks (e.g., ZeRO) for further scaling and communication reduction.

Open research directions:

  • Validation for LLMs >>7B (e.g., 100B parameter scale).
  • Generalization to higher-order tensors (for vision transformers and diffusion architectures).
  • Communication efficiency analysis for distributed training.

FOAM extends the lineage of memory-efficient optimizers—Adam [Kingma & Ba 2014], SM3 (Anil et al., 2019), GaLore [Zhao et al. 2024], APOLLO [Zhu et al. 2025]—by exploiting block structure and residual correction rather than relying on low-rank projections, quantization, or weight freezing. FOAM’s “folded state” paradigm is distinct in not incurring additional projection memory and computational overhead. This approach addresses practical bottlenecks in scaling adaptive optimizers for ultra-large LLMs (Wen et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Folded Optimizer with Approximate Moment (FOAM).