Folded Optimizer with Approximate Moment (FOAM)

Updated 15 December 2025

The paper introduces FOAM’s main contribution: a block-wise folding technique that maintains convergence rates while drastically reducing memory overhead.
FOAM employs fold-unfold matrices with residual correction to compress gradients without sacrificing per-parameter update accuracy.
Empirical results show FOAM achieves comparable perplexity to Adam and increases throughput with up to 50% memory reduction.

Folded Optimizer with Approximate Moment (FOAM) is a memory-efficient adaptive optimization technique designed for LLM training. FOAM compresses optimizer states by folding gradients into block-wise means and restoring full-rank updates via an ephemeral residual. The method achieves convergence rates equivalent to Adam under standard non-convex settings, eliminates up to 90% of optimizer state overhead, and accelerates throughput, rendering it practical for LLMs with multi-billion parameters (Wen et al., 8 Dec 2025).

1. Block-Wise Folding Constructs and Mathematical Foundations

FOAM operates on weight matrices $W \in \mathbb{R}^{m \times n}$ and maintains block-compressed first and second moments of the gradient $G_t$ . Given a fold level $l$ (block size $2^l$ ), FOAM utilizes a “fold” matrix $A^{(l)} \in \mathbb{R}^{n \times (n/2^l)}$ and an “unfold” matrix $E^{(l)} \in \mathbb{R}^{(n/2^l) \times n}$ . These operators perform averaging and replication within blocks, respectively:

$A^{(l)}_{i,j} = 1/2^l$ if $(j-1)\cdot 2^l+1 \leq i \leq j\cdot 2^l$ , else 0.
$E^{(l)}_{i,j} = 1$ if $(i-1)\cdot 2^l+1 \leq j \leq i\cdot 2^l$ , else 0.

The idempotent projector $P^{(l)} = A^{(l)} E^{(l)}$ (with spectral norm 1) projects gradients into block-constant subspaces. Folding compresses both moments to $\mathbb{R}^{m \times (n/2^l)}$ , and the residual $R_t = G_t - (G_t P^{(l)})$ restores per-parameter information on each update, ensuring differentiation within blocks.

2. Update Scheme, Residual Correction, and Algorithmic Implementation

On each optimization step, FOAM performs:

Compute gradient: $G_t = \nabla f(W_{t-1})$ .
Fold moments: $\tilde{G}_t = G_t A^{(l)}$ .
Compute residual: $R_t = G_t - (\tilde{G}_t E^{(l)})$ .
Update folded moments:
- $\tilde{M}_t = \beta_1 \tilde{M}_{t-1} + (1-\beta_1)\tilde{G}_t$
- $\tilde{V}_t = \beta_2 \tilde{V}_{t-1} + (1-\beta_2)(\tilde{G}_t^2)$
Unfold and incorporate residual:
- $M_t = \tilde{M}_t E^{(l)} + R_t$
- $V_t = \tilde{V}_t E^{(l)} + R_t^2$
Update parameters: $W_t = W_{t-1} - \eta_t \alpha \frac{M_t}{\sqrt{V_t+\epsilon}}$

FOAM’s design ensures the residual does not persist across steps, minimizing permanent memory overhead.

3. Memory Reduction and Comparative Analysis of Optimizer States

FOAM reduces optimizer state memory as follows:

Optimizer	State Storage (scalar count)	Empirical Footprint (LLaMA-350M, BF16)
Adam	$2mn$	4.40 GB (model+opt)
MUON	Reduced rank	3.80 GB
Adam-Mini	$2m$	3.30 GB
GaLore-1/4	Low-rank, projected	3.58 GB
APOLLO-1/4	Low-rank, projected	3.58 GB
FOAM-2	$2m(n/4)$	2.75 GB
FOAM-Mini	$2m$	2.22 GB

For fold level $l=1$ (block size 2), optimizer state is halved. For “FOAM-Mini” (maximal folding, $l \approx \log_2 n$ ), state is $O(m)$ , representing a $\sim$ 99% reduction in optimizer overhead. Empirically, FOAM-2 reduces total memory by $\approx$ 50% and matches or exceeds throughput, e.g., FOAM-2 reaches 35.7K tok/sec versus Adam's 34.5K on 4×3090 GPUs.

4. Convergence Properties and Theoretical Guarantees

Under non-convex assumptions (L-smoothness, unbiased gradient noise, bounded gradients), FOAM attains the same convergence rate as Adam. Denoting the residual energy ratio $\delta_l = \max_t \|R_t\| / \|G_t\| \leq 1$ , the main theorem states:

$\min_{1 \leq t \leq T}\mathbb{E}[\|\nabla f(W_t)\|^2] = O\left(\frac{\log T + \delta_l}{\sqrt{T}}\right) + O(\sigma^2)$

Proof utilizes spectral properties of $P^{(l)}$ and moment error bounds. The bound decays $\sim O(\log T/\sqrt{T})$ for $T$ steps, matching Adam’s asymptotics. This suggests block-wise folding with residual does not degrade theoretical optimizer guarantees.

5. Empirical Profiling: Performance, Throughput, and Ablation

FOAM matches or surpasses full-Adam and memory-efficient baselines across LLM pretraining tasks:

LLaMA-350M, 1.3B tokens: FOAM-2 achieves final PPL 15.87; Adam 17.33; GaLore-1/4 19.36; APOLLO-1/4 16.73; FOAM-Mini 16.53.
FOAM requires roughly half the steps to reach Adam’s PPL.
Throughput increases by 4–5% over Adam at equivalent memory.
Robustness: FOAM-l degrades $<$ 0.5 PPL even at $l=3$ (1/8 state).

Ablation studies confirm the necessity of residual injection—removal results in uniform block updates and $>$ 1 PPL degradation. Residual addition to second moment (V) stabilizes optimization. Empirical findings extend to LLMs up to 7B parameters.

6. Compatibility, Limitations, and Open Questions

FOAM is optimizer-agnostic: composable with Adam-Mini, MUON, and compatible with 8-bit quantization. It avoids SVD and random projections, delivering memory and computational efficiency. A plausible implication is FOAM can be combined with gradient-segmentation frameworks (e.g., ZeRO) for further scaling and communication reduction.

Open research directions:

Validation for LLMs $>$ 7B (e.g., 100B parameter scale).
Generalization to higher-order tensors (for vision transformers and diffusion architectures).
Communication efficiency analysis for distributed training.

FOAM extends the lineage of memory-efficient optimizers—Adam [Kingma & Ba 2014], SM3 (Anil et al., 2019), GaLore [Zhao et al. 2024], APOLLO [Zhu et al. 2025]—by exploiting block structure and residual correction rather than relying on low-rank projections, quantization, or weight freezing. FOAM’s “folded state” paradigm is distinct in not incurring additional projection memory and computational overhead. This approach addresses practical bottlenecks in scaling adaptive optimizers for ultra-large LLMs (Wen et al., 8 Dec 2025).

PDF Markdown Chat (Pro)

References (2)

FOAM: Blocked State Folding for Memory-Efficient LLM Training (2025)

Memory-Efficient Adaptive Optimization (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Folded Optimizer with Approximate Moment (FOAM).

Folded Optimizer with Approximate Moment (FOAM)

1. Block-Wise Folding Constructs and Mathematical Foundations

2. Update Scheme, Residual Correction, and Algorithmic Implementation

3. Memory Reduction and Comparative Analysis of Optimizer States

4. Convergence Properties and Theoretical Guarantees

5. Empirical Profiling: Performance, Throughput, and Ablation

6. Compatibility, Limitations, and Open Questions

7. Context and Related Methods

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics