Folded Optimizer with Approximate Moment (FOAM)
- The paper introduces FOAM’s main contribution: a block-wise folding technique that maintains convergence rates while drastically reducing memory overhead.
- FOAM employs fold-unfold matrices with residual correction to compress gradients without sacrificing per-parameter update accuracy.
- Empirical results show FOAM achieves comparable perplexity to Adam and increases throughput with up to 50% memory reduction.
Folded Optimizer with Approximate Moment (FOAM) is a memory-efficient adaptive optimization technique designed for LLM training. FOAM compresses optimizer states by folding gradients into block-wise means and restoring full-rank updates via an ephemeral residual. The method achieves convergence rates equivalent to Adam under standard non-convex settings, eliminates up to 90% of optimizer state overhead, and accelerates throughput, rendering it practical for LLMs with multi-billion parameters (Wen et al., 8 Dec 2025).
1. Block-Wise Folding Constructs and Mathematical Foundations
FOAM operates on weight matrices and maintains block-compressed first and second moments of the gradient . Given a fold level (block size ), FOAM utilizes a “fold” matrix and an “unfold” matrix . These operators perform averaging and replication within blocks, respectively:
- if , else 0.
- if , else 0.
The idempotent projector (with spectral norm 1) projects gradients into block-constant subspaces. Folding compresses both moments to , and the residual restores per-parameter information on each update, ensuring differentiation within blocks.
2. Update Scheme, Residual Correction, and Algorithmic Implementation
On each optimization step, FOAM performs:
- Compute gradient: .
- Fold moments: .
- Compute residual: .
- Update folded moments:
- Unfold and incorporate residual:
- Update parameters:
FOAM’s design ensures the residual does not persist across steps, minimizing permanent memory overhead.
3. Memory Reduction and Comparative Analysis of Optimizer States
FOAM reduces optimizer state memory as follows:
| Optimizer | State Storage (scalar count) | Empirical Footprint (LLaMA-350M, BF16) |
|---|---|---|
| Adam | $2mn$ | 4.40 GB (model+opt) |
| MUON | Reduced rank | 3.80 GB |
| Adam-Mini | $2m$ | 3.30 GB |
| GaLore-1/4 | Low-rank, projected | 3.58 GB |
| APOLLO-1/4 | Low-rank, projected | 3.58 GB |
| FOAM-2 | $2m(n/4)$ | 2.75 GB |
| FOAM-Mini | $2m$ | 2.22 GB |
For fold level (block size 2), optimizer state is halved. For “FOAM-Mini” (maximal folding, ), state is , representing a 99% reduction in optimizer overhead. Empirically, FOAM-2 reduces total memory by 50% and matches or exceeds throughput, e.g., FOAM-2 reaches 35.7K tok/sec versus Adam's 34.5K on 4×3090 GPUs.
4. Convergence Properties and Theoretical Guarantees
Under non-convex assumptions (L-smoothness, unbiased gradient noise, bounded gradients), FOAM attains the same convergence rate as Adam. Denoting the residual energy ratio , the main theorem states:
Proof utilizes spectral properties of and moment error bounds. The bound decays for steps, matching Adam’s asymptotics. This suggests block-wise folding with residual does not degrade theoretical optimizer guarantees.
5. Empirical Profiling: Performance, Throughput, and Ablation
FOAM matches or surpasses full-Adam and memory-efficient baselines across LLM pretraining tasks:
- LLaMA-350M, 1.3B tokens: FOAM-2 achieves final PPL 15.87; Adam 17.33; GaLore-1/4 19.36; APOLLO-1/4 16.73; FOAM-Mini 16.53.
- FOAM requires roughly half the steps to reach Adam’s PPL.
- Throughput increases by 4–5% over Adam at equivalent memory.
- Robustness: FOAM-l degrades 0.5 PPL even at (1/8 state).
Ablation studies confirm the necessity of residual injection—removal results in uniform block updates and 1 PPL degradation. Residual addition to second moment (V) stabilizes optimization. Empirical findings extend to LLMs up to 7B parameters.
6. Compatibility, Limitations, and Open Questions
FOAM is optimizer-agnostic: composable with Adam-Mini, MUON, and compatible with 8-bit quantization. It avoids SVD and random projections, delivering memory and computational efficiency. A plausible implication is FOAM can be combined with gradient-segmentation frameworks (e.g., ZeRO) for further scaling and communication reduction.
Open research directions:
- Validation for LLMs 7B (e.g., 100B parameter scale).
- Generalization to higher-order tensors (for vision transformers and diffusion architectures).
- Communication efficiency analysis for distributed training.
7. Context and Related Methods
FOAM extends the lineage of memory-efficient optimizers—Adam [Kingma & Ba 2014], SM3 (Anil et al., 2019), GaLore [Zhao et al. 2024], APOLLO [Zhu et al. 2025]—by exploiting block structure and residual correction rather than relying on low-rank projections, quantization, or weight freezing. FOAM’s “folded state” paradigm is distinct in not incurring additional projection memory and computational overhead. This approach addresses practical bottlenecks in scaling adaptive optimizers for ultra-large LLMs (Wen et al., 8 Dec 2025).