Low-Memory Optimization (LOMO) Methods

Updated 23 June 2026

Low-Memory Optimization (LOMO) is a collection of techniques that minimize memory usage in optimizing large-scale machine learning models while preserving accuracy and convergence.
It reduces auxiliary memory needs through methods like optimizer state compression, fused backward-update loops, and reversible network architectures.
LOMO enables practical applications in on-device fine-tuning, federated learning, and resource-efficient training by balancing memory efficiency with computational trade-offs.

Low-Memory Optimization (LOMO) encompasses a diverse set of algorithmic, architectural, and systems-level techniques designed to enable efficient optimization of modern large-scale machine learning models under stringent memory constraints. LOMO methods span a wide range of domains, including deep learning, nonsmooth optimization, embedded systems, and resource-constrained distributed settings. They are unified by the principle of reducing auxiliary memory required for optimizer state, activations, intermediate buffers, or system-level data structures, while preserving accuracy and convergence guarantees. Recent advances in LOMO have enabled full-parameter fine-tuning of multi-billion parameter models on commodity GPUs, on-device adaptation of LLMs, memory-optimal federated learning, and highly resource-efficient inference and training regimes.

1. Historical Foundations in Subgradient and Convex Optimization

Classical low-memory optimization strategies originated in large-scale convex and nonsmooth optimization, where memory limits necessitated algorithms with only O(n) or O(n+m) auxiliary storage—far less than required for Hessians, full gradient history, or large cutting-plane bundles. Notable developments include:

Shor’s Normalized Subgradient Method: Maintains only the current iterate and one subgradient vector; requires O(n) storage and guarantees suboptimality within λ(1+ε)/2 for any ε>0 (Dvurechensky et al., 2019).
Nesterov’s Smoothing and Accelerated Methods: Applies adaptive smoothing and acceleration to nonsmooth objectives, storing only the current point, a dual variable, and a subgradient, yielding O(n+m) memory and O(1/ε) complexity (Dvurechensky et al., 2019).
Universal Accelerated and Mirror-Prox Methods: Provide optimal black-box rates for Hölder continuous objectives and monotone variational inequalities, with minimal per-iteration storage (Dvurechensky et al., 2019).
Adaptive Mirror Descent (Primal-Dual): Solves constrained and variational problems using only a running iterate, (sub)gradient, and small auxiliary state, achieving O(1/ε²) iteration complexity with O(n+m) storage (Dvurechensky et al., 2019).

These foundational techniques established that, even for high-dimensional and nonsmooth problems, efficiency and optimality could be attained with sub-linear or linear memory overhead.

2. System-Level and Architectural LOMO: Memory Partitioning and Reversible Networks

System-level LOMO methods exploit operating system or hardware-level mechanisms to partition, isolate, or reconstruct memory resources:

Virtual Memory Node Partitioning: Lim et al. introduced a scheme that partitions physical memory into k virtual nodes (“vnodes”), each dedicated to a process class (e.g., critical vs non-critical applications) (Lim et al., 2021). Formally, $\sum_{i=1}^k M_i = M$ (total memory), with strict caps $T_i = M_i$ per vnode. Local memory pressure is handled within vnode boundaries via selective reclamation. This prevents the activation of global low-memory killers, dramatically increasing free memory (+670 MB), reducing process launch latency (–1015 ms), and fully eliminating system-wide LMK/OOMK events under workload (Lim et al., 2021).
Reversible Transformer Networks: In large model fine-tuning, RevFFN introduces reversible blocks that split activations, performing coupled forward updates $Y_1 = X_1 + f(X_1, X_2)$ , $Y_2 = X_2 + g(Y_1)$ , enabling reconstruction of original activations during the backward pass without caching all intermediates (Liu et al., 24 Dec 2025). This achieves a 2× reduction in activation memory and enables full-model fine-tuning (e.g., 2.7B Mixture-of-Experts LLMs) on a single 80 GB GPU, outperforming activation checkpointing and distributed-sharding frameworks (Liu et al., 24 Dec 2025).

These approaches shift parts of the memory management problem to system primitives or invertible computation, offloading or reconstructing information as needed to stay within hardware bounds.

3. Optimizer-Centric LOMO: State Compression, Fusion, and Block Selection

Mainstream LOMO research focuses on redesigning optimizer state management and parameter updates to minimize memory overhead without sacrificing statistical adaptivity:

Optimizer State Compression and Factorization:
- Extreme Tensoring (ET): Diagonal preconditioners (e.g., AdaGrad) are factorized as Kronecker products of mode-wise marginal accumulators, reducing O(d) memory to O(k·d^{1/k}). With $k=2$ or 3, ET matches Adam’s performance at 1000× less memory on ResNet-18 and transformer tasks (Chen et al., 2019).
- Adafactor and SMMF: Both maintain low-rank or factored statistics for second moments (e.g., NNMF on square-matricized tensors in SMMF, minimizing auxiliary state to $\sim$ 2( $\hat n{+}\hat m$ ) per tensor) with up to 96% memory reduction compared to Adam, and maintain performance parity across CNNs and transformers (Park et al., 2024).
- CAME: Supplements Adafactor with a confidence-guided momentum scaling, stabilizing updates via per-block moving-average factorization of instantaneous instability, storing only four vectors per block (Luo et al., 2023).
- Low-rank Momentum Factorization (MoFaSGD): Approximates first order moment with an adaptive rank- $r$ SVD, updating thin factors via efficient QR/SVD operations, requiring O((m+n)r) memory (e.g., 66% saving over AdamW for r=32) (Mahdavinia et al., 10 Jul 2025).
Fused Backward-Update Loops: LOMO fuses gradient computation and parameter update in a single backward pass, so that no full gradient tensor is ever materialized; peak memory is O(1) in the number of layers for gradient storage (Lv et al., 2023). AdaLomo extends this by adding an NNMF-compressed adaptive learning rate per parameter; grouped update normalization ensures stability and recovers AdamW-level convergence (Lv et al., 2023).
Block-Coordinate Descent (BlockLLM): Updates are performed only on selected parameter blocks (typically transformer layers) with highest normalized gradient magnitude, updating < $5\%$ of parameters and optimizer state per step. Memory scales as $[1+3(1-s)]n\,b$ for sparsity $T_i = M_i$ 0, with empirical VRAM savings of 13% vs. rank-8 GaLore and accuracy parity (Ramesh et al., 2024).

Combinations of these methods—statistical adaptivity, low-rank/matrix compressions, per-block or vnodal isolation, fusion—underpin almost all current LOMO optimizers.

4. Activation and Forward Memory Minimization: Token/Fine-Grained Sparsity and Low-Precision

Beyond optimizer state, LOMO targets reduction of activation and intermediate buffer memory, critical in long-context and extreme classification:

Token Elimination and Contextual Sparsity (LeMo): By profiling attention scores at the block level and training tiny predictors of block informativeness, only $T_i = M_i$ 1 token fraction is retained per layer; memory and step speed improve by up to 1.93× and 1.36× over LoRA/LongLoRA without notable loss in perplexity or accuracy for sequences up to 16K (Wang et al., 15 Jan 2025).
Gradient and Update Fusion in Low Precision (ELMO, ME-MPO): ELMO eliminates both FP32 master weights and DRAM-resident gradient/momentum buffers for gigantic classification heads (e.g., 3M labels). All forward/updates are carried out in BF16/FP8 with Kahan summation/stochastic rounding, and chunked kernel-fused updates reduce peak memory to 6.6 GB (vs 39.7 GB baseline) with less than 1% accuracy loss (Zhang et al., 13 Oct 2025). ME-MPO generalizes this by representing parameters as FP16+ $T_i = M_i$ 2 “extra bits” and fusing optimizer steps into backward hooks, yielding 16–25% lower memory (Lewandowski et al., 2023).

These approaches exploit the redundancy in token and label space, as well as modern hardware support for low-precision arithmetic, to minimize transient memory use.

5. LOMO in Distributed, Federated, and Zeroth-Order Environments

LOMO directly addresses federated and on-device learning, where training must proceed under strict per-device memory envelopes:

Federated Foresight Pruning + BP-Free Training: Devices compute per-parameter saliency via Monte Carlo NTK-based scores; pruned networks retain only the most influential parameters (Zhang et al., 2024). Zeroth-order (Stein’s identity) optimization then runs on the sparse subnet, reducing peak memory up to 9× over full backprop and lowering FLOPs by 5–10× (Zhang et al., 2024). Empirically, accuracy loss relative to FedAvg is within 3–5% at 90% sparsity.
Low-Rank Jacobian Backpropagation with Error Feedback (GradLite): LLM fine-tuning under severe memory budgets employs low-rank projections $T_i = M_i$ 3, storing only O(k) activations per layer and compensating for projection bias via error feedback (Yang et al., 26 Oct 2025). Empirical peak VRAM drops 41–50% vs. checkpointing; theoretical convergence guarantees are preserved.
Memory Partitioning in Embedded Systems: In Android environments, fixed vnode boundaries and CPU affinity ensure applications never trigger system-level OOM, crucial for reliability (Lim et al., 2021).

These advances have made memory-optimal collaborative learning feasible for edge, mobile, and federated networks.

6. Formal Analysis, Regret, and Convergence Properties

Theoretical guarantees are central to the credibility of LOMO:

Regret and Convergence Bounds:
- SMMF, Adafactor, Extreme Tensoring, MoFaSGD: All match the best-known O(1/√T) or O(1/ε) rates of their full memory-consuming analogues under standard stochastic or deterministic oracles (Chen et al., 2019, Mahdavinia et al., 10 Jul 2025, Park et al., 2024).
- GradLite: Low-rank backprop with error feedback maintains unbiasedness and achieves the same mean convergence guarantee as vanilla SGD/Adam (Yang et al., 26 Oct 2025).
Empirical-Statistical Trade-offs: Adafactor, AdaLomo, MoFaSGD, and CAME show performance degradation only if the factorization/compression rank is too aggressive relative to parameter structure; all provide detailed ablation studies justifying memory/accuracy trade-offs (Lv et al., 2023, Mahdavinia et al., 10 Jul 2025, Park et al., 2024, Luo et al., 2023).
Practical Memory Formulas: For optimizers, per-layer state scales as (parameters) + (O(factorization state)), with auxiliary buffer reduction typically 2–10× depending on technique and rank/pattern. For activations, reductions of 2–10× are common with sparse token involvement or reversible architectures.

LOMO thus bridges the gap between theoretical optimization and practical system limitations.

7. Applications, Limitations, and Future Outlook

LOMO is now critical to full-parameter fine-tuning of LLMs, on-device adaptation, federated learning, and large-output extreme classification. Standard practices include:

Combining LOMO with ZeRO/factorized sharding or activation checkpointing for end-to-end minimization of parameter, state, and activation memory (Lv et al., 2023, Lv et al., 2023).
Practical tuning: memory/fidelity can be dialed in via factorization rank, block size, or token sparsity threshold; factorization overhead is negligible for suitable k, and practical implementations routinely profile or ablate these parameters (Mahdavinia et al., 10 Jul 2025, Park et al., 2024, Wang et al., 15 Jan 2025).
Limitations: Most LOMO schemes trade memory for additional compute, especially in backprop recomputation or matrix/tensor factorization. Some methods (e.g., activation pruning) may degrade compatibility with inflexible hardware layouts or black-box accelerator stacks (Wang et al., 15 Jan 2025). Certain extremely large-batch/federated scenarios still demand further research in stabilizing convergence.

A plausible implication is that continued progress in LOMO, including hybrid quantization, dynamic adaptive partitioning, and hierarchical or multi-scale factorization, will be foundational for the next generation of resource- and privacy-aware AI systems at all scales.