Low-Memory Optimization (LOMO)

Updated 1 December 2025

LOMO is a set of algorithmic and systems-level strategies designed to reduce the memory footprint of activations, parameters, gradients, and optimizer states in deep learning.
It employs techniques such as in-place updates, low-rank factorization, token and block pruning, and optimized memory scheduling to manage resource constraints.
LOMO enables full-parameter and parameter-efficient model adaptation on commercial hardware, democratizing access to large-scale neural network deployments with minimal performance impact.

Low-Memory Optimization (LOMO) refers to a broad class of algorithmic and systems-level strategies that explicitly target the reduction of memory footprint—activation, parameter, gradient, and/or optimizer-state—during the training, fine-tuning, or inference of deep neural networks. The rapid growth in parameter count, context window, and output size in models spanning LLMs, computer vision, and extreme classification has made memory the principal bottleneck in scaling such systems. LOMO methods enable full-parameter and parameter-efficient adaptation of large models on commercially available hardware, democratizing access and improving the practicality of downstream deployment.

1. Foundational Principles and Motivation

LOMO approaches are motivated by the empirical scaling law that, in standard deep learning workflows, memory use scales linearly or quadratically with the number of activations (context length, batch size), model parameters, and optimizer states. For instance, transformer-based LLMs require activation memory $O(N \cdot d \cdot L)$ —frequently over 50 $\times$ the parameter state for context windows $N \sim 32$ K (Wang et al., 15 Jan 2025). AdamW-style optimizers demand $2\times$ parameter count to store first and second moments per parameter (Xiao et al., 26 Nov 2024, Luo et al., 2023), while extreme classification heads with millions of output labels may drive GPU usage into the tens of GBs (Zhang et al., 13 Oct 2025). In federated and edge settings, full backpropagation is not feasible for low-memory clients (Zhang et al., 8 May 2024).

The objective of LOMO is therefore to compress, sparsify, or partition memory usage at each major bottleneck with minimal sacrifice (often no detectable degradation) in training convergence or final task performance.

2. Core Algorithmic Strategies

LOMO encompasses multiple methodological classes:

In-place and Fused Update Schemes: These methods fuse gradient computation with in-place parameter updates, immediately updating parameters as their gradients are computed and discarding each gradient after use. This eliminates the need for storing a full-model gradient buffer and, when combined with SGD (i.e., no momentum/variance states), reduces memory to $O(1)$ over gradients and optimizer state (Lv et al., 2023, Lv et al., 2023). AdaLomo extends this with non-negative matrix factorization for second-moment state, supporting AdamW-style adaptivity at comparable memory cost (Lv et al., 2023).
Low-rank and Tensorized Optimizer State: By factorizing momentum or second-moment accumulators into rank-1 (or low-rank) representations, methods such as Adafactor, SMMF, and CAME replace $O(mn)$ full-matrix storage with $O(m+n)$ or $O(\sqrt{mn})$ factors for weight matrices (Park et al., 12 Dec 2024, Luo et al., 2023). Extreme tensoring generalizes adaptive preconditioning to storage scales of $O(\sum_i d_i)$ for $k$ -mode tensors, achieving up to $10^3\times$ reduction at the expense of weaker cross-mode adaptivity (Chen et al., 2019).
Block-structured Selective Update and Pruning: Selecting small, dynamically chosen blocks of trainable parameters (BlockLLM), or aggressively pruning uninformative weights or tokens (LeMo, federated foresight) drastically reduces the parameter and activation set involved in both forward and backward computation (Ramesh et al., 25 Jun 2024, Wang et al., 15 Jan 2025, Zhang et al., 8 May 2024). BlockLLM achieves 13–15% VRAM savings by freezing 90–95% of parameters, with minimal convergence degradation (Ramesh et al., 25 Jun 2024).
Activation and Memory Management: Integer programming (OLLA) schedules operator sequencing, allocation, and deallocation to minimize peak resident sets and memory fragmentation, achieving an average 30–40% peak memory reduction over online/malloc-based allocation without changing the model or recomputing (Steiner et al., 2022). LeMo leverages token-level contextual sparsity for activation memory reduction by $1.93\times$ (Wang et al., 15 Jan 2025). Chunking and segmenting approaches split large weight or activation blocks into manageable pieces, controlling peak requirements in the classification layer (Zhang et al., 13 Oct 2025).
Low-precision Arithmetic, Mixed-precision, and FP32 Master-State Elimination: Pure BF16/FP8 training (ELMO) exploits Kahan summation and stochastic rounding, removing the need for FP32 master weights in encoder and classification blocks (Zhang et al., 13 Oct 2025). Fusing optimizer updates inside backprop also eliminates the per-step gradient buffer and can yield an additional 15–25% savings, even with standard mixed-precision protocols (Lewandowski et al., 2023).
Federated and Edge-enabling Pruning with BP-free Optimization: In federated learning for AIoT, foresight pruning via neural tangent kernel (NTK) sensitivity, combined with zeroth-order (“backprop-free”) training using Stein’s identity, achieves true O(1) backprop memory, 8–9 $\times$ device memory reduction, and performance parity with dense FL models (Zhang et al., 8 May 2024).

3. Memory Complexity and Empirical Benchmarks

The quantitative impact of LOMO methods is consistently supported by large-scale empirical results:

Approach	Optimizer State Memory Reduction	Activation/Gradient Savings	Acc/Conv Impact	Reference
LeMo (token)	$-$	$1.93\times$ (activations)	$<$ 2% $\Delta$ PPL	(Wang et al., 15 Jan 2025)
AdaLomo	$75$– $90\%$ vs. AdamW	$O(1)$ gradient	$=$ AdamW	(Lv et al., 2023)
SMMF	up to $96\%$ vs. Adam/Adafactor/SM3	$O(\sqrt{N})$ per tensor	$<$ 0.5% drop	(Park et al., 12 Dec 2024)
BlockLLM	$13.5\%$ VRAM drop	Proportional to block size	$=$ full-tune	(Ramesh et al., 25 Jun 2024)
ELMO (FP8)	$6$x (FP8), $4$x (BF16) encoder head	$k$ x chunking in output layer	$=$ SOTA (P@1)	(Zhang et al., 13 Oct 2025)
OLLA	$1/3$ average peak RAM	Activation allocation only	none	(Steiner et al., 2022)
MoFaSGD	$3$– $10\times$ (low-rank, LoRA-comp.)	$-$	AdamW-matching	(Mahdavinia et al., 10 Jul 2025)

For GPT-3 175B, LeMo’s activation memory at $N=32$ K is 1.93 $\times$ lower than LoRA/LongLoRA (Wang et al., 15 Jan 2025). AdaLomo supports full LLaMA-65B training on 8 $\times$ 24GB GPUs (memory: 19.2GB/GPU, standard AdamW: $>$ 80GB/GPU) with zero stability or quality loss (Lv et al., 2023, Lv et al., 2023). SMMF achieves 60 $\times$ optimizer-state reductions on BERT/GPT-2, with stable convergence (Park et al., 12 Dec 2024). BlockLLM, with $<$ 5% parameter updates, maintains state-of-the-art GLUE task perplexity (Ramesh et al., 25 Jun 2024). ELMO enables 3-million-label XMC models using only 6.6GiB (FP8), while previous SOTA needed 39.7GiB (Zhang et al., 13 Oct 2025). OLLA’s offline ILP-based memory planning enables all standard models to run in up to one-third less RAM with sub-second solver times (Steiner et al., 2022).

4. Design Trade-offs, Limitations, and Extensions

LOMO introduces several explicit trade-offs:

Adaptivity vs. Memory: Low-rank and factorized optimizers risk losing update adaptivity, especially with severe rank constraints; confidence-guided correction (CAME) and decorrelating projection updates (COAP) address this by penalizing directions with high sketch error, regaining Adam-level convergence (Xiao et al., 26 Nov 2024, Luo et al., 2023).
Parameter vs. Activation Savings: Most optimizer-state LOMO approaches cannot reduce activation memory, which dominates in long-context or large-batch regimes; approaches targeting redundant tokens/weights or chunked processing (LeMo, ELMO, OLLA) are required to control peak activation footprint.
Implementation Complexity: Advanced block selection, joint ILP solvers, custom CUDA/Triton kernels, and fused updates require nontrivial engineering. Some methods depend on hardware features (e.g. native FP8 support) (Zhang et al., 13 Oct 2025).
Hyperparameter Sensitivity and Convergence: Naive aggressive pruning or block-sparsity may compromise rare-feature modeling or slow convergence; practical deployment requires monitoring and may require domain-specific predictor retraining (LeMo) (Wang et al., 15 Jan 2025, Ramesh et al., 25 Jun 2024).
Compatibility: Certain LOMO variants (e.g., in-place fused update) are not compatible with multi-step gradient accumulation or certain optimizer-closure architectures (Lewandowski et al., 2023), but can typically be integrated with activation or parameter quantization, checkpointing, or sharded distributed frameworks (Lv et al., 2023, Lv et al., 2023).

5. Applications and Broader Impact

LOMO has transformed the feasibility of training and adapting massive models for a range of scenarios:

Full-parameter fine-tuning of LLMs (LLaMA-65B, 30B, etc.) on single-node consumer-grade GPUs (Lv et al., 2023, Lv et al., 2023), supporting both English and multilingual corpora.
Extreme multi-label classification tasks with millions of outputs, previously intractable on commodity hardware (Zhang et al., 13 Oct 2025).
Federated learning on edge and IoT devices, with O(1) per-step memory, reduction in communication, and robust performance in non-IID heterogeneous settings (Zhang et al., 8 May 2024).
Memory-bounded and mobile/embedded computing, including OS-level kernel memory allocation partitions for prioritized application responsiveness (Lim et al., 2021).

Table: LOMO Variant Classes and Main Applications

Class	Technical Strategy	Typical Applications
Fused In-place	Per-step in-place param updates	LLM full-tune, edge FL
Low-rank	Matrix/tensor momentum factorization	Transformer/CNN pretraining
Token/Weight Pruning	Redundant feature elimination	Long-context LLM, FL, XMC
Memory Scheduling	Execution/lifetime ILP optimization	Any deep net on fixed RAM
Low-precision	Pure BF16/FP8, round-elimination	XMC, distributed LLM

6. Theoretical Guarantees

Several LOMO optimizers, including SMMF, MoFaSGD, and Extreme Tensoring, have established regret and convergence guarantees that closely match those of the corresponding full-memory methods, under standard convexity, smoothness, and bounded-gradient assumptions (Park et al., 12 Dec 2024, Mahdavinia et al., 10 Jul 2025, Chen et al., 2019):

MoFaSGD achieves an optimal $O(1/\sqrt{T})$ convergence rate for nonconvex stochastic optimization, matching full AdamW (Mahdavinia et al., 10 Jul 2025).
Extreme tensoring interpolates between full AdaGrad (regret $O(\sqrt{dT})$ ) and SGD ( $O(\sqrt{T})$ ), with intermediate memory regimes (Chen et al., 2019).
SMMF’s regret bound is identical to AdamNC/AMSGrad up to bounded compression error per step (Park et al., 12 Dec 2024).

7. Outlook and Future Research

Future progress in LOMO is directed toward:

Unified multi-dimensional sparsity: Simultaneously optimizing token, parameter, and hidden-dimension sparsity (e.g., sparse activation schedules combined with block-update selection) (Wang et al., 15 Jan 2025, Ramesh et al., 25 Jun 2024).
Learned block selection and meta-tuning: Dynamic adjustment of sparsity level, block partitioning, and mask thresholds as training progresses (Ramesh et al., 25 Jun 2024).
Hardware-software co-design: Efficient integration of low-precision arithmetic, chunked execution, and kernel fusion with rapidly evolving accelerator architectures (Zhang et al., 13 Oct 2025).
Robustness and calibration: Ensuring that predictor models and threshold choices generalize across domains and data distributions (Wang et al., 15 Jan 2025).
Automated memory-planning and allocation for arbitrary computational graphs at compile time, accessible to non-expert users (Steiner et al., 2022).