Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

Published 1 May 2026 in cs.LG and cs.AI | (2605.00650v1)

Abstract: Fine-tuning LLMs is necessary for various dedicated downstream tasks, but classic backpropagation-based fine-tuning methods require substantial GPU memory. To this end, a recent work, MeZO, which relies solely on forward passes to fine-tune LLMs, significantly reduces GPU requirements at the cost of slower convergence due to its indifference to loss landscapes. Standard solutions, such as Adam, explore loss landscapes by estimating the first- and second-order moments and storing them in memory to guide the model's movement through dimensions with lower curvature and vice versa. However, directly applying Adam negates MeZO's advantage as it will triple the memory requirement. In light of this, we propose AdaMeZO, a zeroth-order optimizer that leverages Adam-style first- and second-moment estimates without maintaining them in memory. We present a theoretical analysis of AdaMeZO, corroborated by extensive experiments demonstrating AdaMeZO's performance, showing that AdaMeZO can outperform MeZO while requiring up to $70\%$ fewer forward passes. Trajectory visualizations affirm AdaMeZO's ability to adapt to diverse loss landscapes.

Authors (3)

Summary

  • The paper introduces AdaMeZO, which achieves Adam-style moment adaptation in a zeroth-order setting without storing full moment histories.
  • It leverages truncated moment estimates and blockwise PRNG state caching to reconstruct gradients on demand, reducing memory cost and accelerating convergence.
  • Empirical results show that AdaMeZO lowers forward passes by up to 70.9% and improves accuracy compared to MeZO across diverse LLM architectures.

AdaMeZO: Efficient Adam-Style Zeroth-Order Optimization for LLM Fine-Tuning

Introduction and Motivation

Fine-tuning LLMs for specialized downstream tasks is memory-intensive, primarily due to backpropagation in first-order methods. Approaches such as PEFT mitigate memory overhead by modifying only subsets of the model but still rely on gradients, thus memory remains a bottleneck. Recently, zeroth-order (ZO) optimization—fine-tuning using forward passes without backpropagation—has been proposed to significantly reduce memory requirements and enable fine-tuning on resource-constrained hardware. MeZO introduces stochastic gradient estimates solely via forward propagation, allowing in-place parameter updates with minimal memory. However, its SGD-style updates struggle with slow convergence and poor adaptation to anisotropic loss landscapes.

AdaMeZO proposes a distinct advancement: it achieves Adam-style preconditioning in ZO optimization without explicitly storing first- and second-moment histories, thus preserving MeZO’s memory efficiency while accelerating convergence to match the adaptive advantages of Adam.

Technical Contributions and Methodology

AdaMeZO introduces an Adam-style ZO optimizer with two technical innovations:

  1. Truncated Moment Estimates: Instead of maintaining full first and second moment accumulators (which would triple memory consumption), AdaMeZO computes truncated versions over a finite horizon hh, discarding outdated gradient contributions. This truncation leverages the rapid decay of exponential moving averages, enabling effective moment estimation with only the most recent gradients.
  2. Block-wise PRNG State Caching: Exploiting PRNG state caching at a fine granularity, AdaMeZO can regenerate stochastic gradient directions exactly when reevaluating or updating moments without keeping past gradients or moment vectors in memory. Block-wise partitioning (e.g., layer-wise) further minimizes the required temporary memory for moment computation.

The optimizer uses SPSA-based ZO gradient estimates similar to MeZO, but updates parameters with moment-aware, Adam-style scaling. Both the first and second moments are reconstructed on demand with blockwise operations, enabled by precise, low-overhead state restoration of the PRNG.

In theory, AdaMeZO achieves a convergence rate and stationary-point guarantee similar to preconditioned MeZO (see Theorem 4.7), under standard smoothness and bounded-variance assumptions, without the prohibitive memory cost of adaptive moment methods.

Empirical Results

AdaMeZO is empirically validated on a suite of LLMs spanning encoder-decoder (e.g., RoBERTa-large), decoder-only architectures (e.g., OPT-1.3B/13B, LLaMA3-B/7B), and evaluated on a range of standard NLP tasks under few-shot settings.

Numerical Findings:

  • On RoBERTa-large (SST-2, SST-5, SNLI, MNLI, RTE, TREC), AdaMeZO achieves up to 1.2% average absolute accuracy improvement over MeZO.
  • On OPT-1.3B, AdaMeZO requires up to 70.9% fewer forward passes to reach MeZO’s final loss values and delivers superior accuracy on nearly all tasks.
  • On larger architectures (OPT-13B, LLaMA-3B/7B), AdaMeZO maintains or improves performance relative to memory-efficient ZO baselines and matches or exceeds the strong HiZOO/Helene ZO adaptive optimizers, usually with lower or similar memory footprints.

Memory Usage: Compared to MeZO (1x), HiZOO (1.5x), and Adam (4.4x), AdaMeZO requires only ~1.07x the memory for ZO fine-tuning. This minimal increase is attributed to ephemeral, blockwise PRNG state storage required for moment reconstruction.

Convergence/Speed: AdaMeZO matches Adam’s trajectory adaptability and approaches its minima in toy landscape experiments, but with ZO noise and longer paths due to horizon warm-up. Wall-clock time per step increases compared to MeZO, primarily due to PRNG and aggregation overhead, but the total training time benefits from significantly fewer steps.

Theoretical and Practical Implications

The theoretical advance of AdaMeZO is the realization that adaptive ZO optimization can be achieved without the quadratic or cubic memory blowup previously considered inevitable for moment-based methods. By reconstructing truncated, blockwise moments using deterministic random streams, AdaMeZO matches the speed and stability benefits of Adam-like optimizers in a pure ZO context. This makes truly low-memory LLM fine-tuning feasible even for models with billions of parameters, which broadens accessibility to resource-constrained settings and edge devices.

Practically, AdaMeZO enables:

  • Resource-efficient specialization of foundation LLMs in settings where GPU RAM is a hard constraint.
  • Better adaptation to complex, ill-conditioned loss landscapes during LLM fine-tuning without backpropagation.
  • Application to both masked and autoregressive transformer architectures across classification, ranking, and generation.

Limitations and Future Directions

AdaMeZO inherits the fundamental noise and bias limitations of ZO estimation, particularly for high-dimensional, stiff landscapes where ZO estimators are intrinsically inefficient. Truncated moments, while empirically effective, introduce potential estimation bias; the theoretical gap between truncated and full-moment adaptation remains to be precisely characterized. Second-moment estimation accuracy is limited by ZO noise and finite horizons. There is also additional computational cost for reconstructing blockwise historical gradients, which could be mitigated by further algorithmic and systems-level optimization.

Future research could improve the accuracy of ZO second-moment estimators, investigate structured or learning-based block partitioning strategies, and explore generalization to related adaptive optimizers (e.g., Adafactor, Lion) within the ZO paradigm. Further scaling studies on models ≫10\gg 10B parameters would help validate AdaMeZO’s practical applicability at absolute frontier scales.

Conclusion

AdaMeZO delivers Adam-style moment adaptation in zeroth-order optimization for LLM fine-tuning, avoiding the substantial memory overhead of classic Adam while significantly accelerating convergence compared to vanilla MeZO. The approach is theoretically justified and empirically validated at scale, offering a compelling methodology for memory-efficient, adaptive fine-tuning of large models without backward passes. Its adoption opens new directions for democratized LLM deployment and on-device customization in both research and applications.

Reference: "AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments" (2605.00650).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.