- The paper introduces AdaMeZO, which achieves Adam-style moment adaptation in a zeroth-order setting without storing full moment histories.
- It leverages truncated moment estimates and blockwise PRNG state caching to reconstruct gradients on demand, reducing memory cost and accelerating convergence.
- Empirical results show that AdaMeZO lowers forward passes by up to 70.9% and improves accuracy compared to MeZO across diverse LLM architectures.
AdaMeZO: Efficient Adam-Style Zeroth-Order Optimization for LLM Fine-Tuning
Introduction and Motivation
Fine-tuning LLMs for specialized downstream tasks is memory-intensive, primarily due to backpropagation in first-order methods. Approaches such as PEFT mitigate memory overhead by modifying only subsets of the model but still rely on gradients, thus memory remains a bottleneck. Recently, zeroth-order (ZO) optimization—fine-tuning using forward passes without backpropagation—has been proposed to significantly reduce memory requirements and enable fine-tuning on resource-constrained hardware. MeZO introduces stochastic gradient estimates solely via forward propagation, allowing in-place parameter updates with minimal memory. However, its SGD-style updates struggle with slow convergence and poor adaptation to anisotropic loss landscapes.
AdaMeZO proposes a distinct advancement: it achieves Adam-style preconditioning in ZO optimization without explicitly storing first- and second-moment histories, thus preserving MeZO’s memory efficiency while accelerating convergence to match the adaptive advantages of Adam.
Technical Contributions and Methodology
AdaMeZO introduces an Adam-style ZO optimizer with two technical innovations:
- Truncated Moment Estimates: Instead of maintaining full first and second moment accumulators (which would triple memory consumption), AdaMeZO computes truncated versions over a finite horizon h, discarding outdated gradient contributions. This truncation leverages the rapid decay of exponential moving averages, enabling effective moment estimation with only the most recent gradients.
- Block-wise PRNG State Caching: Exploiting PRNG state caching at a fine granularity, AdaMeZO can regenerate stochastic gradient directions exactly when reevaluating or updating moments without keeping past gradients or moment vectors in memory. Block-wise partitioning (e.g., layer-wise) further minimizes the required temporary memory for moment computation.
The optimizer uses SPSA-based ZO gradient estimates similar to MeZO, but updates parameters with moment-aware, Adam-style scaling. Both the first and second moments are reconstructed on demand with blockwise operations, enabled by precise, low-overhead state restoration of the PRNG.
In theory, AdaMeZO achieves a convergence rate and stationary-point guarantee similar to preconditioned MeZO (see Theorem 4.7), under standard smoothness and bounded-variance assumptions, without the prohibitive memory cost of adaptive moment methods.
Empirical Results
AdaMeZO is empirically validated on a suite of LLMs spanning encoder-decoder (e.g., RoBERTa-large), decoder-only architectures (e.g., OPT-1.3B/13B, LLaMA3-B/7B), and evaluated on a range of standard NLP tasks under few-shot settings.
Numerical Findings:
- On RoBERTa-large (SST-2, SST-5, SNLI, MNLI, RTE, TREC), AdaMeZO achieves up to 1.2% average absolute accuracy improvement over MeZO.
- On OPT-1.3B, AdaMeZO requires up to 70.9% fewer forward passes to reach MeZO’s final loss values and delivers superior accuracy on nearly all tasks.
- On larger architectures (OPT-13B, LLaMA-3B/7B), AdaMeZO maintains or improves performance relative to memory-efficient ZO baselines and matches or exceeds the strong HiZOO/Helene ZO adaptive optimizers, usually with lower or similar memory footprints.
Memory Usage: Compared to MeZO (1x), HiZOO (1.5x), and Adam (4.4x), AdaMeZO requires only ~1.07x the memory for ZO fine-tuning. This minimal increase is attributed to ephemeral, blockwise PRNG state storage required for moment reconstruction.
Convergence/Speed: AdaMeZO matches Adam’s trajectory adaptability and approaches its minima in toy landscape experiments, but with ZO noise and longer paths due to horizon warm-up. Wall-clock time per step increases compared to MeZO, primarily due to PRNG and aggregation overhead, but the total training time benefits from significantly fewer steps.
Theoretical and Practical Implications
The theoretical advance of AdaMeZO is the realization that adaptive ZO optimization can be achieved without the quadratic or cubic memory blowup previously considered inevitable for moment-based methods. By reconstructing truncated, blockwise moments using deterministic random streams, AdaMeZO matches the speed and stability benefits of Adam-like optimizers in a pure ZO context. This makes truly low-memory LLM fine-tuning feasible even for models with billions of parameters, which broadens accessibility to resource-constrained settings and edge devices.
Practically, AdaMeZO enables:
- Resource-efficient specialization of foundation LLMs in settings where GPU RAM is a hard constraint.
- Better adaptation to complex, ill-conditioned loss landscapes during LLM fine-tuning without backpropagation.
- Application to both masked and autoregressive transformer architectures across classification, ranking, and generation.
Limitations and Future Directions
AdaMeZO inherits the fundamental noise and bias limitations of ZO estimation, particularly for high-dimensional, stiff landscapes where ZO estimators are intrinsically inefficient. Truncated moments, while empirically effective, introduce potential estimation bias; the theoretical gap between truncated and full-moment adaptation remains to be precisely characterized. Second-moment estimation accuracy is limited by ZO noise and finite horizons. There is also additional computational cost for reconstructing blockwise historical gradients, which could be mitigated by further algorithmic and systems-level optimization.
Future research could improve the accuracy of ZO second-moment estimators, investigate structured or learning-based block partitioning strategies, and explore generalization to related adaptive optimizers (e.g., Adafactor, Lion) within the ZO paradigm. Further scaling studies on models ≫10B parameters would help validate AdaMeZO’s practical applicability at absolute frontier scales.
Conclusion
AdaMeZO delivers Adam-style moment adaptation in zeroth-order optimization for LLM fine-tuning, avoiding the substantial memory overhead of classic Adam while significantly accelerating convergence compared to vanilla MeZO. The approach is theoretically justified and empirically validated at scale, offering a compelling methodology for memory-efficient, adaptive fine-tuning of large models without backward passes. Its adoption opens new directions for democratized LLM deployment and on-device customization in both research and applications.
Reference: "AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments" (2605.00650).