Papers
Topics
Authors
Recent
2000 character limit reached

MeZO: Memory-Efficient Zeroth-Order Optimization

Updated 21 November 2025
  • MeZO is a memory-efficient zeroth-order optimization method that estimates gradients via forward-only, subspace-restricted perturbations, bypassing backpropagation.
  • It reduces gradient variance by confining updates to carefully selected subspaces and employs variants like block coordinate descent, sparse masking, and quantization to optimize memory use.
  • Empirical findings show that techniques such as QZO enable fine-tuning of large models (e.g., LLaMA-2-13B) with up to 25× memory reduction compared to traditional methods.

Memory-Efficient Zeroth-Order Optimization (MeZO) encompasses a suite of algorithms and theoretical results enabling large-scale neural network fine-tuning—particularly for LLMs—while maintaining GPU or on-device memory consumption at (or near) inference-level costs. By avoiding backpropagation and instead estimating gradients via forward-only perturbations, MeZO and its descendants allow fine-tuning of models orders of magnitude larger than what is practical with backprop-based optimizers, at the cost of slower convergence mitigated through structured perturbations, subspace modeling, masking, and quantization.

1. Core Principles: Subspace Perturbation and Unified Framework

Memory-Efficient Zeroth-Order Optimization is built on stochastic gradient estimation by finite-difference queries in selected parameter subspaces. The classical SPSA-style ZO-SGD updates the parameter vector θ by approximating the gradient using random perturbations:

gt=L(θt+μut)L(θtμut)2μut,utN(0,Id)g_t = \frac{L(\theta_t + \mu u_t) - L(\theta_t - \mu u_t)}{2\mu} \cdot u_t,\quad u_t\sim\mathcal N(0, I_d)

MeZO generalizes this by restricting perturbations and updates to subspaces. Let StRdS_t \subseteq \mathbb{R}^d be a k-dimensional subspace with projection MtRd×dM_t \in \mathbb{R}^{d \times d} (stable rank skds \approx k \ll d); sample subspace-restricted perturbations uSt=Mtutu_{S_t} = M_t u_t:

gSt(θt)=L(θt+μuSt)L(θtμuSt)2μuStg_{S_t}(\theta_t) = \frac{L(\theta_t+\mu u_{S_t}) - L(\theta_t-\mu u_{S_t})}{2\mu} \cdot u_{S_t}

This reduces the variance in gradient estimation from O(d)O(d) to O(s)O(s), where s is the subspace rank (Park et al., 31 Jan 2025). Crucially, the effectiveness of subspace perturbation is governed not just by the subspace’s dimension, but by its alignment with the dominant curvature directions of the loss landscape. The mean “effective overlap” (ρ̄) with the Hessian’s principal directions controls both convergence and generalization bounds.

2. Algorithmic Variants and Memory-Scaling Strategies

The MeZO methodology has spawned multiple algorithmic variants targeting different efficiency regimes, including:

  • Block Coordinate Descent (MeZO-BCD): Parameters θ are partitioned into N disjoint blocks, updating one block at each step with a subspace-restricted perturbation. This approach achieves wall-clock speedups by reducing per-step memory and compute to O(block size), while retaining convergence (Park et al., 31 Jan 2025).
  • Sparse and Low-Rank MeZO: Perturbations are restricted to small subsets via binary masks (sparsity) or low-rank factorization. Memory cost becomes O(k), where k is mask density or rank. Empirically, the selection of perturbed coordinates (e.g., small-magnitude or sensitive parameters) impacts both accuracy and convergence (Liu et al., 24 Feb 2024, Guo et al., 5 Jun 2024).
  • Quantized MeZO (QZO): Perturbs only the continuous scaling factors in quantized models, keeping integer weights fixed and leveraging directional derivative clipping for stability. This enables full-model fine-tuning in as little as 5.8 GB GPU memory, with up to 18× reduction over baselines (Shang et al., 19 May 2025).

The following table summarizes leading MeZO variants and their memory cost determinants:

Variant Perturbation Structure Memory Scaling
Full MeZO All parameters O(d)
Sparse-Mask k nonzero coordinates O(k)
Low-Rank d × s via tensor factors O(d s)
Block-BCD Block size b O(b)
QZO Only scales (quantized weights) O(#scales) + O(int)

3. Theoretical Convergence and Generalization Analysis

The unified theory in (Park et al., 31 Jan 2025) establishes convergence rates for MeZO under broad classes of subspace perturbations. For L-smooth, nonconvex objectives, and subspace perturbations with stable rank s and alignment ρ̄ with local Hessian:

$\mathbb{E}[\|\nabla L(\theta_t)\|^2] \le O\left(\frac{1}{\rhō\,T} \left(r^2 + \frac{s^2}{d} +1\right) + \frac{\Delta}{\alpha T} + \frac{\sigma^2}{B}\right)$

where r is the intrinsic dimension of the Hessian, T is iteration count, and Δ, σ2, B are epoch decrement, noise, and minibatch size respectively. Dimension-free convergence (i.e., no d dependence) is achieved when s ≈ r ≪ d and subspace alignment is maximized. Generalization bounds via uniform stability analysis scale with the subspace rank, not the full parameter count.

4. Practical Implementations and Empirical Results

Extensive empirical validation on large LLMs (e.g., OPT-13B, LLaMA-2-13B) and diverse language tasks (SuperGLUE, SQuAD, COPA) demonstrates:

  • Speed and Memory: MeZO-BCD achieves up to 2.77× wall-clock speedup with equivalent final performance as prior MeZO/low-rank MeZO methods at sublinear memory and compute (Park et al., 31 Jan 2025).
  • Quantized ZO: QZO fine-tuning with int4 weights enables high performance (e.g., 80.5% SST-2 accuracy in 5.8GB, Llama-2-13B) and outperforms zero-shot in all reported tasks (Shang et al., 19 May 2025).
  • Sparse and Masked MeZO: Sparse-Mask and Masked-MeZO (MaZO) methods reduce the effective update dimension by 80–99%, often increasing accuracy and reducing required steps by 3–4× (Liu et al., 24 Feb 2024, Zhang et al., 17 Feb 2025).
  • Broader Impact: Empirical studies report robust results across PEFT scenarios, multi-task learning, continual learning, and on-device fine-tuning regimes under stringent memory constraints (Zhang et al., 17 Feb 2025, Yu et al., 23 Oct 2025, Katti et al., 14 Nov 2025).

5. Design Trade-Offs and Practical Considerations

The following trade-offs arise in MeZO-style ZO optimization:

  • Subspace size (k, s, b): Smaller subspaces reduce per-step compute and memory, but if too small, degrade “effective overlap” and slow convergence.
  • Mask selection: Coherent, importance-based or sensitive masks outperform random masking for convergence and generalization.
  • Block-wise and tensorized adapters: Block coordinate updating and tensor-train adapters (AdaZeta) offer further dimensionality reduction and enable large-batch, hardware-friendly implementations (Yang et al., 26 Jun 2024).
  • Variance and stability: Methods such as variance reduction (MeZO-SVRG), directional derivative clipping (QZO), annealed momentum, and adaptive scheduling (AdaZeta) are deployed to reduce gradient noise and prevent divergence, especially at scale.

6. Extensions and Future Directions

Major axes for future development and deployment include:

7. Comparison with Backpropagation and Limitations

Compared to first-order (FO, backprop-based) methods, MeZO reduces memory by 8–25×, permitting fine-tuning at scale on commodity hardware (e.g., 7–13B models on 8–12GB GPUs). However, classic ZO estimators have O(d) variance and converge much more slowly (compensated by advanced subspace, block, and adaptive strategies). Empirical and theoretical results consistently show that careful design (masking, alignment, quantization) is vital to prevent performance and convergence degradation when scaling to the largest models (Park et al., 31 Jan 2025, Shang et al., 19 May 2025, Zhang et al., 18 Feb 2024, Malladi et al., 2023).

Limitations include:

  • Increased wall-clock time per task unless block, sparse, or parallel-perturbation techniques are employed.
  • Sensitivity to subspace/mask design and hyperparameters (block size, perturbation scale).
  • Need for acceleration techniques (momentum, second-order preconditioning) to match the speed of FO optimizers in critical applications (Zhao et al., 16 Nov 2024, Behric et al., 4 Nov 2025).

A plausible implication is that as ZO methods become more sophisticated—integrating the full spectrum of subspace perturbation, data-centric rewriting, adaptivity, and quantization—the boundary between memory-constrained and memory-unconstrained fine-tuning will continue to erode, enabling scalable adaptation for foundation models even under severe resource constraints (Park et al., 31 Jan 2025, Shang et al., 19 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Memory-Efficient Zeroth-Order Optimization (MeZO).