MeZO: Memory-Efficient Zeroth-Order Optimization

Updated 21 November 2025

MeZO is a memory-efficient zeroth-order optimization method that estimates gradients via forward-only, subspace-restricted perturbations, bypassing backpropagation.
It reduces gradient variance by confining updates to carefully selected subspaces and employs variants like block coordinate descent, sparse masking, and quantization to optimize memory use.
Empirical findings show that techniques such as QZO enable fine-tuning of large models (e.g., LLaMA-2-13B) with up to 25× memory reduction compared to traditional methods.

Memory-Efficient Zeroth-Order Optimization (MeZO) encompasses a suite of algorithms and theoretical results enabling large-scale neural network fine-tuning—particularly for LLMs—while maintaining GPU or on-device memory consumption at (or near) inference-level costs. By avoiding backpropagation and instead estimating gradients via forward-only perturbations, MeZO and its descendants allow fine-tuning of models orders of magnitude larger than what is practical with backprop-based optimizers, at the cost of slower convergence mitigated through structured perturbations, subspace modeling, masking, and quantization.

1. Core Principles: Subspace Perturbation and Unified Framework

Memory-Efficient Zeroth-Order Optimization is built on stochastic gradient estimation by finite-difference queries in selected parameter subspaces. The classical SPSA-style ZO-SGD updates the parameter vector θ by approximating the gradient using random perturbations:

$g_t = \frac{L(\theta_t + \mu u_t) - L(\theta_t - \mu u_t)}{2\mu} \cdot u_t,\quad u_t\sim\mathcal N(0, I_d)$

MeZO generalizes this by restricting perturbations and updates to subspaces. Let $S_t \subseteq \mathbb{R}^d$ be a k-dimensional subspace with projection $M_t \in \mathbb{R}^{d \times d}$ (stable rank $s \approx k \ll d$ ); sample subspace-restricted perturbations $u_{S_t} = M_t u_t$ :

$g_{S_t}(\theta_t) = \frac{L(\theta_t+\mu u_{S_t}) - L(\theta_t-\mu u_{S_t})}{2\mu} \cdot u_{S_t}$

This reduces the variance in gradient estimation from $O(d)$ to $O(s)$ , where s is the subspace rank (Park et al., 31 Jan 2025). Crucially, the effectiveness of subspace perturbation is governed not just by the subspace’s dimension, but by its alignment with the dominant curvature directions of the loss landscape. The mean “effective overlap” (ρ̄) with the Hessian’s principal directions controls both convergence and generalization bounds.

2. Algorithmic Variants and Memory-Scaling Strategies

The MeZO methodology has spawned multiple algorithmic variants targeting different efficiency regimes, including:

Block Coordinate Descent (MeZO-BCD): Parameters θ are partitioned into N disjoint blocks, updating one block at each step with a subspace-restricted perturbation. This approach achieves wall-clock speedups by reducing per-step memory and compute to O(block size), while retaining convergence (Park et al., 31 Jan 2025).
Sparse and Low-Rank MeZO: Perturbations are restricted to small subsets via binary masks (sparsity) or low-rank factorization. Memory cost becomes O(k), where k is mask density or rank. Empirically, the selection of perturbed coordinates (e.g., small-magnitude or sensitive parameters) impacts both accuracy and convergence (Liu et al., 2024, Guo et al., 2024).
Quantized MeZO (QZO): Perturbs only the continuous scaling factors in quantized models, keeping integer weights fixed and leveraging directional derivative clipping for stability. This enables full-model fine-tuning in as little as 5.8 GB GPU memory, with up to 18× reduction over baselines (Shang et al., 19 May 2025).

The following table summarizes leading MeZO variants and their memory cost determinants:

Variant	Perturbation Structure	Memory Scaling
Full MeZO	All parameters	O(d)
Sparse-Mask	k nonzero coordinates	O(k)
Low-Rank	d × s via tensor factors	O(d s)
Block-BCD	Block size b	O(b)
QZO	Only scales (quantized weights)	O(#scales) + O(int)

3. Theoretical Convergence and Generalization Analysis

The unified theory in (Park et al., 31 Jan 2025) establishes convergence rates for MeZO under broad classes of subspace perturbations. For L-smooth, nonconvex objectives, and subspace perturbations with stable rank s and alignment ρ̄ with local Hessian:

$\mathbb{E}[\|\nabla L(\theta_t)\|^2] \le O\left(\frac{1}{\rhō\,T} \left(r^2 + \frac{s^2}{d} +1\right) + \frac{\Delta}{\alpha T} + \frac{\sigma^2}{B}\right)$

where r is the intrinsic dimension of the Hessian, T is iteration count, and Δ, σ^2, B are epoch decrement, noise, and minibatch size respectively. Dimension-free convergence (i.e., no d dependence) is achieved when s ≈ r ≪ d and subspace alignment is maximized. Generalization bounds via uniform stability analysis scale with the subspace rank, not the full parameter count.

4. Practical Implementations and Empirical Results

Extensive empirical validation on large LLMs (e.g., OPT-13B, LLaMA-2-13B) and diverse language tasks (SuperGLUE, SQuAD, COPA) demonstrates:

Speed and Memory: MeZO-BCD achieves up to 2.77× wall-clock speedup with equivalent final performance as prior MeZO/low-rank MeZO methods at sublinear memory and compute (Park et al., 31 Jan 2025).
Quantized ZO: QZO fine-tuning with int4 weights enables high performance (e.g., 80.5% SST-2 accuracy in 5.8GB, Llama-2-13B) and outperforms zero-shot in all reported tasks (Shang et al., 19 May 2025).
Sparse and Masked MeZO: Sparse-Mask and Masked-MeZO (MaZO) methods reduce the effective update dimension by 80–99%, often increasing accuracy and reducing required steps by 3–4× (Liu et al., 2024, Zhang et al., 17 Feb 2025).
Broader Impact: Empirical studies report robust results across PEFT scenarios, multi-task learning, continual learning, and on-device fine-tuning regimes under stringent memory constraints (Zhang et al., 17 Feb 2025, Yu et al., 23 Oct 2025, Katti et al., 14 Nov 2025).

5. Design Trade-Offs and Practical Considerations

The following trade-offs arise in MeZO-style ZO optimization:

Subspace size (k, s, b): Smaller subspaces reduce per-step compute and memory, but if too small, degrade “effective overlap” and slow convergence.
Mask selection: Coherent, importance-based or sensitive masks outperform random masking for convergence and generalization.
Block-wise and tensorized adapters: Block coordinate updating and tensor-train adapters (AdaZeta) offer further dimensionality reduction and enable large-batch, hardware-friendly implementations (Yang et al., 2024).
Variance and stability: Methods such as variance reduction (MeZO-SVRG), directional derivative clipping (QZO), annealed momentum, and adaptive scheduling (AdaZeta) are deployed to reduce gradient noise and prevent divergence, especially at scale.

6. Extensions and Future Directions

Major axes for future development and deployment include:

Adaptive and Data-Driven Subspaces: Dynamic, data-driven selection of subspace for perturbation (e.g., hierarchical masking, adaptive querying) is an active area under exploration (Park et al., 31 Jan 2025, Yang et al., 2024).
Further Quantization and Device Deployment: Combining extreme sparsity, quantization, and integer-only ZO steps (e.g., ElasticZO-INT8) continues to extend on-device feasibility (Shang et al., 19 May 2025, Sugiura et al., 8 Jan 2025, Guo et al., 2024).
Hybrid and Continual Learning: Hybrid FO/ZO optimizers (e.g., ElasticZO, Addax) and architectures mixing ZO for adapters and FO for classifier layers strike flexible balances between plasticity, stability, and memory (Sugiura et al., 8 Jan 2025, Li et al., 2024, Yu et al., 23 Oct 2025).
Block, Temporal, and Low-Rank Extensions: Further memory reductions and estimator variance control are achieved using joint block and temporal low-rank tensor decompositions (TeZO) (Sun et al., 31 Jan 2025).
Automated Data-Aware Enhancements: Pipelines such as OAT-Rephrase leverage LLMs as data rewriters optimized for ZO fine-tuning, reflecting a trend toward integrated data-optimizer co-design (Long et al., 10 Jun 2025).

7. Comparison with Backpropagation and Limitations

Compared to first-order (FO, backprop-based) methods, MeZO reduces memory by 8–25×, permitting fine-tuning at scale on commodity hardware (e.g., 7–13B models on 8–12GB GPUs). However, classic ZO estimators have O(d) variance and converge much more slowly (compensated by advanced subspace, block, and adaptive strategies). Empirical and theoretical results consistently show that careful design (masking, alignment, quantization) is vital to prevent performance and convergence degradation when scaling to the largest models (Park et al., 31 Jan 2025, Shang et al., 19 May 2025, Zhang et al., 2024, Malladi et al., 2023).

Limitations include:

Increased wall-clock time per task unless block, sparse, or parallel-perturbation techniques are employed.
Sensitivity to subspace/mask design and hyperparameters (block size, perturbation scale).
Need for acceleration techniques (momentum, second-order preconditioning) to match the speed of FO optimizers in critical applications (Zhao et al., 2024, Behric et al., 4 Nov 2025).

A plausible implication is that as ZO methods become more sophisticated—integrating the full spectrum of subspace perturbation, data-centric rewriting, adaptivity, and quantization—the boundary between memory-constrained and memory-unconstrained fine-tuning will continue to erode, enabling scalable adaptation for foundation models even under severe resource constraints (Park et al., 31 Jan 2025, Shang et al., 19 May 2025).