Quantized MeZO (QZO) Optimization

Updated 27 November 2025

Quantized MeZO (QZO) is a technique that fuses low-bit quantization with zeroth-order (gradient-free) optimization for efficient fine-tuning of large neural networks.
It uses a two-stage stochastic quantization process to yield unbiased finite-difference gradient estimates even under ultra-low-bit formats like INT4 and INT8.
Empirical studies indicate significant memory reductions and competitive performance on language and vision tasks, eliminating the need for backpropagation.

Quantized MeZO (QZO) refers to a class of optimization techniques that combine quantized neural network representations with zeroth-order (gradient-free) optimization strategies for the purposes of memory- and compute-efficient fine-tuning. QZO arises in the context of training and adapting large neural models—including LLMs—using low-bit (4–8 bit) numerical formats and forward-only loss queries, thus avoiding both backpropagation and full-precision storage. This approach addresses the significant memory and computational barriers posed by large-scale architectures when deployed in resource-constrained environments (Zhou et al., 17 Feb 2025, Shang et al., 19 May 2025).

1. Fundamental Problem Formulation

Quantized neural networks map real-valued parameters $\theta \in \mathbb{R}^d$ to low-precision representations through quantization operators. For a fixed quantization codebook or scalar scale $\Delta$ , the quantized weight vector is

$\bar{\theta}_i = \operatorname{round}(\theta_i / \Delta_i), \qquad \theta_i = \Delta_i\,\bar{\theta}_i,$

with $\bar{\theta} \in \mathbb{Z}^d$ discrete codes. Fine-tuning in this regime involves minimizing the loss $\mathcal{L}(\theta; \mathcal{B})$ , typically over a set of frozen codes and continuous scales: $\min_{\Delta} \; \mathcal{L}(\Delta \odot \bar{\theta}; \mathcal{B}),$ or, in the fully quantized scenario, keeping both parameters and perturbations in low-bit format and performing optimization steps without gradient information or backpropagation (Zhou et al., 17 Feb 2025, Shang et al., 19 May 2025).

2. Quantized Zeroth-Order Gradient Estimation

Traditional zeroth-order (ZO) stochastic gradient descent estimates the gradient via finite differences along randomly sampled directions: $g(w) = \frac{1}{n} \sum_{i=1}^n \frac{\mathcal{L}(w+\epsilon u_i) - \mathcal{L}(w-\epsilon u_i)}{2\epsilon} u_i,$ where $u_i \sim \mathcal{N}(0, I)$ . Directly quantizing $u_i$ damages their distributional properties, introducing significant bias to gradient estimates. To circumvent this, QZO frameworks employ independent stochastic quantization for each perturbation: $u_{i,1} = Q_1(u_i)$ , $u_{i,2} = Q_2(u_i)$ , with stochastic rounding schemes ensuring $\mathbb{E}[u_{i,1}] = \mathbb{E}[u_{i,2}] = u_i$ and $\mathbb{E}[u_{i,1}u_{i,2}^T]=u_i u_i^T$ . The unbiased quantized finite-difference estimator becomes

$\widehat{\nabla} \mathcal{L}(\bar{w}) = \frac{1}{n} \sum_{i=1}^n \frac{\mathcal{L}(\bar{w}+\epsilon u_{i,1}) - \mathcal{L}(\bar{w}-\epsilon u_{i,1})}{2\epsilon} u_{i,2}.$

This estimator, denoted Q-RGE2, retains unbiasedness even in ultra-low-bit formats (e.g., INT4), as opposed to naïve quantization (Q-RGE1), which exhibits rapidly increasing bias and instability (Zhou et al., 17 Feb 2025).

3. Algorithmic Procedures and Implementation

The QZO update is performed entirely in low precision, with no backward pass required:

For each minibatch, $n$ random perturbations $u_i$ are sampled and stochastically quantized to $(u_{i,1}, u_{i,2})$ .
The loss is queried at $\bar{w} \pm \epsilon u_{i,1}$ , and the sensitivity $\mu_i$ is calculated.
Model parameters are updated as

$\bar{w}_{t+1} = \bar{w}_t - \sum_{i=1}^n Q\left(\frac{\eta_t \mu_i}{n} u_{i,2}\right),$

where $Q[\cdot]$ denotes stochastic rounding and $\eta_t$ is the learning rate (Zhou et al., 17 Feb 2025).

A variant, developed specifically for scenarios where only the quantization scales are optimized (with codes $\bar{\theta}$ fixed), estimates gradients w.r.t. $\Delta$ via forward-only perturbations:

Two loss queries at $(\Delta+\epsilon z) \odot \bar{\theta}$ and $(\Delta-\epsilon z) \odot \bar{\theta}$ are performed for a random direction $z$ .
The directional derivative $d = (\mathcal{L}_+ - \mathcal{L}_-)/(2\epsilon)$ is obtained and clipped (to reduce variance), followed by a scale update: $\Delta_i \leftarrow \max(\Delta_i - \eta_t d' z_i, 0)$ (Shang et al., 19 May 2025).

4. Quantization Schemes and Theoretical Properties

Supported datatypes: INT4, INT8, and mixed low-precision formats (FP8, per-channel INT/Floating).
Optimization properties: QZO with two-stage stochastic quantization yields provably unbiased gradient estimates in expectation, even in the presence of severe quantization (Zhou et al., 17 Feb 2025).
Stability mechanisms: Directional derivative clipping constrains variance without introducing bias, ensuring convergence and preventing training collapse in practical deployments (Shang et al., 19 May 2025).
Integration: QZO can be layered atop any post-training quantization pipeline, such as GPTQ, AWQ, AQLM, and QLoRA. The only requirement is a continuous quantization scale or codebook (Shang et al., 19 May 2025).

5. Empirical Results and Efficiency Analysis

Extensive empirical studies demonstrate:

Memory footprint reductions: Up to $2.94\times$ (full tuning) and $5.47\times$ (LoRA-tuning) reduction compared to quantized first-order methods. Memory cost is driven down to $3.6$ GB for 4-bit 7B-parameter models, enabling single-GPU training and adaptation (Shang et al., 19 May 2025, Zhou et al., 17 Feb 2025).
Performance parity: On LLM tuning tasks (GLUE, SQuAD, MultiRC), QZO achieves accuracy close to or surpassing full-precision MeZO and outperforms quantized first-order optimizers, especially under extreme (INT4/INT8) quantization (Zhou et al., 17 Feb 2025).
Forward-only training: No backpropagation or full-precision accumulators are needed. Only two forward loss queries per update, with highly efficient inference kernels on quantized hardware (Zhou et al., 17 Feb 2025, Shang et al., 19 May 2025).
Vision and diffusion tasks: QZO successfully fine-tunes Stable Diffusion 3.5 Large (quantized 4-bit) using $12.4$ GB VRAM, far below the $86$ GB required for fp16 backpropagation. Similar qualitative benefits observed, though diffusion models display sensitivity to ZO noise (Shang et al., 19 May 2025).

6. Limitations and Extensions

Quantization error: The quality of gradient estimation is constrained by quantization noise. Poor quantization of model weights amplifies stochastic estimation error.
Performance gaps: There remains a gap to full-precision, backprop-based fine-tuning, especially for highly nonlinear objectives or ultra-low-bit settings.
Specialized perturbation: For generative models (e.g., diffusion), ZO perturbations may interfere with scheduled denoising, impacting fidelity (Shang et al., 19 May 2025).

Proposed extensions include joint optimization of quantization scales and select full-precision parameters, adaptively scheduled perturbations, and advanced variance-reduced ZO estimators (antithetic sampling, multi-point queries) (Shang et al., 19 May 2025).

7. Historical and Nomenclatural Notes

Although “Quantized MeZO” (QZO or QuZO) was independently introduced in at least two contemporary works—“QuZO: Quantized Zeroth-Order Fine-Tuning for LLMs” (Zhou et al., 17 Feb 2025) and “Fine-tuning Quantized Neural Networks with Zeroth-order Optimization” (Shang et al., 19 May 2025)—both present the same core synthesis of quantized inference and zeroth-order optimization. The algorithms are orthogonal to specific quantization schemas and consistent with the MeZO (Memory-efficient Zeroth-order Optimization) principle—generalizing its applicability from full-precision to ultra-low-bit neural network fine-tuning.

A different but unrelated combinatorial quantity, termed “quantized Mező” numbers, appears in the context of $q$ -analogs of Stirling and Bell identities, as in (Shattuck, 2014); this usage is distinct and not connected to neural optimization.