Differentially Private Zeroth-Order Optimization

Updated 10 July 2025

Differentially private zeroth-order optimization is a framework that enables gradient-free learning by relying solely on function evaluations with strict privacy guarantees.
It employs techniques like randomized smoothing, two-point finite-difference estimators, and calibrated noise aggregation to mask individual data contributions.
Practical implementations have proven effective in high-dimensional settings, such as large language model fine-tuning, federated learning, and distributed optimization.

Differentially private zeroth-order optimization encompasses a rapidly evolving family of algorithmic techniques that enable data-driven learning and optimization in settings where only function values (not explicit gradients) are available, all while rigorously satisfying differential privacy guarantees. These algorithms are increasingly important in domains such as LLM fine-tuning, federated learning, and distributed optimization, particularly in high-dimensional and resource-constrained environments. The field covers theoretical insights, algorithmic advances, practical implementations, and important controversies regarding the sufficiency of inherent randomness for privacy.

1. Foundational Principles and Definitions

Differential privacy (DP) is a rigorous mathematical framework ensuring that the outputs of an algorithm do not reveal sensitive information about any individual data point. Formally, an algorithm $\mathcal{A}$ is $(\varepsilon, \delta)$ -differentially private if for all neighboring datasets $D, D'$ and for all measurable sets $S$ ,

$\Pr[\mathcal{A}(D) \in S] \leq e^\varepsilon \Pr[\mathcal{A}(D') \in S] + \delta.$

Zeroth-order (ZO) optimization refers to methods that do not require gradient information and instead operate via black-box queries to the objective function. In the context of DP, these methods must mask the dependency of function outputs on individual data points, typically by adding carefully calibrated noise to intermediate computations or outputs so that any single change in the data does not substantially affect the probability distribution over observed outputs.

Several nuanced privacy definitions appear in the literature:

$(\varepsilon, \delta)$ -DP (classical, as above)
Rényi Differential Privacy (RDP): Parameterized by an order $\alpha > 1$ , providing tighter composition and subsampling analysis, and especially used for iterative algorithms (Zhang et al., 27 Jun 2024, Chien et al., 30 May 2025).

A key challenge in ZO settings is that, compared to first-order methods, one must carefully handle the directional or anisotropic nature of sensitivity and noise calibration because updates rely on random directions rather than full data-dependent gradients.

2. Algorithmic Methodologies

Gradient-Free and Smoothing Schemes

The most prevalent approach to differentially private ZO optimization is to use randomized smoothing and two-point finite-difference estimators. For a non-smooth or nonconvex Lipschitz function $f:\mathbb{R}^d \rightarrow \mathbb{R}$ , a smoothed surrogate $\hat{f}_\delta(x)$ is defined via: $\hat{f}_\delta(x) = \mathbb{E}_{u \sim \mathcal{U}(S^{d-1})}[f(x + \delta u)]$ where $u$ is sampled uniformly from the unit sphere, and $\delta>0$ controls smoothing. The gradient of $\hat{f}_\delta$ is estimated using a two-point difference: $\widetilde{\nabla} f(x) = \frac{d}{2\delta} (f(x+\delta u) - f(x-\delta u)) u$ Variance reduction is performed either by averaging across multiple independent directions or by leveraging specific aggregation schemes such as the "tree mechanism" to compose private estimates while controlling cumulative privacy loss (Zhang et al., 27 Jun 2024).

Noise Calibration and Aggregation

Noise is typically added to either:

the finite-difference scalar feedback (as in (Tang et al., 9 Jan 2024, Liu et al., 12 Feb 2024)),
the aggregated estimate of the function’s gradient (with variance tied to estimator sensitivity, as in (Zhang et al., 27 Jun 2024)),
or the final update step via privacy-preserving mechanisms such as Laplace or Gaussian, sometimes combined with tree-aggregation for improvements in utility and DP composition (Zhang et al., 27 Jun 2024).

A critical property is sensitivity calibration: the sensitivity of the (possibly multi-sample averaged) gradient estimator is quantified—often $O(dL/T)$ , where $L$ is a Lipschitz constant and $T$ is the number of iterations—and noise is scaled accordingly. Multiple batch and mini-batch structures are employed to further dilute the privacy cost (Zhang et al., 27 Feb 2025, Zhang et al., 27 Jun 2024).

3. Theoretical Guarantees and Complexity

A central result of recent work is that differentially private zeroth-order algorithms can, under appropriate conditions, achieve sample complexity and convergence rates matching those of their nonprivate analogs (Zhang et al., 27 Jun 2024). Specifically, to achieve a $(\delta, \epsilon)$ -stationary point for the smoothed function, the algorithm requires

$M = \widetilde{\Omega}\left(\frac{d}{\delta\epsilon^3} + \frac{d^{3/2}}{\rho\delta\epsilon^2}\right)$

samples, where $\rho$ is a privacy parameter (e.g., in RDP) and $\widetilde{\Omega}$ hides logarithmic factors.

"Privacy for free": When $\rho \geq \sqrt{d}\epsilon$ , the additional complexity due to privacy is overtaken by the nonprivate term, indicating that strong privacy can sometimes be enforced with negligible sample complexity penalty (Zhang et al., 27 Jun 2024). This result directly extends the optimal dimension-dependent rates achieved in standard nonprivate (stochastic) zeroth-order optimization to the private setting, especially for high-dimensional models encountered in practice.

4. Extensions to Nonconvex, Nonsmooth, and High-Dimensional Settings

Unlike classic DP optimization, which often assumes convexity or smoothness (and thus is limited to first-order methods), the methodologies here apply uniform or randomized smoothing to create surrogate functions amenable to ZO gradient estimation, even for arbitrary nonconvex or nonsmooth problems (Zhang et al., 27 Jun 2024). Stationarity is defined in the "Goldstein-stationary" sense: a point $x$ is $(\delta, \epsilon)$ -stationary if the norm of the averaged (local) gradients around $x$ in a $\delta$ -neighborhood is less than $\epsilon$ .

The combination of

multiple function evaluations along independent directions
variance reduction
tree-based aggregation allows these algorithms to lift first-order privacy-utility tradeoffs—and to operate with minimal structural assumptions on the underlying objective.

5. Practical Implementations and Impact

DP ZO optimization is increasingly deployed for:

Fine-tuning LLMs under strict memory constraints (Tang et al., 9 Jan 2024, Liu et al., 12 Feb 2024, Chien et al., 30 May 2025), because per-example gradient computation and storage required by DP-SGD become impractical when model size reaches billions of parameters. ZO approaches require only forward passes (function evaluations), substantially reducing compute and memory requirements.
Vertical Federated Learning (VFL), avoiding gradient exposure and its associated privacy risks by replacing backward communication with a single forward pass and a DP scalar difference at the server (Zhang et al., 27 Feb 2025).
Distributed optimization settings, where no gradients are available or communication bottlenecks preclude vector-valued noise addition (Gratton et al., 2020).

Empirical results demonstrate that with careful hyperparameter scheduling, DP ZO methods can attain accuracy close to or sometimes exceeding typical DP-SGD baselines under strong privacy constraints, while being far more scalable (Tang et al., 9 Jan 2024, Liu et al., 12 Feb 2024, Zhang et al., 27 Feb 2025).

Summary Table: Representative Sample Complexities

Setting	Complexity Bound	"Privacy for free" regime
Non-private ZO	$\tilde{O}(d/\delta\epsilon^{3})$	n/a
Private ZO (RDP, (Zhang et al., 27 Jun 2024))	$\tilde{O}(\frac{d}{\delta\epsilon^{3}} + \frac{d^{3/2}}{\rho\delta\epsilon^2})$	If $\rho\ge\sqrt{d}\epsilon$ , private cost negligible

6. Limitations and Controversies

Sufficiency of Inherent ZO Randomness for Privacy

A major, recently resolved controversy concerns whether the inherent randomness in ZO estimators substitutes for explicit DP noise addition. Several recent works posed the question of whether the stochasticity in sampling directions (as used in, e.g., SPSA or finite-difference schemes) suffices for DP (Gupta et al., 8 Jul 2025).

The answer is negative. Studies show that for common ZO methods (including projected ZO-GD) and a range of even convex objectives, the output distributions corresponding to neighboring datasets can be perfectly distinguishable, especially in "zero-preserving" estimators (where the estimated gradient is identically zero for zero-valued losses). The distributional gap can persist even with randomized initialization, and the privacy loss grows superlinearly with the number of iterations (as $T^{4/3}$ in derived bounds) (Gupta et al., 8 Jul 2025). Therefore, explicit additive noise is necessary to guarantee DP in ZO frameworks, and "inherent randomness" is not sufficient.

Amplification by Iteration and Hidden-State Analysis

Recent work generalizes the privacy amplification-by-iteration (PABI) principle from first-order DP-SGD to ZO settings using hybrid noise mechanisms (scalar directional plus isotropic noise) and multiple orthonormal update directions per iteration (Chien et al., 30 May 2025). Critically, these results show that under a hidden-state regime—when only the final iterate is released—the total privacy loss can converge independently of the number of optimization steps. This significantly improves upon naive composition analyses and relates the utility-privacy tradeoff to the number and geometry of queried directions.

7. Comparative Performance and Outlook

Private ZO optimization now matches or exceeds the theoretical and practical efficiency of nonprivate baselines and first-order DP methods in many regimes, particularly for nonconvex, nonsmooth, high-dimensional, or distributed settings (Zhang et al., 27 Jun 2024, Zhang et al., 27 Feb 2025, Tang et al., 9 Jan 2024, Liu et al., 12 Feb 2024). Table-based summaries and closed-form bounds in the literature provide precise quantification of the privacy-utility-resource tradeoff.

Future research directions include:

Adaptive and dynamic scheduling of noise and smoothing parameters
Integration with modular and parameter-efficient fine-tuning schemes
Extension to reinforcement learning and other black-box optimization domains
Tightening the theoretical gap on best possible sample complexity, especially under composition and amplification

The developments in this area are shaping practical, scalable, and theoretically grounded approaches for privacy-preserving optimization in modern machine learning workflows, particularly where first-order information is inaccessible or computationally costly.