Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Differentially Private Zeroth-Order Optimization

Updated 10 July 2025
  • Differentially private zeroth-order optimization is a framework that enables gradient-free learning by relying solely on function evaluations with strict privacy guarantees.
  • It employs techniques like randomized smoothing, two-point finite-difference estimators, and calibrated noise aggregation to mask individual data contributions.
  • Practical implementations have proven effective in high-dimensional settings, such as large language model fine-tuning, federated learning, and distributed optimization.

Differentially private zeroth-order optimization encompasses a rapidly evolving family of algorithmic techniques that enable data-driven learning and optimization in settings where only function values (not explicit gradients) are available, all while rigorously satisfying differential privacy guarantees. These algorithms are increasingly important in domains such as LLM fine-tuning, federated learning, and distributed optimization, particularly in high-dimensional and resource-constrained environments. The field covers theoretical insights, algorithmic advances, practical implementations, and important controversies regarding the sufficiency of inherent randomness for privacy.

1. Foundational Principles and Definitions

Differential privacy (DP) is a rigorous mathematical framework ensuring that the outputs of an algorithm do not reveal sensitive information about any individual data point. Formally, an algorithm A\mathcal{A} is (ε,δ)(\varepsilon, \delta)-differentially private if for all neighboring datasets D,DD, D' and for all measurable sets SS,

Pr[A(D)S]eεPr[A(D)S]+δ.\Pr[\mathcal{A}(D) \in S] \leq e^\varepsilon \Pr[\mathcal{A}(D') \in S] + \delta.

Zeroth-order (ZO) optimization refers to methods that do not require gradient information and instead operate via black-box queries to the objective function. In the context of DP, these methods must mask the dependency of function outputs on individual data points, typically by adding carefully calibrated noise to intermediate computations or outputs so that any single change in the data does not substantially affect the probability distribution over observed outputs.

Several nuanced privacy definitions appear in the literature:

  • (ε,δ)(\varepsilon, \delta)-DP (classical, as above)
  • Rényi Differential Privacy (RDP): Parameterized by an order α>1\alpha > 1, providing tighter composition and subsampling analysis, and especially used for iterative algorithms (Zhang et al., 27 Jun 2024, Chien et al., 30 May 2025).

A key challenge in ZO settings is that, compared to first-order methods, one must carefully handle the directional or anisotropic nature of sensitivity and noise calibration because updates rely on random directions rather than full data-dependent gradients.

2. Algorithmic Methodologies

Gradient-Free and Smoothing Schemes

The most prevalent approach to differentially private ZO optimization is to use randomized smoothing and two-point finite-difference estimators. For a non-smooth or nonconvex Lipschitz function f:RdRf:\mathbb{R}^d \rightarrow \mathbb{R}, a smoothed surrogate f^δ(x)\hat{f}_\delta(x) is defined via: f^δ(x)=EuU(Sd1)[f(x+δu)]\hat{f}_\delta(x) = \mathbb{E}_{u \sim \mathcal{U}(S^{d-1})}[f(x + \delta u)] where uu is sampled uniformly from the unit sphere, and δ>0\delta>0 controls smoothing. The gradient of f^δ\hat{f}_\delta is estimated using a two-point difference: ~f(x)=d2δ(f(x+δu)f(xδu))u\widetilde{\nabla} f(x) = \frac{d}{2\delta} (f(x+\delta u) - f(x-\delta u)) u Variance reduction is performed either by averaging across multiple independent directions or by leveraging specific aggregation schemes such as the "tree mechanism" to compose private estimates while controlling cumulative privacy loss (Zhang et al., 27 Jun 2024).

Noise Calibration and Aggregation

Noise is typically added to either:

A critical property is sensitivity calibration: the sensitivity of the (possibly multi-sample averaged) gradient estimator is quantified—often O(dL/T)O(dL/T), where LL is a Lipschitz constant and TT is the number of iterations—and noise is scaled accordingly. Multiple batch and mini-batch structures are employed to further dilute the privacy cost (Zhang et al., 27 Feb 2025, Zhang et al., 27 Jun 2024).

3. Theoretical Guarantees and Complexity

A central result of recent work is that differentially private zeroth-order algorithms can, under appropriate conditions, achieve sample complexity and convergence rates matching those of their nonprivate analogs (Zhang et al., 27 Jun 2024). Specifically, to achieve a (δ,ϵ)(\delta, \epsilon)-stationary point for the smoothed function, the algorithm requires

M=Ω~(dδϵ3+d3/2ρδϵ2)M = \widetilde{\Omega}\left(\frac{d}{\delta\epsilon^3} + \frac{d^{3/2}}{\rho\delta\epsilon^2}\right)

samples, where ρ\rho is a privacy parameter (e.g., in RDP) and Ω~\widetilde{\Omega} hides logarithmic factors.

"Privacy for free": When ρdϵ\rho \geq \sqrt{d}\epsilon, the additional complexity due to privacy is overtaken by the nonprivate term, indicating that strong privacy can sometimes be enforced with negligible sample complexity penalty (Zhang et al., 27 Jun 2024). This result directly extends the optimal dimension-dependent rates achieved in standard nonprivate (stochastic) zeroth-order optimization to the private setting, especially for high-dimensional models encountered in practice.

4. Extensions to Nonconvex, Nonsmooth, and High-Dimensional Settings

Unlike classic DP optimization, which often assumes convexity or smoothness (and thus is limited to first-order methods), the methodologies here apply uniform or randomized smoothing to create surrogate functions amenable to ZO gradient estimation, even for arbitrary nonconvex or nonsmooth problems (Zhang et al., 27 Jun 2024). Stationarity is defined in the "Goldstein-stationary" sense: a point xx is (δ,ϵ)(\delta, \epsilon)-stationary if the norm of the averaged (local) gradients around xx in a δ\delta-neighborhood is less than ϵ\epsilon.

The combination of

  • multiple function evaluations along independent directions
  • variance reduction
  • tree-based aggregation allows these algorithms to lift first-order privacy-utility tradeoffs—and to operate with minimal structural assumptions on the underlying objective.

5. Practical Implementations and Impact

DP ZO optimization is increasingly deployed for:

  • Fine-tuning LLMs under strict memory constraints (Tang et al., 9 Jan 2024, Liu et al., 12 Feb 2024, Chien et al., 30 May 2025), because per-example gradient computation and storage required by DP-SGD become impractical when model size reaches billions of parameters. ZO approaches require only forward passes (function evaluations), substantially reducing compute and memory requirements.
  • Vertical Federated Learning (VFL), avoiding gradient exposure and its associated privacy risks by replacing backward communication with a single forward pass and a DP scalar difference at the server (Zhang et al., 27 Feb 2025).
  • Distributed optimization settings, where no gradients are available or communication bottlenecks preclude vector-valued noise addition (Gratton et al., 2020).

Empirical results demonstrate that with careful hyperparameter scheduling, DP ZO methods can attain accuracy close to or sometimes exceeding typical DP-SGD baselines under strong privacy constraints, while being far more scalable (Tang et al., 9 Jan 2024, Liu et al., 12 Feb 2024, Zhang et al., 27 Feb 2025).

Summary Table: Representative Sample Complexities

Setting Complexity Bound "Privacy for free" regime
Non-private ZO O~(d/δϵ3)\tilde{O}(d/\delta\epsilon^{3}) n/a
Private ZO (RDP, (Zhang et al., 27 Jun 2024)) O~(dδϵ3+d3/2ρδϵ2)\tilde{O}(\frac{d}{\delta\epsilon^{3}} + \frac{d^{3/2}}{\rho\delta\epsilon^2}) If ρdϵ\rho\ge\sqrt{d}\epsilon, private cost negligible

6. Limitations and Controversies

Sufficiency of Inherent ZO Randomness for Privacy

A major, recently resolved controversy concerns whether the inherent randomness in ZO estimators substitutes for explicit DP noise addition. Several recent works posed the question of whether the stochasticity in sampling directions (as used in, e.g., SPSA or finite-difference schemes) suffices for DP (Gupta et al., 8 Jul 2025).

The answer is negative. Studies show that for common ZO methods (including projected ZO-GD) and a range of even convex objectives, the output distributions corresponding to neighboring datasets can be perfectly distinguishable, especially in "zero-preserving" estimators (where the estimated gradient is identically zero for zero-valued losses). The distributional gap can persist even with randomized initialization, and the privacy loss grows superlinearly with the number of iterations (as T4/3T^{4/3} in derived bounds) (Gupta et al., 8 Jul 2025). Therefore, explicit additive noise is necessary to guarantee DP in ZO frameworks, and "inherent randomness" is not sufficient.

Amplification by Iteration and Hidden-State Analysis

Recent work generalizes the privacy amplification-by-iteration (PABI) principle from first-order DP-SGD to ZO settings using hybrid noise mechanisms (scalar directional plus isotropic noise) and multiple orthonormal update directions per iteration (Chien et al., 30 May 2025). Critically, these results show that under a hidden-state regime—when only the final iterate is released—the total privacy loss can converge independently of the number of optimization steps. This significantly improves upon naive composition analyses and relates the utility-privacy tradeoff to the number and geometry of queried directions.

7. Comparative Performance and Outlook

Private ZO optimization now matches or exceeds the theoretical and practical efficiency of nonprivate baselines and first-order DP methods in many regimes, particularly for nonconvex, nonsmooth, high-dimensional, or distributed settings (Zhang et al., 27 Jun 2024, Zhang et al., 27 Feb 2025, Tang et al., 9 Jan 2024, Liu et al., 12 Feb 2024). Table-based summaries and closed-form bounds in the literature provide precise quantification of the privacy-utility-resource tradeoff.

Future research directions include:

  • Adaptive and dynamic scheduling of noise and smoothing parameters
  • Integration with modular and parameter-efficient fine-tuning schemes
  • Extension to reinforcement learning and other black-box optimization domains
  • Tightening the theoretical gap on best possible sample complexity, especially under composition and amplification

The developments in this area are shaping practical, scalable, and theoretically grounded approaches for privacy-preserving optimization in modern machine learning workflows, particularly where first-order information is inaccessible or computationally costly.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.