Public-Data-Assisted Zeroth-Order Optimizers

Updated 18 November 2025

PAZO is a family of gradient-free optimization methods that leverage public-data priors to minimize empirical risk on private datasets under differential privacy constraints.
Variants like PAZO-M, PAZO-P, and PAZO-S combine public gradients with privatized zeroth-order estimates to reduce computational costs and improve statistical efficiency.
Empirical results demonstrate that PAZO methods offer significant runtime reductions and enhanced accuracy, particularly in high-dimensional and strict privacy settings.

Public-Data-Assisted Zeroth-Order Optimizers (PAZO) constitute a family of optimization algorithms designed to minimize empirical risk on private datasets while leveraging auxiliary public data to guide gradient estimation. These methods blend the computational efficiency and privacy properties of zeroth-order (gradient-free) approaches with the practical utility gains derived from public-data-informed surrogates, yielding superior privacy/utility trade-offs especially in highly private regimes and situations with compute or memory constraints (Gong et al., 13 Nov 2025, Cheng et al., 2021).

1. Problem Setting and Zeroth-Order Framework

The canonical problem formulation is to minimize, subject to a differential privacy (DP) constraint,

$F(w) = \frac{1}{n}\sum_{i=1}^n \ell(w; x_i)$

over a private dataset $D_{\mathrm{priv}}$ , given also an auxiliary public dataset $D_{\mathrm{pub}}$ . In high-dimensional models ( $d$ ), direct per-sample gradient computations (as in DP-SGD) are often computationally or memory prohibitive. Zeroth-order (ZO) or gradient-free methods sidestep this bottleneck by estimating the gradient via directional finite differences: $g_\lambda(w; x) = \frac{\ell(w+\lambda u; x) - \ell(w-\lambda u; x)}{2\lambda} u,\quad u\in \sqrt{d}\,\mathbb S^{d-1}$ Noise and clipping are introduced to privatize these estimates efficiently, in contrast to classical first-order methods, where privacy costs scale with full gradients (Gong et al., 13 Nov 2025).

2. PAZO Algorithmic Suite: Variants and Mechanisms

PAZO algorithms integrate public data in various forms to reduce bias and variance in ZO gradient approximations. Three principal variants are prominent:

PAZO-M (Mixed): Combines a public-data first-order gradient ( $g_t^{\mathrm{pub}}$ ) with a (privatized) ZO estimate ( $\widetilde g_t$ ), forming

$g_t = \alpha\,g_t^{\mathrm{pub}} + (1-\alpha)\,\widetilde g_t, \quad \alpha \in (0,1)$

The mixing factor $\alpha$ balances public gradient bias against private ZO variance. In practice, $\alpha\approx 0.5$ provides robust performance.

PAZO-P (Public Subspace): Constructs a low-dimensional subspace from $k$ public gradients. Optimization is then restricted to this subspace by ZO queries along directions spanned by the public gradients, significantly reducing query complexity in favorable regimes.
PAZO-S (Best Public Gradient Selection): At each step, evaluates private loss for candidate updates along $k$ public gradients and selects the direction that minimizes the (privatized) objective.

Each PAZO variant achieves DP per iteration by clipping directional differences and adding calibrated Gaussian noise, exploiting the fact that ZO queries have lower sensitivity than vector-valued gradients (Gong et al., 13 Nov 2025).

Algorithm	Use of Public Data	ZO Queries
PAZO-M	Mix public/private grads	Full-dim, q times
PAZO-P	Public subspace grad	k-dim, q times
PAZO-S	Direction selection	k+1 evaluations

3. Theoretical Foundations and Guarantees

PAZO methods are analyzed under the assumptions of $L$ -smoothness, $M$ -Lipschitz per-example loss, bounded variance $(\sigma_1^2,\sigma_2^2)$ , and $\gamma$ -similarity (i.e., $\|\nabla f'(w) - \nabla f(w)\| \le \gamma$ for all $w$ ) between public and private data distributions.

Privacy: The per-iteration noise for DP protection is quantified as

$\sigma \ge c_2\,\frac{b\sqrt{T\log(1/\delta)}}{n\,\varepsilon}$

for total $T$ iterations, private batch size $b$ , and target $(\varepsilon,\delta)$ -DP (Gong et al., 13 Nov 2025).

Stationarity Rates (Nonconvex):
- PAZO-M:
$\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla f(w_t)\|^2 \le O\left(\frac{1-\alpha}{\alpha}\sqrt{d}\right) + O\left(\gamma^2 \frac{\alpha \sqrt d}{2(1-\alpha)+\alpha \sqrt{d}}\right) + O\left(\frac{\sigma_1^2}{b}+\frac{\sigma_2^2}{b'}\right)$ - PAZO-P:

$\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla f(w_t)\|^2 \le O(k) + O\left(\sqrt{\gamma^2+\frac{\sigma_2^2}{b'}+\frac{\sigma_1^2}{b}}\right)$ - PAZO-S:

$\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla f(w_t)\|^2 \le O(\gamma^2+\tfrac{\sigma_2^2}{b'})+O(\frac{1}{T})$

These results confirm that utility under PAZO improves as the public/private gradient similarity $\gamma$ decreases and as public-data batch size $b'$ increases (Gong et al., 13 Nov 2025).

Complexity: Per iteration, PAZO-M and PAZO-P require only forward evaluations on private data (eliminating expensive backprop on privacy-sensitive samples), and memory requirements are reduced to storing $O(d)$ or $O(kd)$ vectors.

4. Prior-Guided ZO: Surrogate-Driven Acceleration

The broader theoretical context for public-data priors is provided by prior-guided random gradient-free (PRGF) and accelerated random search (ARS) algorithms (Cheng et al., 2021). In these schemes, a prior direction $p_t$ , instantiated as a surrogate gradient learned on public data, is combined with randomly sampled orthogonal subspaces to project and estimate the underlying (unobserved) true gradient. The convergence speed gains depend on the cosine similarity between $p_t$ and the true gradient. When the prior is uninformative, the estimator reverts efficiently to the (slower) standard ZO case.

Implementation guidelines stipulate that priors must be periodically checked for quality (e.g., monitoring $D_t^{\text{est}}$ ) and downweighted or suppressed if public/private data mismatch becomes significant. Robustness is maintained under distribution shift by adaptively modulating reliance on the public-data prior (Cheng et al., 2021).

5. Empirical Evaluation and Practical Impact

PAZO methods have been assessed on vision and language benchmarks under strict DP constraints (public data fraction ≈4%). Empirical findings include:

Utility: PAZO-M achieves 71.3% accuracy on CIFAR-10 at $\varepsilon=1$ , exceeding DP-SGD (50.8%) and matching or outperforming public-data-augmented first-order baselines (DOPE-SGD at 70.9%). On MNLI, PAZO-P attains 69.8% at $\varepsilon=1$ , outperforming DOPE-SGD (68.0%) (Gong et al., 13 Nov 2025).
Computational Efficiency: Significant runtime reductions are observed: on CIFAR-10, PAZO-M is $8\times$ faster per iteration than DP-SGD; on IMDB, $9\times$ faster. Maximum speedups reach $16\times$ versus DP-SGD, with competitive or superior accuracy at tight privacy levels.
Best-Use Scenarios: PAZO is especially advantageous where per-sample backpropagation is expensive and modest, similar-public data are available. Tight-privacy (low- $\varepsilon$ ) regimes further accentuate its gains.

6. Limitations, Assumptions, and Implementation Nuances

PAZO algorithms require:

$\gamma$ -similarity between public and private data gradients; large divergence in distributions leads to reduced utility and slower convergence.
Small public batch/subspace size ( $k = 3$ –$6$ recommended).
Hyperparameters (e.g., mixing parameter $\alpha\in[0.25,0.75]$ , smoothing $\lambda$ , query number $q$ ) are robust in typical ranges.

Limitations include reliance on surrogate prior informativeness—performance may revert to baseline ZO if the public prior is misaligned. PAZO-S is most effective when the public subspace is very low-dimensional. Surrogate models must be entirely public to preserve privacy guarantees (Gong et al., 13 Nov 2025, Cheng et al., 2021).

7. Relation to Broader Prior-Guided and ZO Methodologies

The principles of PAZO extend the notion of prior-guided ZO estimation, originally explored in the PRGF and ARS frameworks (Cheng et al., 2021), to privacy-sensitive contexts. By embedding public-data-trained surrogates as priors, these methods achieve superior query efficiency and reduced variance, provided distributional alignment is sufficient. The fallback mechanisms and adaptivity central to PAZO ensure robustness under modest domain mismatch.

In summary, Public-Data-Assisted Zeroth-Order Optimizers provide an operationally efficient, privacy-preserving class of algorithms, fundamentally enabled by leveraging public information as guidance for gradient estimation. The confluence of ZO privatization and data-driven priors closes the gap between first-order and gradient-free DP optimization in terms of utility, while offering substantial computational and memory efficiency in practical deployments (Gong et al., 13 Nov 2025, Cheng et al., 2021).