Gradient-Free Federated Learning

Updated 23 June 2026

Gradient-Free Federated Learning is a set of techniques that builds collaborative models using function evaluations and summary statistics instead of gradients.
It incorporates diverse methods including Bayesian generative models, tree ensembles, operator-theoretic mappings, and zeroth-order optimization to enhance privacy and efficiency.
GFFL is applied in privacy-critical, heterogeneous environments, offering robust performance with significant communication savings.

Gradient-Free Federated Learning (GFFL) encompasses a family of federated learning techniques in which statistical or predictive models are collaboratively constructed across distributed clients without transmission or aggregation of explicit model gradients or their surrogates. GFFL methodologies leverage function value queries, summary statistics, ensemble artifacts, or kernelized representations rather than stochastic gradient information. This design mitigates privacy risk, reduces communication overhead, and improves interoperability in heterogeneous or privacy-critical environments. The approaches described below span Bayesian generative models, tree ensembles, zeroth-order optimization, operator-theoretic mappings, and Riemannian projection, collectively advancing the GFFL paradigm.

1. Theoretical Foundations and Motivations

Gradient-free federated learning arises from the recognition that direct exchange of gradient or Hessian information in traditional federated optimization (e.g., FedAvg) exposes sensitive data to inference attacks and induces high, repetitive communication costs. Classical federated algorithms often couple local stochastic optimization with parameter or gradient aggregation at the server, a structure that is ill-suited for non-differentiable objectives, hard parameter constraints, or privacy-restricted settings. GFFL frameworks instead operate in regimes where clients can only return function values, summary statistics, or high-level model outputs—enabling robust learning even when gradients are unavailable, ill-defined (e.g., for decision trees), or too privacy-sensitive to share (Hahn et al., 2020, Ma et al., 2023, Lobanov et al., 2022, Ma et al., 8 Mar 2025, Wang et al., 30 Jul 2025, Kumar et al., 30 Nov 2025).

The principal theoretical instruments include:

Zeroth-order (finite-difference) optimization,
Smoothing techniques for non-smooth and non-differentiable objectives,
Operator-theoretic mappings between function spaces,
Approximate Bayesian computation (ABC) and likelihood-free inference,
Aggregation and transfer via model artifacts or kernelized summaries.

Applications extend to settings with restricted computational budgets, structured parameter manifolds, privacy regulations (e.g., healthcare, finance), and heterogeneous or black-box clients.

2. Key Methodological Variants

Gradient-free FL is implemented via diverse algorithmic paradigms, determined by the nature of local workflows, the communicated artifacts, and the global aggregation mechanism. The principal GFFL classes include:

Bayesian Generative Models with Summary Statistics

The GRAFFL framework eschews gradient aggregation in favor of summary-statistic-based collaboration (Hahn et al., 2020). Clients compress their data into sufficient, linearly separable representations using locally trained autoencoders (SuffiAE), guaranteeing Bayesian sufficiency and inviolability via dimension reduction and injected noise. The central server applies ABC samplers to match simulated and observed summaries, accepting parameter proposals when their summary-discrepancy falls below a threshold. No gradients, raw data, or model parameters are exchanged. This design provably preserves privacy and enables federated inference of Bayesian generative models, such as Gaussian mixture models (GMMs).

Gradient-Free Tree Ensembles

In the FedXGBllr framework, federated gradient boosting is performed without any gradient or Hessian exchange (Ma et al., 2023). Each client trains a local XGBoost model using its data and own gradients (no communication). Clients only transmit their final local tree ensembles to a central server, which aggregates and broadcasts all trees to all clients. Per-tree learnable rates are then jointly learned via a small convolutional neural network in a federated optimization loop, but only the weights of these networks are communicated in subsequent rounds. This approach massively reduces communication compared to split-level gradient sharing and enhances privacy by eliminating explicit derivative exposure.

Operator-Theoretic and Kernel-Based Methods

An operator-theoretic approach maps the optimal regression function in $L^2$ directly into a reproducing kernel Hilbert space (RKHS), bypassing parameter gradients entirely (Kumar et al., 30 Nov 2025). Each client computes class-conditional kernel summaries (e.g., via Kernel Affine Hull Machines, KAHMs) and transmits only scalar space-folding measures. The server aggregates these scalars and establishes the global classifier by minimum assignment, supporting differentially private or fully homomorphic encrypted (FHE) communication. This protocol limits communication to $O(QC)$ scalars and achieves provable excess risk bounds without iterative gradient updates.

Zeroth-Order and Smoothing-Based Stochastic Optimization

For non-smooth stochastic convex objectives, GFFL employs smoothing via $\ell_1$ - or $\ell_2$ -randomization—function values are averaged over local perturbations to render the objectives smooth (Lobanov et al., 2022). Unbiased gradient-free estimators, using single-point or two-point finite difference schemes, allow each client to perform stochastic updates and participate in federated averaging. This delivers optimal oracle complexity up to logarithmic factors, with robust dimension-scaling and greater resilience to noise for $\ell_1$ -randomization.

Zeroth-Order Projection and Manifold Optimization

For federated optimization under manifold constraints (orthogonality, low-rank), a projection-based zeroth-order estimator is constructed from Euclidean perturbations and subsequent projections onto the feasible manifold (Wang et al., 30 Jul 2025). The algorithm achieves bias and variance guarantees matching first-order methods, using only function evaluations and projections. Applications include black-box adversarial attack design, low-rank neural network training, and kernel PCA with strong convergence guarantees.

Black-Box Data-Free Knowledge Transfer

In FedZGE, a generative model is maintained at the server and trained by sending synthetic data and its perturbed copies to clients for inference through their local models (Ma et al., 8 Mar 2025). Only the outputs (logits) of client models—never their parameters—are sent back, enabling black-box, model-agnostic federation without access to real data. Generator updates use zeroth-order estimated gradients via finite differences, and the protocol supports heterogeneous client architectures and data-free operation.

3. Mathematical Schemes and Statistical Properties

The diversity of GFFL methodologies is reflected in their mathematical infrastructure:

Approximate Bayesian computation (ABC) in GRAFFL: Posterior inference of model parameters is replaced by threshold-acceptance of parameter proposals whose simulated summary statistics match those from real data within a selected $\epsilon$ ; with sufficient statistics, the scheme is optimal as $\epsilon \to 0$ (Hahn et al., 2020).
Finite-difference gradient estimators for smoothings: For a function $f(x)$ , setting $\mu>0$ and sampling from a uniform sphere, finite-difference estimators such as

$g^{(2)}(x;\xi, e) = \frac{d}{2\mu}[f(x + \mu e, \xi) - f(x - \mu e, \xi)]e$

are unbiased for the gradient of the smoothed function (Lobanov et al., 2022).

Operator-theoretic risk bounds: Embedding into RKHS and mapping back ensures that the excess risk and generalization error decrease as $O(QC)$ 0 (sample size), and empirical Rademacher-complexity analysis ensures learning-theoretic soundness (Kumar et al., 30 Nov 2025).
Riemannian zeroth-order estimation: Given a manifold $O(QC)$ 1, Euclidean perturbation combined with projection onto the manifold provides a computationally efficient estimator for constrained federated settings; sublinear convergence in $O(QC)$ 2 and matching first-order rates are achieved (Wang et al., 30 Jul 2025).
Zeroth-order generator optimization in black-box settings: FedZGE estimates gradients with respect to synthetic data by

$O(QC)$ 3

leveraging black-box client inference, and the server updates the generator by backpropagation through its own model parameters (Ma et al., 8 Mar 2025).

4. Privacy, Communication, and Robustness Properties

A core motivation for gradient-free federated learning is enhancing privacy and reducing the attack surface:

Privacy: Omission of gradient and Hessian exchange prevents gradient-inversion and deep leakage attacks. Protocols such as GRAFFL guarantee that summary statistics and discrepancies are insufficient to reconstruct raw data, even under collusion among clients or with the server (Hahn et al., 2020). The operator-theoretic approach enables differential privacy guarantees via one-shot perturbation at the data matrix level and supports FHE inference with minimal overhead (Kumar et al., 30 Nov 2025). FedZGE transmits only synthetic data and logits, never parameters or raw instances (Ma et al., 8 Mar 2025).
Communication Efficiency: All GFFL variants reduce communication compared to conventional FL. In FedXGBllr, communication is lowered by 25–700×, transmitting only tree structures instead of per-split statistics (Ma et al., 2023). The operator-theoretic method requires only a single round of $O(QC)$ 4 scalar summaries (Kumar et al., 30 Nov 2025). Zeroth-order optimizer variants require transmission only of model vectors per round (Lobanov et al., 2022). FedZGE achieves intermediate communication costs, lying between pure distillation-FL and full model parameter transfer (Ma et al., 8 Mar 2025).
Robustness and Model Flexibility: By abstracting away from gradient transmission, GFFL methods naturally support heterogeneous client models and data distributions. For example, FedZGE handles settings where clients have divergent architectures, and operator-theoretic methods remain robust under extreme label or feature distributional heterogeneity (Ma et al., 8 Mar 2025, Kumar et al., 30 Nov 2025).

5. Empirical Results and Use Cases

Empirical benchmarks across these frameworks confirm the practical feasibility and advantages of GFFL. Notable results include:

GRAFFL recovers true GMM parameters in synthetic scenarios, improves AUC in imbalanced clinical data, and restores F₁ score from 0 to ≈1 in scarce-data vehicle datasets. Convergence is controlled via accepted set size, typical $O(QC)$ 5, $O(QC)$ 6 (Hahn et al., 2020).
FedXGBllr achieves equivalent or superior accuracy compared to gradient-sharing FL, with orders-of-magnitude reduction in communication. For example, on the a9a dataset with $O(QC)$ 7 clients: FedXGBllr reaches 85.1% accuracy with ≈6 MB traffic, compared to 84.9% (centralized) and ≈150–4,200 MB in gradient-sharing FL (Ma et al., 2023).
FedZGE offers 3–5 point accuracy improvements over distillation-based FL in non-IID settings, matches data-free white-box FL within 1 point on CIFAR benchmarks, and uniquely supports model heterogeneity (Ma et al., 8 Mar 2025).
Operator-theoretic framework realizes up to 23.7 points gain over parameter-efficient FL on severely imbalanced splits, and supports accurate encrypted inference at practical latencies (Kumar et al., 30 Nov 2025).
Riemannian GFFL matches first-order FL convergence for constrained and manifold-valued models, with application to federated low-rank neural networks and adversarial attack generation (Wang et al., 30 Jul 2025).
Smoothing-based GFFL empirically demonstrates that $O(QC)$ 8-randomization achieves lower error floors and greater adversarial noise resilience than $O(QC)$ 9-randomization for non-smooth convex objectives (Lobanov et al., 2022).

6. Limitations and Prospects for Further Research

While GFFL delivers significant advantages, several limitations remain:

Model Expressiveness: Some protocols, such as fixed-tree GFFL, cannot adapt model structure after the initial aggregation round, limiting adaptability to highly non-IID scenarios (Ma et al., 2023).
Communication and Hyperparameter Tuning: Methods involving finite-difference or manifold perturbations require an explicit balancing of bias–variance via smoothing hyperparameters ( $\ell_1$ 0) and mini-batch sizes ( $\ell_1$ 1) to avoid accuracy loss in high dimensions (Lobanov et al., 2022, Wang et al., 30 Jul 2025).
Differential Privacy and Security: While many GFFL frameworks are compatible with DP and FHE, formal DP guarantees for all communication steps are not always present, and the support for adaptive or representation-learned feature encoders is partial (Kumar et al., 30 Nov 2025).
Future Directions: Ongoing research priorities include integration of advanced ABC samplers, hybrid or adaptive tree refinement under privacy budgets, encoder learnability under the operator-theoretic regime, decentralized or peer-to-peer GFFL, adaptive or variance-reduced zeroth-order schemes, and strong DP or secure aggregation protocols in all regimes.

7. Representative Comparison of GFFL Approaches

Framework	Local Computation	Communicated Artifact	Privacy/DP Support
GRAFFL (Hahn et al., 2020)	SuffiAE encoding, ABC distance eval	Summary statistics/distances	Irreversible summaries, ABC noise
FedXGBllr (Ma et al., 2023)	Full tree building, CNN rate tuning	Tree ensembles, CNN weights	No gradient exposure; extensions to DP possible
Operator-Theoretic (Kumar et al., 30 Nov 2025)	Kernel, KAHM encoding	Space-folding scalars	One-shot DP, FHE
Smoothing/Zeroth-Order (Lobanov et al., 2022, Wang et al., 30 Jul 2025)	Smoothing, finite-diff updates	Model vectors	Inherent due to lack of gradients
FedZGE (Ma et al., 8 Mar 2025)	Local SGD, model serving	Synthetic data, logits	No param/model sharing, black-box

The GFFL paradigm thus provides a spectrum of mathematically principled, communication-efficient, and privacy-enhancing protocols for federated optimization in settings where gradient-based coordination is impossible, undesirable, or unsafe.