Gradient-Free Federated Learning
- Gradient-Free Federated Learning is a set of decentralized methods that bypass explicit gradient computation using forward passes and summary statistics to safeguard privacy and accommodate limited client resources.
- These techniques employ strategies like zeroth-order optimization, likelihood-free inference, and projection-based algorithms to handle heterogeneous data and device architectures while reducing communication overhead.
- They demonstrate near-state-of-the-art performance with formal privacy and convergence guarantees, though challenges such as increased variance and slower convergence persist.
Gradient-free federated learning (GFFL) encompasses a class of distributed machine learning approaches that eliminate the need for explicit gradient computation and exchange during collaborative model training across decentralized data silos. These methods are motivated by concerns over privacy risks arising from gradient leakage, the limitations of client hardware incapable of efficient backpropagation, heterogeneity of devices and models, and the reduced communication and storage overhead facilitated by operating without gradient information. GFFL encompasses algorithms employing likelihood-free inference, zeroth-order optimization, model/rule composition, and other operator-theoretic or combinatorial methods to orchestrate effective training while strictly controlling the information flow between clients and servers.
1. Core Principles of Gradient-Free Federated Learning
GFFL replaces standard first-order optimization—in which gradients of a loss or likelihood with respect to model parameters form the atomic unit of communication—with mechanisms that rely on function evaluations, summary statistics, aggregated outputs, or low-dimensional surrogates. The main motivations and principles are:
- Privacy Enhancement: Avoiding explicit sharing of gradients or model states mitigates the risk of gradient-based inversion attacks and reduces the attack surface for data leakage (Hahn et al., 2020, Feng et al., 2023, Ma et al., 8 Mar 2025).
- Client Hardware Constraints: Many edge devices lack resources for storing backward-pass activations or running full backpropagation; inference-only protocols are compatible with quantization and pruning (Feng et al., 2023).
- Model and Data Heterogeneity: GFFL approaches can accommodate varying local model architectures and statistical distributions, as communication depends only on forward passes or coarse summaries (Ma et al., 8 Mar 2025).
- Communication Efficiency: Summary-based or function-evaluation-based protocols can substantially reduce the number and size of messages exchanged per training round, especially when clever surrogates are used (Ma et al., 2023, Kumar et al., 30 Nov 2025).
2. Algorithmic Paradigms and Methodologies
The spectrum of GFFL encompasses several prominent algorithmic templates:
2.1 Likelihood-Free Gradient-Free Inference
The "GRAFFL" framework applies Approximate Bayesian Computation (ABC) in a federated manner, where each client shares only compressed sufficient statistics—computed via a tailored variational autoencoder architecture, "SuffiAE"—and discrepancy measures between statistics of real and synthetic data. The global parameter posterior is approximated by an accept/reject ABC loop, never requiring a gradient computation or sharing of raw data, gradients, or model weights (Hahn et al., 2020).
2.2 Zeroth-Order and Randomized Optimization
Protocols such as "BAFFLE" and "FedZGE" leverage random finite-difference gradient estimators (zeroth-order schemes). Clients evaluate loss (or output) differences between perturbed parameterizations or input samples and return these differences to the server, which reconstructs an unbiased gradient estimate for parameter updates (Feng et al., 2023, Ma et al., 8 Mar 2025). This approach is extended to support settings where only black-box (non-white-box) access to models is possible (e.g., for knowledge distillation engines or generator training in data-free/black-box FL).
Randomized smoothing and two-point evaluation, either via - or -balls, are used for non-smooth convex stochastic optimization, leading to unbiased gradient estimators with dimension-dependent or dimension-free variance scaling (Lobanov et al., 2022).
2.3 Projection-Based and Manifold-Constrained Algorithms
For Riemannian optimization problems, such as low-rank matrix learning or learning under structural constraints, projection-based zeroth-order federated SGD can be employed. Euclidean sampling and projection constitute a computationally efficient alternative to tangent-space sampling on manifolds, yielding unbiased estimators of the Riemannian gradient with favorable bias-variance properties (Wang et al., 30 Jul 2025).
2.4 Gradient-Free Rule or Model Composition
Approaches such as federated XGBoost with learnable learning rates ("FedXGBllr") eliminate per-split gradient exchange by collecting local ensembles and optimizing a global surrogate predictor formed by learning scalar weights (per-tree learning rates) via a lightweight federated CNN. All communication is decoupled from gradient statistics and per-split aggregation, yielding both privacy and communication improvements (Ma et al., 2023).
Operator-theoretic frameworks deploy kernel machines and space folding measures, constructing global prediction rules—using only scalar summaries and avoiding backpropagation altogether—that can be deployed efficiently, privately, and with compatibility for cryptographic inference (Kumar et al., 30 Nov 2025).
3. Privacy, Security, and Communication Efficiency
GFFL achieves privacy benefits through design choices that restrict the type and granularity of exchanged information:
- No sharing of gradients, raw input data, or client-side model weights; only outputs, summary statistics, discrepancies, or fixed-feature representations may be exchanged (Hahn et al., 2020, Ma et al., 8 Mar 2025, Kumar et al., 30 Nov 2025).
- Sufficient dimensionality reduction (e.g., SuffiAE, randomized smoothing) hinders the possibility of reconstructing original data from compressed summaries (Hahn et al., 2020).
- Addition of noise for differential privacy or one-shot perturbation of client summaries eliminates the need for round-by-round privacy accounting (Kumar et al., 30 Nov 2025).
- Secure aggregation via client-specific zero-sum noise can protect additive updates from individual leakage (Feng et al., 2023).
Communication-efficient protocols, such as the surrogate ensemble method for XGBoost or space-folding kernel machines, can reduce the number of communication rounds by factors of 25–700 relative to standard gradient-sharing approaches, and decrease transmitted data by orders of magnitude (Ma et al., 2023, Kumar et al., 30 Nov 2025).
4. Theoretical Guarantees and Convergence Analysis
- Gradient-free algorithms based on randomized smoothing and two-point finite-difference satisfy the same communication/complexity guarantees (up to dimension-dependent factors) as standard stochastic first-order SGD in convex/nonsmooth optimization, and can achieve convergence for minibatch-accelerated schemes (Lobanov et al., 2022).
- For projection-based zeroth-order optimization on Riemannian manifolds, sublinear convergence rates matching first-order federated rates (with linear speedup in number of clients) are achieved, provided the manifold is prox-smooth and the loss functions are sufficiently smooth (Wang et al., 30 Jul 2025).
- Operator-theoretic frameworks provide explicit finite-sample risk/error bounds, Rademacher complexities for the induced hypothesis space, and closed-form kernel or space-folding approximations with non-asymptotic rates. Differential privacy properties are maintained with one-shot noise-injected releases and kernel smoothing (Kumar et al., 30 Nov 2025).
5. Empirical Evaluation and Applications
Extensive empirical studies across GFFL approaches demonstrate:
- Near-optimal or state-of-the-art performance on tabular, vision, language, and time-series datasets, with minimal accuracy degradation relative to full-gradient or centralized training (Hahn et al., 2020, Feng et al., 2023, Ma et al., 8 Mar 2025, Ma et al., 2023, Kumar et al., 30 Nov 2025).
- Substantial resilience under data heterogeneity: gradient-free methods can sustain high accuracy even under highly non-IID client distributions, imbalanced or missing classes, and mixing of model architectures (Ma et al., 8 Mar 2025, Kumar et al., 30 Nov 2025).
- Communication and computation gains: reduction in memory use (e.g., BAFFLE achieves 5–10% of BP footprint), significant cuts in communication bytes/rounds, and compatibility with inference-optimized hardware (Feng et al., 2023, Ma et al., 2023).
- Flexible support for settings such as black-box model access (FedZGE), constrained/structured parameter spaces (projection-based on manifolds), operator-based prediction heads (kernel/space-folded rules), and federated ensemble tuning (FedXGBllr).
6. Limitations and Future Directions
- Stochastic zeroth-order optimization suffers from increased variance, requiring more function queries for large models; further, the convergence rate can be slower and the final accuracy may lag gradient-based baselines by several percent unless advanced variance-reduction or optimization strategies are used (Feng et al., 2023).
- ABC-based federated inference requires careful tuning of the acceptance threshold or number of accepted draws ; dynamic or adaptive ABC could ameliorate this (Hahn et al., 2020).
- Communication cost per iteration in ABC or zeroth-order methods may still scale with sample size or model dimensionality; batch techniques (e.g., sequential Monte Carlo/ABC, surrogate modeling) are under active investigation (Hahn et al., 2020, Kumar et al., 30 Nov 2025).
- Model flexibility after local ensemble freezing (as in gradient-free XGBoost) may be limited; all adaptation occurs by learning ensemble weights rather than exploring new model structure (Ma et al., 2023).
- While formal privacy guarantees are provided in operator-theoretic DP protocols, most GFFL approaches have not yet established universally tight DP-ε bounds for parameter or summary-statistic sharing, though the frameworks enable integration with established DP mechanisms (Kumar et al., 30 Nov 2025).
7. Representative Gradient-Free Federated Learning Methods
| Framework | Main Methodology | Key Features |
|---|---|---|
| GRAFFL (Hahn et al., 2020) | ABC via SuffiAE summaries | Likelihood-free, privacy via sufficiency, generative |
| BAFFLE (Feng et al., 2023) | Zeroth-order finite diff. | Inference-only hardware, client privacy, memory gains |
| FedZGE (Ma et al., 8 Mar 2025) | ZO black-box distillation | Data/model heterogeneity, communication efficiency |
| L1/L2 ZO FL (Lobanov et al., 2022) | / randomization | Non-smooth convex optimization, high-dim performance |
| Riemannian ZO FL (Wang et al., 30 Jul 2025) | Projection-based ZO SGD | Manifold-constrained learning, sublinear convergence |
| FedXGBllr (Ma et al., 2023) | Surrogate ensemble weights | Gradient-free tabular FL, XGBoost, comm. savings |
| Operator-theoretic (Kumar et al., 30 Nov 2025) | RKHS/space folding | Scalar summaries, DP, FHE-compatible, fast convergence |
Each approach is tailored for distinct settings—privacy-sensitive generative modeling (Hahn et al., 2020), deep neural network training on resource-constrained clients (Feng et al., 2023), data-free black-box knowledge transfer (Ma et al., 8 Mar 2025), convex optimization on nonsmooth losses (Lobanov et al., 2022), optimization under manifold constraints (Wang et al., 30 Jul 2025), federated GBDT for tabular data (Ma et al., 2023), and universal operator-theoretic compositional heads supporting differential privacy and homomorphic encryption (Kumar et al., 30 Nov 2025).
For detailed pseudocode, risk bounds, or implementation-specific formulas, readers are referred to the respective source publications.