Federated Learning With L0 Constraint Via Probabilistic Gates For Sparsity (2512.23071v1)

Published 28 Dec 2025 in stat.ML, cs.DC, cs.LG, and cs.SE

Abstract: Federated Learning (FL) is a distributed machine learning setting that requires multiple clients to collaborate on training a model while maintaining data privacy. The unaddressed inherent sparsity in data and models often results in overly dense models and poor generalizability under data and client participation heterogeneity. We propose FL with an L0 constraint on the density of non-zero parameters, achieved through a reparameterization using probabilistic gates and their continuous relaxation: originally proposed for sparsity in centralized machine learning. We show that the objective for L0 constrained stochastic minimization naturally arises from an entropy maximization problem of the stochastic gates and propose an algorithm based on federated stochastic gradient descent for distributed learning. We demonstrate that the target density (rho) of parameters can be achieved in FL, under data and client participation heterogeneity, with minimal loss in statistical performance for linear and non-linear models: Linear regression (LR), Logistic regression (LG), Softmax multi-class classification (MC), Multi-label classification with logistic units (MLC), Convolution Neural Network (CNN) for multi-class classification (MC). We compare the results with a magnitude pruning-based thresholding algorithm for sparsity in FL. Experiments on synthetic data with target density down to rho = 0.05 and publicly available RCV1, MNIST, and EMNIST datasets with target density down to rho = 0.005 demonstrate that our approach is communication-efficient and consistently better in statistical performance.

Summary

The paper introduces FLoPS, an L0-constrained framework that employs probabilistic gates to enforce sparsity in federated learning models.
It formulates a constrained optimization linking L0 constraints with entropy maximization, ensuring precise density control even under non-IID data and client heterogeneity.
The paper presents FLoPS-PA, a communication-efficient variant that compresses active parameter indices without compromising predictive performance.

Federated Learning With $L_0$ Constraint Via Probabilistic Gates for Sparsity

Introduction and Problem Setting

This paper addresses fundamental limitations in conventional Federated Learning (FL) concerning model sparsity. Despite FL's success in privacy-preserving distributed learning, classical FL approaches typically produce excessively dense models. This over-parameterization leads to poor generalization in non-IID and heterogeneous client participation regimes, and excessive resource consumption on edge devices. The authors propose an $L_0$ -constrained FL framework—FLoPS—that enforces a hard constraint on the proportion of active model parameters by leveraging reparameterization via continuous probabilistic gates.

$L_0$ -Constrained Optimization Formulation

The paper outlines a constrained optimization problem over the global model's parameters, enforcing $\|\theta\|_0/|\theta| \leq \rho$ , where $\rho$ is a pre-specified target density. Handling the $L_0$ pseudo-norm in gradient-based frameworks is non-trivial due to its discontinuity. To circumvent this, the authors employ parameter reparameterization via a set of stochastic binary gates (following the Hard Concrete distribution), as proposed in the context of centralized $L_0$ regularization [Louizos et al., 2017].

A key conceptual contribution is the connection between $L_0$ -constrained optimization and entropy maximization over gate state distributions, formalized using Lagrangian duality. This formulation reveals an intrinsic relationship between the sparsity constraint and a variational free energy objective from statistical physics, further connecting FL with mean-field variational inference.

Distributed Algorithms: FLoPS and FLoPS-PA

The core algorithm, FLoPS, utilizes server-coordinated, synchronous updates with reparameterized gradients propagated over both parameter and gate distributions. The updates to gates $\phi$ and model parameters $\tilde{\theta}$ are performed collaboratively—with the Lagrange dual $\lambda$ governing adherence to the sparsity target. The method includes a restart mechanism for $\lambda$ when the constraint is satisfied, ensuring stable enforcement of the density constraint.

Recognizing the communication bottleneck in FL, the paper introduces FLoPS-PA, a "parameter averaging" variant that dramatically reduces the communication cost by transmitting only indices corresponding to the top-m active parameters (per the learned sparsity), and compressing the associated gate values. This design achieves practical communication efficiency without sacrificing statistical performance.

Experimental Evaluation

Extensive experiments are reported over both synthetic and real-world high-dimensional datasets (RCV1, MNIST, EMNIST), systematically varying data heterogeneity and client participation rates. The evaluation leverages metrics such as True Discovery Rate (TDR), $R^2$ , accuracy, and cross-entropy to quantify both sparsity recovery and predictive performance.

The results demonstrate that:

FLoPS achieves precise control over model density, gradually matching the desired parameter sparsity during training, even for extreme $\rho$ as low as $0.005$.

Figure 1: Evolution of the expected number of active gates during training in linear regression, highlighting convergence to the prescribed target density.

Sparsity recovery (TDR) and statistical performance are consistently superior to magnitude-based iterative thresholding (FedIter-HT), particularly under high data heterogeneity, low client participation, and high correlations in features.
Communication savings with FLoPS-PA are substantial, especially for large models and very low $\rho$ . This variant achieves efficient training with only a minor tradeoff in performance, outperforming dense FedAvg approaches when magnitude pruning is deferred to post-training.
Dynamic patterns of sparsity emerge during training, as visualized by the soft Jaccard/IOU heatmaps of the active gates. FLoPS first explores the parameter space, then rapidly settles into a stable, high-performing configuration, in contrast to more erratic patterns seen with FedIter-HT.

Theoretical and Practical Implications

This work provides a formal bridge between energy-based modeling, variational inference, and sparsity-constrained distributed optimization. By embedding the $L_0$ constraint into the probabilistic gate mechanism, it enables precise, differentiable control over sparsity and offers a principled alternative to heuristic or magnitude-based pruning. The experimental findings suggest that enforcing explicit density constraints during FL training is substantially more robust than post hoc sparsification—especially important for resource-constrained clients and communication-efficient FL settings.

On a practical level, the methods are directly applicable to high-dimensional tasks (e.g., NLP, vision), and the compression mechanism can benefit deployment across severely bandwidth/compute-limited networks. The capacity of FLoPS/FLoPS-PA to maintain generalization under severe non-IID data and low active density thresholds indicates immediate relevance for on-device intelligence (e.g., smartphones, IoT).

Future Directions

Potential avenues for extension include:

Decentralized FL architectures: FLoPS currently assumes centralized orchestration; generalization to decentralized protocols promises broader applicability.
Richer model classes: Extension to structured sparsity, group sparsity, or convolutional architectures.
Adaptive/learned sparsity schedules: Dynamically adjusting $\rho$ during training for improved generalization or resource adaptation.
Theoretical analysis: Formal convergence and generalization bounds under probabilistic gating and nonconvex landscapes.

Conclusion

The paper introduces a theoretically principled, communication- and computation-efficient FL method enforcing hard sparsity constraints via probabilistic reparameterization. The FLoPS framework achieves significant improvements in statistical performance, precise sparsity, and communication efficiency over magnitude-based alternatives, particularly under heterogeneous and lossy participation. The theoretical unification with entropy maximization, variational inference, and the detailed empirical analysis together present a comprehensive advance for practical, scalable federated learning in resource-constrained environments.