Resilient Aggregation in Distributed Systems

Updated 11 November 2025

Resilient aggregation is a set of robust methods that combine distributed data or model updates to counter Byzantine faults and ensure privacy.
It employs trust-weighted scoring, outlier filtering, and cryptographic techniques to mitigate malicious inputs and maintain convergence.
Empirical studies demonstrate that approaches like Multi-Krum and Lagrange Coded Computing significantly enhance scalability and reliability in federated learning.

Resilient aggregation encompasses algorithmic and cryptographic methodologies for combining data, model updates, or expert advice from multiple sources in a manner robust to arbitrary faults (Byzantine behavior), data or process heterogeneity, privacy constraints, and network unreliability. In distributed machine learning—particularly federated learning (FL)—resilient aggregation refers specifically to aggregation rules and systems that prevent malicious participants or adversaries from corrupting the global model, while simultaneously guaranteeing efficiency, privacy, and convergence. The concept further generalizes to aggregation of expert signals in game-theoretic mechanisms and to robust computation over distributed devices in networked systems.

1. Formal Threat Models and Problem Definitions

Resilient aggregation is defined by the combination of the following threat and system models:

Byzantine Clients: Up to $b$ participants may act arbitrarily maliciously, submitting corrupted or strategically crafted updates, colluding, or becoming silent (dropout).
Honest-but-Curious Aggregators or Servers: The server may honestly execute the protocol but attempts to infer individual user data.
Colluding Subsets: Up to $t$ honest-but-curious users may collude in attempts to obtain information about other users' updates.
Data Heterogeneity: Clients' data distributions $\mathcal{D}_i$ are non-identical, introducing heterogeneous gradient distributions.
Dropout Tolerance: System must remain correct and privacy-preserving even if up to $p$ (possibly adversarial) nodes drop out in a round.

Robustness, privacy, and correctness goals are formalized in information-theoretic terms. Typical requirements include:

Byzantine resilience: The final aggregate (e.g., global model update) cannot be arbitrarily deviated by any set of up to $b$ malicious clients; typically quantified by a $(b,\lambda)$ -robustness or resilience constant, or in the $(\alpha, A)$ -resilience sense of [Blanchard et al., 2017].
Privacy: The mutual information between any honest user's update and the federator/view of any coalition of up to $t$ participants is zero.
Convergence: Global model iterates must remain within a small neighborhood of the optimal solution and maintain non-decreasing network utility under repeated rounds.

2. Algorithmic and Cryptographic Mechanisms

Resilient aggregation protocols employ a combination of robust statistical methods and cryptographic primitives:

2.1 Trust-Weighted and Outlier-Filtering Aggregators

Trust-Weighted Aggregation: FLTrust-style rules apply a trust score $\tau_i$ $τ_{i}$ (typically $\tau_i = \mathrm{ReLU}(\cos \theta_i)$ $τ_{i} = ReLU (cos θ_{i})$ for the angle $\theta_i$ $θ_{i}$ between a "trusted" reference and user $i$ $i$ 's update) to attenuate or amplify users' gradients.
- In ByITFL (Xia et al., 2024), $\tau_i$ is approximated using a degree- $k$ polynomial in the inner product between the public "root" update and the user's update, facilitating secret-shared polynomial evaluation.
Outlier Detection: Algorithms filter out suspicious updates by clustering distances, Mahalanobis or Euclidean norms, or model-based likelihood scoring.
- Multi-Krum [Blanchard et al., 2017]: Selects update vectors minimizing aggregated Euclidean distance to their closest non-outlier peers, tolerating up to $A$ Byzantine vectors if $N > 2A + 2$ (So et al., 2020, Xia et al., 2024).
- Robust clustering: CenterwO and MeanwO achieve 2-approximation to the outlier-robust 1-center/1-mean and thus nearly optimal resilience in both homogeneous and heterogeneous error scenarios (Yi et al., 2023).
Layer- and Coordinate-Adaptive Filters: For high-dimensional models and heterogeneously layered architectures, per-layer or per-coordinate sparsification (Top- $k$ selection) combined with layer-wise median Z-score filtering improves robustness (Xu et al., 2024).

2.2 Cryptographic Secure Aggregation

Verifiable Secret Sharing (VSS): Users share secret-masked updates using Lagrange-coded polynomials or Shamir secret sharing, with verifiable commitments enforcing information-theoretic privacy and consistency even in the presence of malicious sharing (Xia et al., 2024, Egger et al., 11 Jun 2025).
Lagrange Coded Computing (LCC): Encodes vector updates into random low-degree polynomials, distributing evaluations as shares. LCC generalizes to enable distributed multiparty computation of polynomials (for trust scores, validation, and final aggregation).
Re-randomization and Masking: Multiplication on secret shares generates structure; perfect secrecy is restored using additional polynomial sharing (sub-sharing and linear mixing), followed by re-randomization.
Dropout Resilience: Schemes handle dropouts by having the remaining subset of users participate in a rekeyed or recomputed mask removal step, sometimes leveraging aggregated group shares or precomputed mask shares (Zheng et al., 2022).

3. Information-Theoretic Guarantees and Robustness Conditions

A central feature of state-of-the-art resilient aggregation is information-theoretic privacy and robustness. ByITFL (Xia et al., 2024) proves:

Privacy: If $n \ge 2b + (k+1)(m + t - 1) + p + 1$ , then any $t$ colluding users and the federator learn nothing about other users' updates: $I(\text{honest updates}; \text{federator/coalition's view}) = 0$ .
Byzantine Resilience: Reed–Solomon decoding on the set of secret-shared polynomials corrects up to $b$ malicious submissions and $p$ dropouts.
Correctness: Final aggregate $G = B/A$ , with $A$ and $B$ reconstructed from shares, is a trust-weighted combination nearly indistinguishable from the non-private robust baseline (e.g., FLTrust).

For robust clustering-based aggregation (Yi et al., 2023), $(f, \lambda)$ -resilient averaging and $(f, \kappa)$ -robustness guarantee that the output is within $O(f/(n - f))$ of the honest mean, with high-probability selection of a benign subset even under strategic Byzantine placement (sneak/siege attacks).

4. Performance, Scalability, and Empirical Observations

Empirical and complexity analyses demonstrate practical constraints and trade-offs inherent in resilient aggregation:

Communication and Computation: ByITFL's per-user cost is $O((d / m) n^3 + n^4)$ (dominated by IT-VSS and re-randomization broadcasts) and scales poorly in $n$ (Xia et al., 2024). Alternative approaches such as CenterwO, MeanwO reduce the per-round cost to $O(n^2 d)$ (Yi et al., 2023).
Convergence and Model Accuracy: On MNIST with 50% Byzantine clients, ByITFL matches FLTrust in accuracy and outperforms FedAvg and less-robust baselines (Xia et al., 2024). Cluster-based aggregation achieves worst-case accuracy gains exceeding 0.4 compared to all previous aggregation baselines on heterogeneous and homogeneous benchmarks (Yi et al., 2023).
Empirical Robustness: Layer-wise adaptive protocols (e.g., LASA) attain robust test accuracies with high true positive rates for benign-layer selection under a wide attack spectrum (Xu et al., 2024).
Practical Limitations: Information-theoretically private schemes are communication-heavy due to polynomial degree, VSS, and group broadcast requirements. Votes and multiple candidate updates (in two-phase frameworks) increase round cost, but are essential to adaptively counter diverse attack vectors.

5. Extensions, Limitations, and Open Problems

Limitations: High communication and cryptographic overheads limit the scalability of fully robust, information-theoretic secret-sharing approaches as $n$ grows (Xia et al., 2024). Deployment in wireless or asynchronous networks (beyond fault-tolerant synchrony) is not yet addressed by current techniques.
Possible Improvements: Future directions include more efficient polynomial approximations, subvector partitioning, batching, refined role separation in trust frameworks, and the use of more communication-efficient IT primitives.
Open Problems: Optimal design of polynomial or other nonlinear approximations for trust-based scoring under targeted or adaptive attacks, fine-grained trade-off analysis of privacy versus robustness in settings with stronger Byzantine capabilities, and adaptation of hierarchical robust aggregation schemes to heterogeneous network security domains remain open (Xia et al., 2024, Yi et al., 2023).

Resilient aggregation operates at the intersection of distributed optimization, robust statistics, cryptographic multi-party computation, and information design. The techniques (Lagrange Coded Computing, IT-VSS, clustering-based robust statistics, re-randomization) have been generalized to privacy-preserving decentralized learning, hierarchical aggregation over networked topologies, and game-theoretic information fusion. Beyond federated learning, they apply to domains such as distributed consensus, sensor networks, peer-to-peer training, and decision support systems involving strategic agent reporting (Arieli et al., 2023, Viroli et al., 2016).

In summary, resilient aggregation denotes a class of methods, algorithms, and system designs for combining inputs from distributed, potentially adversarial or heterogeneous sources, achieving robust, privacy-preserving, and efficient consensus or model updates under a wide range of faults and attack strategies. The field synthesizes advances in robust machine learning, information theory, and distributed systems, with state-of-the-art protocols achieving provable security and Byzantine tolerance in large-scale federated learning.