Comparative Elimination Filter

Updated 14 January 2026

Comparative Elimination Filter is a robust aggregation method that discards the highest norm or furthest-distance reports to mitigate Byzantine faults in distributed and federated learning.
It operates by sorting gradients, eliminating the top f reports, and averaging the remaining ones, ensuring efficiency under bounded variance and redundancy conditions.
Empirical evaluations show that the CE filter achieves near-linear convergence with minimal overhead, outperforming several alternative robust aggregation algorithms.

The Comparative Elimination (CE) filter is a norm-based robust aggregation mechanism designed for Byzantine fault tolerance in distributed machine learning systems. Its principal function is to eliminate the influence of agents that may send arbitrary, potentially adversarial information—so-called Byzantine agents—by discarding the largest-norm (or furthest distance) reports before model update. The CE filter has been shown to provide provable resilience to a bounded fraction of Byzantine faults under standard stochastic and strong convexity or Polyak–Łojasiewicz (PL) conditions, both in convex and nonconvex optimization settings. Its computational simplicity and minimal distributional assumptions distinguish it within the robust aggregation literature (Gupta et al., 2020, Dutta et al., 2023).

1. Byzantine Fault Model in Distributed and Federated Learning

Distributed Stochastic Gradient Descent (D-SGD) and federated optimization frameworks comprise $n$ or $N$ agents and a centralized, trusted server or coordinator. Each agent independently conducts stochastic optimization, sampling data points from an unknown common distribution per round. In the fault-tolerant setting, up to $f$ agents may be Byzantine: able to collude, deviate arbitrarily from the prescribed algorithm, or inject malicious vectors or models. The remaining honest agents sample data i.i.d. and compute stochastic gradients or local models. Typical goal statements require finding a minimizer of the aggregate objective—either the expected loss $Q(w) = \mathbb{E}_{z\sim D}[\ell(w, z)]$ for D-SGD, or the minimizer of $\sum_{i\in H} q^i(x)$ for local/federated optimization, where $H$ is the set of honest agents (Gupta et al., 2020, Dutta et al., 2023).

2. Comparative Elimination Filter: Algorithmic Formulation

The CE filter operates by discarding the largest-norm reports at each aggregation round. In D-SGD, the server receives gradient vectors $\{g_i^t\}$ , computes their Euclidean norms, sorts, and retains only the $n-f$ smallest-norm gradients: $S \;=\; \{ i : r_i \text{ is among the smallest } n-f \}$ The update step computes the mean of the surviving gradients: $w^{t+1} = w^t - \eta_t \;\frac1{n-f} \sum_{i \in S} g_i^t$ In federated/local SGD settings, the CE filter discards the $f$ local models $\{x^i_{k,T}\}$ with largest $\ell_2$ distance from the global model $\bar x_k$ . The server update is then: $\bar x_{k+1} = \frac1{N-f} \sum_{i \in \mathcal{F}_k} x^i_{k,T}$ where $\mathcal{F}_k$ indexes the $N-f$ closest agents. The filter implicitly requires redundancy conditions (e.g., $N>2f$ and "2-f redundancy") for robust convergence (Dutta et al., 2023).

Pseudocode: D-SGD CE Filter

Input: current iterate w^t, claimed gradients {g_i^t}_{i=1}^n, elimination budget f
Step 1 (Receive): Each agent i sends g_i^t. 
Step 2 (CGE filter): Compute norms r_i = ∥g_i^t∥₂, sort and keep smallest n−f.
Step 3 (Update): w^{t+1} ← w^t − η_t · (mean of kept gradients)

[Federated/local SGD pseudocode specifications are identical modulo input type and aggregation semantics.]

3. Fault Tolerance Conditions and Theoretical Guarantees

CE filter guarantees fault tolerance against a fraction $f$ of Byzantine agents under classical stochastic assumptions. Required conditions include bounded variance ( $\mathbb{E}\|g_i^t-\nabla Q(w^t)\|^2\le\sigma^2$ ), Lipschitz gradients, and strong convexity or PL inequality: $\|\nabla q^\mathcal{H}(x)\|^2 \ge 2\mu (q^\mathcal{H}(x) - q^\mathcal{H}(x^*))$ Fault-tolerance margin is defined for D-SGD as: $\frac{f}{n} < \frac{\lambda}{2\lambda+\mu}$ In federated optimization, "2-f redundancy" ensures that the minimizer remains unchanged after discarding any $f$ reports, provided $N>2f$ . The linear convergence rate holds in both convex and PL nonconvex regimes up to a steady-state error dependent on $\sigma^2$ , $\eta$ , and $f$ : $\mathbb{E}\|w^t - w^*\|^2 \le \rho^t \|w^0 - w^*\|^2 + \frac{1-\rho^t}{1-\rho}M^2$ with explicit expressions for the contraction factor $\rho$ and bias $M^2$ (Gupta et al., 2020, Dutta et al., 2023). Under the stochastic PL–rate theorem, federated/local SGD with CE filter converges linearly to the honest optimum plus variance-induced bias.

4. Comparative Analysis: CE Filter and Alternative Robust Aggregators

The following table contrasts CE filter properties with other prominent Byzantine-robust aggregation algorithms:

Filter	Complexity	Fault Tolerance Constraints
Multi-KRUM	$\mathcal{O}(n(n+d))$	$f < n/2$
Geometric Median-of-Means	$\mathcal{O}(n(n+d))$	$f < b/2$ per block
Spectral methods (SEVER)	$\mathcal{O}(n d \min\{n,d\})$	$f < n/4$
CWTM, signSGD	$\mathcal{O}(n d)$	Distributional assumptions (unimodal/symmetric)
Comparative Elimination	$\mathcal{O}(n(d+\log n))$	$f/n < \lambda/(2\lambda+\mu)$ , $N>2f$

The CE filter requires only norm computations and sorting per round, dispensing with geometric medians, blockwise operations, or spectral computations. It robustly tolerates Byzantine fractions up to theoretical margins set by strong convexity/PL parameters and the redundancy of honest cost functions.

5. Hyper-Parameter Selection and Operating Conditions

CE filter operation depends critically on accurate hyper-parameter selection:

Step size ( $\eta_t$ , $\alpha_k$ ): Fixed values in $(0,\overline\eta)$ ensure linear convergence, or a diminishing schedule ( $\eta_t\propto 1/t$ ) for standard SGD regimes. For PL nonconvex settings, $\alpha \leq \mu/(72L^2T)$ .
Batch size ( $k$ ): Larger batches decrease variance $\sigma^2$ , shrinking neighborhood bias; typical values are $k=32$ –$256$.
Elimination budget ( $f$ ): Server must estimate an upper bound on Byzantine agents; set CE filter to discard exactly $f$ per round.
Exponential averaging: Optional, with update $h_i^t = \beta h_i^{t-1} + (1-\beta)g_i^t$ ( $\beta \in [0.4,0.8]$ ), lowers variance and stabilizes convergence at minimal computational cost.
Redundancy constraints: For federated settings, $N > 2f$ is required.

This configuration ensures the CE filter remains statistically efficient and algorithmically robust under adversarial conditions (Gupta et al., 2020, Dutta et al., 2023).

6. Empirical Performance and Evaluation

Empirical evaluations on neural networks (MNIST + LeNet, $d\approx 4.3\times10^5$ ) and synthetic regression/classification tasks highlight the CE filter's practical impact. Benchmarked against GeoMed, MoM, coordinate-wise trimmed mean, and Multi-KRUM, CE achieves test accuracy within 1–3% of the best robust method under several Byzantine faults (gradient-reverse, label-flip, “norm-confusing”). Representative per-iteration times: CE $\approx$ 0.57 s, outperforming GeoMed (2.47 s), MoM (1.15 s), Multi-KRUM (2.22 s), CWTM (0.90 s). With exponential averaging ( $\beta \approx 0.6$ ), CE approaches theoretical robustness limits (Gupta et al., 2020).

In nonconvex PL regression and classification settings, simulations with $N=50$ and $f\in\{2,5,8,10\}$ show CE outperforms Multi-KRUM and CWTM in proximity to honest optimum. Increased local steps ( $T=3$ ) further accelerate convergence (Dutta et al., 2023).

7. Summary and Practical Implications

The Comparative Elimination filter robustifies distributed and federated SGD against Byzantine faults by eliminating the top- $f$ largest-norm reports per round. Its salient features include:

Fault tolerance under bounded variance and convexity/PL growth conditions
Linear convergence to a small neighborhood of the global optimum
Minimal computational overhead, requiring only sorting and averaging
Proven empirical efficacy under diverse fault models

A plausible implication is that CE filter methodology can be generally applied wherever per-round norm or distance statistics are easily computed and strong redundancy is present among honest agents, particularly in communication-limited and large-scale federated learning deployments (Gupta et al., 2020, Dutta et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE) (2020)

Resilient Federated Learning under Byzantine Attack in Distributed Nonconvex Optimization with 2-f Redundancy (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Comparative Elimination (CE) Filter.