Median of Means: Robust Statistical Estimator

Updated 30 June 2025

Median of Means is a robust estimator that partitions data into blocks and uses medians of block means to counteract outliers and heavy-tailed noise.
It achieves minimax-optimal rates and non-asymptotic error bounds under minimal moment assumptions, ensuring reliable performance in noisy environments.
Its algorithmic design facilitates integration into machine learning methods, supporting scalable, distributed, and online implementations for robust risk minimization.

The Median of Means (MoM) is a robust statistical estimator designed to achieve high-probability, near-optimal performance in the presence of heavy-tailed data and arbitrary contamination, notably in modern machine learning and statistical learning contexts. Unlike classical empirical means, which are highly sensitive to outliers, MoM methods rely on a blockwise median aggregation, conferring resistance to data corruption and providing well-characterized non-asymptotic guarantees under minimal moment assumptions.

1. Foundational Principles and Construction

The Median of Means methodology addresses the limitations of the standard mean estimator in both theoretical robustness and practical applications. For a given set of $N$ observations $Z_1, \ldots, Z_N$ , the MoM estimator operates as follows:

Partition the dataset into $K$ (ideally equal-sized) disjoint blocks: $B_1, \ldots, B_K$ .
Compute block means in each block: $\bar{Z}_{B_k} = \frac{1}{|B_k|} \sum_{i\in B_k} Z_i$ .
Aggregate the $K$ block means via the empirical median:

$\mathrm{MOM}_K(Z) = \mathrm{Median}(\bar{Z}_{B_1}, \ldots, \bar{Z}_{B_K}).$

In supervised learning, this approach applies to loss functions, forming blockwise average losses which are then aggregated by the median to define a robust risk estimator: $MOM_K(\ell_f) = \mathrm{Median}\left\{\frac{1}{|B_k|} \sum_{(X_i, Y_i) \in B_k} \ell_f(X_i, Y_i)\right\}.$

This procedure generalizes to more complex learning tasks, such as risk minimization, mean embedding estimation in kernel spaces, and robust empirical risk minimization (ERM) frameworks.

2. Statistical Properties and Theoretical Guarantees

MoM estimators attain minimax-optimal rates of convergence under substantially weaker conditions than their classical counterparts. Key theoretical properties include:

Minimal moment requirements: Only bounded second moments (finite variance) are needed for informative (inlier) data points, rather than sub-Gaussian tails or boundedness.
Deviation inequalities: Non-asymptotic, high-probability error bounds are achieved even under heavy-tailed distributions. For univariate means,

$\mathbb{P} \left( |\mathrm{MOM}_K(Z) - \mathbb{E} Z| \geq c \sigma \sqrt{\frac{K}{N}} \right) \leq e^{-c' K}$

for variance $\sigma^2$ and universal constants $c, c'$ , provided a majority of blocks are not contaminated.

Robustness to arbitrary outliers: Outliers can be adversarial, with no assumption on their distribution or structure; the estimator's error is insensitive as long as less than half of the blocks are contaminated.
Breakdown point and breakdown number: The effective breakdown "number" quantifies the maximal number of outliers tolerated while retaining optimal statistical accuracy. For estimation of a $d$ -dimensional mean with noise variance $\sigma^2$ , this is on the order of $\sigma^2 d$ ; for $s$ -sparse vectors, $\sigma^2 s \log(d/s)$ . Beyond this, the estimation error increases at most linearly with the outlier fraction.

3. Practical Computability and Algorithmic Integration

MoM estimators are naturally compatible with a wide range of machine learning algorithms and are scalable for large datasets:

Blockwise algorithms: Standard optimization algorithms (gradient descent, subgradient methods, coordinate descent, proximal methods such as ISTA/FISTA, ADMM) can be adapted by using the "median block" data at each update step.
Block selection and shuffling: Re-randomization or shuffling of blocks between iterations can improve convergence and facilitate outlier detection.
Distributed and parallel settings: MoM blockwise structure enables "map-reduce" style computation, well suited for distributed or memory-limited environments.
Examples: For LASSO regression, the usual empirical risk is replaced by MoM-aggregated block risks, with robust parameter updates derived from the median block.

4. Applications Across Machine Learning Domains

MoM principles have been extended and rigorously studied in diverse contexts:

Classification: MoM minimizers for empirical risk (replacing mean losses with blockwise medians) yield classifiers provably robust to outliers and with performance matching ERM under weak assumptions, even for unbounded convex surrogates.
Kernel methods: MoM-based mean embedding estimators robustify kernel mean and maximum mean discrepancy (MMD) estimators, as well as related algorithms for causality, distributional testing, and generative models.
Clustering: Both classical and model-based clustering frameworks integrate MoM to estimate centroids, mitigating sensitivity to initialization and noise, with theoretical consistency and high-fidelity empirical recovery even at high outlier rates.
Density estimation and optimal transport: MoM-enhanced kernel density estimators deliver high-probability accuracy under broad contamination scenarios. In optimal transport and Wasserstein distance estimation, MoM enables construction of estimators with provable consistency and robustness, facilitating robust training in WGANs and similar generative models.
Regression and high-dimensional estimation: MOM-based ensemble methods have been proposed for robust model selection, hyperparameter tuning, and risk minimization even on data with heavy tails or adversarial salt-and-pepper noise.

5. Outlier Detection and Depth Measures

MoM methods provide integrated tools for outlier detection:

Data points are assigned a "depth score" by tracking the frequency with which they appear in the block selected as median across multiple MoM iterations.
Informative or central data repeatedly occur in median blocks; outliers are rarely selected.
This mechanism yields a natural, task-relevant outlier ranking and can be exploited for unsupervised anomaly detection or pre-filtering contaminated data prior to model fitting.

6. Relationships to Other Robust Estimation Techniques

MoM estimators complement and, in many settings, surpass classical robust statistics (e.g., trimmed means, M-estimators, Huber estimators, Stahel-Donoho estimators):

Unlike M-estimators, MoM does not require tuning influence functions or assumptions on the fraction or mechanism of contamination.
MoM estimators offer non-asymptotic, minimax-optimal guarantees without strong tail or independence assumptions.
In high-dimensional settings and structured tasks (e.g., sparse vectors, low-rank matrix estimation), MoM provides breakdown numbers and finite-sample error rates unattainable by classical approaches under minimal regularity.

7. Comparison Table: Key Features of MoM Estimators

Feature	Median of Means (MoM)	Empirical Mean	Robust M-estimator
Requires sub-Gaussian tails	No	Yes	Varies
Adversarial outlier robust	Yes (up to breakdown number)	No	Yes (typically assumes known contamination rate)
Minimax-optimal rate	Yes	Only in sub-Gaussian case	Not always
Outlier detection	Natural (via block centrality and depth)	No	Usually not
Distributed/online compatible	Yes	Yes	Often more complex
Computational complexity	Comparable to mean for non-kernel methods	Lower (mean only)	Higher (iterative)

Conclusion

Median of Means estimators constitute a central development in modern robust statistics and machine learning, synthesizing blockwise aggregation with median selection to provide resilience against both heavy-tailed noise and arbitrary contamination. The MoM framework is versatile, admitting a broad range of algorithmic adaptations for high- and low-dimensional data, supporting robust learning, estimation, and inference. It combines theoretical guarantees—minimax rates, well-defined breakdown thresholds, and explicit non-asymptotic error bounds—with ease of implementation and practical outlier detection, making it a foundational component in the design of robust algorithms for real-world data analysis.

PDF Markdown Chat (Pro)