Mean estimation and regression under heavy-tailed distributions--a survey
Abstract: We survey some of the recent advances in mean estimation and regression function estimation. In particular, we describe sub-Gaussian mean estimators for possibly heavy-tailed data both in the univariate and multivariate settings. We focus on estimators based on median-of-means techniques but other methods such as the trimmed mean and Catoni's estimator are also reviewed. We give detailed proofs for the cornerstone results. We dedicate a section on statistical learning problems--in particular, regression function estimation--in the presence of possibly heavy-tailed data.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at a very basic task in statistics: estimating the average (the “mean”) of data, and building regression models (predicting one thing from another). It focuses on situations where the data can have “heavy tails”—that is, you sometimes see very large, rare values (outliers). In such cases, the usual average can be unreliable. The authors explain several smarter ways to estimate averages and do regression so that the answers are accurate and trustworthy even when outliers occur.
Questions the paper asks
- How can we estimate an average from data that might include extreme, rare values without being fooled by those outliers?
- Can we make estimators with high confidence guarantees (meaning they are accurate “with high probability”), similar to what we would get with perfectly nice, bell-shaped (Gaussian) data?
- How do we extend these ideas from one number (univariate data) to many numbers at once (vectors, or multivariate data)?
- How can these ideas help in machine learning, especially in building regression models when errors or inputs can be heavy-tailed?
- Are there limits—what is possible and what is not—if we only assume the data has a finite variance?
How did the authors approach the problem?
The main challenge is that outliers can pull the ordinary average far away from the true mean. The paper surveys several “robust” estimators—methods designed to resist the influence of outliers—then analyzes how accurate they can be, with clear probability guarantees.
Key idea: “Sub-Gaussian” performance
Imagine your data was perfectly nice and followed a bell curve. Then the error of the sample mean shrinks like:
- roughly on average, and
- with high probability it is at most about ,
where is the number of samples, is the standard deviation, and is the allowed failure probability (say means 95% confidence). The paper asks: can we get this same kind of guarantee (“sub-Gaussian”) even if the data are heavy-tailed? Surprisingly, yes—if we use the right estimators.
Median-of-means (MoM)
- Split your data into groups (blocks) of roughly equal size.
- Compute the average within each group.
- Take the median of those group averages.
Why this works: An outlier may ruin one group’s mean, but it’s unlikely to ruin most groups. The median ignores a few bad groups and keeps the typical ones. This simple trick gives strong accuracy guarantees even when the usual average fails.
Key points:
- It achieves sub-Gaussian-like error: about .
- You must choose the number of groups based on your desired confidence .
- It also works when only a slightly stronger condition holds (like having a finite moment), and even better when a third moment exists.
Catoni’s estimator
- Instead of averaging raw values, it solves a modified equation that “down-weights” large deviations (outliers) using a carefully designed influence function.
- Think of it as a smart average that grows more slowly when confronted with big values.
Key points:
- It can match optimal sub-Gaussian guarantees.
- It needs a tuning parameter related to and, ideally, a bound on variance .
- If you don’t know , there are adaptive tricks (like Lepski’s method) to pick a good parameter from the data.
Trimmed mean
- Sort the data.
- Chop off the largest and smallest few percent (potential outliers).
- Average the rest.
Key points:
- If you trim a fraction proportional to , you get sub-Gaussian-like performance.
- It is conceptually simple and robust, even if some data points are adversarially corrupted.
Moving to multivariate data (vectors)
When each data point is a vector (many numbers at once), we want the estimate to be close in Euclidean distance. The paper adapts robust ideas to this case:
- Coordinate-wise median-of-means: take MoM for each coordinate separately. This is simple but can be less sharp in high dimensions.
- Geometric median-of-means: instead of per-coordinate medians, take the geometric median of block means (the point minimizing the sum of distances). This is a convex optimization problem, so it’s computable and yields dimension-free guarantees (no explicit dependence on the number of coordinates).
- Thresholding the norm (Catoni–Giulini): shrink data points with very large norms toward zero before averaging. This is easy to compute, but needs a prior bound on how spread out the data are, and care to avoid issues like not being translation-invariant.
Median-of-means tournaments
This is a clever way to get truly sub-Gaussian behavior for vectors under very mild assumptions (just finite covariance). Imagine comparing candidate mean points by how well they fit the data across blocks, and choosing the one that consistently “wins” these comparisons:
- Partition data into blocks and compute block means.
- For any two candidate points and , see which one is closer to the block means on most blocks (“a defeats b” if it is closer more often).
- Choose a point that minimizes the radius of the region where it “wins” against others.
Key points:
- It achieves a high-probability error bound of the form
where is the covariance matrix, is its trace (sum of variances), and is its largest eigenvalue.
- This is “dimension-free” and mirrors what you would expect in the Gaussian case.
Main findings
- The usual sample mean can be unreliable with heavy-tailed data; its high-probability error often scales like (worse dependence on ) rather than the desired .
- There exist simple, robust estimators—median-of-means, Catoni’s estimator, trimmed mean—that recover sub-Gaussian-like guarantees even under heavy tails, assuming at least finite variance.
- In many cases, the estimator must depend on the target confidence level (for example, through the number of blocks in MoM). The paper proves you cannot have a single estimator that is optimally sub-Gaussian for a wide range of without extra information.
- For multivariate data, geometric median-of-means and MoM tournaments provide strong, dimension-free guarantees, very close to the ideal Gaussian case.
- These robust ideas extend to regression (learning a function that predicts outcomes). Using median-of-means style comparisons, one can build learning algorithms that remain accurate even when errors have heavy tails.
Why this matters
In real life—finance, web traffic, sensor networks, medical data—outliers happen. A simple average can be badly thrown off by a few extreme values. Robust estimators give you protection: they keep the estimate reliable, with clear high-probability guarantees. This is especially crucial in machine learning, where training data often contain noise, rare spikes, or even malicious corruption.
Takeaways and impact
- Robust mean estimators are practical and powerful. Median-of-means and trimmed mean are simple to implement and offer strong guarantees.
- Catoni-type methods achieve optimal accuracy with thoughtful tuning and can be adapted when variance is unknown.
- For high-dimensional data, geometric median-of-means and tournament-based methods provide “Gaussian-like” performance without assuming the data are Gaussian.
- You typically need to decide in advance how confident you want to be (choose ), and tune the estimator accordingly; this dependence is often unavoidable.
- These tools carry over to regression and learning, helping build models that remain dependable even when data are messy.
In short: the paper shows that, with the right robust methods, you can get the kind of reliability you expect from ideal bell-shaped data, even when your real data have heavy tails and outliers. This makes statistics and machine learning more trustworthy in the wild.
Collections
Sign up for free to add this paper to one or more collections.