Mean estimation and regression under heavy-tailed distributions--a survey (1906.04280v1)

Published 10 Jun 2019 in math.ST, cs.LG, stat.ML, and stat.TH

Abstract: We survey some of the recent advances in mean estimation and regression function estimation. In particular, we describe sub-Gaussian mean estimators for possibly heavy-tailed data both in the univariate and multivariate settings. We focus on estimators based on median-of-means techniques but other methods such as the trimmed mean and Catoni's estimator are also reviewed. We give detailed proofs for the cornerstone results. We dedicate a section on statistical learning problems--in particular, regression function estimation--in the presence of possibly heavy-tailed data.

Citations (220)

View on Semantic Scholar

Summary

The paper surveys robust statistical techniques, particularly Median-Of-Means and its variants, for estimating means and regression functions effectively under heavy-tailed data.
Median-Of-Means and similar estimators offer sub-Gaussian error rates and robust performance even with minimal distributional assumptions or high-dimensional data.
The surveyed robust techniques are critical for reliable inference in fields like finance and data science, paving the way for resilient machine learning under data irregularities.

Mean Estimation and Regression Under Heavy-Tailed Distributions: An Expert Overview

The paper "Mean Estimation and Regression under Heavy-Tailed Distributions" by Lugosi and Mendelson presents a comprehensive survey of methodologies for estimating means and regression functions in settings where heavy-tailed data is prevalent. The focus lies on sub-Gaussian estimators for mean estimation and the adaptation of these techniques to multivariate settings and regression problems.

Statistical inference in heavy-tailed environments poses considerable challenges due to significant data variability and the presence of outliers. Traditional mean estimators, like the empirical mean, fall short when applied to data with heavy-tailed distributions as they are not robust to outliers and can yield unreliable estimates. The methodologies surveyed in this paper address these limitations by proposing alternative estimators that demonstrate robust performance.

Key Contributions

The paper centers around the Median-Of-Means (MoM) technique and its variants as robust mean estimators in presence with little moment or distributional assumptions. The MoM estimator involves creating several subsamples, calculating the mean of each, and then taking the median of these means. This approach enhances robustness and can achieve a sub-Gaussian error rate under a finite second-moment condition, even in the presence of heavy-tailed data.

Univariate Analysis

The univariate case is explored with attention given to the fundamental limitations of conventional methods and several robust alternatives like the trimmed means and Catoni’s estimator. These alternatives are detailed with proofs, demonstrating their capacity to achieve non-asymptotic performance bounds that approach sub-Gaussian behavior for non-Gaussian data.

Multivariate and General Norms

The extension to multivariate cases outlines challenges posed by vector data and introduces notions like multivariate medians and thresholds. The geometric median-of-means and other generalized notions of medians are shown to provide strong performance in high dimensions. Of particular interest is the proposal of using median-of-means tournaments, which offer dimension-free bounds on estimation errors, demonstrating their efficacy even for non-Euclidean norms.

Regression Problems

In regression, the authors extend MoM techniques to estimate regression functions, challenging empirical risk minimization approaches in heavy-tailed scenarios. The paper presents "median-of-means tournaments" as powerful tools in tackling regression problems by leveraging robust estimations to construct reliable predictors.

Implications and Future Directions

The surveyed methodologies imply significant practical impacts on fields where robust statistical estimates are critical, such as finance, physics, and large-scale data science applications. Theoretical developments in estimators reveal critical insights into designing algorithms that remain effective under minimal distributional assumptions.

Looking ahead, these insights have the potential to inform robust learning algorithms, particularly in the development of models resilient to adversarial attacks and data corruption. Future research may explore computational efficiency improvements and adaptations for even broader classes of normed spaces and application areas.

In conclusion, Lugosi and Mendelson’s work provides a thorough examination of mean and regression function estimation in heavy-tailed contexts, offering robust techniques that are particularly suitable for modern applications faced with data irregularities. The paper outlines a clear path forward for researchers interested in advancing robust statistics and machine learning.