Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 385 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means (2506.14673v3)

Published 17 Jun 2025 in stat.ML and cs.LG

Abstract: The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavy-tailed data. In this work, we analyze its performance in the task of simultaneously estimating the mean of each function in a class $\mathcal{F}$ when the data distribution possesses only the first $p$ moments for $p \in (1,2]$. We prove a new sample complexity bound using a novel symmetrization technique that may be of independent interest. Additionally, we present applications of our result to $k$-means clustering with unbounded inputs and linear regression with general losses, improving upon existing works.

Summary

The paper demonstrates that the median-of-means estimator effectively overcomes the shortcomings of traditional means in heavy-tailed data settings.
The study introduces a novel analysis framework using ghost samples, providing explicit uniform convergence and sample complexity bounds.
The proposed methodology is applied to k-means clustering and linear regression, improving performance in scenarios with unbounded data features.

Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means

The paper "Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means" by Mikael Møller Høgsgaard and Andrea Paudice addresses the challenges of estimating the mean of functions under heavy-tailed distributions that exhibit only the first $p$ moments for $p \in (1,2]$ . The authors leverage the Median-of-Means (MoM) estimator's robust properties, highlighting its advantages over traditional mean estimators, particularly in dealing with heavier-tailed data distributions. This work provides significant theoretical contributions to the uniform convergence of the MoM estimator, focusing on sample complexity and its applications to machine learning tasks such as $k$ -means clustering and linear regression with broad loss functions.

The paper initially presents a critical insight into the failure of the sample mean for estimating the mean under heavy-tailed distributions. In cases where data possesses only finite moments up to order $p \in (1,2]$ , the sample mean experiences sub-optimal performance. The MoM estimator, on the other hand, efficiently estimates the mean by leveraging data symmetrization through batch splitting and computing medians across batches, leading to superior convergence properties.

Theoretical Contributions

At the core of Høgsgaard and Paudice's work lies the development of a novel analysis framework for the MoM estimator. The authors provide robust theoretical guarantees for uniform mean estimation, detailing an innovative symmetrization technique involving two ghost samples that augments traditional approaches. This approach aids in establishing a sample complexity bound that enables explicit upper bounds, which are traditionally challenging to determine due to dependencies on quantities related to the Rademacher complexity.

The paper's primary theorem demonstrates that the MoM estimator achieves a sample complexity of order $(v_p/\varepsilon^p)^{1/(p-1)}(\log(N_{D}(/16,m)/\delta) + \kappa_{0}(\delta))$ , given a function class $\cF$ that admits a suitable cover. This robust sample complexity is derived without making stringent assumptions on function boundedness, thereby extending the utility of MoM to a broader class of functions.

Applications

The authors exemplify their theoretical results with applications to $k$ -means clustering and linear regression. For $k$ -means clustering, they address scenarios where both input data and centers are unbounded, achieving an exponential improvement in confidence terms over previous works. Similarly, for linear regression, the paper establishes new complexity bounds for loss functions that are continuous, providing results that align with known sub-exponential tail distributions while requiring only finite $p$ -th moments, hence broadening the applicability to diverse real-world data scenarios.

Implications and Future Research

This paper lays the groundwork for extending the use of the MoM estimator in machine learning and statistics, especially for tasks where data distributions deviate from the ideal cases assumed by traditional estimators. The authors’ approach invites further exploration into other areas of learning where heavy-tailed data are prevalent. Moreover, the techniques introduced can inspire future refinements of uniform convergence analysis under non-standard probability distributions.

In conclusion, Høgsgaard and Paudice's work delivers a comprehensive analytical framework for utilizing the MoM estimator in heavy-tailed settings, addressing both theoretical and practical aspects pertinent to today's complex machine learning challenges. Their contributions enhance our understanding of mean estimation under non-ideal data conditions, with implications that span econometrics, network science, and beyond.