Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation (1905.13576v3)

Published 30 May 2019 in math.ST, cs.IT, math.IT, and stat.TH

Abstract: This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}\sigma$, for $\mathcal{N}\sigma\triangleq\mathcal{N}(0,\sigma2 \mathrm{I}d)$, by $\hat{P}_n\ast\mathcal{N}\sigma$, where $\hat{P}n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and $\chi2$-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance ($\mathsf{W}_1$) converges at rate $e{O(d)}n{-\frac{1}{2}}$ in remarkable contrast to a typical $n{-\frac{1}{d}}$ rate for unsmoothed $\mathsf{W}_1$ (and $d\ge 3$). For the KL divergence, squared 2-Wasserstein distance ($\mathsf{W}_22$), and $\chi2$-divergence, the convergence rate is $e{O(d)}n{-1}$, but only if $P$ achieves finite input-output $\chi2$ mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to $\omega(n{-1})$ for the KL divergence and $\mathsf{W}_22$, while the $\chi2$-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy $h(P\ast\mathcal{N}\sigma)$ in the high-dimensional regime. The distribution $P$ is unknown but $n$ i.i.d samples from it are available. We first show that any good estimator of $h(P\ast\mathcal{N}_\sigma)$ must have sample complexity that is exponential in $d$. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate $e{O(d)}n{-\frac{1}{2}}$, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided.

Citations (60)

Summary

We haven't generated a summary for this paper yet.