Equivalence of distance-based and RKHS-based statistics in hypothesis testing (1207.6076v3)

Published 25 Jul 2012 in stat.ME, cs.LG, math.ST, stat.ML, and stat.TH

Abstract: We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.

Citations (643)

View on Semantic Scholar

Summary

The paper proves that energy distances and RKHS-based MMD are equivalent when computed with a semimetric of negative type.
It extends classical results to semimetrics, broadening applications in two-sample and independence testing with proven statistical consistency.
It offers improved test power and computational efficiency by leveraging alternative kernel choices and kernel eigenspectra in test design.

Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing

The paper "Equivalence of Distance-based and RKHS-based Statistics in Hypothesis Testing" provides a comprehensive analysis of two statistical methodologies: energy distances and reproducing kernel Hilbert space (RKHS) methods. These methodologies are pivotal in two-sample and independence testing scenarios frequently encountered in statistics and machine learning.

Main Contributions

The authors establish a unifying framework that demonstrates the equivalence between distance-based statistics, such as energy distances and distance covariances, and RKHS-based statistics like maximum mean discrepancies (MMD) and the Hilbert–Schmidt independence criterion (HSIC).

Equivalence of Methodologies: The paper proves that, when computed with a semimetric of negative type, energy distances correspond to MMD within RKHS. Conversely, for any positive definite kernel, MMD can be interpreted as an energy distance relative to a specific negative-type semimetric.
Generalization to Semimetrics: Most existing results apply to metrics of negative type; this framework extends these to semimetrics, allowing application in a broader context.
Statistical Consistency: It characterizes the probability distributions for which these test statistics are consistent against all alternatives, ensuring their reliability in practical applications.
Improved Test Power: By framing energy distance within a broader family of distance kernels, the authors show that tests with alternative kernel choices can achieve greater statistical power than traditional energy distance-based tests.

Numerical and Theoretical Insights

The paper provides several quantitative and theoretical insights that are crucial for practitioners:

Numerical Comparisons: Through various experiments, the paper compares the performance of different kernel choices, showing significant improvements in sensitivity and robustness in testing scenarios.
Empirical Estimates: It discusses empirical estimates for MMD and HSIC, demonstrating how these can be effectively implemented to construct reliable hypothesis tests.
Test Design: By leveraging kernel eigenspectra, the authors present methods for designing statistical tests without resorting to bootstrapping, leading to computational efficiency and scalability.

Implications and Future Directions

The equivalence established in this work has several profound implications:

Expanding Application Areas: Researchers can now apply these methodologies to more complex data types and structures, such as graphs and text strings, which are not naturally Euclidean.
Enhanced Exploratory Tools: The broadening of admissible kernels opens new frontiers for exploratory data analysis, potentially empowering domain-specific applications where domain knowledge can influence kernel choice.

This paper harmonizes two important statistical testing approaches, offering practitioners flexible and potent tools for hypothesis testing. Future research could extend these methods further, exploring new kernel functions tailored to specific application areas, further enriching the toolkit available for statistical and machine learning research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Final_Industry/status/1786399089388556437