A Kernel Method for the Two-Sample Problem (0805.2368v1)

Published 15 May 2008 in cs.LG and cs.AI

Abstract: We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

Citations (2,255)

View on Semantic Scholar

Summary

The paper presents the Maximum Mean Discrepancy (MMD) in RKHS as a robust metric for distinguishing probability distributions.
It proposes both quadratic and linear time implementations, enabling efficient tests for high-dimensional and large-scale datasets.
Experiments demonstrate superior sensitivity and computational efficiency compared to classic multivariate statistical tests.

A Kernel Method for the Two-Sample Problem

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola present a rigorous methodology for comparing probability distributions using kernel methods, particularly when dealing with high-dimensional data. The proposed approach leverages Reproducing Kernel Hilbert Spaces (RKHS) to develop statistical tests for determining if two samples are drawn from different distributions, commonly referred to as the two-sample or homogeneity problem.

Formal Definition and Approach

The core idea is to use the Maximum Mean Discrepancy (MMD) as a test statistic. MMD is defined as the largest difference in expectations over functions within the unit ball of an RKHS. This test statistic can be directly computed in quadratic time concerning the sample sizes. The authors also discuss efficient linear time approximations, which make the method applicable to large datasets.

Technical Contributions

1. MMD as a Metric on Distribution Space:

The authors establish that when the function space is the unit ball of a universal RKHS, the MMD becomes a strong metric on the space of probability distributions. This means the MMD is zero if and only if the distributions are identical, making it a powerful tool for hypothesis testing.

2. Statistical Tests Based on MMD:

Three distinct tests based on MMD are proposed:

Two tests leverage large deviation bounds on the test statistic, providing finite sample guarantees of performance.
The third test uses the asymptotic distribution of the MMD statistic, shown to be more sensitive to differences in distributions for small sample sizes.

3. Computational Efficiency:

The authors address the computational efficiency of their method. While the exact MMD computation requires O(m+n)² operations, they present a linear time version that approximates the test statistic, making it feasible for applications requiring fast execution or operating in streaming data scenarios.

4. Application on Various Data Structures:

The method's versatility is demonstrated through applications in database attribute matching using the Hungarian marriage method and comparisons over graph-based distributions. These novel applications highlight the method's robustness in handling both structured and unstructured data.

Strong Numerical Results

The experiments conducted show that the MMD-based methodologies outperform several classical multivariate statistical tests, including the Friedman-Rafsky test and the Biau-Györfi test, in both sensitivity to distributional differences and computational efficiency. The methodology demonstrates superior performance in settings with high data dimensionality and low sample size, a common scenario in practical applications.

Implications for Practical and Theoretical Research

The practical implications of this research are significant. The ability to distinguish between samples from different distributions with high accuracy using computationally efficient methods opens up new avenues for data integration, bioinformatics, and signal processing.

Theoretically, the use of RKHS and kernel methods in defining and computing the MMD provides a strong foundation for extending this approach to other machine learning problems, such as domain adaptation, anomaly detection, and generative modeling.

Future Directions

The paper suggests several potential directions for further research. One area is the development of adaptive kernel selection strategies, optimizing the choice of kernel parameters for specific data types or distributions to improve test sensitivity. Another promising direction is the exploration of the MMD in different function classes beyond RKHS, potentially enhancing the method's applicability and performance in various domains.

In summary, this paper provides a comprehensive, efficient, and theoretically sound approach to the two-sample problem, with demonstrated advantages in multiple application areas. The authors' contributions significantly enhance the toolkit available for researchers and practitioners working with high-dimensional and complex data distributions.

PDF Markdown