The Cramer Distance as a Solution to Biased Wasserstein Gradients (1705.10743v1)

Published 30 May 2017 in cs.LG and stat.ML

Abstract: The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cram\'er distance. We show that the Cram\'er distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. To illustrate the relevance of the Cram\'er distance in practice we design a new algorithm, the Cram\'er Generative Adversarial Network (GAN), and show that it performs significantly better than the related Wasserstein GAN.

Citations (330)

View on Semantic Scholar

Summary

The paper introduces the Cramer distance to provide unbiased sample gradients, addressing the biases inherent in the Wasserstein metric.
It demonstrates through theoretical and empirical analyses that Cramer distance improves stability in applications like generative modeling and ordinal regression.
Key properties such as sum invariance, scale sensitivity, and unbiased gradients highlight its advantages in optimizing machine learning models with SGD.

The Cramer Distance as a Solution to Biased Wasserstein Gradients

This paper addresses a significant issue in the use of the Wasserstein metric in machine learning, specifically its problematic sample gradient bias. Despite the Wasserstein metric's ability to reflect the underlying geometric relationships between outcomes and its utility in fields like ordinal regression and generative modeling, this bias presents challenges in its application, particularly for optimization via stochastic gradient descent (SGD). The authors propose the Cramer distance as an alternative, offering robust mathematical justification for its use due to its fulfiLLMent of several desired properties.

Core Properties and Analysis

Central to the discussion are three properties of divergences that are valuable in machine learning applications: sum invariance, scale sensitivity, and unbiased sample gradients. While the KL divergence holds unbiased sample gradients, it falls short in geometric sensitivity compared to the Wasserstein metric, which does not possess unbiased sample gradients. Through theoretical and empirical analysis, the paper demonstrates that the Cramer distance, unlike the Wasserstein metric, maintains all three properties.

Mathematical Formulation and Unbiased Sample Gradients

The Cramer distance is articulated through its formulation involving cumulative distribution functions (CDFs). For distributions $P$ and $Q$ , the Cramer distance is defined by integrating the squared difference between their CDFs. This distance notably provides unbiased sample gradients, avoiding the pitfalls associated with applying SGD to minimize Wasserstein losses. The paper mathematically establishes this unbiased nature, positioning the Cramer distance as capable of more stable, reliable optimizations.

Practical Implications and Empirical Validation

The paper extends its analysis to practical applications, including ordinal regression and generative adversarial networks (GANs). It utilizes the Cramer distance to develop the Cramer GAN, which shows marked improvements over traditional Wasserstein GANs. The proposed Cramer GAN demonstrates more stable learning and enhanced diversity in generated samples, crucial for high-fidelity generative modeling tasks. Additionally, empirical results in image modeling and ordinal regression highlight the superiority of the Cramer distance in environments where similarity measurement between outcomes is pivotal.

Theoretical Contribution and Future Directions

The introduction of the Cramer distance expands the toolkit available for machine learning tasks that necessitate probability metrics sensitive to outcome geometry. As an unbiased estimator, it opens new avenues for tasks requiring precise measure metrics while being compatible with SGD. The paper suggests further exploration into unbiased estimation techniques for the Wasserstein metric, as well as variance reduction techniques for the Cramer distance's sample gradients.

Conclusion

The paper effectively challenges the Wasserstein metric's dominance in certain machine learning contexts, presenting the Cramer distance as a superior choice when unbiased sample gradients are critical. Future developments may continue to explore and expand the theoretical underpinnings of this work, potentially influencing a broader array of probabilistic modeling techniques within AI and machine learning research.

PDF Markdown