Learning Generative Models with Sinkhorn Divergences (1706.00292v3)

Published 1 Jun 2017 in stat.ML

Abstract: The ability to compare two degenerate probability distributions (i.e. two probability distributions supported on two distinct low-dimensional manifolds living in a much higher-dimensional space) is a crucial problem arising in the estimation of generative models for high-dimensional observations such as those arising in computer vision or natural language. It is known that optimal transport metrics can represent a cure for this problem, since they were specifically designed as an alternative to information divergences to handle such problematic scenarios. Unfortunately, training generative machines using OT raises formidable computational and statistical challenges, because of (i) the computational burden of evaluating OT losses, (ii) the instability and lack of smoothness of these losses, (iii) the difficulty to estimate robustly these losses and their gradients in high dimension. This paper presents the first tractable computational method to train large scale generative models using an optimal transport loss, and tackles these three issues by relying on two key ideas: (a) entropic smoothing, which turns the original OT loss into one that can be computed using Sinkhorn fixed point iterations; (b) algorithmic (automatic) differentiation of these iterations. These two approximations result in a robust and differentiable approximation of the OT loss with streamlined GPU execution. Entropic smoothing generates a family of losses interpolating between Wasserstein (OT) and Maximum Mean Discrepancy (MMD), thus allowing to find a sweet spot leveraging the geometry of OT and the favorable high-dimensional sample complexity of MMD which comes with unbiased gradient estimates. The resulting computational architecture complements nicely standard deep network generative models by a stack of extra layers implementing the loss function.

Citations (584)

View on Semantic Scholar

Summary

The paper introduces the Sinkhorn divergence, a novel OT-based loss that leverages entropic smoothing to enhance training robustness.
It employs GPU-enabled algorithmic differentiation to efficiently compute Sinkhorn iterations, bridging Wasserstein metrics with MMD.
Numerical experiments demonstrate improved gradient stability and computational efficiency, making it promising for scalable deep generative models.

Learning Generative Models with Sinkhorn Divergences

The paper "Learning Generative Models with Sinkhorn Divergences" by Aude Genevay, Gabriel Peyré, and Marco Cuturi addresses the computational and statistical challenges prevalent in training generative models through optimal transport (OT) metrics. The primary contribution is the introduction of the Sinkhorn divergence, a novel OT-based loss function intended to enhance the robustness and tractability of generative model training.

Core Concepts

The central problem outlined revolves around comparing two degenerate probability distributions, particularly when they exist on low-dimensional manifolds within a higher-dimensional space. Optimal transport metrics are highlighted for their suitability in handling such cases, yet the inherent computational costs, gradient instability, and high-dimensional estimation challenges culminate in substantial obstacles for their practical deployment in learning tasks.

The Sinkhorn divergence, a differentiable and tractable OT-based loss, is the cornerstone of this research. It leverages two principal innovations:

Entropic Smoothing: Converts the original OT loss into a differentiable quantity that accommodates Sinkhorn fixed-point iterations, significantly improving smoothness and robustness.
Algorithmic Differentiation: Utilizing GPU-enabled automatic differentiation facilitates efficient computation of these Sinkhorn iterations.

Moreover, this smoothing introduces a family of losses that transition between the Wasserstein metrics and Maximum Mean Discrepancy (MMD), offering flexibility in balancing the geometrical strengths of OT with the high-dimensional sample complexity benefits of MMD.

Numerical Results and Claims

The numerical results illustrating the efficacy of the Sinkhorn divergence are compelling. The Sinkhorn loss effectively interpolates between OT and MMD, thus exhibiting a balance in sample complexity and gradient bias mitigation. The implementation on GPUs using automatic differentiation reveals substantial improvements in computational efficiency, aligning well with existing deep learning infrastructures.

From a statistical perspective, the experiments demonstrate how the Sinkhorn divergence can achieve better sample complexity approximations, especially as the entropic regularization parameter is varied. Notably, the sample complexity shows favorable rates closer to MMD for higher entropic regularization, offering a practical balance for real-world applications.

Implications and Future Prospects

The theoretical implications of this paper are substantial. It introduces a divergence that potentially reshapes the landscape for generative modeling, especially where computational resources and stability are constrained. Practically, the proposed method enables seamless integration with standard neural network architectures, potentially influencing the development of scalable and efficient generative models.

Future avenues include exploring the precise conditions under which the Sinkhorn divergence provides positive lower bounds—a crucial aspect for its reliable usage as a distance metric. Further investigations into its sample complexity and empirical validation across varied datasets could yield additional insights and solidify its standing as a versatile tool in generative modeling.

In conclusion, by offering a robust alternative to traditional OT methods and bridging the gap to MMD approaches, the Sinkhorn divergence holds promise for advancing the theoretical and practical toolkit available to researchers and practitioners in the development of efficient generative models.

PDF Markdown