Assessing Generative Models via Precision and Recall (1806.00035v2)

Published 31 May 2018 in stat.ML and cs.LG

Abstract: Recent advances in generative modeling have led to an increased interest in the study of statistical divergences as means of model comparison. Commonly used evaluation methods, such as the Frechet Inception Distance (FID), correlate well with the perceived quality of samples and are sensitive to mode dropping. However, these metrics are unable to distinguish between different failure cases since they only yield one-dimensional scores. We propose a novel definition of precision and recall for distributions which disentangles the divergence into two separate dimensions. The proposed notion is intuitive, retains desirable properties, and naturally leads to an efficient algorithm that can be used to evaluate generative models. We relate this notion to total variation as well as to recent evaluation metrics such as Inception Score and FID. To demonstrate the practical utility of the proposed approach we perform an empirical study on several variants of Generative Adversarial Networks and Variational Autoencoders. In an extensive set of experiments we show that the proposed metric is able to disentangle the quality of generated samples from the coverage of the target distribution.

Authors (5)

Mehdi S. M. Sajjadi (28 papers)
Olivier Bachem (52 papers)
Olivier Bousquet (33 papers)
Sylvain Gelly (43 papers)
Mario Lucic (42 papers)

Citations (531)

View on Semantic Scholar

Summary

Assessing Generative Models via Precision and Recall

The paper "Assessing Generative Models via Precision and Recall" introduces a nuanced approach to evaluating generative models by expanding the traditional metrics of precision and recall to distributions. This methodology addresses the limitations of commonly used one-dimensional metrics such as Inception Score (IS) and Fréchet Inception Distance (FID), which, while useful, fail to differentiate between various failure modes of generative models, like mode collapse or mode inventing.

Key Contributions

Two-dimensional Evaluation Metric: The authors redefine precision and recall for distributions, allowing for a bifocal assessment capturing both sample quality (precision) and coverage of the target distribution (recall). This provides a more complete picture of a model's performance by separating failure cases.
Efficient Computational Algorithm: The paper introduces an elegant algorithm for calculating the new precision-recall metrics using samples from both the target and generated distributions.
Empirical Validation: The authors conduct extensive experiments on models including several GAN and VAE variants, demonstrating the practical utility of their approach. Unlike FID or IS, which might offer similar scores for different underlying failures, this method delineates between high-quality sample generation and broad coverage of the target distribution.
Insights into Generative Models: The paper confirms widely held beliefs within the research community about generative model characteristics. For instance, it verifies that GANs often prioritize precision over recall, leading to sharper images but occasional mode collapse, whereas VAEs cover more modes but may suffer from low precision, such as producing blurrier images.

Methodology and Theoretical Insights

The new precision and recall concepts hinge on a probabilistic framework that dissects the divergence between the learned and reference distributions into two components. Precision measures how much of the generated distribution overlaps with a segment of the reference distribution, whereas recall assesses how much of the reference distribution is covered by the generated set.

The work connects these ideas to existing metrics, spotlighting the shortcomings of utilising a single-value score. It proposes using pre-trained feature extractors to map samples into a feature space, where comparisons can better capture the underlying data distributions' statistical properties. This is crucial for high-dimensional data like images and text.

Results and Observations

Through experiments on datasets such as MNIST, CIFAR-10, and CelebA, the paper showcases precise control over precision and recall trade-offs. The authors effectively demonstrate that common generative metrics cannot distinguish between mode dropping (failure to generate some parts of the distribution) and mode inventing (creating unrealistic samples).

A profound contribution is the ability to visualize PRD curves, granting researchers a novel way to identify and dissect model trade-offs. This allows for a more informed discussion about trade-offs intrinsic to model architectures or training paradigms, providing clearer guidance for subsequent development stages.

Implications and Future Directions

This work shifts how generative models should be evaluated, advocating for a more detailed analysis of model performance across different dimensions. It lays the groundwork for future research to further refine these metrics or develop analogous metrics across other types of distributional evaluations.

As AI continues to evolve, this method could potentially influence evaluation standards for other models dealing with high-dimensional data generation, exemplifying a robust framework that intersects both qualitative and quantitative assessment strategies.

In summary, this paper provides a valuable, more granular perspective on the evaluation of generative models, challenging and expanding current evaluation methodologies. It potentially paves the way for more sophisticated and comprehensively assessed generative modeling techniques.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos