Pros and Cons of GAN Evaluation Measures (1802.03446v5)

Published 9 Feb 2018 in cs.CV

Abstract: Generative models, in particular generative adversarial networks (GANs), have received significant attention recently. A number of GAN variants have been proposed and have been utilized in many applications. Despite large strides in terms of theoretical progress, evaluating and comparing GANs remains a daunting task. While several measures have been introduced, as of yet, there is no consensus as to which measure best captures strengths and limitations of models and should be used for fair model comparison. As in other areas of computer vision and machine learning, it is critical to settle on one or few good measures to steer the progress in this field. In this paper, I review and critically discuss more than 24 quantitative and 5 qualitative measures for evaluating generative models with a particular emphasis on GAN-derived models. I also provide a set of 7 desiderata followed by an evaluation of whether a given measure or a family of measures is compatible with them.

Authors (1)

Ali Borji (89 papers)

Citations (837)

View on Semantic Scholar

Summary

Overview of "Pros and Cons of GAN Evaluation Measures"

The paper "Pros and Cons of GAN Evaluation Measures" by Ali Borji aims to critically discuss the various quantitative and qualitative evaluation metrics for Generative Adversarial Networks (GANs). The review spans over 24 quantitative and 5 qualitative measures, providing a comprehensive analysis of their strengths, limitations, and suitability for diverse GAN evaluation scenarios. The central objective is to standardize and recommend robust evaluation measures that can facilitate fair model comparison and steer the progress in the field of generative models.

Quantitative Measures

Inception Score (IS): This score leverages a pre-trained Inception model to evaluate generated samples based on their classifiability and diversity across class labels. It measures the KL divergence between the conditional label distribution and the marginal distribution over all labels. Despite its widespread adoption, IS does not account for whether the generated samples cover the true data distribution and can be manipulated by generating single realistic examples per class.

Fréchet Inception Distance (FID): This metric computes the Wasserstein distance between two multivariate Gaussians fitted to real and generated data embeddings obtained from a pre-trained Inception model. FID is considered superior to IS as it can detect intra-class mode dropping and misfits in data distribution, while remaining computationally efficient and correlating with human judgment.

Maximum Mean Discrepancy (MMD): MMD compares the dissimilarity between two probability distributions (real vs. generated) based on kernelized two-sample tests. It is a robust measure with low sample and computational complexity, making it practical for monitoring GAN training.

Classifier Two-Sample Tests (C2ST): This measure uses classifiers to distinguish samples from the real and generated datasets. A binary classifier is trained to differentiate between the two, and its accuracy serves as the evaluation metric. Variations such as using a 1-NN classifier simplify the process.

Adversarial Accuracy and Divergence: These metrics evaluate how well the generated data matches the real data distributions in terms of conditional label distributions. Classifiers are trained on generated and real data separately, and their performance on validation sets is measured.

Qualitative Measures

Human Perceptual Judgment: This includes user studies where participants are asked to distinguish between real and generated images or rate the quality of generated images. While this method relates closely to real-world applicability, it is subject to biases, high cost, and reproducibility challenges.

Nearest Neighbors Analysis: Generated samples are compared to their nearest neighbors in the training set to detect overfitting. This method can be deceived by small transformations and may favor models that memoratively generate training examples.

Rapid Scene Categorization: Inspired by psychophysical studies, this test involves showing images briefly to participants to see if they can distinguish generated images from real ones. This "Turing-like" test is intuitive but may not capture diversity adequately.

Evaluating Mode Collapse: Methods such as the Birthday Paradox Test and the analysis of mode distribution on synthetic datasets assess the diversity of generated samples. These methods can detect mode collapse but might be impractical for large datasets.

Internal Network Dynamics: Exploring latent space interpolations, feature visualizations, and hierarchical representations help understand what the GAN has learned. These techniques provide insights into space continuity, making it possible to assess if the model produces novel images or memorizes the training data.

Implications and Future Directions

The detailed review places FID and IS at the forefront due to their discriminative power and computational efficiency. However, the quest for the ultimate evaluation measure remains open, as no single metric can comprehensively cover all aspects of GAN performance like fidelity, diversity, and semantic quality. Measures often overlap in their conceptual foundations, employing classifiers, distance metrics, and feature comparisons to achieve their goals.

Future research should focus on:

Creating a standardized repository of evaluation code to ensure consistency and reproducibility in GAN assessments.
Conducting empirical and analytical analyses to compare and benchmark different measures under uniform conditions.
Developing evaluation metrics that can encapsulate both fidelity and diversity while remaining computationally feasible.

In summary, "Pros and Cons of GAN Evaluation Measures" serves as a critical guide for researchers, offering detailed insights into existing evaluation techniques and paving the way for more robust and comprehensive methods in the future development of GANs.

PDF Markdown