Classification Accuracy Score for Conditional Generative Models (1905.10887v2)

Published 26 May 2019 in cs.LG and stat.ML

Abstract: Deep generative models (DGMs) of images are now sufficiently mature that they produce nearly photorealistic samples and obtain scores similar to the data distribution on heuristics such as Frechet Inception Distance (FID). These results, especially on large-scale datasets such as ImageNet, suggest that DGMs are learning the data distribution in a perceptually meaningful space and can be used in downstream tasks. To test this latter hypothesis, we use class-conditional generative models from a number of model classes---variational autoencoders, autoregressive models, and generative adversarial networks (GANs)---to infer the class labels of real data. We perform this inference by training an image classifier using only synthetic data and using the classifier to predict labels on real data. The performance on this task, which we call Classification Accuracy Score (CAS), reveals some surprising results not identified by traditional metrics and constitute our contributions. First, when using a state-of-the-art GAN (BigGAN-deep), Top-1 and Top-5 accuracy decrease by 27.9\% and 41.6\%, respectively, compared to the original data; and conditional generative models from other model classes, such as Vector-Quantized Variational Autoencoder-2 (VQ-VAE-2) and Hierarchical Autoregressive Models (HAMs), substantially outperform GANs on this benchmark. Second, CAS automatically surfaces particular classes for which generative models failed to capture the data distribution, and were previously unknown in the literature. Third, we find traditional GAN metrics such as Inception Score (IS) and FID neither predictive of CAS nor useful when evaluating non-GAN models. Furthermore, in order to facilitate better diagnoses of generative models, we open-source the proposed metric.

PDF Abstract

Overview of "Classification Accuracy Score for Conditional Generative Models"

The paper introduces a novel evaluation metric called the Classification Accuracy Score (CAS) for assessing the performance of conditional generative models in generating data suitable for downstream tasks such as image classification. Notably, while deep generative models (DGMs) have reached a stage where they can produce samples with photorealistic properties, evaluating these models beyond perceptual metrics like Frechet Inception Distance (FID) remains challenging. The paper argues that these models should be evaluated based on their utility in downstream tasks rather than exclusively on their ability to replicate training data distributions.

The paper explores variational autoencoders, autoregressive models, and generative adversarial networks (GANs) to determine the efficacy of class-conditional generative models in labeling real data. Several notable findings are discussed based on the application of the CAS metric, challenging the reliability of conventional GAN metrics such as Inception Score (IS) and FID.

Key Findings

Performance Gaps in Generative Models: The authors report significant performance deficits when using top-tier GANs, like BigGAN-deep, for tasks that require learning data distribution effectively. CAS revealed substantial decreases in Top-1 and Top-5 accuracies, with figures demonstrating drops by 27.9% and 41.6%, respectively.
Comparative Model Performance: The paper found that models such as the Vector-Quantized Variational Autoencoder-2 (VQ-VAE-2) and Hierarchical Autoregressive Models (HAMs) outperform GANs on CAS despite having lower scores in established GAN metrics such as IS and FID. These findings suggest that IS and FID may not be suitable indicators of a model's capacity to perform well in learning data distributions or downstream tasks.
Automatic Detection of Model Failures: CAS provides the means to identify specific classes for which generative models fail to properly capture the data distribution. It surfaces previously undocumented deficiencies in generated data that generic GAN metrics do not reveal.
Lack of Predictive Power of Traditional Metrics: The paper highlights the lack of correlation between IS/FID and CAS, indicating that conventional metrics do not predict generative model performance on CAS. This discrepancy emphasizes the need to evaluate models using metrics aligned with specific tasks they are intended to perform.
Naive Augmentation Score (NAS): The paper also assesses a variant of CAS called Naive Augmentation Score, which indicates slight improvements in Top-5 accuracy in scenarios where model-generated data augments the original dataset, rather than replacing it. This finding suggests that carefully chosen synthetic data can enhance training data, albeit under limited conditions.

Implications and Future Directions

The introduction of CAS as a metric underscores the necessity for alternative evaluation metrics tailored to task performance, particularly as generative models advance and their applications extend to downstream use-cases beyond image generation. The results articulate a clear gap between the generative capability of models as per conventional metrics and their utility in practical tasks, calling for tighter alignment between model evaluation methods and real-world applications.

For future development, researchers should consider evolving metrics that capture the nuances of task-specific performance. There is potential to refine these metrics to capture generalization beyond the training set, an increasingly significant factor as generative models encroach upon practical applications across domains.

In conclusion, this paper makes a substantial contribution to the evaluation of conditional generative models by introducing CAS, a metric that bridges the gap between perceptual fidelity and practical utility. As the field of AI continues to expand, methodologies such as this will be critical in ensuring that generative models are both innovative and applicable to tangible tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Suman Ravuri (9 papers)
Oriol Vinyals (116 papers)

Citations (215)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos