An Empirical Study on Evaluation Metrics of Generative Adversarial Networks
This research paper presents an extensive empirical analysis of various sample-based evaluation metrics for Generative Adversarial Networks (GANs). The authors aim to address the challenge of evaluating these metrics themselves by scrutinizing their effectiveness in distinguishing generated from real samples, sensitivity to mode dropping and collapsing, and their ability to detect overfitting.
Key Metrics Evaluated
The paper focuses on several widely-used metrics, including:
- Inception Score (IS): Evaluates the quality and diversity of generated images via a pretrained Inception model on ImageNet.
- Mode Score (MS): Extends IS by incorporating a comparison with the real distribution.
- Kernel Maximum Mean Discrepancy (MMD): Measures the dissimilarity between distributions using a kernel function.
- Wasserstein Distance (WD): Computes the Earth Mover's Distance between two distributions.
- Fréchet Inception Distance (FID): Models distributions as Gaussians in feature space, capturing differences in mean and covariance.
- 1-Nearest Neighbor (1-NN) Two-Sample Test: Assesses distribution closeness by evaluating the leave-one-out accuracy of real versus generated samples.
Significant Findings
- Discriminability: The paper identifies MMD and 1-NN as the most effective metrics in distinguishing between real and generated images, with superior sensitivity to mode dropping and collapsing.
- Robustness: The metrics were tested for robustness to image transformations like translations and rotations. Metrics computed in convolutional feature spaces showed robustness, while those based on pixel distances did not.
- Efficiency: MMD and 1-NN demonstrated sample efficiency by achieving discriminative power with fewer samples. However, Wasserstein Distance exhibited higher sample complexity and computational inefficiency.
- Overfitting Detection: MMD and 1-NN accuracy showed potential for detecting overfitting by measuring the score gap between training and validation sets, highlighting generalization differences.
Implications and Future Directions
The paper underscores the necessity of evaluating GAN metrics in suitable feature spaces for effective performance. It establishes MMD and 1-NN accuracy as preferable evaluation tools due to their robust performance across multiple tests. Additionally, it offers a perspective on the inherent limitations of metrics such as the Inception Score, especially when applied to datasets dissimilar to those they were originally designed for.
Looking forward, the findings suggest opportunities for developing more refined metrics that offer even greater efficiency and accuracy in various GAN applications. Furthermore, the insights into overfitting raise intriguing questions about GAN generalization that may prompt new theoretical explorations in unsupervised learning scenarios.
The authors have contributed an open-source implementation of their metric evaluations, providing valuable resources for further advancements in GAN research and development.