Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An empirical study on evaluation metrics of generative adversarial networks (1806.07755v2)

Published 19 Jun 2018 in cs.LG, cs.CV, and stat.ML
An empirical study on evaluation metrics of generative adversarial networks

Abstract: Evaluating generative adversarial networks (GANs) is inherently challenging. In this paper, we revisit several representative sample-based evaluation metrics for GANs, and address the problem of how to evaluate the evaluation metrics. We start with a few necessary conditions for metrics to produce meaningful scores, such as distinguishing real from generated samples, identifying mode dropping and mode collapsing, and detecting overfitting. With a series of carefully designed experiments, we comprehensively investigate existing sample-based metrics and identify their strengths and limitations in practical settings. Based on these results, we observe that kernel Maximum Mean Discrepancy (MMD) and the 1-Nearest-Neighbor (1-NN) two-sample test seem to satisfy most of the desirable properties, provided that the distances between samples are computed in a suitable feature space. Our experiments also unveil interesting properties about the behavior of several popular GAN models, such as whether they are memorizing training samples, and how far they are from learning the target distribution.

An Empirical Study on Evaluation Metrics of Generative Adversarial Networks

This research paper presents an extensive empirical analysis of various sample-based evaluation metrics for Generative Adversarial Networks (GANs). The authors aim to address the challenge of evaluating these metrics themselves by scrutinizing their effectiveness in distinguishing generated from real samples, sensitivity to mode dropping and collapsing, and their ability to detect overfitting.

Key Metrics Evaluated

The paper focuses on several widely-used metrics, including:

  • Inception Score (IS): Evaluates the quality and diversity of generated images via a pretrained Inception model on ImageNet.
  • Mode Score (MS): Extends IS by incorporating a comparison with the real distribution.
  • Kernel Maximum Mean Discrepancy (MMD): Measures the dissimilarity between distributions using a kernel function.
  • Wasserstein Distance (WD): Computes the Earth Mover's Distance between two distributions.
  • Fréchet Inception Distance (FID): Models distributions as Gaussians in feature space, capturing differences in mean and covariance.
  • 1-Nearest Neighbor (1-NN) Two-Sample Test: Assesses distribution closeness by evaluating the leave-one-out accuracy of real versus generated samples.

Significant Findings

  1. Discriminability: The paper identifies MMD and 1-NN as the most effective metrics in distinguishing between real and generated images, with superior sensitivity to mode dropping and collapsing.
  2. Robustness: The metrics were tested for robustness to image transformations like translations and rotations. Metrics computed in convolutional feature spaces showed robustness, while those based on pixel distances did not.
  3. Efficiency: MMD and 1-NN demonstrated sample efficiency by achieving discriminative power with fewer samples. However, Wasserstein Distance exhibited higher sample complexity and computational inefficiency.
  4. Overfitting Detection: MMD and 1-NN accuracy showed potential for detecting overfitting by measuring the score gap between training and validation sets, highlighting generalization differences.

Implications and Future Directions

The paper underscores the necessity of evaluating GAN metrics in suitable feature spaces for effective performance. It establishes MMD and 1-NN accuracy as preferable evaluation tools due to their robust performance across multiple tests. Additionally, it offers a perspective on the inherent limitations of metrics such as the Inception Score, especially when applied to datasets dissimilar to those they were originally designed for.

Looking forward, the findings suggest opportunities for developing more refined metrics that offer even greater efficiency and accuracy in various GAN applications. Furthermore, the insights into overfitting raise intriguing questions about GAN generalization that may prompt new theoretical explorations in unsupervised learning scenarios.

The authors have contributed an open-source implementation of their metric evaluations, providing valuable resources for further advancements in GAN research and development.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Qiantong Xu (26 papers)
  2. Gao Huang (178 papers)
  3. Yang Yuan (52 papers)
  4. Chuan Guo (77 papers)
  5. Yu Sun (226 papers)
  6. Felix Wu (30 papers)
  7. Kilian Weinberger (11 papers)
Citations (250)
Github Logo Streamline Icon: https://streamlinehq.com