Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking FID: Towards a Better Evaluation Metric for Image Generation (2401.09603v2)

Published 30 Nov 2023 in cs.CV

Abstract: As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. A note on the inception score, 2018.
  2. Demystifying MMD GANs, 2021.
  3. Muse: Text-to-image generation via masked generative transformers. ICML, 2023.
  4. Pali: A jointly-scaled multilingual language-image model, 2022.
  5. Effectively unbiased FID and inception score and where to find them. CoRR, abs/1911.07023, 2019.
  6. The Fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982.
  7. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  8. Characteristic kernels on groups and semigroups. In NeurIPS. Curran Associates, Inc., 2008.
  9. Generative Adversarial Nets. In NeurIPS, 2014.
  10. A kernel method for the two-sample-problem. In NeurIPS. MIT Press, 2006.
  11. A kernel two-sample test. J. Mach. Learn. Res., 13(1):723–773, 2012.
  12. A class of invariant consistent tests for multivariate normality. Communications in statistics-Theory and Methods, 1990.
  13. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, 2018.
  14. A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
  15. Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755. Springer, 2014.
  16. K. V. Mardia. Measures of Multivariate Skewness and Kurtosis with Applications. Biometrika, 1970.
  17. Maurice Fréchet. Sur la distance de deux lois de probabilité. Annales de l’ISUP, 1957.
  18. Midjourney, 2022. https:://www.midjourney.com.
  19. On Aliased Resizing and Surprising Subtleties in GAN Evaluation. In CVPR, 2022.
  20. Learning transferable visual models from natural language supervision. In ICML, 2021.
  21. Hierarchical text-conditional image generation with clip latents. preprint, 2022.
  22. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  23. Photorealistic text-to-image diffusion models with deep language understanding. preprint, 2022. [arXiv:2205.11487].
  24. Improved techniques for training gans. CoRR, abs/1606.03498, 2016.
  25. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
  26. Laurens van der Maaten and Geoffrey E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
  27. An empirical study on evaluation metrics of generative adversarial networks. CoRR, abs/1806.07755, 2018.
  28. Scaling autoregressive models for content-rich text-to-image generation. In ICML, 2022.
  29. HYPE: human eye perceptual evaluation of generative models. CoRR, abs/1904.01121, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sadeep Jayasumana (19 papers)
  2. Srikumar Ramalingam (40 papers)
  3. Andreas Veit (29 papers)
  4. Daniel Glasner (7 papers)
  5. Ayan Chakrabarti (42 papers)
  6. Sanjiv Kumar (123 papers)
Citations (58)

Summary

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

The paper "Rethinking FID: Towards a Better Evaluation Metric for Image Generation" authors propose a critical reevaluation of the popular Fréchet Inception Distance (FID) used in assessing the quality of generated images. The authors highlight multiple shortcomings of FID and introduce an alternative, the CLIP-MMD (CMMD) metric, which they argue addresses these problems more effectively.

Background and Motivation

Image generation models, especially text-to-image models, have advanced significantly. With this progress, having robust evaluation metrics is crucial. The FID metric, based on Inception-v3 embeddings, has been widely adopted to measure the discrepancy between distributions of real and generated images. However, FID relies on several assumptions and limitations that the authors argue render it inadequate for modern generative models.

Limitations of FID

Inception Embeddings

FID is based on Inception-v3 embeddings. These embeddings were trained on the ImageNet dataset, which contains about 1 million images covering only 1000 distinct classes. This limited scope inadequately represents the complex and diverse nature of images generated by contemporary models.

Normality Assumption

A key assumption underlying FID is that the distributions of Inception embeddings are multivariate normal. The authors empirically demonstrate that Inception embeddings for real and generated image sets are far from normally distributed. Using statistical tests—Mardia’s skewness and kurtosis tests, and Henze-Zirkler test—they show convincingly that Inception embeddings do not fit a multivariate normal distribution, undermining the core assumption of FID.

Sample Inefficiency

FID's estimation of the covariance matrix in high-dimensional space (2048 dimensions) demands a large sample size. This issue of sample inefficiency leads to high computational costs and unreliable results with small sample sizes.

Empirical Evidence Against FID

The authors present empirical evidence showcasing FID's limitations, particularly:

  1. Contradiction with human raters: FID often does not align with human subjective evaluations of image quality.
  2. Inaccurate reflection of progressive improvements: FID fails to monotonically improve with iterative refinements in models like Muse and StableDiffusion, incorrectly suggesting quality degradation.
  3. Inadequate handling of complex distortions: FID fails to capture quality degradation under complex distortions applied in the latent space of VQGAN.

The CMMD Metric

To address the limitations of FID, the authors propose CMMD (CLIP-MMD). This metric leverages CLIP embeddings and the Maximum Mean Discrepancy (MMD) distance with a Gaussian RBF kernel.

Key Advantages of CMMD

  1. Rich Embeddings: CLIP embeddings are trained on 400 million image-text pairs, providing a richer representation of the diverse content generated by modern models.
  2. Distribution-Free: By using MMD, CMMD avoids any assumptions about the distribution of embeddings, making it theoretically sound and robust.
  3. Unbiased and Sample Efficient: MMD is an unbiased estimator and demonstrates high sample efficiency in practice, making it less computationally expensive than FID.
  4. Human alignment: CMMD shows strong agreement with human subjective evaluations, making it a practical choice for evaluating image quality.

Experiments and Results

The authors conduct extensive experiments to validate their claims:

  • Human Evaluation: In scenarios where human raters significantly favored one model over another, CMMD aligned with these preferences, whereas FID did not.
  • Image Distortions: CMMD correctly identified quality degradation in images progressively distorted in the latent space, while FID failed to do so.
  • Sample Efficiency: CMMD provided reliable estimates even with a smaller sample size, whereas FID required a very large dataset to provide stable results.

Implications and Future Work

The adoption of CMMD for evaluating image generation models has both practical and theoretical implications. Practically, it offers a more reliable and computationally efficient metric that aligns better with human perception. Theoretically, it underscores the importance of choosing appropriate embeddings and distance metrics in the evaluation of machine learning models.

Future research could explore further applications of MMD and CLIP embeddings in other domains of generative modeling and conduct additional comparisons with other emerging metrics.

Conclusion

The paper makes a compelling case for reevaluating FID and adopting CMMD as a more robust and reliable metric for evaluating image generation models. By addressing key limitations of FID, the CMMD metric promises to provide a more accurate reflection of the quality improvements in modern image generation techniques. Researchers in the field are encouraged to adopt CMMD for a more reliable and consistent evaluation framework.

Reddit Logo Streamline Icon: https://streamlinehq.com