Rethinking FID: Towards a Better Evaluation Metric for Image Generation (2401.09603v2)
Abstract: As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.
- A note on the inception score, 2018.
- Demystifying MMD GANs, 2021.
- Muse: Text-to-image generation via masked generative transformers. ICML, 2023.
- Pali: A jointly-scaled multilingual language-image model, 2022.
- Effectively unbiased FID and inception score and where to find them. CoRR, abs/1911.07023, 2019.
- The Fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Characteristic kernels on groups and semigroups. In NeurIPS. Curran Associates, Inc., 2008.
- Generative Adversarial Nets. In NeurIPS, 2014.
- A kernel method for the two-sample-problem. In NeurIPS. MIT Press, 2006.
- A kernel two-sample test. J. Mach. Learn. Res., 13(1):723–773, 2012.
- A class of invariant consistent tests for multivariate normality. Communications in statistics-Theory and Methods, 1990.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, 2018.
- A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
- Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755. Springer, 2014.
- K. V. Mardia. Measures of Multivariate Skewness and Kurtosis with Applications. Biometrika, 1970.
- Maurice Fréchet. Sur la distance de deux lois de probabilité. Annales de l’ISUP, 1957.
- Midjourney, 2022. https:://www.midjourney.com.
- On Aliased Resizing and Surprising Subtleties in GAN Evaluation. In CVPR, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. preprint, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. preprint, 2022. [arXiv:2205.11487].
- Improved techniques for training gans. CoRR, abs/1606.03498, 2016.
- Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
- Laurens van der Maaten and Geoffrey E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
- An empirical study on evaluation metrics of generative adversarial networks. CoRR, abs/1806.07755, 2018.
- Scaling autoregressive models for content-rich text-to-image generation. In ICML, 2022.
- HYPE: human eye perceptual evaluation of generative models. CoRR, abs/1904.01121, 2019.
- Sadeep Jayasumana (19 papers)
- Srikumar Ramalingam (40 papers)
- Andreas Veit (29 papers)
- Daniel Glasner (7 papers)
- Ayan Chakrabarti (42 papers)
- Sanjiv Kumar (123 papers)