Effectively Unbiased FID and Inception Score and where to find them (1911.07023v3)

Published 16 Nov 2019 in cs.CV and cs.LG

Abstract: This paper shows that two commonly used evaluation metrics for generative models, the Fr\'echet Inception Distance (FID) and the Inception Score (IS), are biased -- the expected value of the score computed for a finite sample set is not the true value of the score. Worse, the paper shows that the bias term depends on the particular model being evaluated, so model A may get a better score than model B simply because model A's bias term is smaller. This effect cannot be fixed by evaluating at a fixed number of samples. This means all comparisons using FID or IS as currently computed are unreliable. We then show how to extrapolate the score to obtain an effectively bias-free estimate of scores computed with an infinite number of samples, which we term $\overline{\textrm{FID}}\infty$ and $\overline{\textrm{IS}}\infty$. In turn, this effectively bias-free estimate requires good estimates of scores with a finite number of samples. We show that using Quasi-Monte Carlo integration notably improves estimates of FID and IS for finite sample sets. Our extrapolated scores are simple, drop-in replacements for the finite sample scores. Additionally, we show that using low discrepancy sequence in GAN training offers small improvements in the resulting generator.

Citations (182)

View on Semantic Scholar

Summary

The paper proposes effectively unbiased alternatives FID_infty and IS_infty using extrapolation and QMC methods to overcome finite sample bias.
It demonstrates a linear relationship between metric scores and 1/N, validating the extrapolation technique for more reliable model comparisons.
Rigorous experiments show that Sobol sequence integrators substantially lower variance, enhancing evaluation fidelity of deep generative models.

An Analysis of Bias in Fréchet Inception Distance and Inception Score Metrics

The paper by Chong and Forsyth from the University of Illinois tackles the prevalent issue of bias in two widely-used evaluation metrics for Deep Generative Models: Fréchet Inception Distance (FID) and Inception Score (IS). This bias is intrinsic in the computation of FID and IS when evaluated over finite sample sizes, leading to unreliable comparisons between different models. The authors propose alternative metrics, FID $_\infty$ and IS $_\infty$ , which aim to provide effectively unbiased estimates of these scores as if computed with an infinite number of samples. Their work introduces Quasi-Monte Carlo (QMC) methods as a means to improve the accuracy and reliability of these metrics.

Bias Challenges with FID and IS

The paper begins by systematically demonstrating that both FID and IS metrics are biased when computed using a finite set of samples. This bias arises because the expected value of scores computed on subsets of generated samples does not equate to the scores’ true values, and the bias varies depending on the generator as well as the number of samples $N$ . Consequently, models can be ranked incorrectly simply due to variances in the biased estimation rather than actual model performance.

Proposed Remedies: FID $_\infty$ and IS $_\infty$

The authors introduce FID $_\infty$ and IS $_\infty$ as effectively unbiased metrics, leveraging the concept of extrapolation to overcome the biases inherent when $N$ is finite. This process involves extrapolating score data to estimate values as $N$ approaches infinity, thereby minimizing the influence of biased estimators. They employ QMC methods, specifically using Sobol sequence integrators, to reduce variance in the finite sample estimates, thus improving the reliability of extrapolated scores.

Evaluation of Methodology

Throughout the experimentation phase, the authors convincingly show the linearity in FID and IS scores in relation to $1/N$, attesting to the consistency of their proposed extrapolation approach. By focusing on reducing variance through QMC methods, the methodology provides higher fidelity in estimation for both FID $_\infty$ and IS $_\infty$ . The paper includes rigorous comparisons of normal sampling methods against the QMC approaches, notably demonstrating lower standard deviations and improved accuracy with Sobol sequences.

Practical Implications and Next Steps

The implications of refining the measurement criteria using FID $_\infty$ and IS $_\infty$ are profound for both theoretical research and practical implementations within the field of generative models. Researchers can make more reliable comparisons between models, potentially influencing the landscape of model development and architectural choices. The paper sets the stage for further investigation into more complex integrator designs and broader application of QMC techniques to diverse model architectures. Additionally, future work could explore extensive correlations between these effectively unbiased metrics and perceptual evaluations such as HYPE.

Conclusion

Chong and Forsyth's paper addresses inherent biases within established metrics for generative models, proposing a robust solution that holds promise for more accurate comparisons and evaluations of model performance. Their use of extrapolation and QMC methods not only presents an academic advancement but potentially alters the evaluation standard for generative models, presenting a trajectory for future research in ameliorating biases across other computational fields.

PDF Markdown