FID and IS Metrics Overview

Updated 23 June 2026

FID and IS metrics are key measures that evaluate generative models by comparing real and generated image distributions using deep features.
They compute similarity through methods like 2-Wasserstein distance and KL divergence, capturing differences in image fidelity and diversity.
Extensions such as R-FID, CLIP-based FID, and flow-based methods address issues like sample bias and adversarial manipulation for improved robustness.

The Fréchet Inception Distance (FID) and Inception Score (IS) are the most widely used metrics for evaluating generative models, especially generative adversarial networks and diffusion models. FID quantifies the similarity between distributions of real and generated images in the feature space of a deep convolutional network (typically Inception-V3), modeling both sets as multivariate Gaussians and calculating the 2-Wasserstein (Fréchet) distance between them. IS evaluates the quality and diversity of generated samples by computing how confidently a fixed classifier assigns labels to generated images and how broadly the generated samples cover the label space. These metrics are computationally tractable and correlate with human judgment in specific settings, but both present challenges regarding interpretation, implementation, sample size dependence, feature-space choice, and susceptibility to manipulation.

1. Formal Definitions and Computation

Fréchet Inception Distance (FID):

Given distributions of real ( $P_r$ ) and generated ( $P_g$ ) images, compute the empirical mean and covariance of their feature embeddings ( $\mu_r, \Sigma_r$ for real; $\mu_g, \Sigma_g$ for generated), typically using the 2048-dimensional activations from the pool3 (pre-logits) layer of Inception-V3. The FID is given by:

$\mathrm{FID}(P_r, P_g) = \|\mu_r - \mu_g\|_2^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$

Feature extraction uses a fixed, ImageNet-pretrained Inception-V3. Best practices require matching sample sizes for real and generated sets (e.g., $N=50\,000$ ), official weights, and standard preprocessing and resizing protocols to reduce metric variance and ensure reproducibility (Kynkäänniemi et al., 2022).

Inception Score (IS):

Given a pretrained Inception-V3 network, IS is defined as:

$\mathrm{IS} = \exp\left( \mathbb{E}_{x \sim p_g} \left[ \mathrm{KL}\left( p(y|x) \;\|\; p(y) \right) \right] \right)$

where $p(y|x)$ is the predicted class distribution for image $x$ and $p(y) = \mathbb{E}_{x \sim p_g}[p(y|x)]$ is the marginal predicted label distribution (Betzalel et al., 2022, Alfarra et al., 2022).

2. Theoretical Properties and Metric Interpretation

FID estimates how close generated data is to real data in the high-level feature space, penalizing both mean shifts ("feature drift") and loss of covariance structure ("mode collapse" or poor diversity). IS combines sample quality (low entropy per-image label predictions) and diversity (high entropy over aggregate labels) into a single score, rewarding both high-confidence images and broad class coverage.

A key difference is that IS operates solely on generated images, while FID directly compares generated and real data. FID aligns more closely with classical $P_g$ 0-divergences in probabilistic models (Betzalel et al., 2022).

However, FID's reliance on a fixed Gaussian approximation in ImageNet-trained feature space introduces domain dependence and may not respect semantic similarity in out-of-domain contexts. Both FID and IS, as single-score metrics, conflate fidelity and diversity, making it difficult to diagnose whether a model's gains reflect more realistic outputs, better coverage, or simply adaptation to embedding peculiarities (Naeem et al., 2020, Kim et al., 2023).

3. Limitations: Feature Dependence, Manipulability, and Implementation

Both metrics are sensitive to the properties of the underlying feature extractor. FID, in particular, is tightly coupled to the ImageNet class embedding structure:

Manipulating class-histogram alignment (e.g., by matching the distribution of Top- $P_g$ 1 ImageNet classes) can dramatically lower FID without improving perceptual quality; empirical reductions as large as 60% are achieved purely by resampling generated image sets to align predicted class frequencies (Kynkäänniemi et al., 2022).
The Grad-CAM–style sensitivity analysis reveals FID disproportionately responds to image regions associated with strong ImageNet class responses, often ignoring main subject features when these are not aligned with ImageNet classes.
Adversarial attacks—both in pixel space and generator latent space—can increase or decrease FID and IS with imperceptible image changes. For FID, this is achieved by maximizing feature-space distances, while for IS, maximizing entropy or classifier confusion produces low scores for visually plausible images and vice versa (Alfarra et al., 2022).

Finite-sample bias is present in both FID and IS:

$P_g$ 2

with model- and distribution-dependent constants, meaning scores are not directly comparable across different sample sizes or even models (Chong et al., 2019). This can be mitigated by extrapolating to infinite sample size ( $P_g$ 3), for example via regression across multiple sample sizes.

4. Extensions and Robust Alternatives

A variety of methods extend or address limitations in FID/IS:

Robust FID (R-FID): Uses adversarially trained (robust) Inception-V3 networks for feature extraction, dramatically increasing pixel- and latent-space robustness against adversarial attacks. R-FID strictly increases as image quality degrades and is substantially more difficult to manipulate artificially (Alfarra et al., 2022).
Domain-General Features: Substituting CLIP features, learned on diverse, multimodal data, for Inception features, aligns the metric to a broader range of image domains, improves Gaussianity of embeddings, and reduces outlier effects (Betzalel et al., 2022).
Compound FID (CFID): Simultaneously computes FID at different levels (low, mid, high) within Inception-V3, returning the maximum of the three, to better capture both local (texture) and global (semantic) aberrations in generated results (Nunn et al., 2021).

Precision and Recall Variants: Address the fidelity/diversity conflation by support-based statistics, such as k-nearest-neighbor (kNN) overlap between real and generated samples. However, early kNN-based methods suffer from hyperparameter sensitivity, outlier vulnerability, and lack of statistical consistency (Naeem et al., 2020). Topological Precision and Recall (TopP&R) offers statistically consistent, robust estimation of supports via KDE and bootstrap confidence bands, maintaining bounded, interpretable metrics and robustness to non-IID noise (Kim et al., 2023). Density and Coverage metrics refine fidelity/diversity estimation further, providing analytic tractability under $P_g$ 4 (Naeem et al., 2020).

Conditional Metrics: For class-conditional generation, the conditional Inception Score (cIS) and conditional FID (cFID) decompose scores into between-class and within-class components, yielding more granular diagnosis of class coverage and mode collapse (Benny et al., 2020).

Normalizing Flow Methods: Flow-based Likelihood Distance (FLD), Dual-Flow Likelihood Distance (D-FLD), and FLD+ leverage normalizing flows to model the true data distribution's likelihood directly, bypassing Gaussian assumptions and reducing sample size requirements by orders of magnitude compared to FID (hundreds of images vs. tens of thousands). These methods exhibit strong monotonicity under a broad array of distortions, faster convergence, efficient computation, and easy adaptation to new domains (e.g., medical imaging) (Jeevan et al., 2024, Jeevan et al., 2024).

5. Practical Implications and Best Practices

The primary strengths of FID are computational simplicity, widespread adoption, and rough correlation with human judgment when comparing closely related models. However, reliance on a fixed feature extractor (ImageNet-Inception), single scalar score, and Gaussian embedding assumptions introduces critical pitfalls: FID and IS can be arbitrarily manipulated, may disagree with human perceptual quality, and confound fidelity and diversity (Kynkäänniemi et al., 2022, Alfarra et al., 2022).

Best practices include:

Never relying on FID alone for model selection; always complement with human studies and domain-adapted metrics.
Reporting unbiased FID estimates ( $P_g$ 5), multiple metrics (e.g., KID, Clean FID, FLD/D-FLD, TopP&R), and using robust or domain-adapted feature extractors (CLIP, robust Inception).
For application to non-ImageNet or low-data domains, re-train or fine-tune backbone or normalizing flow representations for more meaningful evaluation (Jeevan et al., 2024, Jeevan et al., 2024, Kim et al., 2023).
Standardizing experimental protocols (fixed $P_g$ 6, deterministic preprocessing, official weights) for reproducibility, and verifying monotonicity and behavior under known degradations (Kynkäänniemi et al., 2022, Jeevan et al., 2024).

6. Comparative Overview of Major Metrics

Metric	Embedding Type	Sample Efficiency	Domain Adaptability	Handles Outliers	Decomposes Fidelity/Diversity
FID	InceptionV3 (ImageNet)	>20,000	Poor	No	No
IS	InceptionV3 (ImageNet)	>2,000	Poor	No	No
R-FID	Robust Inception	>20,000	Poor	Strong	No
FID (CLIP)	CLIP	~20,000	High	Better	No
TopP&R	Any (KDE-based)	10,000	Good	Strong	Yes
FLD/D-FLD	Normalizing Flow	~200	Excellent	Yes	Yes
FLD+	Flow on pooled features	~200	Excellent	Yes	Yes
CFID	InceptionV3 multi-level	>20,000	Poor	Weak	Partially

7. Outlook and Recommendations

The field is moving toward metrics that are robust, data-efficient, domain-adaptable, and that disentangle the key desiderata of sample fidelity and diversity. FID and IS, while historically central, are increasingly seen as baseline metrics—a plausible implication is that future evaluation methodology will universally incorporate both precision/recall–based and likelihood-based (flow) metrics, with strong preference for those exhibiting statistical consistency, interpretability, and robustness to both outliers and dataset/domain shifts (Jeevan et al., 2024, Jeevan et al., 2024, Kim et al., 2023). Employing a rigorous, multi-metric evaluation suite, complemented by human perception studies and standardized protocols, remains essential for credible generative model assessment.