Papers
Topics
Authors
Recent
Search
2000 character limit reached

Monge Inception Distance (MIND)

Updated 4 July 2026
  • Monge Inception Distance (MIND) is a metric that evaluates the similarity between generated and real data distributions using one-dimensional optimal transport.
  • It computes averaged Wasserstein distances by projecting high-dimensional embeddings onto random unit vectors, avoiding full covariance estimation.
  • MIND offers significant gains in speed, memory efficiency, and robustness compared to FID, achieving reliable evaluations with fewer samples.

to=arxiv_search.search ทะเบียนฟรี เติมเงินไทยฟรี{"query":"(Berthet et al., 7 May 2026) MIND Monge Inception Distance for Generative Models Evaluation","max_results":5}ดลองใช้ฟรี to=arxiv_search.search 尚度ිjson 官网群{"query":"MIND: Monge Inception Distance for Generative Models Evaluation", "max_results": 10} RTLU to=arxiv_search.search 天天中彩票买json ությունն{"query":"(Berthet et al., 7 May 2026)", "max_results": 10} Monge Inception Distance (MIND) is a metric for evaluating generative models that was proposed to address key limitations of the widely adopted Fréchet Inception Distance (FID). It replaces the Gaussian approximation used by FID with a sliced Wasserstein construction that compares distributions through averaged one-dimensional optimal transport distances, efficiently computed via sorting. In the formulation reported in "MIND: Monge Inception Distance for Generative Models Evaluation," MIND is presented as more sample-efficient by one order of magnitude, faster to compute by two orders of magnitude, and more robust to adversarial attacks such as moment-matching, while maintaining high correlation with the standard FID benchmark (Berthet et al., 7 May 2026).

1. Formal definition

MIND is defined on embedding distributions. Let pθp_\theta be the distribution of Inception-v3 embeddings of samples from the generative model, let pdatap_{\mathrm{data}} be the distribution of embeddings of real data, let dd be the embedding dimension, with d=2048d = 2048 for Inception-v3, let S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\} be the unit sphere, and let U(S)U(S) denote the uniform distribution on SS.

The construction begins with the squared $2$-Wasserstein distance between one-dimensional distributions μ\mu and ν\nu, both supported on pdatap_{\mathrm{data}}0:

pdatap_{\mathrm{data}}1

For two empirical samples pdatap_{\mathrm{data}}2 and pdatap_{\mathrm{data}}3 of equal size pdatap_{\mathrm{data}}4, the same quantity has the closed-form empirical expression

pdatap_{\mathrm{data}}5

The sliced Wasserstein distance between two pdatap_{\mathrm{data}}6-dimensional distributions pdatap_{\mathrm{data}}7 and pdatap_{\mathrm{data}}8 is then defined as the average of these one-dimensional distances over random directions pdatap_{\mathrm{data}}9:

dd0

where dd1 denotes the distribution of the dot-product dd2 when dd3.

With this notation, the Monge Inception Distance is defined, up to a constant dd4 chosen to put MIND on a similar numerical scale as FID, by

dd5

This definition makes explicit that MIND compares the full empirical distributions in projection space rather than only low-order moments. The paper’s summary further states that sliced-Wasserstein is a metric; this is the basis for the claim that MIND is a proper distance and for its reported robustness to moment-matching attacks (Berthet et al., 7 May 2026).

2. Empirical estimator and computational procedure

In practical evaluation, one draws dd6 independent random directions dd7 and dd8 samples from each distribution, yielding the estimator

dd9

where d=2048d = 20480 and d=2048d = 20481 are the generated and real embeddings. In the reported implementation, d=2048d = 20482 for d=2048d = 20483.

The high-level computation consists of repeatedly sampling a unit vector d=2048d = 20484, projecting the embedding matrices d=2048d = 20485 and d=2048d = 20486 onto that direction, sorting the two resulting length-d=2048d = 20487 vectors, and accumulating the mean squared difference between the aligned sorted projections. The stated complexity is d=2048d = 20488 time and d=2048d = 20489 extra memory for projections, explicitly avoiding any S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}0 covariance.

The summary also notes that the procedure is easily batched on GPUs or TPUs: one stores S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}1 and S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}2, generates a batch of S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}3 directions in S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}4, performs projection in one matrix multiply, and then sorts each of the S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}5 projected vectors of length S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}6. This makes the algorithm operationally distinct from covariance-based FID pipelines, because the core numerical primitives are matrix multiplication and sorting rather than covariance estimation and matrix square roots (Berthet et al., 7 May 2026).

3. Complexity and statistical behavior

The reported comparison to FID is both computational and statistical. For FID, the summary states that one must estimate a S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}7-dimensional mean in S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}8 time and a S={uRd:u2=1}S=\{u\in\mathbb{R}^d:\|u\|_2=1\}9 covariance matrix in U(S)U(S)0 time, then compute the matrix square-root of U(S)U(S)1 in U(S)U(S)2 time, with total time U(S)U(S)3 and memory U(S)U(S)4.

For MIND, the summary states that evaluation requires U(S)U(S)5 projections with U(S)U(S)6 cost for projection and U(S)U(S)7 cost for sorting, with no matrix square-root. The total time is therefore U(S)U(S)8, and memory is U(S)U(S)9.

The statistical comparison is central to the motivation. The summary states that covariance estimation error for FID scales as SS0, so SS1 is required for stability, and practitioners use SS2. By contrast, each one-dimensional Wasserstein estimate used by MIND converges at SS3, independently of dimension, and the Monte Carlo average over SS4 directions converges at SS5. In practice, the reported stable regime is SS6 and SS7.

The summary therefore concludes that, relative to FID at SS8 and SS9, MIND at $2$0 is over $2$1 faster to compute and uses more than $2$2 less extra memory. This suggests that the principal gain is not merely constant-factor optimization, but the replacement of high-dimensional moment estimation by repeated one-dimensional OT computations (Berthet et al., 7 May 2026).

4. Empirical evaluation against FID

The reported empirical results are organized around several evaluation tasks.

In the generated-versus-true discrimination task, the probability of mis-classifying real versus generated samples drops below $2$3 for MIND with $2$4, whereas FID requires $2$5.

For agreement with the conventional benchmark, the reported scatter of $2$6 versus $2$7 over $2$8 training checkpoints shows a near-linear relation, with Pearson $2$9 after rescaling by μ\mu0. The reported correlation of these metrics with the number of training steps is

  • μ\mu1: μ\mu2,
  • μ\mu3: μ\mu4,
  • μ\mu5: μ\mu6.

In a checkpoint-ordering task involving five diffusion-model checkpoints, MIND at μ\mu7 is reported to achieve the same near-perfect ordering error rate as MMD or sliced-FID, whereas FID needs μ\mu8 to match.

In controlled perturbation experiments involving Gaussian blur, blocked rectangles, and dataset mixtures, MIND at μ\mu9 is reported to reliably order perturbation levels with error below ν\nu0, matching or exceeding MMD, while FID remains above ν\nu1 error even at ν\nu2.

Taken together, these experiments support the paper’s claim that MIND with ν\nu3 samples can replace the evaluation performance of FID with ν\nu4 samples, and that even smaller sample sizes such as ν\nu5 or ν\nu6 remain highly informative for rapid model iteration (Berthet et al., 7 May 2026).

5. Robustness to moment-matching attacks

A central empirical claim concerns adversarial robustness. The summary describes a synthetic batch that is adversarially optimized to match the first two moments of the real data, thereby driving FID toward zero. This directly targets the Gaussian moment structure on which FID depends.

The reported fraction of original metric value remaining after such hacking is as follows:

Metric Fraction remaining
FIDν\nu7 11.2%
ν\nu8-FID (means only) 2.6%
ν\nu9-FID (variances only) 4.2%
MMD 12.2%
MIND 31.1%

These values are used to support the claim that MIND is more robust to moment-matching attacks than FID and related moment-based decompositions. The reported interpretation is that matching first and second moments is insufficient to collapse a sliced-Wasserstein comparison in the same way that it can collapse a Gaussian-approximation metric. A plausible implication is that MIND preserves more sensitivity to higher-order discrepancies in the embedding distribution, although the summary does not formalize this beyond the reported attack results (Berthet et al., 7 May 2026).

6. Advantages, scope, and limitations

The reported advantages of MIND over FID are listed explicitly. They include sample efficiency, with stable scores at pdatap_{\mathrm{data}}00 samples and informative coarse comparisons even at pdatap_{\mathrm{data}}01 or pdatap_{\mathrm{data}}02; computational speed, with pdatap_{\mathrm{data}}03 faster evaluation enabling real-time scoring during training; memory efficiency, with pdatap_{\mathrm{data}}04 less peak memory and no pdatap_{\mathrm{data}}05 covariances; robustness, because sliced-Wasserstein is a metric; discriminative power, because it reliably orders models and detects subtle distortions at low pdatap_{\mathrm{data}}06; and embedding-agnostic applicability, because it can be used with any learned representation, including CLIP, DINO, and audio or video embeddings.

The stated limitations are equally important. Like FID, MIND measures distribution-level similarity in a fixed embedding space and does not capture perceptual qualities beyond those encoded by the embedding function, with examples in the summary including typography legibility and high-frequency details not reflected in embeddings. MIND also requires choosing two parameters, the sample count pdatap_{\mathrm{data}}07 and the number of projections pdatap_{\mathrm{data}}08, as well as a scale pdatap_{\mathrm{data}}09 to match FID magnitudes. In addition, the Monte Carlo average over directions introduces small additional variance, although the summary states that pdatap_{\mathrm{data}}10 suffices.

The runtime and memory measurements illustrate these trade-offs concretely. On TPUv4, the reported wall-clock time is approximately pdatap_{\mathrm{data}}11 s for pdatap_{\mathrm{data}}12 and approximately pdatap_{\mathrm{data}}13 s for pdatap_{\mathrm{data}}14, corresponding to roughly a pdatap_{\mathrm{data}}15 speedup. The reported peak memory is below pdatap_{\mathrm{data}}16 GB for pdatap_{\mathrm{data}}17 and above pdatap_{\mathrm{data}}18 GB for pdatap_{\mathrm{data}}19, corresponding to more than a pdatap_{\mathrm{data}}20 reduction. Within the scope defined by embedding-based evaluation, these measurements position MIND as a sample-based alternative to covariance-based FID with substantially lower computational burden (Berthet et al., 7 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Monge Inception Distance (MIND).