Monge Inception Distance (MIND)
- Monge Inception Distance (MIND) is a metric that evaluates the similarity between generated and real data distributions using one-dimensional optimal transport.
- It computes averaged Wasserstein distances by projecting high-dimensional embeddings onto random unit vectors, avoiding full covariance estimation.
- MIND offers significant gains in speed, memory efficiency, and robustness compared to FID, achieving reliable evaluations with fewer samples.
to=arxiv_search.search ทะเบียนฟรี เติมเงินไทยฟรี{"query":"(Berthet et al., 7 May 2026) MIND Monge Inception Distance for Generative Models Evaluation","max_results":5}ดลองใช้ฟรี to=arxiv_search.search 尚度ිjson 官网群{"query":"MIND: Monge Inception Distance for Generative Models Evaluation", "max_results": 10} RTLU to=arxiv_search.search 天天中彩票买json ությունն{"query":"(Berthet et al., 7 May 2026)", "max_results": 10} Monge Inception Distance (MIND) is a metric for evaluating generative models that was proposed to address key limitations of the widely adopted Fréchet Inception Distance (FID). It replaces the Gaussian approximation used by FID with a sliced Wasserstein construction that compares distributions through averaged one-dimensional optimal transport distances, efficiently computed via sorting. In the formulation reported in "MIND: Monge Inception Distance for Generative Models Evaluation," MIND is presented as more sample-efficient by one order of magnitude, faster to compute by two orders of magnitude, and more robust to adversarial attacks such as moment-matching, while maintaining high correlation with the standard FID benchmark (Berthet et al., 7 May 2026).
1. Formal definition
MIND is defined on embedding distributions. Let be the distribution of Inception-v3 embeddings of samples from the generative model, let be the distribution of embeddings of real data, let be the embedding dimension, with for Inception-v3, let be the unit sphere, and let denote the uniform distribution on .
The construction begins with the squared $2$-Wasserstein distance between one-dimensional distributions and , both supported on 0:
1
For two empirical samples 2 and 3 of equal size 4, the same quantity has the closed-form empirical expression
5
The sliced Wasserstein distance between two 6-dimensional distributions 7 and 8 is then defined as the average of these one-dimensional distances over random directions 9:
0
where 1 denotes the distribution of the dot-product 2 when 3.
With this notation, the Monge Inception Distance is defined, up to a constant 4 chosen to put MIND on a similar numerical scale as FID, by
5
This definition makes explicit that MIND compares the full empirical distributions in projection space rather than only low-order moments. The paper’s summary further states that sliced-Wasserstein is a metric; this is the basis for the claim that MIND is a proper distance and for its reported robustness to moment-matching attacks (Berthet et al., 7 May 2026).
2. Empirical estimator and computational procedure
In practical evaluation, one draws 6 independent random directions 7 and 8 samples from each distribution, yielding the estimator
9
where 0 and 1 are the generated and real embeddings. In the reported implementation, 2 for 3.
The high-level computation consists of repeatedly sampling a unit vector 4, projecting the embedding matrices 5 and 6 onto that direction, sorting the two resulting length-7 vectors, and accumulating the mean squared difference between the aligned sorted projections. The stated complexity is 8 time and 9 extra memory for projections, explicitly avoiding any 0 covariance.
The summary also notes that the procedure is easily batched on GPUs or TPUs: one stores 1 and 2, generates a batch of 3 directions in 4, performs projection in one matrix multiply, and then sorts each of the 5 projected vectors of length 6. This makes the algorithm operationally distinct from covariance-based FID pipelines, because the core numerical primitives are matrix multiplication and sorting rather than covariance estimation and matrix square roots (Berthet et al., 7 May 2026).
3. Complexity and statistical behavior
The reported comparison to FID is both computational and statistical. For FID, the summary states that one must estimate a 7-dimensional mean in 8 time and a 9 covariance matrix in 0 time, then compute the matrix square-root of 1 in 2 time, with total time 3 and memory 4.
For MIND, the summary states that evaluation requires 5 projections with 6 cost for projection and 7 cost for sorting, with no matrix square-root. The total time is therefore 8, and memory is 9.
The statistical comparison is central to the motivation. The summary states that covariance estimation error for FID scales as 0, so 1 is required for stability, and practitioners use 2. By contrast, each one-dimensional Wasserstein estimate used by MIND converges at 3, independently of dimension, and the Monte Carlo average over 4 directions converges at 5. In practice, the reported stable regime is 6 and 7.
The summary therefore concludes that, relative to FID at 8 and 9, MIND at $2$0 is over $2$1 faster to compute and uses more than $2$2 less extra memory. This suggests that the principal gain is not merely constant-factor optimization, but the replacement of high-dimensional moment estimation by repeated one-dimensional OT computations (Berthet et al., 7 May 2026).
4. Empirical evaluation against FID
The reported empirical results are organized around several evaluation tasks.
In the generated-versus-true discrimination task, the probability of mis-classifying real versus generated samples drops below $2$3 for MIND with $2$4, whereas FID requires $2$5.
For agreement with the conventional benchmark, the reported scatter of $2$6 versus $2$7 over $2$8 training checkpoints shows a near-linear relation, with Pearson $2$9 after rescaling by 0. The reported correlation of these metrics with the number of training steps is
- 1: 2,
- 3: 4,
- 5: 6.
In a checkpoint-ordering task involving five diffusion-model checkpoints, MIND at 7 is reported to achieve the same near-perfect ordering error rate as MMD or sliced-FID, whereas FID needs 8 to match.
In controlled perturbation experiments involving Gaussian blur, blocked rectangles, and dataset mixtures, MIND at 9 is reported to reliably order perturbation levels with error below 0, matching or exceeding MMD, while FID remains above 1 error even at 2.
Taken together, these experiments support the paper’s claim that MIND with 3 samples can replace the evaluation performance of FID with 4 samples, and that even smaller sample sizes such as 5 or 6 remain highly informative for rapid model iteration (Berthet et al., 7 May 2026).
5. Robustness to moment-matching attacks
A central empirical claim concerns adversarial robustness. The summary describes a synthetic batch that is adversarially optimized to match the first two moments of the real data, thereby driving FID toward zero. This directly targets the Gaussian moment structure on which FID depends.
The reported fraction of original metric value remaining after such hacking is as follows:
| Metric | Fraction remaining |
|---|---|
| FID7 | 11.2% |
| 8-FID (means only) | 2.6% |
| 9-FID (variances only) | 4.2% |
| MMD | 12.2% |
| MIND | 31.1% |
These values are used to support the claim that MIND is more robust to moment-matching attacks than FID and related moment-based decompositions. The reported interpretation is that matching first and second moments is insufficient to collapse a sliced-Wasserstein comparison in the same way that it can collapse a Gaussian-approximation metric. A plausible implication is that MIND preserves more sensitivity to higher-order discrepancies in the embedding distribution, although the summary does not formalize this beyond the reported attack results (Berthet et al., 7 May 2026).
6. Advantages, scope, and limitations
The reported advantages of MIND over FID are listed explicitly. They include sample efficiency, with stable scores at 00 samples and informative coarse comparisons even at 01 or 02; computational speed, with 03 faster evaluation enabling real-time scoring during training; memory efficiency, with 04 less peak memory and no 05 covariances; robustness, because sliced-Wasserstein is a metric; discriminative power, because it reliably orders models and detects subtle distortions at low 06; and embedding-agnostic applicability, because it can be used with any learned representation, including CLIP, DINO, and audio or video embeddings.
The stated limitations are equally important. Like FID, MIND measures distribution-level similarity in a fixed embedding space and does not capture perceptual qualities beyond those encoded by the embedding function, with examples in the summary including typography legibility and high-frequency details not reflected in embeddings. MIND also requires choosing two parameters, the sample count 07 and the number of projections 08, as well as a scale 09 to match FID magnitudes. In addition, the Monte Carlo average over directions introduces small additional variance, although the summary states that 10 suffices.
The runtime and memory measurements illustrate these trade-offs concretely. On TPUv4, the reported wall-clock time is approximately 11 s for 12 and approximately 13 s for 14, corresponding to roughly a 15 speedup. The reported peak memory is below 16 GB for 17 and above 18 GB for 19, corresponding to more than a 20 reduction. Within the scope defined by embedding-based evaluation, these measurements position MIND as a sample-based alternative to covariance-based FID with substantially lower computational burden (Berthet et al., 7 May 2026).