CLIP-based Maximum Mean Discrepancy

Updated 3 January 2026

CMMD is a technique that integrates CLIP's semantically rich embeddings with the maximum mean discrepancy framework to assess distributional differences in image data.
It employs a Gaussian RBF kernel on normalized CLIP embeddings, ensuring robust, distribution-free testing and improved sensitivity compared to traditional metrics like FID.
CMMD is applicable in generative model evaluation and legal originality testing, providing reliable metrics even with low sample sizes and diverse image modalities.

CLIP-based @@@@2@@@@ (CMMD) is a technique for quantitatively assessing distributional differences between sets of image data, particularly for evaluating generative models and measuring semantic originality. CMMD integrates the Maximum Mean Discrepancy (MMD) statistical test with the powerful, semantically rich image embeddings produced by Contrastive Language–Image Pretraining (CLIP) models. This approach overcomes limitations of earlier metrics such as Fréchet Inception Distance (FID), offering improved reliability, distributional sensitivity, and robustness to sample size and data modality variances (Jayasumana et al., 2023, Mukherjee et al., 11 Apr 2025).

1. Motivation and Background

Conventional evaluation of image generation models frequently relies on distance metrics comparing real and generated image distributions in a feature space. FID, based on the distance between multivariate Gaussian fits to Inception‐v3 activations, has been widely used for this purpose. However, contemporary image generative models create content that exceeds the domain of Inception-v3, rendering its features insufficiently expressive. Three principal shortcomings arise for FID: the restrictive feature space of Inception, its core assumption of normality (multivariate Gaussianity), and poor sample efficiency coupled with estimator bias. These flaws motivate the adoption of CMMD, which leverages CLIP's broader coverage of visual semantics and MMD's distribution-free properties (Jayasumana et al., 2023).

2. Mathematical Formalism

CMMD is constructed by embedding image sets into a vector space via a pre-trained CLIP encoder and then applying the squared MMD, computed using a Gaussian RBF kernel over these embeddings. Let $P$ and $Q$ denote two probability distributions over $\mathbb{R}^d$ (the CLIP embedding space). The mean embedding in a reproducing kernel Hilbert space $\mathcal{H}$ with kernel $k$ is given by

$\mu_P = \mathbb{E}_{x\sim P}[\phi(x)]$

where $\phi$ is the feature map induced by $k$ .

The squared MMD is defined as

$\mathrm{MMD}^2_k(P, Q) = \mathbb{E}_{x, x' \sim P} [k(x, x')] - 2 \mathbb{E}_{x \sim P, y\sim Q} [k(x, y)] + \mathbb{E}_{y, y' \sim Q} [k(y, y')]$

When $k$ is a characteristic kernel, this is a true metric on distributions.

For finite samples $X = \{x_i\}_{i=1}^m$ , $Y = \{y_j\}_{j=1}^n$ , the unbiased U-statistic estimator is:

$\widehat{\mathrm{MMD}^2_u}(X, Y) = \frac{1}{m(m-1)} \sum_{i \ne j} k(x_i, x_j) + \frac{1}{n(n-1)} \sum_{i \ne j} k(y_i, y_j) - \frac{2}{mn} \sum_{i=1}^m \sum_{j=1}^n k(x_i, y_j)$

The kernel $k$ in CMMD is the Gaussian RBF,

$k(u,v) = \exp\left(-\frac{\|\psi(u) - \psi(v)\|^2_2}{2\sigma^2}\right)$

where $\psi(\cdot)$ denotes the normalized CLIP embedding and $\sigma$ is the bandwidth parameter. In practice, $\sigma$ can be fixed (e.g., $\sigma=10$ for the ViT-L/14@336px model) (Jayasumana et al., 2023) or set using the median pairwise distance heuristic (Mukherjee et al., 11 Apr 2025).

3. Algorithmic Procedure and Implementation

The CMMD calculation consists of the following steps:

Feature Extraction
- Apply a pre-trained CLIP visual encoder (e.g., ViT-L/14@336px or ViT-H-14-quickgelu) to both image sets, producing $d$ -dimensional embeddings. Embeddings may be normalized to unit norm.
Kernelization
- Compute the pairwise squared Euclidean distances between all embeddings from both sets.
- Apply the Gaussian RBF kernel with bandwidth $\sigma$ either fixed or determined via the median heuristic.
MMD Estimation
- Compute the unbiased U-statistic as above. For practical purposes, multiply by a constant (e.g., $1000$) to ease reporting.
Permutation Testing and Power Estimation (where hypothesis testing between sets is desired)
- Pool all embeddings.
- Shuffle and split into groups, repeatedly recomputing the MMD statistic to build its distribution under the null.
- Compare the observed statistic to this distribution to obtain a $p$ -value.
Computational Considerations
- CLIP inference can be batched for efficiency (e.g., $\sim2$ ms per image on TPU).
- Kernel and U-statistic computation is $O(N^2 d)$ , amenable to batching and parallelization. In experiments, evaluation over 30,000 images requires $\sim70$ ms, which compares favorably to FID computation (Jayasumana et al., 2023).

An outline of the CMMD test:

Step	Description	Key Details
Embed	CLIP encoder, normalize to unit length	Suitable variant and dataset-dependent
Kernelize	Gaussian RBF on pairwise distances	$\sigma$ by median heuristic or fixed
Estimate MMD	Unbiased U-statistic as above	Supports direct distance estimation
Permute/Test	Permutation test for statistical hypothesis testing	Robust in low- $n$ regime, distribution-free

4. Comparative Performance and Empirical Results

CMMD demonstrates both statistical and practical advantages over FID and similar metrics. Empirical findings include:

Alignment with Human Judgment: CMMD correctly ranked a full text-to-image model above its early-stopped variant in $92.5\%$ of cases, matching human raters. FID and its unbiased form, by contrast, failed this test (Jayasumana et al., 2023).
Monotonicity During Progressive Refinement: For models using diffusion or autoregressive refinement, FID scores exhibit oscillatory or degraded behavior, while CMMD decreases monotonically, reflecting qualitative improvement (Jayasumana et al., 2023).
Sensitivity to Distortions: When generated images were perturbed via VQGAN token corruption, CMMD tracked degradation reliably, whereas FID scores fluctuated and could paradoxically improve under certain distortions (Jayasumana et al., 2023).
Sample Efficiency: CMMD achieves statistically stable estimates with as few as $1,000$ samples, unlike FID, which requires $>20,000$ images for stability. In legal and policy applications (e.g., art originality assessment), CMMD can distinguish human from AI-generated art at $>95\%$ power with only $7$–$10$ samples per group (Mukherjee et al., 11 Apr 2025).

5. Theoretical Properties

Key theoretical aspects of CMMD include:

Metric Validity: With a characteristic kernel and injective embedding (guaranteed by CLIP's expressive power), MMD is a metric on probability distributions.
No Distributional Assumptions: Unlike FID, which relies on multivariate normality, CMMD imposes no such constraints, retaining full sensitivity to modality and multimodal structure.
Sample Efficiency and Bias: The U-statistic estimator is unbiased with convergence at the rate $O(1/\sqrt{N})$ independently of embedding dimension, a marked contrast to the $O(d^2/N)$ scaling required for FID covariance estimation (Jayasumana et al., 2023).
Generalizability: Use of CLIP broadens the domain of image types (including stylized, compositional, and non-photorealistic images) for which quality evaluations remain theoretically justified and practically meaningful.

6. Applications and Interpretability

CMMD serves in multiple domains where quantification of distributional discrepancy in high-dimensional, semantically structured data is required:

Text-to-Image Model Evaluation: CMMD provides robust assessment for next-generation generative models, particularly where traditional metrics underperform due to domain shift.
Intellectual Property and Originality Testing: By enabling sensitive, distribution-level comparison, CMMD assists in distinguishing genuine creative novelty in AI outputs, with direct applicability to legal originality and authorship inquiries (Mukherjee et al., 11 Apr 2025).
Statistical Hypothesis Testing: The permutation test framework equipped with CMMD enables nonparametric two-sample testing, with well-calibrated type I error even in small-sample regimes.
Interpretability: Lower CMMD scores imply higher similarity; the absolute value depends on $\sigma$ and embedding norm, thus comparative use is recommended. All evaluations should report sample size, $\sigma$ , and CLIP variant for reproducibility.

7. Implementation and Usage Considerations

CMMD implementations are available in JAX and Python, with code supporting embedding calculation, kernelization, MMD estimation, and permutation testing (Jayasumana et al., 2023, Mukherjee et al., 11 Apr 2025). CLIP encoder choice, embedding normalization, and kernel bandwidth are programmable. For large $n$ , practitioners may adopt kernel approximation or subsampling but standard O( $n^2 d$ ) computation remains practical for commonly used batch sizes.

CMMD's robustness to sample size, statistical power with few examples, lack of dependence on strict distributional assumptions, and calibration with practical and legal test scenarios establish it as a comprehensive alternative to existing image distribution comparison techniques for both technical benchmarking and broader societal applications (Jayasumana et al., 2023, Mukherjee et al., 11 Apr 2025).