Multi-Population-aware MMD (MMD-MP)

Updated 11 January 2026

The paper introduces MMD-MP, which omits the intra-machine similarity term to reduce variance in statistical tests for detecting heterogeneous machine-generated texts.
It leverages deep kernel training with robust variance estimation, achieving significant improvements in test power and AUROC over standard MMD methods.
MMD-MP is applicable at both paragraph and sentence levels, enabling reliable and transferable detection across various large language models and decoding strategies.

Multi-Population-aware Maximum Mean Discrepancy (MMD-MP) is an optimization method for distributional two-sample tests that addresses the challenge of detecting machine-generated texts originating from diverse LLMs. By modeling population structure among generated texts, MMD-MP produces a highly stable and powerful statistical test, significantly improving over standard MMD-based approaches when the “machine” class is itself comprised of heterogeneous subpopulations. The method is particularly effective for distinguishing between human- and machine-generated texts when the latter are drawn from multiple LLMs or decoding strategies (Zhang et al., 2024).

1. Fundamentals of Maximum Mean Discrepancy

Maximum Mean Discrepancy (MMD) quantifies the difference between two distributions $P$ and $Q$ over a domain $\mathcal{X}$ by embedding them in a reproducing-kernel Hilbert space (RKHS) with kernel $k$ . The squared MMD is

$\mathrm{MMD}^2(P, Q; k) = \left\| \mu_P - \mu_Q \right\|_{\mathcal{H}_k}^2 = \mathbb{E}_{x, x' \sim P} [k(x, x')] + \mathbb{E}_{y, y' \sim Q} [k(y, y')] - 2\mathbb{E}_{x \sim P, y \sim Q} [k(x, y)].$

With samples $\{x_i\}_{i=1}^n \sim P^n$ , $\{y_j\}_{j=1}^n \sim Q^n$ , the unbiased U-statistic estimate is

$\widehat{\mathrm{MMD}}^2_u = \frac{1}{n(n-1)} \sum_{i \ne j} H_{ij}, \ H_{ij} = k(x_i, x_j) + k(y_i, y_j) - k(x_i, y_j) - k(x_j, y_i).$

MMD has desirable theoretical properties for non-parametric hypothesis testing and is widely used for two-sample problems in text and vision domains.

2. Variance Inflation due to Multiple Populations

When employing a deep kernel $k_\omega$ (parameterized, e.g., via a neural network atop a pretrained encoder such as RoBERTa) for MMD-based detection, one optimizes a test-power proxy,

$J(P, Q; k_\omega) \approx \frac{\mathrm{MMD}^2(P, Q; k_\omega)}{\sigma_{\mathfrak{H}_1}},$

where $Q$ 0 is the asymptotic variance under the alternative. In the context of machine-generated texts, the “machine” sample $Q$ 1 may comprise outputs from a variety of LLMs and sampling settings, rendering $Q$ 2 a mixture of subpopulations. This population heterogeneity causes the intra-class term $Q$ 3 to be unstable and difficult to optimize. Empirically, this leads to increased sample variance in the MMD statistic during kernel learning; as shown in synthetic and real data (e.g., Figure 1 in the paper), this instability can impair the reliability of hypothesis tests. Detailed decomposition attributes the variance escalation mainly to $Q$ 4 from subpopulation mixing (Zhang et al., 2024).

3. The Multi-Population-Aware Objective (MMD-MP)

Removal of the Intra-Machine Term

MMD-MP introduces the Multi-Population Proxy (MPP), which omits the problematic $Q$ 5 term:

$Q$ 6

The unbiased U-statistic estimator for equal sample sizes $Q$ 7 is

$Q$ 8

By bypassing the generator–generator similarity term, MMD-MP directly targets human–machine discrepancies.

Variance Estimation and Optimization

Under the alternative, the asymptotic distribution is

$Q$ 9

where

$\mathcal{X}$ 0

The objective optimized is

$\mathcal{X}$ 1

where $\mathcal{X}$ 2 is a small ridge parameter for numerical stability. This construction yields lower variance and increased stability during kernel training, particularly in multi-generator contexts.

4. Algorithmic Structure

Training the Deep Kernel

Initialize with human samples $\mathcal{X}$ 3, machine samples $\mathcal{X}$ 4, fixed encoder $\mathcal{X}$ 5, kernel parameters $\mathcal{X}$ 6, hyperparameters $\mathcal{X}$ 7, $\mathcal{X}$ 8, and $\mathcal{X}$ 9.
For each iteration: build $k$ 0, compute $k$ 1 and variance $k$ 2, update $k$ 3 to maximize $k$ 4 via Adam.
Output is an optimized deep kernel $k$ 5.

Paragraph-Level Detection

Given test sets $k$ 6 and $k$ 7, compute $k$ 8.
Generate the null via permutation, calculate the $k$ 9-value as the fraction of permuted MMDs exceeding the observed value.
Suitable for batch paragraph-based detection scenarios.

Sentence-Level Detection

Fix a reference set of human sentences $\mathrm{MMD}^2(P, Q; k) = \left\| \mu_P - \mu_Q \right\|_{\mathcal{H}_k}^2 = \mathbb{E}_{x, x' \sim P} [k(x, x')] + \mathbb{E}_{y, y' \sim Q} [k(y, y')] - 2\mathbb{E}_{x \sim P, y \sim Q} [k(x, y)].$ 0.
For each candidate $\mathrm{MMD}^2(P, Q; k) = \left\| \mu_P - \mu_Q \right\|_{\mathcal{H}_k}^2 = \mathbb{E}_{x, x' \sim P} [k(x, x')] + \mathbb{E}_{y, y' \sim Q} [k(y, y')] - 2\mathbb{E}_{x \sim P, y \sim Q} [k(x, y)].$ 1, compute the “biased” MMD estimate $\mathrm{MMD}^2(P, Q; k) = \left\| \mu_P - \mu_Q \right\|_{\mathcal{H}_k}^2 = \mathbb{E}_{x, x' \sim P} [k(x, x')] + \mathbb{E}_{y, y' \sim Q} [k(y, y')] - 2\mathbb{E}_{x \sim P, y \sim Q} [k(x, y)].$ 2.
Use the resulting scores to evaluate AUROC for distinguishing single machine-generated versus human sentences.

5. Theoretical Guarantees

The estimator $\mathrm{MMD}^2(P, Q; k) = \left\| \mu_P - \mu_Q \right\|_{\mathcal{H}_k}^2 = \mathbb{E}_{x, x' \sim P} [k(x, x')] + \mathbb{E}_{y, y' \sim Q} [k(y, y')] - 2\mathbb{E}_{x \sim P, y \sim Q} [k(x, y)].$ 3 is asymptotically normal:

For large $\mathrm{MMD}^2(P, Q; k) = \left\| \mu_P - \mu_Q \right\|_{\mathcal{H}_k}^2 = \mathbb{E}_{x, x' \sim P} [k(x, x')] + \mathbb{E}_{y, y' \sim Q} [k(y, y')] - 2\mathbb{E}_{x \sim P, y \sim Q} [k(x, y)].$ 5, test power satisfies

showing that maximizing $\mathrm{MMD}^2(P, Q; k) = \left\| \mu_P - \mu_Q \right\|_{\mathcal{H}_k}^2 = \mathbb{E}_{x, x' \sim P} [k(x, x')] + \mathbb{E}_{y, y' \sim Q} [k(y, y')] - 2\mathbb{E}_{x \sim P, y \sim Q} [k(x, y)].$ 7 aligns with statistical power maximization.

Uniform convergence (Theorem 1) produces

under standard kernel regularity assumptions, confirming consistency for the learning objective.

6. Empirical Evaluation

Data and Benchmarks

Paragraph detection: HC3 (Q&A, ChatGPT vs. human), XSum (news).
Sentence detection: same sources.
Machine-generated spans include GPT-2 small/medium, GPT-3 small (∼550M), GPT-Neo small/large, GPT-j-6B, ChatGPT (GPT-3.5), GPT4All-j.

Baselines

Method	Detection Type	Kernel Type
MMD-O	Paragraph/Sentence	RBF kernel
MMD-D	Paragraph/Sentence	Deep kernel
C2ST-S/L	Paragraph/Sentence	Classifier two-sample
DetectGPT/OpenAI-D/CE-Clf	Single-instance	Direct/Classifier features

Metrics

Paragraph-level: test power at α = 0.05.
Sentence-level: AUROC.

Performance Summary

On synthetic 4-Gaussian mixtures, MMD-MP outperforms MMD-D by up to +9 points in test power as sample variance grows.
On HC3 (3100 paras), MMD-MP achieves 93.2% power vs. 91.8% for MMD-D; similar gains observed on GPT3-S, Neo-S, and mixed settings.
For HC3 (1000 paras), average +2–6 points improvement over MMD-D across both single- and multi-generator configurations.
In unbalanced settings (2000 human vs. 400 machine): +7–14 point test power and +4–9 point AUROC improvement over MMD-D.
For sentence-level detection, MMD-MP surpasses DetectGPT, ChatGPT-D, and the CE-classifier by 1–2 points AUROC on ChatGPT, and by 5–15 points AUSROC on more challenging models.
Transfer experiments (trained on ChatGPT+GPT-2, tested on GPT-Neo-L, GPT-j, or GPT4All-j) show gains of +23–28 points test power and +3–5 points AUROC over MMD-D.
t-SNE visualizations demonstrate that MMD-MP produces more clustered human texts and decorrelated machine-generated clusters, validating reduced multi-population variance.

7. Practical Implementation and Significance

MMD-MP is deployable across both batch (paragraph-level) and real-time (sentence-level) machine-generated text detection scenarios. The method first compiles datasets representing both human and potentially multi-population machine-generated text. The deep kernel is trained via the MMD-MP criterion, omitting the intra-machine similarity term to control variance. Detection applies the 2-sample permutation framework for group content, or the single-instance “biased” MMD estimate for individual sentences. MMD-MP demonstrates consistent advantages in statistical power, stability, and transferability across unseen LLMs, establishing its efficacy as a robust kernel optimization method specifically designed for the multi-distributional landscape of modern text generation (Zhang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Population-aware MMD (MMD-MP).

Multi-Population-aware MMD (MMD-MP)

1. Fundamentals of Maximum Mean Discrepancy

2. Variance Inflation due to Multiple Populations

3. The Multi-Population-Aware Objective (MMD-MP)

Removal of the Intra-Machine Term

Variance Estimation and Optimization

4. Algorithmic Structure

Training the Deep Kernel

Paragraph-Level Detection

Sentence-Level Detection

5. Theoretical Guarantees

6. Empirical Evaluation

Data and Benchmarks

Baselines

Metrics

Performance Summary

7. Practical Implementation and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Population-aware MMD (MMD-MP)

1. Fundamentals of Maximum Mean Discrepancy

2. Variance Inflation due to Multiple Populations

3. The Multi-Population-Aware Objective (MMD-MP)

Removal of the Intra-Machine Term

Variance Estimation and Optimization

4. Algorithmic Structure

Training the Deep Kernel

Paragraph-Level Detection

Sentence-Level Detection

5. Theoretical Guarantees

6. Empirical Evaluation

Data and Benchmarks

Baselines

Metrics

Performance Summary

7. Practical Implementation and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research