Papers
Topics
Authors
Recent
2000 character limit reached

Maximum Mean Discrepancy Split Criterion

Updated 11 December 2025
  • The Maximum Mean Discrepancy (MMD) split criterion is a nonparametric approach that partitions data by maximizing divergence between empirical distributions.
  • It employs kernel methods to compute divergence and guide optimal train/validation splits for improved robustness and recursive clustering methods.
  • Empirical validations show that MMD-based splits enhance out-of-distribution model performance and achieve near-perfect clustering metrics.

The Maximum Mean Discrepancy (MMD) split criterion is a data-partitioning principle designed to maximize distributional separation between subsets, primarily motivated by domain shift and unsupervised/data-driven clustering challenges. The MMD statistic serves as a nonparametric measure of divergence between probability distributions and underpins principled split criteria for tasks including out-of-distribution model selection and unsupervised functional data clustering. Recent major developments formalize both split-maximization and recursive clustering approaches based on the maximization of MMD and its weighted variants, delivering theoretically-grounded algorithms with empirical and oracle guarantees (Napoli et al., 29 May 2024, Chakrabarty et al., 15 Jul 2025).

1. Definition of Maximum Mean Discrepancy and its Empirical Estimation

Let PP and QQ be probability distributions over a space X\mathcal{X} and let κ:X×XR\kappa:\mathcal{X}\times\mathcal{X}\to\mathbb{R} denote a positive-definite kernel with feature map ϕ:XH\phi:\mathcal{X}\to\mathcal{H} into an RKHS H\mathcal{H}. The squared Maximum Mean Discrepancy is defined as

MMD2(P,Q)=μPμQH2=ExP[ϕ(x)]EyQ[ϕ(y)]H2.\mathrm{MMD}^2(P,Q)=\|\mu_P-\mu_Q\|^2_{\mathcal H}=\Big\|\mathbb{E}_{x\sim P}[\phi(x)] - \mathbb{E}_{y\sim Q}[\phi(y)]\Big\|^2_{\mathcal H}.

For i.i.d. samples {xi}i=1n\{x_i\}_{i=1}^n from PP and {yj}j=1m\{y_j\}_{j=1}^m from QQ, the unbiased empirical estimator is

MMD2(P,Q)=1n2i,iκ(xi,xi)+1m2j,jκ(yj,yj)2nmi,jκ(xi,yj).\mathrm{MMD}^2(P,Q)=\frac{1}{n^2}\sum_{i,i'}\kappa(x_i,x_{i'})+\frac{1}{m^2}\sum_{j,j'}\kappa(y_j,y_{j'})-\frac{2}{nm}\sum_{i,j}\kappa(x_i,y_j).

(Napoli et al., 29 May 2024, Chakrabarty et al., 15 Jul 2025)

For data splitting, a weighted-MMD form is introduced:

dw(A,B)=n1n2n1+n2MMD2(P^A,P^B),d_w(A,B)=\frac{n_1 n_2}{n_1+n_2}\mathrm{MMD}^2(\widehat{P}_A,\widehat{P}_B),

where AA and BB are discrete data subsets of size n1n_1 and n2n_2. The prefactor penalizes severely imbalanced splits (Chakrabarty et al., 15 Jul 2025).

2. MMD as a Data Splitting and Clustering Objective

Split Maximization under Domain Shift

For model selection robust to domain shift, the optimal train/validation partition maximizes the MMD between the two sets in feature space. Let SS index the dataset, VSV\subset S designate the (fractional) validation subset, and T=SVT=S\setminus V the training subset. With constraints ensuring class/domain representativity (V(y,d)=g=hS(y,d)=g|V_{(y,d)=g}|=h|S_{(y,d)=g}| for all gg), the split maximizes

maxT,V  μPTμPVH,\max_{T,V}\;\|\mu_{\mathbb P_T}-\mu_{\mathbb P_V}\|_{\mathcal H},

subject to cardinality constraints (Napoli et al., 29 May 2024).

Binary Splitting for Maximum Cluster Separation

For unlabelled data, the task is to find a partition (A,B)(A,B) of a dataset E\mathcal{E} so that the weighted-MMD dw(A,B)d_w(A,B) is maximized. The "BS" procedure recursively seeks the point addition that maximizes dwd_w, continuing until overall maximal separation is found (Chakrabarty et al., 15 Jul 2025).

3. Algorithmic Realization: Kernel k-means and Recursive Binary Splitting

Constrained Kernel k-means for MMD Splits

Maximizing the between-set MMD is equivalent (via the law of total variance in the RKHS) to minimizing the within-cluster variance for k=2k=2 kernel k-means, under the same constraints. The optimal partition is found by:

  • Alternating centroid updates in feature space and constrained assignments.
  • Assignments are posed as linear programs: let U{0,1}n×2U\in\{0,1\}^{n\times 2} assign each point to one cluster, constrained to have fixed (class/domain) cluster sizes.
  • By Hoffmann’s total-unimodularity result, LP relaxation is exact for this setup.
  • For large nn, the kernel matrix can be approximated efficiently (Nyström approximation) (Napoli et al., 29 May 2024).

Recursive Binary Splitting and Clustering

The BS algorithm for functional data is defined as follows:

  1. Initialize one cluster as empty (C1(0)=C_1^{(0)}=\emptyset), the other as the full set (C2(0)=EC_2^{(0)}={\cal E}).
  2. Iteratively move a point maximizing the incremental dwd_w until all n1n-1 possible splittings are examined.
  3. Select the split (iteration) where weighted-MMD is maximized.
  4. Efficient update formulas ensure O(n2)O(n^2) complexity for nn data points (Chakrabarty et al., 15 Jul 2025).

This procedure forms the backbone of both single-split and recursive (multiway) clusterings.

4. Extension to Unknown or Fixed Number of Clusters

With unknown KK, an additional "single cluster check" (SCC) is implemented:

  • Let VV be the ratio of maximum to minimum weighted-MMD observed over all possible splits.
  • Null H0H_0 ("single cluster") is accepted if V1<VR/H|V-1| < |V-R/H| for ratios R,HR,\,H computed from extremal MMDs; otherwise, reject and split.
  • The CURBS-I algorithm combines SCC and recursive BS, producing (K^\widehat K) clusters when further splits are unwarranted (Chakrabarty et al., 15 Jul 2025).

For fixed K=JK=J ("CURBS-II"), after recursive bisection, closest-pair merging via dwd_w controls the final cluster count. This merging phase preserves the order-preserving property (POP): no cluster ever splits a true class.

5. Theoretical Guarantees and Oracle Analysis

In the oracle model, every observation’s true population label is known and used to define “representative” empirical measures per cluster:

P~S=j=1K(SDjS)P^j.\widetilde P_S = \sum_{j=1}^K \left(\frac{|S\cap{\cal D}_j|}{|S|}\right)\widehat P_j.

  • Under mixture models, MMD contracts strictly unless clusters are pure.
  • CURBS-I* (oracle variant) perfectly recovers true clusters for balanced sizes as nn\to\infty.
  • CURBS-II* guarantees Perfect-Order-Preserving (POP) clustering for all JJ: if the true number of clusters is KK, setting J<KJ<K merges whole classes, J>KJ>K only splits classes (Chakrabarty et al., 15 Jul 2025).

6. Empirical Validation and Applications

Comparative studies across domain generalization (DG), unsupervised domain adaptation (UDA), and functional-data clustering demonstrate:

  • Cluster-based splits (linear or RBF kernel) close approximately 50% of the gap (in normalized accuracy) between random split and oracle split, outperforming metadata-based or leave-one-domain-out strategies.
  • Average test domain accuracy gains are +3 pp over random splitting (Napoli et al., 29 May 2024).
  • Direct analysis yields Spearman rank-correlation ρ=0.63\rho=0.63 (p \ll 10⁻³) and Pearson r0.61r\approx0.61 (p 3.8×104\approx3.8\times10^{-4}) between achieved MMD and held-out test accuracy, supporting the notion that maximizing MMD between train/validation benefits out-of-distribution model selection.
  • In functional data clustering, the weighted-MMD/BS framework yields near-perfect Rand indices—empirically outperforming alternative methods, even where clusters differ by only subtle scale changes (Chakrabarty et al., 15 Jul 2025).

7. Practical Considerations and Implementational Choices

  • Kernels: Gaussian, Laplacian, and linear choices, with practical selection by the median pairwise distance heuristic.
  • Computational complexity: O(n2)O(n^2) pairwise kernel updates are needed, but BS incremental-update formulas and kernel approximations (e.g., Nyström) improve scalability.
  • The algorithms require no access to explicit metadata beyond (class, domain) labels when available for splitting or representative-measure computation for oracle results, and can be deployed for both unlabelled and semi-supervised settings.
  • The linear programming formulation allows enforcement or relaxation of class/domain balance constraints as required (Napoli et al., 29 May 2024, Chakrabarty et al., 15 Jul 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Maximum Mean Discrepancy (MMD) Split Criterion.