Maximum Mean Discrepancy Split Criterion
- The Maximum Mean Discrepancy (MMD) split criterion is a nonparametric approach that partitions data by maximizing divergence between empirical distributions.
- It employs kernel methods to compute divergence and guide optimal train/validation splits for improved robustness and recursive clustering methods.
- Empirical validations show that MMD-based splits enhance out-of-distribution model performance and achieve near-perfect clustering metrics.
The Maximum Mean Discrepancy (MMD) split criterion is a data-partitioning principle designed to maximize distributional separation between subsets, primarily motivated by domain shift and unsupervised/data-driven clustering challenges. The MMD statistic serves as a nonparametric measure of divergence between probability distributions and underpins principled split criteria for tasks including out-of-distribution model selection and unsupervised functional data clustering. Recent major developments formalize both split-maximization and recursive clustering approaches based on the maximization of MMD and its weighted variants, delivering theoretically-grounded algorithms with empirical and oracle guarantees (Napoli et al., 29 May 2024, Chakrabarty et al., 15 Jul 2025).
1. Definition of Maximum Mean Discrepancy and its Empirical Estimation
Let and be probability distributions over a space and let denote a positive-definite kernel with feature map into an RKHS . The squared Maximum Mean Discrepancy is defined as
For i.i.d. samples from and from , the unbiased empirical estimator is
(Napoli et al., 29 May 2024, Chakrabarty et al., 15 Jul 2025)
For data splitting, a weighted-MMD form is introduced:
where and are discrete data subsets of size and . The prefactor penalizes severely imbalanced splits (Chakrabarty et al., 15 Jul 2025).
2. MMD as a Data Splitting and Clustering Objective
Split Maximization under Domain Shift
For model selection robust to domain shift, the optimal train/validation partition maximizes the MMD between the two sets in feature space. Let index the dataset, designate the (fractional) validation subset, and the training subset. With constraints ensuring class/domain representativity ( for all ), the split maximizes
subject to cardinality constraints (Napoli et al., 29 May 2024).
Binary Splitting for Maximum Cluster Separation
For unlabelled data, the task is to find a partition of a dataset so that the weighted-MMD is maximized. The "BS" procedure recursively seeks the point addition that maximizes , continuing until overall maximal separation is found (Chakrabarty et al., 15 Jul 2025).
3. Algorithmic Realization: Kernel k-means and Recursive Binary Splitting
Constrained Kernel k-means for MMD Splits
Maximizing the between-set MMD is equivalent (via the law of total variance in the RKHS) to minimizing the within-cluster variance for kernel k-means, under the same constraints. The optimal partition is found by:
- Alternating centroid updates in feature space and constrained assignments.
- Assignments are posed as linear programs: let assign each point to one cluster, constrained to have fixed (class/domain) cluster sizes.
- By Hoffmann’s total-unimodularity result, LP relaxation is exact for this setup.
- For large , the kernel matrix can be approximated efficiently (Nyström approximation) (Napoli et al., 29 May 2024).
Recursive Binary Splitting and Clustering
The BS algorithm for functional data is defined as follows:
- Initialize one cluster as empty (), the other as the full set ().
- Iteratively move a point maximizing the incremental until all possible splittings are examined.
- Select the split (iteration) where weighted-MMD is maximized.
- Efficient update formulas ensure complexity for data points (Chakrabarty et al., 15 Jul 2025).
This procedure forms the backbone of both single-split and recursive (multiway) clusterings.
4. Extension to Unknown or Fixed Number of Clusters
With unknown , an additional "single cluster check" (SCC) is implemented:
- Let be the ratio of maximum to minimum weighted-MMD observed over all possible splits.
- Null ("single cluster") is accepted if for ratios computed from extremal MMDs; otherwise, reject and split.
- The CURBS-I algorithm combines SCC and recursive BS, producing () clusters when further splits are unwarranted (Chakrabarty et al., 15 Jul 2025).
For fixed ("CURBS-II"), after recursive bisection, closest-pair merging via controls the final cluster count. This merging phase preserves the order-preserving property (POP): no cluster ever splits a true class.
5. Theoretical Guarantees and Oracle Analysis
In the oracle model, every observation’s true population label is known and used to define “representative” empirical measures per cluster:
- Under mixture models, MMD contracts strictly unless clusters are pure.
- CURBS-I* (oracle variant) perfectly recovers true clusters for balanced sizes as .
- CURBS-II* guarantees Perfect-Order-Preserving (POP) clustering for all : if the true number of clusters is , setting merges whole classes, only splits classes (Chakrabarty et al., 15 Jul 2025).
6. Empirical Validation and Applications
Comparative studies across domain generalization (DG), unsupervised domain adaptation (UDA), and functional-data clustering demonstrate:
- Cluster-based splits (linear or RBF kernel) close approximately 50% of the gap (in normalized accuracy) between random split and oracle split, outperforming metadata-based or leave-one-domain-out strategies.
- Average test domain accuracy gains are +3 pp over random splitting (Napoli et al., 29 May 2024).
- Direct analysis yields Spearman rank-correlation (p 10⁻³) and Pearson (p ) between achieved MMD and held-out test accuracy, supporting the notion that maximizing MMD between train/validation benefits out-of-distribution model selection.
- In functional data clustering, the weighted-MMD/BS framework yields near-perfect Rand indices—empirically outperforming alternative methods, even where clusters differ by only subtle scale changes (Chakrabarty et al., 15 Jul 2025).
7. Practical Considerations and Implementational Choices
- Kernels: Gaussian, Laplacian, and linear choices, with practical selection by the median pairwise distance heuristic.
- Computational complexity: pairwise kernel updates are needed, but BS incremental-update formulas and kernel approximations (e.g., Nyström) improve scalability.
- The algorithms require no access to explicit metadata beyond (class, domain) labels when available for splitting or representative-measure computation for oracle results, and can be deployed for both unlabelled and semi-supervised settings.
- The linear programming formulation allows enforcement or relaxation of class/domain balance constraints as required (Napoli et al., 29 May 2024, Chakrabarty et al., 15 Jul 2025).