Two-Sample Neural Classifier Test
- The test recasts two-sample hypothesis testing as a supervised classification problem that estimates likelihood ratios via neural network outputs.
- It employs strategies like data splitting, permutation calibration, and sequential e-values to maintain valid inference and optimize sample usage.
- Continuous witness functions and deep kernel variants enhance performance in high-dimensional settings compared to traditional tests.
A Two-Sample Neural Classifier Test is a two-sample procedure that recasts the hypothesis versus as a supervised prediction problem on pooled, artificially labeled data. Given samples and , one pools the observations, assigns a binary label indicating source membership, and trains a neural network or other probabilistic classifier to predict that label from the observation. Under , the pooled feature-label pair satisfies $X \Indep Y$; under , the label is predictable from the feature, so $X \cancel{\Indep} Y$. In this framework, held-out accuracy, logit or probability scores, witness mean discrepancies, likelihood-ratio-style statistics, and e-values become test statistics, while calibration may proceed through exact binomial laws, permutation, asymptotic approximations, or e-processes with optional stopping guarantees (Lopez-Paz et al., 2016, Pandeva et al., 2022).
1. Formal setup and decision-theoretic interpretation
The formal two-sample problem is to test
The classifier formulation pools the two samples and assigns labels to points from 0 and 1 to points from 2. The resulting regression function is
3
where 4, 5 is the density of 6, and 7 is the density of 8. This yields the odds-likelihood identity
9
Accordingly, a probabilistic classifier estimating 0 is estimating a surrogate for the likelihood ratio. In the balanced case, the Bayes-optimal score is
1
These identities make classifier-based testing simultaneously a density-ratio, independence-testing, and discriminative inference procedure (Cai et al., 2019, Lopez-Paz et al., 2016).
The same perspective underlies witness-function formulations. If 2 is a measurable witness, then the population discrepancy is
3
with empirical counterpart
4
Under 5, 6. This places neural classifier tests in the broader family of mean-discrepancy and integral-probability-metric procedures, but with the witness learned from data rather than fixed a priori (Kübler et al., 2022).
2. Classical classifier two-sample tests
The standard classifier two-sample test, often abbreviated C2ST, splits the pooled labeled dataset into training and test subsets, trains a classifier on the training portion, and computes a test statistic on the untouched test portion. The paper "Revisiting Classifier Two-Sample Tests" (Lopez-Paz et al., 2016) focuses on hard accuracy, but the literature also uses soft probabilities, logits, cross-entropy, and AUC. For balanced classes and a held-out test set 7, the canonical hard-accuracy statistic is
8
with 9 under equal priors.
Under 0 and balanced classes, 1, so the null is exact and simple. More generally, permutation calibration is standard: labels on the held-out set are shuffled, the statistic is recomputed, and the p-value is the fraction of permuted statistics exceeding the observed one. Kim, Ramdas, Singh, and Wasserman showed that this permutation-based approach is finite-sample valid for either half-permutation or full-permutation procedures, and that a Gaussian approximation based on held-out class-wise errors is asymptotically valid for any classifier satisfying their stability assumptions (Kim et al., 2016).
Before later variants are introduced, three classical facts are central. First, the test must preserve a strict split between training and evaluation data; otherwise Type I error control is compromised. Second, the method is consistent if the classifier approaches Bayes risk, because the held-out accuracy then exceeds chance by a fixed effect size. Third, accuracy is interpretable: for the Bayes classifier,
2
so accuracy above 3 directly quantifies total variation distance (Lopez-Paz et al., 2016).
| Statistic | Form | Typical calibration |
|---|---|---|
| Hard accuracy | 4 on held-out labels | Exact binomial under balanced classes |
| Soft score | Mean predicted probability of the true label | Permutation |
| Logit mean difference | 5 | Permutation |
| AUC | ROC-area from held-out scores | Permutation |
A recurrent limitation of this classical form is that a single split wastes data: only the test split contributes directly to the final statistic, while the training split is “used up” in learning. Standard p-values are also not anytime valid, so repeated looks at accumulating data inflate Type I error unless additional machinery is introduced (Pandeva et al., 2022).
3. E-values, sequential validity, and E-C2ST
The E-value Classifier Two-Sample Test (E-C2ST) replaces the one-shot p-value logic of classical C2ST with batchwise e-factors whose product forms an e-process. A conditional e-variable with respect to a null class 6 is a nonnegative measurable function 7 such that
8
for all 9 and all conditioning values 0. If data arrive sequentially and 1, then
2
is a nonnegative supermartingale under 3. Ville’s inequality yields
4
so the rule “reject 5 when 6” is 7-safe under optional stopping (Pandeva et al., 2022).
E-C2ST specializes this construction to binary source labels. Under the independence formulation 8 versus 9, the null model is Bernoulli with unknown class prior and the alternative is Bernoulli with neural-network logit $X \Indep Y$0: $X \Indep Y$1 After partitioning the data into batches $X \Indep Y$2, batch $X \Indep Y$3 uses a classifier trained on previous batches and the current-batch null maximum-likelihood estimate $X \Indep Y$4. The batch e-factor is the per-batch likelihood ratio
$X \Indep Y$5
and the cumulative evidence is $X \Indep Y$6. A bounded mixture alternative $X \Indep Y$7 ensures bounded log-e-values and retains e-variable validity via convexity. The training protocol rotates batches through training and validation roles, updates $X \Indep Y$8 by maximizing previous-batch log-evidence, and stops as soon as $X \Indep Y$9 (Pandeva et al., 2022).
This construction changes the role of sample splitting. Rather than a single train/test split, data are consumed prequentially, and every new batch can both improve the predictor and contribute multiplicative evidence. The paper reports that this multiple-batch strategy increases power while keeping Type I error well below the desired significance level, thereby addressing both data inefficiency and optional-stopping failure modes of standard C2ST (Pandeva et al., 2022).
4. Continuous witnesses, probability statistics, conformalization, and label efficiency
A common misconception is that classifier two-sample tests are fundamentally tests of held-out accuracy. Several later developments instead treat the classifier output as a continuous witness. In the AutoML two-sample test, one trains 0 by minimizing weighted squared loss
1
and then tests using the mean discrepancy of 2 or of the centered score 3. Any minimizer 4 of this squared loss maximizes 5, and in the balanced case
6
Plugging 7 into the population discrepancy yields triangular discrimination. The same paper shows that, on a balanced held-out set, accuracy is just the mean discrepancy of a binary witness,
8
so replacing binary outputs with continuous scores reduces variance and improves power (Kübler et al., 2022).
A closely related probability-based construction uses log-odds directly. If 9 estimates the class probability, then
$X \cancel{\Indep} Y$0
approximates a likelihood-ratio statistic, while
$X \cancel{\Indep} Y$1
tests whether the classification probability is constant. Under uniform consistency of $X \cancel{\Indep} Y$2, the $X \cancel{\Indep} Y$3-based test is asymptotically most powerful in the sense stated in the paper (Cai et al., 2019).
Conformal variants shift the emphasis from probability calibration to score ranking. In the conformal C2ST for neural posterior validation, a classifier score $X \cancel{\Indep} Y$4 is converted into per-point conformal p-values by ranking each test score among calibration scores from $X \cancel{\Indep} Y$5. Under exchangeability, these p-values are exactly $X \cancel{\Indep} Y$6 under $X \cancel{\Indep} Y$7, so any one-sample uniformity test yields exact finite-sample Type-I control. The paper further proves that power degrades gently with score error: if the density-ratio estimate has $X \cancel{\Indep} Y$8 error $X \cancel{\Indep} Y$9, then the expected conformal p-values differ from the oracle ones by at most 0 (Bansal et al., 22 Jul 2025).
Label-costly settings admit a further extension. The label-efficient framework begins with a uniformly labeled seed set, trains a probabilistic classifier 1, then uses bimodal querying to request labels for items with the largest 2 and the largest 3. The final test can be a batch permutation C2ST or a sequential likelihood-ratio-style statistic
4
with anytime-valid guarantee 5 (Li et al., 7 Jan 2025).
5. Relations to kernels, IPMs, learned representations, and interpretable variants
Classifier tests are not isolated from kernel and IPM methodology; several papers make the equivalence explicit. In "Learning Deep Kernels for Non-Parametric Two-Sample Tests" (Liu et al., 2020), accuracy-based C2ST is shown to be exactly an MMD test with a sign kernel,
6
while logit-mean C2ST is MMD with the linear kernel
7
The same paper argues that deep-kernel MMD strictly generalizes C2ST by learning a spatially non-homogeneous kernel 8 and optimizing the power proxy 9 directly rather than cross-entropy.
An IPM formulation is developed for data supported on a low-dimensional manifold. There the test statistic is
0
with 1 either a Hölder class or a ReLU network class approximating it. The resulting neural-network IPM test attains type-II risk of order 2, and the performance depends on intrinsic dimension 3 rather than ambient dimension 4 (Wang et al., 2022).
A distinct branch uses learned deep representations but abandons the classifier statistic itself. "Two-sample Testing Using Deep Learning" (Kirchler et al., 2019) trains a feature map 5 on auxiliary supervised or unsupervised tasks and then applies asymptotic location tests on hidden-layer means: 6 and
7
These statistics are linear-time in sample size and asymptotically control the Type I error rate.
Interpretability has motivated further alternatives. The self-organizing-map two-sample test trains a SOM on pooled unlabeled data, projects both samples to the grid, and compares per-neuron hit histograms by
8
Because the map supports hit histograms, component planes, and U-matrices, it can reveal where and how 9 and 00 differ rather than merely rejecting equality (Álvarez-Ayllón et al., 2022).
The same reduction-to-two-samples idea has also been used for evaluating black-box multiclass classifiers. One samples 01, trains a distinguisher between 02 and 03, and evaluates separability by a rank-sum or AUC statistic under cross-fitting and stability conditions (Chen et al., 7 Apr 2026).
6. Empirical behavior, training dynamics, and limitations
Across the literature, a stable empirical pattern is that richer continuous statistics usually outperform hard accuracy, provided validity is preserved. For E-C2ST, the reported results are explicit: on Blob, the method reaches maximum power with fewer samples while maintaining type I error strictly below 04; on KDEF, it achieves 05 fastest and keeps type I error lower than baselines; and on Corrupted MNIST, it exhibits superior power across corruption levels while maintaining type I error below 06 in the 07 case (Pandeva et al., 2022). AutoML witness tests likewise show that continuous witness statistics outperform binary-output variants; on distribution-shift benchmarks, AutoML (bin) consistently underperforms, while continuous-witness AutoML tests outperform MMDAgg and MMD-D in most regimes except the very smallest 08 (Kübler et al., 2022).
Deep-kernel and deep-representation results reinforce the same conclusion in different form. Deep kernels generally outperform C2ST variants on Blob, HDGM, Higgs, MNIST, and CIFAR-10.1, especially when differences are subtle, local, or highly structured (Liu et al., 2020). Deep representation tests on audio, images, and MRI report decreases in type-II error rate of up to 35 percentage points relative to kernel methods and classifier two-sample tests (Kirchler et al., 2019).
Training dynamics have also become an object of theory. The NTK analysis of neural network C2ST derives a theoretical minimum training time needed to detect a deviation-level and a theoretical maximum training time before the NTK test detects that deviation-level. In the resulting small-time regime, the times needed to detect the same deviation-level in the null and alternative scenarios are well-separated, which justifies early-stopping strategies rather than training to convergence (Khurana et al., 2024).
The principal limitations remain consistent across papers. Standard one-shot C2ST wastes data through a single split and is not anytime valid. All classifier-based procedures are sensitive to representation quality, overfitting, and leakage between training and evaluation. Batchwise class imbalance must be handled through the null class prior rather than by fixing 09 without justification. Very small batches can destabilize prequential training. Permutation calibration, cross-fitting, or conformalization usually improves validity, but at additional computational cost. Finally, in simple, well-specified low-dimensional settings, classical parametric tests or carefully tuned kernel tests may remain preferable; neural classifier tests are primarily designed for complex, high-dimensional regimes in which representation learning or adaptive scoring is a substantive advantage (Pandeva et al., 2022, Kim et al., 2016, Kübler et al., 2022).
In contemporary usage, the term therefore denotes not a single test but a family of related procedures: hard-accuracy C2ST, logit and probability tests, continuous witness tests, e-value and sequential tests, conformal score-rank tests, and kernel or IPM formulations that reinterpret the classifier as a learned witness. Their unifying principle is the same: two-sample inference is reduced to source discrimination, and the statistical problem becomes one of turning predictive signal into valid evidence for 10.