Papers
Topics
Authors
Recent
Search
2000 character limit reached

Two-Sample Neural Classifier Test

Updated 4 July 2026
  • The test recasts two-sample hypothesis testing as a supervised classification problem that estimates likelihood ratios via neural network outputs.
  • It employs strategies like data splitting, permutation calibration, and sequential e-values to maintain valid inference and optimize sample usage.
  • Continuous witness functions and deep kernel variants enhance performance in high-dimensional settings compared to traditional tests.

A Two-Sample Neural Classifier Test is a two-sample procedure that recasts the hypothesis H0:P=QH_0:P=Q versus H1:PQH_1:P\neq Q as a supervised prediction problem on pooled, artificially labeled data. Given samples X1,,XnPX_1,\dots,X_n \sim P and Y1,,YmQY_1,\dots,Y_m \sim Q, one pools the observations, assigns a binary label indicating source membership, and trains a neural network or other probabilistic classifier to predict that label from the observation. Under H0H_0, the pooled feature-label pair satisfies $X \Indep Y$; under H1H_1, the label is predictable from the feature, so $X \cancel{\Indep} Y$. In this framework, held-out accuracy, logit or probability scores, witness mean discrepancies, likelihood-ratio-style statistics, and e-values become test statistics, while calibration may proceed through exact binomial laws, permutation, asymptotic approximations, or e-processes with optional stopping guarantees (Lopez-Paz et al., 2016, Pandeva et al., 2022).

1. Formal setup and decision-theoretic interpretation

The formal two-sample problem is to test

H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.

The classifier formulation pools the two samples and assigns labels L=1L=1 to points from H1:PQH_1:P\neq Q0 and H1:PQH_1:P\neq Q1 to points from H1:PQH_1:P\neq Q2. The resulting regression function is

H1:PQH_1:P\neq Q3

where H1:PQH_1:P\neq Q4, H1:PQH_1:P\neq Q5 is the density of H1:PQH_1:P\neq Q6, and H1:PQH_1:P\neq Q7 is the density of H1:PQH_1:P\neq Q8. This yields the odds-likelihood identity

H1:PQH_1:P\neq Q9

Accordingly, a probabilistic classifier estimating X1,,XnPX_1,\dots,X_n \sim P0 is estimating a surrogate for the likelihood ratio. In the balanced case, the Bayes-optimal score is

X1,,XnPX_1,\dots,X_n \sim P1

These identities make classifier-based testing simultaneously a density-ratio, independence-testing, and discriminative inference procedure (Cai et al., 2019, Lopez-Paz et al., 2016).

The same perspective underlies witness-function formulations. If X1,,XnPX_1,\dots,X_n \sim P2 is a measurable witness, then the population discrepancy is

X1,,XnPX_1,\dots,X_n \sim P3

with empirical counterpart

X1,,XnPX_1,\dots,X_n \sim P4

Under X1,,XnPX_1,\dots,X_n \sim P5, X1,,XnPX_1,\dots,X_n \sim P6. This places neural classifier tests in the broader family of mean-discrepancy and integral-probability-metric procedures, but with the witness learned from data rather than fixed a priori (Kübler et al., 2022).

2. Classical classifier two-sample tests

The standard classifier two-sample test, often abbreviated C2ST, splits the pooled labeled dataset into training and test subsets, trains a classifier on the training portion, and computes a test statistic on the untouched test portion. The paper "Revisiting Classifier Two-Sample Tests" (Lopez-Paz et al., 2016) focuses on hard accuracy, but the literature also uses soft probabilities, logits, cross-entropy, and AUC. For balanced classes and a held-out test set X1,,XnPX_1,\dots,X_n \sim P7, the canonical hard-accuracy statistic is

X1,,XnPX_1,\dots,X_n \sim P8

with X1,,XnPX_1,\dots,X_n \sim P9 under equal priors.

Under Y1,,YmQY_1,\dots,Y_m \sim Q0 and balanced classes, Y1,,YmQY_1,\dots,Y_m \sim Q1, so the null is exact and simple. More generally, permutation calibration is standard: labels on the held-out set are shuffled, the statistic is recomputed, and the p-value is the fraction of permuted statistics exceeding the observed one. Kim, Ramdas, Singh, and Wasserman showed that this permutation-based approach is finite-sample valid for either half-permutation or full-permutation procedures, and that a Gaussian approximation based on held-out class-wise errors is asymptotically valid for any classifier satisfying their stability assumptions (Kim et al., 2016).

Before later variants are introduced, three classical facts are central. First, the test must preserve a strict split between training and evaluation data; otherwise Type I error control is compromised. Second, the method is consistent if the classifier approaches Bayes risk, because the held-out accuracy then exceeds chance by a fixed effect size. Third, accuracy is interpretable: for the Bayes classifier,

Y1,,YmQY_1,\dots,Y_m \sim Q2

so accuracy above Y1,,YmQY_1,\dots,Y_m \sim Q3 directly quantifies total variation distance (Lopez-Paz et al., 2016).

Statistic Form Typical calibration
Hard accuracy Y1,,YmQY_1,\dots,Y_m \sim Q4 on held-out labels Exact binomial under balanced classes
Soft score Mean predicted probability of the true label Permutation
Logit mean difference Y1,,YmQY_1,\dots,Y_m \sim Q5 Permutation
AUC ROC-area from held-out scores Permutation

A recurrent limitation of this classical form is that a single split wastes data: only the test split contributes directly to the final statistic, while the training split is “used up” in learning. Standard p-values are also not anytime valid, so repeated looks at accumulating data inflate Type I error unless additional machinery is introduced (Pandeva et al., 2022).

3. E-values, sequential validity, and E-C2ST

The E-value Classifier Two-Sample Test (E-C2ST) replaces the one-shot p-value logic of classical C2ST with batchwise e-factors whose product forms an e-process. A conditional e-variable with respect to a null class Y1,,YmQY_1,\dots,Y_m \sim Q6 is a nonnegative measurable function Y1,,YmQY_1,\dots,Y_m \sim Q7 such that

Y1,,YmQY_1,\dots,Y_m \sim Q8

for all Y1,,YmQY_1,\dots,Y_m \sim Q9 and all conditioning values H0H_00. If data arrive sequentially and H0H_01, then

H0H_02

is a nonnegative supermartingale under H0H_03. Ville’s inequality yields

H0H_04

so the rule “reject H0H_05 when H0H_06” is H0H_07-safe under optional stopping (Pandeva et al., 2022).

E-C2ST specializes this construction to binary source labels. Under the independence formulation H0H_08 versus H0H_09, the null model is Bernoulli with unknown class prior and the alternative is Bernoulli with neural-network logit $X \Indep Y$0: $X \Indep Y$1 After partitioning the data into batches $X \Indep Y$2, batch $X \Indep Y$3 uses a classifier trained on previous batches and the current-batch null maximum-likelihood estimate $X \Indep Y$4. The batch e-factor is the per-batch likelihood ratio

$X \Indep Y$5

and the cumulative evidence is $X \Indep Y$6. A bounded mixture alternative $X \Indep Y$7 ensures bounded log-e-values and retains e-variable validity via convexity. The training protocol rotates batches through training and validation roles, updates $X \Indep Y$8 by maximizing previous-batch log-evidence, and stops as soon as $X \Indep Y$9 (Pandeva et al., 2022).

This construction changes the role of sample splitting. Rather than a single train/test split, data are consumed prequentially, and every new batch can both improve the predictor and contribute multiplicative evidence. The paper reports that this multiple-batch strategy increases power while keeping Type I error well below the desired significance level, thereby addressing both data inefficiency and optional-stopping failure modes of standard C2ST (Pandeva et al., 2022).

4. Continuous witnesses, probability statistics, conformalization, and label efficiency

A common misconception is that classifier two-sample tests are fundamentally tests of held-out accuracy. Several later developments instead treat the classifier output as a continuous witness. In the AutoML two-sample test, one trains H1H_10 by minimizing weighted squared loss

H1H_11

and then tests using the mean discrepancy of H1H_12 or of the centered score H1H_13. Any minimizer H1H_14 of this squared loss maximizes H1H_15, and in the balanced case

H1H_16

Plugging H1H_17 into the population discrepancy yields triangular discrimination. The same paper shows that, on a balanced held-out set, accuracy is just the mean discrepancy of a binary witness,

H1H_18

so replacing binary outputs with continuous scores reduces variance and improves power (Kübler et al., 2022).

A closely related probability-based construction uses log-odds directly. If H1H_19 estimates the class probability, then

$X \cancel{\Indep} Y$0

approximates a likelihood-ratio statistic, while

$X \cancel{\Indep} Y$1

tests whether the classification probability is constant. Under uniform consistency of $X \cancel{\Indep} Y$2, the $X \cancel{\Indep} Y$3-based test is asymptotically most powerful in the sense stated in the paper (Cai et al., 2019).

Conformal variants shift the emphasis from probability calibration to score ranking. In the conformal C2ST for neural posterior validation, a classifier score $X \cancel{\Indep} Y$4 is converted into per-point conformal p-values by ranking each test score among calibration scores from $X \cancel{\Indep} Y$5. Under exchangeability, these p-values are exactly $X \cancel{\Indep} Y$6 under $X \cancel{\Indep} Y$7, so any one-sample uniformity test yields exact finite-sample Type-I control. The paper further proves that power degrades gently with score error: if the density-ratio estimate has $X \cancel{\Indep} Y$8 error $X \cancel{\Indep} Y$9, then the expected conformal p-values differ from the oracle ones by at most H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.0 (Bansal et al., 22 Jul 2025).

Label-costly settings admit a further extension. The label-efficient framework begins with a uniformly labeled seed set, trains a probabilistic classifier H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.1, then uses bimodal querying to request labels for items with the largest H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.2 and the largest H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.3. The final test can be a batch permutation C2ST or a sequential likelihood-ratio-style statistic

H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.4

with anytime-valid guarantee H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.5 (Li et al., 7 Jan 2025).

5. Relations to kernels, IPMs, learned representations, and interpretable variants

Classifier tests are not isolated from kernel and IPM methodology; several papers make the equivalence explicit. In "Learning Deep Kernels for Non-Parametric Two-Sample Tests" (Liu et al., 2020), accuracy-based C2ST is shown to be exactly an MMD test with a sign kernel,

H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.6

while logit-mean C2ST is MMD with the linear kernel

H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.7

The same paper argues that deep-kernel MMD strictly generalizes C2ST by learning a spatially non-homogeneous kernel H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.8 and optimizing the power proxy H0:P=QversusH1:PQ.H_0:P=Q \qquad \text{versus} \qquad H_1:P\neq Q.9 directly rather than cross-entropy.

An IPM formulation is developed for data supported on a low-dimensional manifold. There the test statistic is

L=1L=10

with L=1L=11 either a Hölder class or a ReLU network class approximating it. The resulting neural-network IPM test attains type-II risk of order L=1L=12, and the performance depends on intrinsic dimension L=1L=13 rather than ambient dimension L=1L=14 (Wang et al., 2022).

A distinct branch uses learned deep representations but abandons the classifier statistic itself. "Two-sample Testing Using Deep Learning" (Kirchler et al., 2019) trains a feature map L=1L=15 on auxiliary supervised or unsupervised tasks and then applies asymptotic location tests on hidden-layer means: L=1L=16 and

L=1L=17

These statistics are linear-time in sample size and asymptotically control the Type I error rate.

Interpretability has motivated further alternatives. The self-organizing-map two-sample test trains a SOM on pooled unlabeled data, projects both samples to the grid, and compares per-neuron hit histograms by

L=1L=18

Because the map supports hit histograms, component planes, and U-matrices, it can reveal where and how L=1L=19 and H1:PQH_1:P\neq Q00 differ rather than merely rejecting equality (Álvarez-Ayllón et al., 2022).

The same reduction-to-two-samples idea has also been used for evaluating black-box multiclass classifiers. One samples H1:PQH_1:P\neq Q01, trains a distinguisher between H1:PQH_1:P\neq Q02 and H1:PQH_1:P\neq Q03, and evaluates separability by a rank-sum or AUC statistic under cross-fitting and stability conditions (Chen et al., 7 Apr 2026).

6. Empirical behavior, training dynamics, and limitations

Across the literature, a stable empirical pattern is that richer continuous statistics usually outperform hard accuracy, provided validity is preserved. For E-C2ST, the reported results are explicit: on Blob, the method reaches maximum power with fewer samples while maintaining type I error strictly below H1:PQH_1:P\neq Q04; on KDEF, it achieves H1:PQH_1:P\neq Q05 fastest and keeps type I error lower than baselines; and on Corrupted MNIST, it exhibits superior power across corruption levels while maintaining type I error below H1:PQH_1:P\neq Q06 in the H1:PQH_1:P\neq Q07 case (Pandeva et al., 2022). AutoML witness tests likewise show that continuous witness statistics outperform binary-output variants; on distribution-shift benchmarks, AutoML (bin) consistently underperforms, while continuous-witness AutoML tests outperform MMDAgg and MMD-D in most regimes except the very smallest H1:PQH_1:P\neq Q08 (Kübler et al., 2022).

Deep-kernel and deep-representation results reinforce the same conclusion in different form. Deep kernels generally outperform C2ST variants on Blob, HDGM, Higgs, MNIST, and CIFAR-10.1, especially when differences are subtle, local, or highly structured (Liu et al., 2020). Deep representation tests on audio, images, and MRI report decreases in type-II error rate of up to 35 percentage points relative to kernel methods and classifier two-sample tests (Kirchler et al., 2019).

Training dynamics have also become an object of theory. The NTK analysis of neural network C2ST derives a theoretical minimum training time needed to detect a deviation-level and a theoretical maximum training time before the NTK test detects that deviation-level. In the resulting small-time regime, the times needed to detect the same deviation-level in the null and alternative scenarios are well-separated, which justifies early-stopping strategies rather than training to convergence (Khurana et al., 2024).

The principal limitations remain consistent across papers. Standard one-shot C2ST wastes data through a single split and is not anytime valid. All classifier-based procedures are sensitive to representation quality, overfitting, and leakage between training and evaluation. Batchwise class imbalance must be handled through the null class prior rather than by fixing H1:PQH_1:P\neq Q09 without justification. Very small batches can destabilize prequential training. Permutation calibration, cross-fitting, or conformalization usually improves validity, but at additional computational cost. Finally, in simple, well-specified low-dimensional settings, classical parametric tests or carefully tuned kernel tests may remain preferable; neural classifier tests are primarily designed for complex, high-dimensional regimes in which representation learning or adaptive scoring is a substantive advantage (Pandeva et al., 2022, Kim et al., 2016, Kübler et al., 2022).

In contemporary usage, the term therefore denotes not a single test but a family of related procedures: hard-accuracy C2ST, logit and probability tests, continuous witness tests, e-value and sequential tests, conformal score-rank tests, and kernel or IPM formulations that reinterpret the classifier as a learned witness. Their unifying principle is the same: two-sample inference is reduced to source discrimination, and the statistical problem becomes one of turning predictive signal into valid evidence for H1:PQH_1:P\neq Q10.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Sample Neural Classifier Test.