Optimal subsample size for subsampling-based critical value estimation

Determine a principled method for choosing the subsample size n_B in the subsampling algorithm used to generate null-distribution critical values for kernel-based quadratic distance two-sample and k-sample tests, specifying how n_B should depend on sample size, dimensionality, and test settings to provide clear, reproducible guidance for practitioners.

Background

The paper computes critical values for kernel-based quadratic distance tests using nonparametric sampling algorithms, including bootstrap, subsampling, and permutation, applied to the pooled sample under the null hypothesis.

In the subsampling approach, new samples are generated without replacement and typically have smaller size n_B = b * n with b in (0,1]. The authors note that the choice of this subsample size affects computational cost and test performance, but that there is no established rule for selecting n_B, with existing literature exploring criteria based on asymptotic considerations.

Providing clear guidance on n_B would enhance the stability and reproducibility of the tests implemented in the QuadratiK package and address a recognized gap in the subsampling methodology for KBQD-based goodness-of-fit testing.

References

There is no clear guidance for the choice of the "optimal" subsample size $n_B$ and, the literature investigates this aspect according to optimal subsampling probabilities formulated by minimizing some function of the asymptotic distribution.

Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python (2402.02290 - Saraceno et al., 3 Feb 2024) in Subsection k-Sample Tests, Section 3 (Multivariate KBQD tests)