Incomplete U-Statistics

Updated 11 November 2025

Incomplete U-statistics are estimators that average a subset of kernel evaluations, offering a tractable alternative to full U-statistics.
They achieve key statistical properties like unbiasedness, consistency, and asymptotic normality through both randomized (Bernoulli) and design-based sampling methods.
These methods balance computational efficiency and statistical accuracy, enabling scalable inference in high-dimensional, heavy-tailed, and degenerate settings.

Incomplete U-statistics constitute a broad class of estimators and test statistics that sum over a computationally tractable, randomly or deterministically selected subset of the kernel evaluations constituting a U-statistic. They provide a means to address the intractability of high-order, high-dimensional, or high-volume U-statistics, while often retaining desirable statistical properties such as unbiasedness, consistency, and valid inference, even in complex or irregular models. Modern developments have established explicit Berry–Esseen bounds, edgeworth expansions, central limit theorems, non-asymptotic tail inequalities, and precise trade-offs between statistical accuracy and computational budget for a wide variety of incomplete U-statistic formulations, including randomized and combinatorially structured designs.

1. Formal Definition and Types of Incomplete U-Statistics

Given i.i.d. observations $X_1,\dots,X_n$ and a symmetric kernel $h:X^m\rightarrow\mathbb{R}^p$ , the complete U-statistic is

$U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).$

For computational reasons, the incomplete U-statistic averages the kernel over a subset $D$ of $N\ll\binom{n}{m}$ $m$ -tuples. The two standard constructions are:

Randomized/Bernoulli Sampling: For each $m$ -tuple $\iota\in I_{n,m}$ , draw $Z_\iota\sim\mathrm{Bernoulli}(p_n)$ with $p_n=N/\binom{n}{m}$ , and set

$U'_{n,N} = \frac{1}{\hat N} \sum_{\iota} Z_\iota\, h(X_{\iota}),$

where $\hat N = \sum_\iota Z_\iota$ is the random number of included terms.

Deterministic/Design-based Sampling: Choose $D\subset I_{n,m}$ of size $N$ (randomly or by a structural algorithm), and compute

$U_{n,D} = \frac{1}{N} \sum_{S\in D} h(X_S).$

Further extensions include sampling with replacement, stratified sampling, and combinatorial designs (e.g., equireplicate, orthogonal array).

Classical U-statistics are unbiased and, with $m\ll n$ , have variance $O(1/n)$ under non-degeneracy. Incomplete U-statistics typically retain unbiasedness (at least in expectation) and obtain a variance increase proportional to the reduction in sample size or design overlap.

2. Statistical Properties and Limit Theory

Recent research has established strong distributional approximations and concentration for incomplete U-statistics in both non-degenerate and degenerate settings, even when the underlying kernels are high-degree, high-dimensional, or heavy-tailed:

Asymptotic Normality and Berry–Esseen Bounds: In the standard non-degenerate regime, with $N\asymp n$ , the scaled incomplete U-statistic obeys a CLT:

$\sqrt n (U'_{n,N} - \mu) \ \rightsquigarrow\ N_p(0, m^2\,\Gamma_g + (n/N)\,\Gamma_h)$

where $\Gamma_g$ is the covariance of the Hajek projection $g(x) = \mathbb{E}[h(x, X_2, \dots, X_m)]$ , and $\Gamma_h$ is the covariance of the full kernel. The Berry–Esseen rate matches $O(n^{-1/2})$ up to polylogarithmic factors for appropriate moment and tail conditions (Sturma et al., 2022, Leung, 2024, Song et al., 2019, Miglioli et al., 23 Oct 2025).

Regimes of Computational Budget:
- $N \gg n$ : Recovery of the full U-statistic CLT.
- $N \ll n$ : Variance is dominated by kernel variability, and the CLT limit reflects only $\Gamma_h$ .
- $N \asymp n$ : Both kernel and projection variability matter; Berry–Esseen rates are uniform across regular/irregular/singular nulls (Leung et al., 2024, Leung, 2024).
Degeneracy and Mixed Degeneracy: When the kernel is degenerate (projection vanishes), the limiting law and validity depend on higher-order projections. The "mixed-degenerate" regime (some coordinate projections vanish, others don't) is covered by refined Gaussian approximation theorems for incomplete U-statistics (Sturma et al., 2022).
Consistency Under Heavy Tails: Even when kernel moments are only finite to order $p\geq1$ , incomplete U-statistics are $L_1$ -consistent with explicit power-law rates in $\min(n,N)$ ; variance-saturation is avoided by balancing truncation and sampling error (Dürre et al., 2021).
Design-based and Equireplicate Sampling: For deterministic designs, new Berry–Esseen bounds are available. If the maximal replication $\Delta$ is $O(\log n)$ and the kernel order $m$ grows at most logarithmically, the variance is minimized, and central limit theorems hold even in degenerate regimens (Miglioli et al., 23 Oct 2025).

3. Computational and Statistical Trade-offs

The principal motivation for incomplete U-statistics is computational: evaluating all $\binom{n}{m}$ terms is infeasible for moderate $n$ , $m$ .

Complexity:
- Complete: $O(n^m)$ kernel evaluations and memory.
- Incomplete (random sampling): $O(N)$ , independent of $m$ .
- Equireplicate designs: $O(n r)$ when every data point occurs in exactly $r$ blocks; supports $r\ll n$ and $O(\log^q n)$ .
- Design-based algorithms: $O(N)$ for structured arrays (e.g., orthogonal arrays (Kong et al., 2020), perfect matchings, cyclic hypergraphs (Miglioli et al., 23 Oct 2025)).
Variance and Rate Control:
- For $N = c\,n$ (randomized) or $r = O(\log^q n)$ (design-based), one typically achieves $O(1/n)$ variance and $O(n^{-1/2})$ error rates.
- In degenerate/high-order cases, precise trade-offs involve higher-order projections and sample overlap structure; variance reductions are guaranteed for optimal equireplicate cyclic/hypergraph designs (Miglioli et al., 23 Oct 2025, Kong et al., 2020).
Accuracy–Speed Frontier:
- Increasing $N$ (or $r$ ) improves statistical accuracy, but practical experiments confirm that $N\asymp n$ suffices for size and power accuracy in kernel tests, even as $m$ grows (Schrab et al., 2022, Miglioli et al., 23 Oct 2025).
- There is a regime of diminishing returns: after $r\geq2$ , no further benefit is observed in higher-order Edgeworth expansions (Shao et al., 2023).
- Practical recommendations are to select $N=2n$ or $r=O(\log^q n)$ to balance computational cost and statistical power.

4. Limitations, Robustness, and Extensions

Singular/Irregular Models: In classical U-statistics, regularity (nonvanishing projection) is required for CLTs and accurate size in hypothesis testing. Incomplete U-statistics, due to reduced dependence, achieve "singularity-agnostic" Berry–Esseen bounds, allowing valid inference in singular or near-singular situations (e.g., nested models, boundary constraints, algebraic singularities). There is no dependency-diverging term analogous to $\sigma_h/\sigma_g$ , so the normal approximation is uniformly accurate regardless of proximity to irregularity (Leung et al., 2024).
Heavy-tailed and Infinite-variance Regimes: $L_1$ -consistency can hold under finite $p$ -th moment for any $p>1$ , with nonparametric rates derived for various sampling schemes (with/without replacement, Bernoulli), covering settings where classical theory is inapplicable (Dürre et al., 2021).
Banach- and Hilbert-space Kernels: Deviations and high-moment inequalities for incomplete U-statistics in Banach spaces have been developed under minimal smoothness/degeneracy, supporting functional data (Giraudo, 2024, Giraudo, 2024).
High-dimensional and Infinite-order U-Statistics: Non-asymptotic bounds and bootstrap results control error for high-dimensional kernels and diverging order (e.g., random forests, subbagging) (Song et al., 2019, Chen et al., 2017). Data-driven bootstrap and wild/MG bootstrap techniques are effective and computationally feasible for incomplete U-statistics (Sturma et al., 2022, Schrab et al., 2022, Chen et al., 2017).

5. Applications in Testing and Machine Learning

Incomplete U-statistics are foundational for scalable inference and high-dimensional testing:

Testing with Many or Polymeric Constraints: In situations where the number of constraints $p$ is large compared to $n$ , incomplete U-statistics allow testing of (in)equalities via bootstrap-calibrated statistics, providing uniform type I error control without constrained optimization even in "singular" hypotheses (Sturma et al., 2022).
Goodness-of-fit in Latent Structure Models: Testing large families of algebraically-defined constraints (e.g., tetrads in latent tree models) is made tractable by incomplete U-statistics (e.g., $O(\ell^4)$ constraints with $N=2n$ evaluations) (Sturma et al., 2022).
Kernel Methods and Two-sample/Independence Testing: MMDAggInc, HSICAggInc, and KSDAggInc tests utilize incomplete U-statistics for the Maximum Mean Discrepancy, Hilbert–Schmidt Independence Criterion, and Kernel Stein Discrepancy. These achieve minimax-optimal uniform separation rates, and wild bootstrap calibration matches permutation-based approaches in power, but with reduced complexity (Schrab et al., 2022, Miglioli et al., 23 Oct 2025).
Empirical Risk Minimization: In metric learning, clustering, ranking, and robust estimation, incomplete U-statistics provide $O(n^{-1/2})$ learning rates for ERM with $O(n)$ kernel evaluations, outperforming naive subsampling (Clémençon et al., 2015).
Network Method of Moments and Graph Statistics: Higher-order moments and motif counts in network models can be computed and tested via incomplete U-statistics; regime-specific inference is informed by the budget exponent $\alpha$ and network sparsity parameters (Shao et al., 2023).
Functional Data and Infinite-dimensional Kernels: Balanced incomplete designs and exponential inequalities for Hilbert- and Banach-valued data ensure statistical error control with fixed or slowly growing design sizes (Duembgen et al., 2022, Giraudo, 2024, Giraudo, 2024).

6. Algorithmic Construction and Practical Implementation

Advanced combinatorial and design-based sampling algorithms enable optimal variance and coverage properties:

Equireplicate Designs: For $m=2$ , partitioning of $K_n$ into perfect matchings, Hamiltonian cycles, or cyclic hypergraphs yields $r$ -equireplicate block collections that are optimal or near-optimal for variance and Berry–Esseen error (Miglioli et al., 23 Oct 2025).
Orthogonal Arrays: OA-based selection of sample blocks eliminates lower-order projections and achieves asymptotic efficiency with much smaller sample sizes— $m\gg\sqrt n$ in non-degenerate cases (Kong et al., 2020).
Cyclic Designs for $m>2$ : Construction using appropriately shifted index blocks supports higher-order settings and deterministic computational control (Miglioli et al., 23 Oct 2025).
Divide-and-conquer and Local Jackknife: For estimating nuisance terms (e.g., projections $g(x)$ ), DC or local-jackknife schemes provide estimator splitting suitable for bootstrapping (Sturma et al., 2022, Song et al., 2019).

Empirical demonstration on large-scale datasets (e.g., CIFAR-10) confirms that such designs can deliver speedup factors of $10^2$ – $10^3$ over permutation-based or full U-statistic computations, while maintaining power and error control (Miglioli et al., 23 Oct 2025).

Summary Table: Statistical/Computational Properties of Incomplete U-Statistics

Feature	Complete U-Statistic	Random Incomplete	Equireplicate/Design-based
Complexity	$O(n^m)$	$O(N)$	$O(nr)$
Variance (non-degenerate)	$O(1/n)$	$O(1/N)$	$O(1/(nr))$
Asymptotic Normality	CLT if projection $\neq0$	Uniform by Berry–Esseen bounds	Uniform even if kernel order grows
Control under degeneracy	Non-Gaussian/Breakdown	Valid for mixed-degenerate	Valid for all designs
Limiting distribution rate	$O(n^{-1/2})$	$O(N^{-1/2})$	$O((nr)^{-1/2})$

7. Limitations and Open Questions

While uniform Berry–Esseen rates and consistency guarantees exist, some questions remain open, including sharp lower bounds in heavy-tailed regimes, extensions of $L_1$ -consistency to full central limit theorems under $p < 2$ moments, and the characterization of optimal designs for degenerate kernels as kernel order diverges.
The limits of inference with highly aggressive pruning or extremely degenerate test statistics involve trade-offs between statistical accuracy, coverage, and computational savings that may depend intricately on network structure or kernel smoothness.

A plausible implication is that the structure of the design (random or combinatorial) can be explicitly tuned to match the required accuracy regime for specific inferential tasks, offering adaptive and scalable methodology extendable across nonparametric functionals, kernel tests, and large-scale machine learning.