Papers
Topics
Authors
Recent
Search
2000 character limit reached

Incomplete U-Statistics

Updated 11 November 2025
  • Incomplete U-statistics are estimators that average a subset of kernel evaluations, offering a tractable alternative to full U-statistics.
  • They achieve key statistical properties like unbiasedness, consistency, and asymptotic normality through both randomized (Bernoulli) and design-based sampling methods.
  • These methods balance computational efficiency and statistical accuracy, enabling scalable inference in high-dimensional, heavy-tailed, and degenerate settings.

Incomplete U-statistics constitute a broad class of estimators and test statistics that sum over a computationally tractable, randomly or deterministically selected subset of the kernel evaluations constituting a U-statistic. They provide a means to address the intractability of high-order, high-dimensional, or high-volume U-statistics, while often retaining desirable statistical properties such as unbiasedness, consistency, and valid inference, even in complex or irregular models. Modern developments have established explicit Berry–Esseen bounds, edgeworth expansions, central limit theorems, non-asymptotic tail inequalities, and precise trade-offs between statistical accuracy and computational budget for a wide variety of incomplete U-statistic formulations, including randomized and combinatorially structured designs.

1. Formal Definition and Types of Incomplete U-Statistics

Given i.i.d. observations X1,,XnX_1,\dots,X_n and a symmetric kernel h:XmRph:X^m\rightarrow\mathbb{R}^p, the complete U-statistic is

Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).

For computational reasons, the incomplete U-statistic averages the kernel over a subset DD of N(nm)N\ll\binom{n}{m} mm-tuples. The two standard constructions are:

  • Randomized/Bernoulli Sampling: For each mm-tuple ιIn,m\iota\in I_{n,m}, draw ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n) with pn=N/(nm)p_n=N/\binom{n}{m}, and set

h:XmRph:X^m\rightarrow\mathbb{R}^p0

where h:XmRph:X^m\rightarrow\mathbb{R}^p1 is the random number of included terms.

  • Deterministic/Design-based Sampling: Choose h:XmRph:X^m\rightarrow\mathbb{R}^p2 of size h:XmRph:X^m\rightarrow\mathbb{R}^p3 (randomly or by a structural algorithm), and compute

h:XmRph:X^m\rightarrow\mathbb{R}^p4

  • Further extensions include sampling with replacement, stratified sampling, and combinatorial designs (e.g., equireplicate, orthogonal array).

Classical U-statistics are unbiased and, with h:XmRph:X^m\rightarrow\mathbb{R}^p5, have variance h:XmRph:X^m\rightarrow\mathbb{R}^p6 under non-degeneracy. Incomplete U-statistics typically retain unbiasedness (at least in expectation) and obtain a variance increase proportional to the reduction in sample size or design overlap.

2. Statistical Properties and Limit Theory

Recent research has established strong distributional approximations and concentration for incomplete U-statistics in both non-degenerate and degenerate settings, even when the underlying kernels are high-degree, high-dimensional, or heavy-tailed:

  • Asymptotic Normality and Berry–Esseen Bounds: In the standard non-degenerate regime, with h:XmRph:X^m\rightarrow\mathbb{R}^p7, the scaled incomplete U-statistic obeys a CLT:

h:XmRph:X^m\rightarrow\mathbb{R}^p8

where h:XmRph:X^m\rightarrow\mathbb{R}^p9 is the covariance of the Hajek projection Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).0, and Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).1 is the covariance of the full kernel. The Berry–Esseen rate matches Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).2 up to polylogarithmic factors for appropriate moment and tail conditions (Sturma et al., 2022, Leung, 2024, Song et al., 2019, Miglioli et al., 23 Oct 2025).

  • Regimes of Computational Budget:
    • Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).3: Recovery of the full U-statistic CLT.
    • Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).4: Variance is dominated by kernel variability, and the CLT limit reflects only Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).5.
    • Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).6: Both kernel and projection variability matter; Berry–Esseen rates are uniform across regular/irregular/singular nulls (Leung et al., 2024, Leung, 2024).
  • Degeneracy and Mixed Degeneracy: When the kernel is degenerate (projection vanishes), the limiting law and validity depend on higher-order projections. The "mixed-degenerate" regime (some coordinate projections vanish, others don't) is covered by refined Gaussian approximation theorems for incomplete U-statistics (Sturma et al., 2022).
  • Consistency Under Heavy Tails: Even when kernel moments are only finite to order Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).7, incomplete U-statistics are Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).8-consistent with explicit power-law rates in Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).9; variance-saturation is avoided by balancing truncation and sampling error (Dürre et al., 2021).
  • Design-based and Equireplicate Sampling: For deterministic designs, new Berry–Esseen bounds are available. If the maximal replication DD0 is DD1 and the kernel order DD2 grows at most logarithmically, the variance is minimized, and central limit theorems hold even in degenerate regimens (Miglioli et al., 23 Oct 2025).

3. Computational and Statistical Trade-offs

The principal motivation for incomplete U-statistics is computational: evaluating all DD3 terms is infeasible for moderate DD4, DD5.

  • Complexity:
    • Complete: DD6 kernel evaluations and memory.
    • Incomplete (random sampling): DD7, independent of DD8.
    • Equireplicate designs: DD9 when every data point occurs in exactly N(nm)N\ll\binom{n}{m}0 blocks; supports N(nm)N\ll\binom{n}{m}1 and N(nm)N\ll\binom{n}{m}2.
    • Design-based algorithms: N(nm)N\ll\binom{n}{m}3 for structured arrays (e.g., orthogonal arrays (Kong et al., 2020), perfect matchings, cyclic hypergraphs (Miglioli et al., 23 Oct 2025)).
  • Variance and Rate Control:
    • For N(nm)N\ll\binom{n}{m}4 (randomized) or N(nm)N\ll\binom{n}{m}5 (design-based), one typically achieves N(nm)N\ll\binom{n}{m}6 variance and N(nm)N\ll\binom{n}{m}7 error rates.
    • In degenerate/high-order cases, precise trade-offs involve higher-order projections and sample overlap structure; variance reductions are guaranteed for optimal equireplicate cyclic/hypergraph designs (Miglioli et al., 23 Oct 2025, Kong et al., 2020).
  • Accuracy–Speed Frontier:
    • Increasing N(nm)N\ll\binom{n}{m}8 (or N(nm)N\ll\binom{n}{m}9) improves statistical accuracy, but practical experiments confirm that mm0 suffices for size and power accuracy in kernel tests, even as mm1 grows (Schrab et al., 2022, Miglioli et al., 23 Oct 2025).
    • There is a regime of diminishing returns: after mm2, no further benefit is observed in higher-order Edgeworth expansions (Shao et al., 2023).
    • Practical recommendations are to select mm3 or mm4 to balance computational cost and statistical power.

4. Limitations, Robustness, and Extensions

  • Singular/Irregular Models: In classical U-statistics, regularity (nonvanishing projection) is required for CLTs and accurate size in hypothesis testing. Incomplete U-statistics, due to reduced dependence, achieve "singularity-agnostic" Berry–Esseen bounds, allowing valid inference in singular or near-singular situations (e.g., nested models, boundary constraints, algebraic singularities). There is no dependency-diverging term analogous to mm5, so the normal approximation is uniformly accurate regardless of proximity to irregularity (Leung et al., 2024).
  • Heavy-tailed and Infinite-variance Regimes: mm6-consistency can hold under finite mm7-th moment for any mm8, with nonparametric rates derived for various sampling schemes (with/without replacement, Bernoulli), covering settings where classical theory is inapplicable (Dürre et al., 2021).
  • Banach- and Hilbert-space Kernels: Deviations and high-moment inequalities for incomplete U-statistics in Banach spaces have been developed under minimal smoothness/degeneracy, supporting functional data (Giraudo, 2024, Giraudo, 2024).
  • High-dimensional and Infinite-order U-Statistics: Non-asymptotic bounds and bootstrap results control error for high-dimensional kernels and diverging order (e.g., random forests, subbagging) (Song et al., 2019, Chen et al., 2017). Data-driven bootstrap and wild/MG bootstrap techniques are effective and computationally feasible for incomplete U-statistics (Sturma et al., 2022, Schrab et al., 2022, Chen et al., 2017).

5. Applications in Testing and Machine Learning

Incomplete U-statistics are foundational for scalable inference and high-dimensional testing:

  • Testing with Many or Polymeric Constraints: In situations where the number of constraints mm9 is large compared to mm0, incomplete U-statistics allow testing of (in)equalities via bootstrap-calibrated statistics, providing uniform type I error control without constrained optimization even in "singular" hypotheses (Sturma et al., 2022).
  • Goodness-of-fit in Latent Structure Models: Testing large families of algebraically-defined constraints (e.g., tetrads in latent tree models) is made tractable by incomplete U-statistics (e.g., mm1 constraints with mm2 evaluations) (Sturma et al., 2022).
  • Kernel Methods and Two-sample/Independence Testing: MMDAggInc, HSICAggInc, and KSDAggInc tests utilize incomplete U-statistics for the Maximum Mean Discrepancy, Hilbert–Schmidt Independence Criterion, and Kernel Stein Discrepancy. These achieve minimax-optimal uniform separation rates, and wild bootstrap calibration matches permutation-based approaches in power, but with reduced complexity (Schrab et al., 2022, Miglioli et al., 23 Oct 2025).
  • Empirical Risk Minimization: In metric learning, clustering, ranking, and robust estimation, incomplete U-statistics provide mm3 learning rates for ERM with mm4 kernel evaluations, outperforming naive subsampling (Clémençon et al., 2015).
  • Network Method of Moments and Graph Statistics: Higher-order moments and motif counts in network models can be computed and tested via incomplete U-statistics; regime-specific inference is informed by the budget exponent mm5 and network sparsity parameters (Shao et al., 2023).
  • Functional Data and Infinite-dimensional Kernels: Balanced incomplete designs and exponential inequalities for Hilbert- and Banach-valued data ensure statistical error control with fixed or slowly growing design sizes (Duembgen et al., 2022, Giraudo, 2024, Giraudo, 2024).

6. Algorithmic Construction and Practical Implementation

Advanced combinatorial and design-based sampling algorithms enable optimal variance and coverage properties:

  • Equireplicate Designs: For mm6, partitioning of mm7 into perfect matchings, Hamiltonian cycles, or cyclic hypergraphs yields mm8-equireplicate block collections that are optimal or near-optimal for variance and Berry–Esseen error (Miglioli et al., 23 Oct 2025).
  • Orthogonal Arrays: OA-based selection of sample blocks eliminates lower-order projections and achieves asymptotic efficiency with much smaller sample sizes—mm9 in non-degenerate cases (Kong et al., 2020).
  • Cyclic Designs for ιIn,m\iota\in I_{n,m}0: Construction using appropriately shifted index blocks supports higher-order settings and deterministic computational control (Miglioli et al., 23 Oct 2025).
  • Divide-and-conquer and Local Jackknife: For estimating nuisance terms (e.g., projections ιIn,m\iota\in I_{n,m}1), DC or local-jackknife schemes provide estimator splitting suitable for bootstrapping (Sturma et al., 2022, Song et al., 2019).

Empirical demonstration on large-scale datasets (e.g., CIFAR-10) confirms that such designs can deliver speedup factors of ιIn,m\iota\in I_{n,m}2–ιIn,m\iota\in I_{n,m}3 over permutation-based or full U-statistic computations, while maintaining power and error control (Miglioli et al., 23 Oct 2025).


Summary Table: Statistical/Computational Properties of Incomplete U-Statistics

Feature Complete U-Statistic Random Incomplete Equireplicate/Design-based
Complexity ιIn,m\iota\in I_{n,m}4 ιIn,m\iota\in I_{n,m}5 ιIn,m\iota\in I_{n,m}6
Variance (non-degenerate) ιIn,m\iota\in I_{n,m}7 ιIn,m\iota\in I_{n,m}8 ιIn,m\iota\in I_{n,m}9
Asymptotic Normality CLT if projection ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n)0 Uniform by Berry–Esseen bounds Uniform even if kernel order grows
Control under degeneracy Non-Gaussian/Breakdown Valid for mixed-degenerate Valid for all designs
Limiting distribution rate ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n)1 ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n)2 ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n)3

7. Limitations and Open Questions

  • While uniform Berry–Esseen rates and consistency guarantees exist, some questions remain open, including sharp lower bounds in heavy-tailed regimes, extensions of ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n)4-consistency to full central limit theorems under ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n)5 moments, and the characterization of optimal designs for degenerate kernels as kernel order diverges.
  • The limits of inference with highly aggressive pruning or extremely degenerate test statistics involve trade-offs between statistical accuracy, coverage, and computational savings that may depend intricately on network structure or kernel smoothness.

A plausible implication is that the structure of the design (random or combinatorial) can be explicitly tuned to match the required accuracy regime for specific inferential tasks, offering adaptive and scalable methodology extendable across nonparametric functionals, kernel tests, and large-scale machine learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Incomplete U-statistics.