Papers
Topics
Authors
Recent
2000 character limit reached

Incomplete U-Statistics

Updated 11 November 2025
  • Incomplete U-statistics are estimators that average a subset of kernel evaluations, offering a tractable alternative to full U-statistics.
  • They achieve key statistical properties like unbiasedness, consistency, and asymptotic normality through both randomized (Bernoulli) and design-based sampling methods.
  • These methods balance computational efficiency and statistical accuracy, enabling scalable inference in high-dimensional, heavy-tailed, and degenerate settings.

Incomplete U-statistics constitute a broad class of estimators and test statistics that sum over a computationally tractable, randomly or deterministically selected subset of the kernel evaluations constituting a U-statistic. They provide a means to address the intractability of high-order, high-dimensional, or high-volume U-statistics, while often retaining desirable statistical properties such as unbiasedness, consistency, and valid inference, even in complex or irregular models. Modern developments have established explicit Berry–Esseen bounds, edgeworth expansions, central limit theorems, non-asymptotic tail inequalities, and precise trade-offs between statistical accuracy and computational budget for a wide variety of incomplete U-statistic formulations, including randomized and combinatorially structured designs.

1. Formal Definition and Types of Incomplete U-Statistics

Given i.i.d. observations X1,,XnX_1,\dots,X_n and a symmetric kernel h:XmRph:X^m\rightarrow\mathbb{R}^p, the complete U-statistic is

Un=1(nm)1i1<<imnh(Xi1,,Xim).U_n = \frac{1}{\binom{n}{m}} \sum_{1\leq i_1 < \dots < i_m \leq n} h(X_{i_1},\dots,X_{i_m}).

For computational reasons, the incomplete U-statistic averages the kernel over a subset DD of N(nm)N\ll\binom{n}{m} mm-tuples. The two standard constructions are:

  • Randomized/Bernoulli Sampling: For each mm-tuple ιIn,m\iota\in I_{n,m}, draw ZιBernoulli(pn)Z_\iota\sim\mathrm{Bernoulli}(p_n) with pn=N/(nm)p_n=N/\binom{n}{m}, and set

Un,N=1N^ιZιh(Xι),U'_{n,N} = \frac{1}{\hat N} \sum_{\iota} Z_\iota\, h(X_{\iota}),

where N^=ιZι\hat N = \sum_\iota Z_\iota is the random number of included terms.

  • Deterministic/Design-based Sampling: Choose DIn,mD\subset I_{n,m} of size NN (randomly or by a structural algorithm), and compute

Un,D=1NSDh(XS).U_{n,D} = \frac{1}{N} \sum_{S\in D} h(X_S).

  • Further extensions include sampling with replacement, stratified sampling, and combinatorial designs (e.g., equireplicate, orthogonal array).

Classical U-statistics are unbiased and, with mnm\ll n, have variance O(1/n)O(1/n) under non-degeneracy. Incomplete U-statistics typically retain unbiasedness (at least in expectation) and obtain a variance increase proportional to the reduction in sample size or design overlap.

2. Statistical Properties and Limit Theory

Recent research has established strong distributional approximations and concentration for incomplete U-statistics in both non-degenerate and degenerate settings, even when the underlying kernels are high-degree, high-dimensional, or heavy-tailed:

  • Asymptotic Normality and Berry–Esseen Bounds: In the standard non-degenerate regime, with NnN\asymp n, the scaled incomplete U-statistic obeys a CLT:

n(Un,Nμ)  Np(0,m2Γg+(n/N)Γh)\sqrt n (U'_{n,N} - \mu) \ \rightsquigarrow\ N_p(0, m^2\,\Gamma_g + (n/N)\,\Gamma_h)

where Γg\Gamma_g is the covariance of the Hajek projection g(x)=E[h(x,X2,,Xm)]g(x) = \mathbb{E}[h(x, X_2, \dots, X_m)], and Γh\Gamma_h is the covariance of the full kernel. The Berry–Esseen rate matches O(n1/2)O(n^{-1/2}) up to polylogarithmic factors for appropriate moment and tail conditions (Sturma et al., 2022, Leung, 8 Jun 2024, Song et al., 2019, Miglioli et al., 23 Oct 2025).

  • Regimes of Computational Budget:
    • NnN \gg n: Recovery of the full U-statistic CLT.
    • NnN \ll n: Variance is dominated by kernel variability, and the CLT limit reflects only Γh\Gamma_h.
    • NnN \asymp n: Both kernel and projection variability matter; Berry–Esseen rates are uniform across regular/irregular/singular nulls (Leung et al., 4 Jan 2024, Leung, 8 Jun 2024).
  • Degeneracy and Mixed Degeneracy: When the kernel is degenerate (projection vanishes), the limiting law and validity depend on higher-order projections. The "mixed-degenerate" regime (some coordinate projections vanish, others don't) is covered by refined Gaussian approximation theorems for incomplete U-statistics (Sturma et al., 2022).
  • Consistency Under Heavy Tails: Even when kernel moments are only finite to order p1p\geq1, incomplete U-statistics are L1L_1-consistent with explicit power-law rates in min(n,N)\min(n,N); variance-saturation is avoided by balancing truncation and sampling error (Dürre et al., 2021).
  • Design-based and Equireplicate Sampling: For deterministic designs, new Berry–Esseen bounds are available. If the maximal replication Δ\Delta is O(logn)O(\log n) and the kernel order mm grows at most logarithmically, the variance is minimized, and central limit theorems hold even in degenerate regimens (Miglioli et al., 23 Oct 2025).

3. Computational and Statistical Trade-offs

The principal motivation for incomplete U-statistics is computational: evaluating all (nm)\binom{n}{m} terms is infeasible for moderate nn, mm.

  • Complexity:
    • Complete: O(nm)O(n^m) kernel evaluations and memory.
    • Incomplete (random sampling): O(N)O(N), independent of mm.
    • Equireplicate designs: O(nr)O(n r) when every data point occurs in exactly rr blocks; supports rnr\ll n and O(logqn)O(\log^q n).
    • Design-based algorithms: O(N)O(N) for structured arrays (e.g., orthogonal arrays (Kong et al., 2020), perfect matchings, cyclic hypergraphs (Miglioli et al., 23 Oct 2025)).
  • Variance and Rate Control:
    • For N=cnN = c\,n (randomized) or r=O(logqn)r = O(\log^q n) (design-based), one typically achieves O(1/n)O(1/n) variance and O(n1/2)O(n^{-1/2}) error rates.
    • In degenerate/high-order cases, precise trade-offs involve higher-order projections and sample overlap structure; variance reductions are guaranteed for optimal equireplicate cyclic/hypergraph designs (Miglioli et al., 23 Oct 2025, Kong et al., 2020).
  • Accuracy–Speed Frontier:
    • Increasing NN (or rr) improves statistical accuracy, but practical experiments confirm that NnN\asymp n suffices for size and power accuracy in kernel tests, even as mm grows (Schrab et al., 2022, Miglioli et al., 23 Oct 2025).
    • There is a regime of diminishing returns: after r2r\geq2, no further benefit is observed in higher-order Edgeworth expansions (Shao et al., 2023).
    • Practical recommendations are to select N=2nN=2n or r=O(logqn)r=O(\log^q n) to balance computational cost and statistical power.

4. Limitations, Robustness, and Extensions

  • Singular/Irregular Models: In classical U-statistics, regularity (nonvanishing projection) is required for CLTs and accurate size in hypothesis testing. Incomplete U-statistics, due to reduced dependence, achieve "singularity-agnostic" Berry–Esseen bounds, allowing valid inference in singular or near-singular situations (e.g., nested models, boundary constraints, algebraic singularities). There is no dependency-diverging term analogous to σh/σg\sigma_h/\sigma_g, so the normal approximation is uniformly accurate regardless of proximity to irregularity (Leung et al., 4 Jan 2024).
  • Heavy-tailed and Infinite-variance Regimes: L1L_1-consistency can hold under finite pp-th moment for any p>1p>1, with nonparametric rates derived for various sampling schemes (with/without replacement, Bernoulli), covering settings where classical theory is inapplicable (Dürre et al., 2021).
  • Banach- and Hilbert-space Kernels: Deviations and high-moment inequalities for incomplete U-statistics in Banach spaces have been developed under minimal smoothness/degeneracy, supporting functional data (Giraudo, 3 May 2024, Giraudo, 18 Sep 2024).
  • High-dimensional and Infinite-order U-Statistics: Non-asymptotic bounds and bootstrap results control error for high-dimensional kernels and diverging order (e.g., random forests, subbagging) (Song et al., 2019, Chen et al., 2017). Data-driven bootstrap and wild/MG bootstrap techniques are effective and computationally feasible for incomplete U-statistics (Sturma et al., 2022, Schrab et al., 2022, Chen et al., 2017).

5. Applications in Testing and Machine Learning

Incomplete U-statistics are foundational for scalable inference and high-dimensional testing:

  • Testing with Many or Polymeric Constraints: In situations where the number of constraints pp is large compared to nn, incomplete U-statistics allow testing of (in)equalities via bootstrap-calibrated statistics, providing uniform type I error control without constrained optimization even in "singular" hypotheses (Sturma et al., 2022).
  • Goodness-of-fit in Latent Structure Models: Testing large families of algebraically-defined constraints (e.g., tetrads in latent tree models) is made tractable by incomplete U-statistics (e.g., O(4)O(\ell^4) constraints with N=2nN=2n evaluations) (Sturma et al., 2022).
  • Kernel Methods and Two-sample/Independence Testing: MMDAggInc, HSICAggInc, and KSDAggInc tests utilize incomplete U-statistics for the Maximum Mean Discrepancy, Hilbert–Schmidt Independence Criterion, and Kernel Stein Discrepancy. These achieve minimax-optimal uniform separation rates, and wild bootstrap calibration matches permutation-based approaches in power, but with reduced complexity (Schrab et al., 2022, Miglioli et al., 23 Oct 2025).
  • Empirical Risk Minimization: In metric learning, clustering, ranking, and robust estimation, incomplete U-statistics provide O(n1/2)O(n^{-1/2}) learning rates for ERM with O(n)O(n) kernel evaluations, outperforming naive subsampling (Clémençon et al., 2015).
  • Network Method of Moments and Graph Statistics: Higher-order moments and motif counts in network models can be computed and tested via incomplete U-statistics; regime-specific inference is informed by the budget exponent α\alpha and network sparsity parameters (Shao et al., 2023).
  • Functional Data and Infinite-dimensional Kernels: Balanced incomplete designs and exponential inequalities for Hilbert- and Banach-valued data ensure statistical error control with fixed or slowly growing design sizes (Duembgen et al., 2022, Giraudo, 3 May 2024, Giraudo, 18 Sep 2024).

6. Algorithmic Construction and Practical Implementation

Advanced combinatorial and design-based sampling algorithms enable optimal variance and coverage properties:

  • Equireplicate Designs: For m=2m=2, partitioning of KnK_n into perfect matchings, Hamiltonian cycles, or cyclic hypergraphs yields rr-equireplicate block collections that are optimal or near-optimal for variance and Berry–Esseen error (Miglioli et al., 23 Oct 2025).
  • Orthogonal Arrays: OA-based selection of sample blocks eliminates lower-order projections and achieves asymptotic efficiency with much smaller sample sizes—mnm\gg\sqrt n in non-degenerate cases (Kong et al., 2020).
  • Cyclic Designs for m>2m>2: Construction using appropriately shifted index blocks supports higher-order settings and deterministic computational control (Miglioli et al., 23 Oct 2025).
  • Divide-and-conquer and Local Jackknife: For estimating nuisance terms (e.g., projections g(x)g(x)), DC or local-jackknife schemes provide estimator splitting suitable for bootstrapping (Sturma et al., 2022, Song et al., 2019).

Empirical demonstration on large-scale datasets (e.g., CIFAR-10) confirms that such designs can deliver speedup factors of 10210^210310^3 over permutation-based or full U-statistic computations, while maintaining power and error control (Miglioli et al., 23 Oct 2025).


Summary Table: Statistical/Computational Properties of Incomplete U-Statistics

Feature Complete U-Statistic Random Incomplete Equireplicate/Design-based
Complexity O(nm)O(n^m) O(N)O(N) O(nr)O(nr)
Variance (non-degenerate) O(1/n)O(1/n) O(1/N)O(1/N) O(1/(nr))O(1/(nr))
Asymptotic Normality CLT if projection 0\neq0 Uniform by Berry–Esseen bounds Uniform even if kernel order grows
Control under degeneracy Non-Gaussian/Breakdown Valid for mixed-degenerate Valid for all designs
Limiting distribution rate O(n1/2)O(n^{-1/2}) O(N1/2)O(N^{-1/2}) O((nr)1/2)O((nr)^{-1/2})

7. Limitations and Open Questions

  • While uniform Berry–Esseen rates and consistency guarantees exist, some questions remain open, including sharp lower bounds in heavy-tailed regimes, extensions of L1L_1-consistency to full central limit theorems under p<2p < 2 moments, and the characterization of optimal designs for degenerate kernels as kernel order diverges.
  • The limits of inference with highly aggressive pruning or extremely degenerate test statistics involve trade-offs between statistical accuracy, coverage, and computational savings that may depend intricately on network structure or kernel smoothness.

A plausible implication is that the structure of the design (random or combinatorial) can be explicitly tuned to match the required accuracy regime for specific inferential tasks, offering adaptive and scalable methodology extendable across nonparametric functionals, kernel tests, and large-scale machine learning.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Incomplete U-statistics.