Projected Empirical Processes
- Projected empirical processes are a statistical method that uses one-dimensional projections to reduce high-dimensional data and facilitate effective regression model testing.
- They construct empirical processes indexed by projected covariates, achieving computational scalability and weak convergence to Gaussian processes under regular conditions.
- Multiple projections and p-value combination methods, such as the Cauchy statistic, enhance test power while mitigating the curse of dimensionality in complex regression settings.
A projected empirical process is a statistical methodology that extends classical empirical process theory to high- or infinite-dimensional covariate spaces by leveraging random or data-driven one-dimensional projections. These techniques construct goodness-of-fit tests for regression models—such as the functional linear model (FLM) or sparse high-dimensional regressions—by analyzing empirical processes indexed by projected covariates. Projected empirical processes address the curse of dimensionality, improve computational scalability, and provide theoretically sound testing procedures even in settings where the dimension of the covariate space competes with or exceeds the sample size.
1. Definition and Construction of Projected Empirical Processes
Let be independent observations, where lies in a high- or infinite-dimensional space (such as a separable Hilbert space or for large ), and is a scalar response. In the functional linear model,
with in , a projected empirical process is constructed by first drawing a projection direction (often randomly from a Gaussian measure when is infinite-dimensional), and computing the scalar projections .
The marked empirical process indexed by is then given by
where is a fitted or candidate regression function, and is typically , ensuring the process is normalized for weak convergence analysis (Cuesta-Albertos et al., 2017).
In ultra-high-dimensional parametric regressions, a similar process is defined for unit as
with for a fitted parameter (Tan et al., 2 Jan 2026).
2. Theoretical Foundations and Weak Convergence
The central theoretical advance is the reduction from high-dimensional covariates to scalar projections, allowing the application of classical empirical process tools. Under regularity assumptions—moment restrictions, estimator consistency, and suitable regularization—the projected empirical processes, under the null hypothesis, converge weakly to Gaussian processes conditionally on the projection direction.
In the FLM with estimation error controlled by Cardot–Mas–Sarda (CMS) regularization, weak convergence holds for in to a mean-zero Gaussian process with explicit covariance structure (Cuesta-Albertos et al., 2017). Similarly, for finite-dimensional sparse regressions (possibly with ), a martingale transformation can be applied to to remove the impact of non-asymptotic linearity in , yielding a process converging in law to standard Brownian motion after proper time change (Tan et al., 2 Jan 2026).
Almost sure equivalence theorems ensure that testing the projected process for a randomly drawn is almost surely equivalent to testing the full functional covariate (Cuesta-Albertos et al., 2017):
| Statement | Reference | Details |
|---|---|---|
| iff for all | (Cuesta-Albertos et al., 2017) | "No randomness" proposition |
| iff for drawn from a nondegenerate Gaussian measure (almost surely) | (Cuesta-Albertos et al., 2017) | "Random projection" theorem |
3. Test Statistics, Martingale Transform, and Distribution-Free Inference
Classical continuous functionals such as Kolmogorov–Smirnov (KS) and Cramér–von Mises (CvM) statistics are employed on the projected processes:
- KS:
- CvM: (Cuesta-Albertos et al., 2017)
In ultra-high-dimensional regression (), nuisance parameter shifts in are eliminated via a martingale transform defined in terms of Radon–Nikodym derivatives of conditional expectations and variances. The resulting martingale-transformed process is asymptotically distribution-free, and the CvM-type statistic converges to (where is standard Brownian motion), independently of and regression parameters (Tan et al., 2 Jan 2026).
These advances enable practical tests with explicit limiting null distributions, circumventing parameter estimation challenges that preclude asymptotic normality or linearization in high dimensions.
4. Multiple Projections and p-Value Combination
Because single projections can lose power against alternatives that are nearly orthogonal to the chosen direction, multiple independent projections are adopted. In functional settings, several are drawn (often aligned with leading principal components), with test statistics and bootstrapped -values computed for each. These are aggregated using procedures controlling false discovery rate (FDR), such as the Benjamini–Hochberg rule (Cuesta-Albertos et al., 2017).
In ultra-high dimensions, -values from projections are combined using the Cauchy combination statistic:
with null distributions determined asymptotically by the standard Cauchy law, enabling valid inference even under dependence among projections (Tan et al., 2 Jan 2026).
Moreover, to address frequency sensitivity (low-frequency detection by empirical processes, high-frequency detection by local smoothing), hybrid tests are constructed by combining both types of -values through the same Cauchy mechanism. This joint approach controls Type I error and improves detection across a spectrum of alternatives (Tan et al., 2 Jan 2026).
5. Implementation Strategies and Calibration
Practical implementation requires consistent estimation of projection directions and residuals, as well as calibration of critical values. In functional regression, wild bootstrap procedures—employing Rademacher multipliers to resample residuals—are used to estimate the null distribution of the projected test statistics (Cuesta-Albertos et al., 2017). The steps involve:
- Fit the FLM under the null with regularization.
- Compute test statistic from the observed data.
- Perform bootstrap resampling of residuals, recompute the statistic, and estimate the empirical -value.
- For multiple projections, combine -values using FDR controls.
In ultra-high dimensions, the distribution-free theory eliminates the need for resampling, as the null distribution of is universal, calibrated via the law of (Tan et al., 2 Jan 2026). This enables computational efficiency and scalability to very large and .
6. Power Properties and Finite-Sample Performance
Simulation studies in FLM settings demonstrate that projected CvM tests consistently achieve high power, outperforming KS-based approaches and maintaining Type I error for modest numbers of projections ( to $5$). Increasing further leads to over-conservative behavior due to discrete -values under FDR correction, with offering a robust compromise (Cuesta-Albertos et al., 2017). Compared to previously proposed tests averaging over fixed, non-random projections, randomly projected procedures offer a significant reduction in computational complexity (O() vs. O()), with only a modest sacrifice in power on small samples.
In high-dimensional regression, the martingale-transformed, projected empirical process tests maintain correct size and detect a wide class of alternatives provided that at least one projection is sensitive to the signal. The hybridization with local-smoothing statistics broadens power across oscillatory alternatives (Tan et al., 2 Jan 2026).
7. Broader Impact, Limitations, and Comparative Perspective
Projected empirical processes constitute a robust and scalable paradigm for hypothesis testing in both functional data analysis and high-dimensional statistics. They mitigate the curse of dimensionality by dimension reduction through projections, leverage theoretical properties such as almost sure equivalence of projected and full-covariate nulls, and facilitate computationally efficient inference.
However, some limitations remain. Power loss can occur for alternatives nearly orthogonal to all chosen projections, motivating aggregation across multiple projections. For FDR-type combination rules in the functional setting, large leads to over-conservative performance due to the discrete distribution of bootstrap-based -values (Cuesta-Albertos et al., 2017). In the high-dimensional context, the need for sufficiently accurate rate-bounded estimators (rather than asymptotic normality) is a requirement, though it is notably weaker than classical assumptions (Tan et al., 2 Jan 2026).
The theoretical and empirical advances summarized here delineate a comprehensive and flexible framework for goodness-of-fit testing in modern regression settings, with implications for a wide range of applications in statistics, econometrics, and machine learning.