Empirical Distribution Functions (EDFs)
- EDFs are nonparametric estimators of the cumulative distribution function that compute the proportion of sample data below any given value, forming a basis for statistical inference.
- Classical results like the Glivenko–Cantelli theorem and the Dvoretzky–Kiefer–Wolfowitz inequality establish strong convergence properties and finite-sample confidence bands for EDFs.
- Recent extensions include applications to dependent data, ranked set sampling, time series analysis, and differentially private mechanisms, enhancing both theoretical insight and practical implementation.
An empirical distribution function (EDF) is a nonparametric estimator of the cumulative distribution function (CDF) of a population based on sampled data. Formally, for a sample from an unknown distribution , the EDF is defined by
where is the indicator function. EDFs are foundational in nonparametric inference, providing unbiased estimators of with well-understood convergence and limiting properties. They serve as the basis for classical goodness-of-fit testing, resampling techniques, quantile estimation, empirical process theory, and the analysis of complex sampling designs and dependent data.
1. Classical Results: Law of Large Numbers and Limiting Theory
The central result governing the EDF is the Glivenko–Cantelli theorem, which states that for i.i.d. ,
This uniform law of large numbers underpins the strong consistency of the EDF as an estimator of (Coulibaly et al., 27 Feb 2025, Zähle, 2013). Finite-sample deviation bounds are given by the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality: and the corresponding uniform confidence band
0
The functional central limit theorem (Donsker's theorem) asserts that 1 converges in distribution (in 2) to a mean-zero Gaussian process with covariance kernel 3, characterizing the limiting empirical process (Phandoidaen et al., 2021).
2. Extensions to Dependent Data: Mixing and Functional Dependence
When data exhibit dependence, EDFs retain strong law and CLT properties under suitable mixing conditions. For strictly stationary, 4-mixing or 5-mixing sequences with summable mixing coefficients,
6
and analogous results hold for 7-mixing. Convergence rates and uniform limit laws depend on the decay rate of mixing (Coulibaly et al., 27 Feb 2025, Zähle, 2013).
Functional dependence, captured via the Wu–Shen or Berbee coupling constructions, allows even broader applicability. For a stationary or locally stationary process, a functional CLT holds for the EDF under polynomial decay of functional dependence measures. The empirical process limit has covariance incorporating both marginal and serial dependencies: 8 (Phandoidaen et al., 2021).
For high-dimensional vector-valued data, as in the Gaussian setting with general covariance matrix 9, the behavior of the EDF of vector components is governed by an average off-diagonal covariance parameter 0: 1 Under “vanishing second order” assumptions, both LLN and CLT results extend, with the limit depending only on 2 (Delattre et al., 2012).
3. EDFs under Complex and Ranked Set Sampling Designs
In settings where direct measurement is costly and auxiliary information is available, design-based EDF estimators are constructed using ranked set sampling (RSS). McIntyre's RSS and its finite-population generalizations (level-0, level-1, level-2) yield EDFs with complex inclusion probabilities 3 and variance expressions: 4 Explicit formulas for 5, 6 are derived for each design (Sevil et al., 2022). The resulting design-based EDF is unbiased or approximately unbiased for the finite-population distribution function.
Efficiency analyses using asymptotic relative efficiency (ARE) show that level-2 RSS-based EDFs dominate both simple random sampling (SRS) and other RSS designs in terms of variance reduction, especially with perfect or high-quality ranking. This structure allows practitioners to estimate distributional quantities and quantiles (e.g., medians) with improved precision in finite populations, provided auxiliary ranking variables have sufficient correlation (7) (Sevil et al., 2022).
4. EDFs in Sequential Testing, Time Series, and Multivariate Settings
EDFs are central to nonparametric sequential change-point detection in both univariate and multivariate time series. For a sequence 8, the moving-window EDFs: 9 enable the construction of monitoring statistics sensitive to distributional changes (CUSUM-type, Cramér–von Mises-type, Anderson–Darling-type). Under strong mixing, these detectors have known limits, and thresholds can be calibrated via the dependent multiplier bootstrap (Kojadinovic et al., 2020, Holmes et al., 2022).
In open-end and closed-end monitoring designs, the key innovation is the use of covariance-estimated Mahalanobis norms over finite grids, which retains sensitivity to all types of distributional changes—location, scale, tail, or dependence structure—regardless of coordinate labeling or dimension (Holmes et al., 2022).
5. EDFs in Nonparametric and Goodness-of-Fit Inference
Classical goodness-of-fit tests—Kolmogorov–Smirnov, Cramér–von Mises, Anderson–Darling, and their multivariate extensions—are based directly on functionals of the EDF, or on the difference between the EDF and a parametric (or null) CDF (Milošević et al., 2021). In the two-sample setting, tests compare two EDFs, e.g.,
0
where 1 is the pooled EDF. Recent developments integrate variance-stabilizing weights and L1/Wasserstein-distance concepts to provide uniformly strong power in both classical (e.g., mean and variance shift) and challenging mixture/shape-change alternatives (Dowd, 2020).
EDF-based tests exhibit well-understood local and asymptotic Bahadur efficiencies, with integral-type statistics (Cramér–von Mises, Anderson–Darling) outperforming supremum-type statistics (Kolmogorov–Smirnov) for normality testing in composite settings. Anderson–Darling generally offers the highest local efficiency (Milošević et al., 2021).
6. Extensions: Goodness-of-Fit for Latent Processes and High-Frequency Data
In high-frequency financial econometrics and stochastic process inference, EDFs are employed to estimate time-occupation and marginal CDFs of latent processes such as spot volatility. The realized EDF (REDF) is constructed from bias-corrected local estimators of volatility, e.g.,
2
where 3 is a pre-averaged, noise-robust estimator of instantaneous variance. Uniform consistency and functional CLTs for the REDF are established under microstructure noise and stochastic volatility/jump regimes (Christensen et al., 28 Jan 2026). This enables the construction of realized goodness-of-fit tests for volatility models, with critical values obtained via parametric bootstrap due to dependence in the limiting Gaussian process.
Simulation and empirical results confirm that the REDF-based procedure accurately recovers the true CDF of latent variance and enables powerful, well-sized testing for model fit in realistic high-frequency data settings (Christensen et al., 28 Jan 2026).
7. Differential Privacy, Applications, and Modern Directions
The release of entire EDFs under strong privacy constraints is addressed via differentially private mechanisms. DP-EDFs are constructed using dyadic tree-based Laplace mechanisms or function secret sharing to achieve 4-differential privacy. For query points 5, the DP-ECDF is
6
where Laplace noise 7 is added along root-to-leaf paths in the query tree. The expected squared error per query scales as 8 (Barczewski et al., 10 Feb 2025).
Applications include private release of ROC curves and calibration statistics (e.g., Hosmer–Lemeshow), with empirical studies demonstrating the effectiveness of post-processing techniques (such as isotonic regression) in reducing DP error without inflating privacy loss. Modern EDF methodology thus extends into machine learning, robust inference, and federated or distributed frameworks with strong privacy and communication guarantees.
References:
(Sevil et al., 2022) "Design-based estimators of distribution function in ranked set sampling with an application" (Coulibaly et al., 27 Feb 2025) "On the Glivenko-Cantelli theorem for real-valued empirical functions of stationary 9-mixing and 0-mixing sequences" (Phandoidaen et al., 2021) "Empirical process theory for nonsmooth functions under functional dependence" (Delattre et al., 2012) "On empirical distribution function of high-dimensional Gaussian vector components with an application to multiple testing" (Christensen et al., 28 Jan 2026) "The realized empirical distribution function of stochastic variance with application to goodness-of-fit testing" (Holmes et al., 2022) "Multi-purpose open-end monitoring procedures for multivariate observations based on the empirical distribution function" (Kojadinovic et al., 2020) "Nonparametric sequential change-point detection for multivariate time series based on empirical distribution functions" (Milošević et al., 2021) "Bahadur efficiency of EDF based normality tests when parameters are estimated" (Dowd, 2020) "A New ECDF Two-Sample Test Statistic" (Barczewski et al., 10 Feb 2025) "Differentially Private Empirical Cumulative Distribution Functions" (Zähle, 2013) "Marcinkiewicz-Zygmund and ordinary strong laws for empirical distribution functions and plug-in estimators"