Papers
Topics
Authors
Recent
2000 character limit reached

Weighted ECDF: Theory, Methods, and Applications

Updated 22 November 2025
  • Weighted ECDF is a nonparametric estimator that accounts for data observations assigned with unequal weights, capturing the empirical distribution under non-uniform sampling.
  • It underpins advanced inference techniques, including central limit theorems and Edgeworth expansions, to provide accurate quantile and risk estimation in non-i.i.d. settings.
  • Its applications span survey sampling, missing data corrections, and time-dependent analysis, ensuring valid estimation under informative selection and complex dependence structures.

A weighted empirical cumulative distribution function (weighted ECDF) is a central nonparametric estimator used when observations in a dataset are assigned non-uniform, often data- or design-driven, weights. This generalization of the classical ECDF is instrumental in survey sampling, dealing with missing data, informative selection, and non-i.i.d. data. The weighted ECDF captures the empirical distribution of a population under non-uniform selection or contribution scenarios, and serves as the foundation for risk estimation, quantile inference, and asymptotic distributional theory under complex data-generating processes.

1. Formal Definition and Principal Forms

Let X1,,XnX_1, \dots, X_n be independent (not necessarily identically distributed) random variables on Rs\mathbb{R}^s and w1,,wnw_1, \dots, w_n be associated deterministic, non-negative weights satisfying i=1nwi=n\sum_{i=1}^n w_i = n. The weighted ECDF is defined as

F^n(x)=1ni=1nwi1{Xix},xRs,\widehat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n w_i \, \mathbf{1}\{X_i \le x\}, \quad x \in \mathbb{R}^s,

with expectation

Fn(x)=1ni=1nwiFi(x),F_n(x) = \frac{1}{n}\sum_{i=1}^n w_i F_{i}(x),

where FiF_i is the distribution function of XiX_i (Withers et al., 2010). Design-weighted ECDFs arise naturally with Horvitz–Thompson weights in survey sampling, as well as under length-biased or informative sampling (Bonnéry et al., 2012, Gharbi et al., 8 Oct 2025). In time-dependent or functional data regimes, the weighted ECDF generalizes further by introducing a weight function w(x)w(x) applied to process values, leading to empirical processes indexed by both tt and xx (Yang, 2014).

2. Asymptotic Theory: Weak Convergence, CLT, Edgeworth Expansions

The asymptotic distribution of the weighted ECDF and its functionals depends crucially on the weighting structure and the (non-)identical distribution of the data. Under suitable regularity, a central limit theorem (CLT) applies to smooth functionals T(F^n)T(\widehat{F}_n), yielding

n(T(F^n)T(Fn))dN(0,a2,1),\sqrt{n} \, (T(\widehat{F}_n) - T(F_n)) \to_d \mathcal{N}\Big(0,\,a_{2,1}\Big),

where a2,1a_{2,1} is the asymptotic variance involving the first von Mises derivative of TT and the weights (Withers et al., 2010).

For finer, finite-sample accuracy, a third-order Edgeworth–Cornish–Fisher expansion provides percentile corrections: P(Ynx)=Φ(x)+n1/2h1(x)φ(x)+n1h2(x)φ(x)+O(n3/2),P(Y_n \le x) = \Phi(x) + n^{-1/2} h_1(x)\varphi(x) + n^{-1} h_2(x)\varphi(x) + O(n^{-3/2}), with explicit Hermite-polynomial based terms h1h_1, h2h_2, and cumulant expressions involving the weights and higher-order von Mises derivatives (Withers et al., 2010). This allows construction of confidence intervals and approximations with coverage error O(n3/2)O(n^{-3/2}), rather than O(n1/2)O(n^{-1/2}) as in basic CLT-based inference.

In time-dependent and functional settings, the weak convergence of the weighted empirical process to a Gaussian limit in (E×[0,1])\ell^\infty(E \times [0,1]) depends on the regularity and decay of w(x)w(x), tail conditions, and local oscillation (WL-condition) (Yang, 2014). For instance, if w(x)w(x) is regularly varying at zero and other constraints hold, the process converges to a mean-zero Gaussian process with

Cov(G(s,x),G(t,y))=w(x)w(y){P(X(s)x,X(t)y)xy}.\operatorname{Cov}(G(s, x), G(t, y)) = w(x)w(y) \big\{ P(X(s) \le x, X(t) \le y) - x y \big\}.

3. Weighted ECDF under Informative Selection and Complex Sampling

Informative selection from finite populations results in empirical CDFs that do not converge to the true superpopulation CDF, but rather to a weighted version that reflects the selection mechanism. Under such sampling designs, the limiting CDF is

Fs(α)=αm(y)f(y)dym(y)f(y)dy,F_s(\alpha) = \frac{\int_{-\infty}^\alpha m(y)f(y)\,dy}{\int_{-\infty}^\infty m(y)f(y)\,dy},

where m(y)m(y) is the asymptotic inclusion propensity as a function of the outcome variable, and f(y)f(y) is the superpopulation density (Bonnéry et al., 2012). Uniform L2L_2 and almost sure convergence are obtained under weak dependence and moment conditions on the sampling indicators, even in presence of dependence among sampled units.

This weighting construction subsumes classical Glivenko–Cantelli convergence (when m(y)m(y) is constant), and allows modeling of length-biased, PPS, and stratified sampling designs. The limit theory provides explicit guidelines for validity of empirical process inference under complex survey and observational paper designs.

4. Weighted ECDFs and Missing Data: Inverse Probability Weighting and Smoothing

When responses are subject to missingness under a missing at random (MAR) assumption, the inverse probability weighted (IPW) ECDF is formulated as

F~n(y)=1ni=1nδiπi 1{Yiy},\widetilde{F}_n(y) = \frac{1}{n}\sum_{i=1}^n \frac{\delta_i}{\pi_i} \ \mathbf{1}\{Y_i \le y\},

where δi\delta_i is the response indicator and πi\pi_i is the propensity of observation (Gharbi et al., 8 Oct 2025). Bernstein polynomial smoothing of F~n\widetilde{F}_n yields monotone, boundary-corrected estimators which achieve optimal mean integrated squared error by an explicit sample size-dependent smoothing degree: mopt=n2/3[401B(y)2dy01V(y)dy]2/3,m_{\mathrm{opt}} = n^{2/3}\left[\frac{4 \int_0^1 B(y)^2\,dy}{\int_0^1 V(y)\,dy}\right]^{2/3}, where B(y)B(y) and V(y)V(y) are explicit bias and variance functionals depending on the unknown true CDF and the weighting structure. Feasible estimators, with propensities estimated from auxiliary variables, display variance reduction over pseudo (oracle) versions (Gharbi et al., 8 Oct 2025).

5. Functional Weighted ECDFs and Empirical Processes in Dependent Data

Weighted ECDFs extend to time-dependent or functional data by assigning location- and value-dependent weights. For stochastic processes X(t)X(t), the empirical process constructed as

Un(t,x)=n1/2i=1nw(x)(1{Xi(t)x}x)U_n(t,x) = n^{-1/2} \sum_{i=1}^n w(x)\left( \mathbf{1}\{X_i(t) \le x\} - x \right )

provides a general framework for weighted Donsker-type theorems under broad regularity. The limit Gaussian process is characterized through process-level covariances determined by the weights and the temporal covariance structure of XX (Yang, 2014). Applications include distributional analysis of copula processes, inference on smoothed quantile curves, and high-dimensional (functional) data settings.

6. Practical Implications and Examples

Weighted ECDFs arise across survey methodology, regression diagnostics, Bayesian estimation, and non-i.i.d. inference. In survey sampling, Horvitz–Thompson and design-based weighted ECDFs ensure unbiasedness under complex sampling. In missing data, IPW and smoothed estimators correct for selection bias and are validated via simulation and real-world applications such as NHANES plasma glucose estimation (Gharbi et al., 8 Oct 2025). For dependent or time-indexed data, weighted empirical processes underpin advanced central limit theorems for functional statistics (Yang, 2014).

Rigorous asymptotics (CLT and higher-order expansions) validate quantile inference, confidence interval construction, and bias correction for functionals in weighted and/or non-i.i.d. data (Withers et al., 2010). Empirical convergence results imply that practitioners must carefully account for the weighting structure—whether arising from design, missingness, or dependence—for correct inferential procedures. Failure to do so leads to invalid conclusions about superpopulation distributions, particularly under informative designs (Bonnéry et al., 2012).

7. Summary Table: Main Regimes for Weighted ECDFs

Scenario Weighted ECDF Formulation Key Reference
Non-i.i.d., arbitrary w F^n(x)=1nwi1{Xix}\widehat F_n(x) = \frac{1}{n}\sum w_i \mathbf{1}\{X_i \le x\} (Withers et al., 2010)
Survey sampling Fw(α)=1Nwk1{Ykα}F^w(\alpha) = \frac{1}{N} \sum w_k \mathbf{1}\{Y_k \le \alpha\}, wk=1/πkw_k=1/\pi_k (Bonnéry et al., 2012)
Missing data (IPW) F~n(y)=1nδiπi1{Yiy}\widetilde F_n(y) = \frac{1}{n}\sum \frac{\delta_i}{\pi_i} \mathbf{1}\{Y_i \le y\} (Gharbi et al., 8 Oct 2025)
Time-dependent process Un(t,x)=n1/2w(x)(1{Xi(t)x}x)U_n(t,x) = n^{-1/2}\sum w(x)(\mathbf{1}\{X_i(t)\le x\} - x) (Yang, 2014)

The weighted ECDF unifies and extends classical empirical process theory, underpinning modern statistical methodologies for biased, dependent, and incomplete data. Comprehensive asymptotic analysis and practical simulation results guide applied usage and further theoretical development across domains where sample representativity is nontrivial.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Weighted Empirical Cumulative Distribution Function (Weighted ECDF).