Weighted Empirical Distribution Methods
- Weighted empirical distribution is a generalization of the classical ECDF that uses non-uniform, data-dependent weights to adjust for heterogeneity and bias.
- The methodology incorporates higher order asymptotic expansions like Edgeworth–Cornish–Fisher to improve quantile estimation and confidence interval accuracy.
- This framework enhances robust statistical inference and nonparametric analysis by enabling corrections for non-identically distributed data and complex sampling designs.
A weighted empirical distribution generalizes the classical empirical distribution by assigning non-uniform, possibly data-dependent, and potentially predetermined weights to individual observations. This concept is crucial in contemporary statistical inference, nonparametric theory, large-scale resampling, importance sampling, transfer learning, optimal risk minimization under dataset shift, and robust and distributionally-constrained modeling. The weighted empirical distribution appears both as an explicit object of inference—enabling corrections for heterogeneity or bias—and as an implicit tool for constructing statistics, likelihoods, and loss functions in high-dimensional and complex data scenarios.
1. Definition and Fundamental Properties
Given independent observations (not necessarily identically distributed) and nonnegative weights with normalization (or %%%%3%%%% in some conventions), the weighted empirical distribution is
When , reduces to the classical empirical cumulative distribution function (ECDF). In general, is a discrete probability measure supported on the sample points but with non-uniform masses.
The expectation of under sampling is
where is the true distribution function of .
Weighted empirical distributions facilitate correct representation of data under non-identically distributed samples, enable efficiency improvements through deterministic or data-adaptive reweighting, and underpin modern procedures for robust modeling and inference in heterogeneous data environments (Withers et al., 2010).
2. Edgeworth, Cornish–Fisher, and Higher Order Expansions
Weighted empirical distributions enable not only classical Central Limit Theorem (CLT) results for plug-in estimators , but also refined, higher-order distributional approximations. Any smooth functional —e.g., mean, variance, quantiles, or functionals arising in statistical estimation—can have its sampling distribution expanded to third order as
where , and are the standard normal CDF and density, and are Hermite-polynomial-based correction terms determined by cumulants involving von Mises derivatives and the weight structure.
These expansions provide accurate quantile inference and coverage for confidence intervals beyond first-order CLT normality, correcting both bias and variance due to heterogeneity and weighting, and yielding meaningful improvements in classical statistical inferential tasks (Withers et al., 2010).
3. Cumulant Expansions and von Mises Derivatives
To obtain higher order expansions for , the paper employs von Mises–Taylor expansions for functionals defined not only on probability measures but on signed measures (of total measure $1$). The general expansion is
where is the -th order von Mises derivative, defined so for all .
Cumulant expansions for take the form
with explicit, weight-dependent terms; e.g., the coefficients involve moments and combinations such as
Here, denotes an expectation of a product of first and second von Mises derivatives, averaged against the weight profile . The -th order moment structure further involves cross-product terms coupling distinct sample points, reflecting both heterogeneity and weighting (Withers et al., 2010).
4. Practical Applications: Estimators and Edge Cases
a) Sample Mean and Variance
For the mean, the influence function is and cumulant coefficients reduce to scaled moments of the (mean) distribution convolved with the -th moment of the weights. For the sample variance , higher order derivatives and mixed moments appear, essential under non-identical distributions; e.g.,
with denoting empirical moments.
b) Studentized Mean, Coefficient of Variation
The framework extends directly to plug-in functionals such as the Studentized mean and sample coefficient of variation, where the higher order corrections account for weighted, heterogeneous sampling. In each case, practical inference—confidence intervals, hypothesis tests—benefits from improved asymptotics reflecting both the weights and the varying sampling distributions (Withers et al., 2010).
5. Edgeworth–Cornish–Fisher Expansions: Quantiles and Distributional Approximations
The explicit cumulant expansion allows the use of Edgeworth–Cornish–Fisher (ECF) formulas to correct the estimated distribution and quantiles for : where are determined via Hermite expansions in terms of the standardized cumulant coefficients. This sequence yields more accurate p-values and critical values for functionals under both heterogeneity and weighting, and directly extends classical asymptotic inference.
6. Methodological Extensions: Non-i.i.d. Data and Signed Measures
A crucial methodological development is the extension of von Mises functional derivatives to signed measures of total measure $1$—a necessary move since the weighted empirical distribution may involve negative or non-uniform weighting (e.g., in leave-one-out or importance resampling). The normalization for all ensures the expansion is valid in generality.
This approach is essential for robust inference under - Non-identical distributions (distinct for each ), - Preassigned or data-adaptive weighting (e.g., regression, survey sampling, sandwich estimation), - Settings where classical plug-in theory fails.
Thus, the framework generalizes both the sample measures themselves and the associated inferential expansions (Withers et al., 2010).
7. Implications for Nonparametric and Robust Inference
The explicit third order expansions and cumulant-based inferential methodology provide a substantial improvement over traditional CLT-level inference. The ability to model and correct for both heterogeneity and sampling design is fundamental for
- Nonparametric inference with complex surveys or regression residuals,
- Robust estimation, model validation, and hypothesis testing in high-variance environments,
- Bayesian and likelihood-based procedures requiring accurate quantile or tail approximations.
By integrating weighting into both the empirical measure and its higher order derivative structure, the approach provides a rigorous, systematic framework for precision inference in real-world, heterogeneous data scenarios.
Summary Table: Core Components of Weighted Empirical Distribution Results
Component | Mathematical Expression / Role | Key Implications |
---|---|---|
Weighted empirical CDF | Accommodates weights and heterogeneity | |
Mean under weights | Accurate expectation under non-i.i.d. | |
Von Mises expansion | Generalizes Taylor expansion for functionals | |
Cumulant expansion | Enables EC, CF expansions (finite-sample correction) | |
Higher order asymptotics | Edgeworth–Cornish–Fisher expansions to | Improved confidence intervals, p-values |
References
- For third order asymptotic expansions and cumulants for weighted empirical distributions: "The distribution and quantiles of functionals of weighted empirical distributions when observations have different distributions" (Withers et al., 2010).
- For applications in regression, rank statistics, and Bayes theory: see and references within (Withers et al., 2010).