Truncated Horvitz–Thompson Estimator
- Truncated Horvitz–Thompson estimator is a bias–variance reduction method that replaces extremely small inclusion probabilities with a preset threshold to stabilize weight calculations.
- It achieves a significant reduction in mean squared error—demonstrated by simulation reductions of 20% to 70% and empirical gains of 16% to 55%—with only a negligible bias.
- Its straightforward implementation and adaptability for ratio estimation and zero-truncated count data make it a robust choice in complex survey designs.
The truncated Horvitz–Thompson estimator refers to a family of bias–variance-reducing modifications of the classical Horvitz–Thompson (HT) estimator, primarily relevant when survey sampling inclusion probabilities are highly heterogeneous or when data are subject to truncation (e.g., studies reporting only nonzero counts). The core strategy is to either threshold small inclusion probabilities by a fixed value (the "hard-threshold" or "truncated" HT) to control estimator variance, or to explicitly model truncation and estimate latent totals via inverse probability weighting. These approaches yield estimators with controlled or minimized mean squared error (MSE) at the cost of small bias, with rigorous theoretical and empirical foundations.
1. Classical Horvitz–Thompson Estimator and Its Limitations
Let be a finite population, with variable of interest taking values for unit . Under an unequal-probability, without-replacement sampling design, each unit is included in the sample with first-order probability and second-order probability .
The classical Horvitz–Thompson estimator for the population total is: It is unbiased: . Its variance is
where and .
A key limitation arises when some are very small: the terms can become extreme, inflating variance and leading to unstable estimation, particularly when the population exhibits strong heterogeneity in inclusion probabilities.
2. Hard-Threshold (Truncated) Horvitz–Thompson Estimator: Formulation
The improved Horvitz–Thompson (IHT) estimator modifies the classical estimator by introducing a hard threshold on the inclusion probabilities: Typically, is chosen as the th-smallest inclusion probability, , subject to .
The IHT estimator is then: This shrinkage strategy dampens extreme weights, controls variance, and introduces a bias that is, under regular sampling conditions, asymptotically negligible.
3. Bias, Variance, and Exact MSE
Define (units with thresholded probabilities) and . The estimator's bias is: The exact variance is: Therefore, the mean squared error is
An unbiased estimator of the MSE can be constructed using sample data and the partition , with terms involving , , and the corresponding sample means.
4. Theoretical Comparison and Asymptotics
The IHT estimator achieves MSE of order , matching that of the classical HT estimator. Its bias is also and thus negligible compared to variance for large samples. Theoretical results establish: For Poisson sampling, the IHT estimator yields uniform improvement in MSE over the classical HT as long as there exist two units such that (Zong et al., 2018). In all cases, as -heterogeneity increases, the benefit of truncation becomes more pronounced.
5. Practical Selection of Threshold and Implementation
Set as follows: Arrange inclusion probabilities in increasing order, ; initialize , and for , increment while . Finally, take . For large populations, .
The estimator is readily implemented: replace all with , compute weights , and estimate the total as a weighted sum of . Standard software for survey analysis can accommodate weight adjustments; the unbiased MSE estimator requires access to pairwise inclusion probabilities.
Empirically, reductions in relative MSE range from 20% to 70% in simulations; real data applications show efficiency gains of 16–55% (Zong et al., 2018). In scenarios with extremely small inclusion probabilities, the variance control afforded by truncation can be critical.
6. Extension to Ratio Estimation
When an auxiliary variable with known population total is available, the improved ratio estimator is obtained by replacing all by : Under smoothness and higher-order inclusion probability conditions, the MSE for these improved ratio estimators satisfies
and likewise for (Zong et al., 2018).
7. Applicability, Robustness, and Empirical Performance
The truncated HT/IHT estimator is most advantageous in survey designs or population structures where certain units have very low selection probabilities, leading to instability in the classical estimator. Its effectiveness is largely independent of the relationship between and : the reduction in MSE persists even when is not related to selection. The bias introduced is typically insignificant relative to variance reduction, and practical implementation is straightforward, including for ratio estimation.
Empirical analysis in real datasets (e.g., the "Lucy" firm data) demonstrates robustness and efficiency: with sample sizes up to 30% of the population, MSE reductions of 16–55% were observed, and the optimal threshold adapts as the sampling fraction increases. The estimator is accompanied by an unbiased and tractable variance estimate, and operational guidelines for threshold choice have emerged from both theoretical and empirical investigations.
A plausible implication is that in large, highly stratified or unequal-probability sampling, routine adoption of the IHT estimator may offer substantial efficiency gains with negligible bias, especially in applications where maximal control of estimator variance is sought.
For zero-truncated count data meta-analysis, as in studies of post-bariatric-surgery suicide, the truncated HT framework is adapted to estimate the total number of studies—including those systematically missing due to zero outcomes—by estimating inclusion probabilities via a zero-truncated count regression and applying the HT formula to infer the unobserved population size. This approach enables unbiased estimation even under exclusion mechanisms, and the use of a parametric bootstrap robustly quantifies total uncertainty (Dennett et al., 2023).