Truncated Horvitz–Thompson Estimator

Updated 12 November 2025

Truncated Horvitz–Thompson estimator is a bias–variance reduction method that replaces extremely small inclusion probabilities with a preset threshold to stabilize weight calculations.
It achieves a significant reduction in mean squared error—demonstrated by simulation reductions of 20% to 70% and empirical gains of 16% to 55%—with only a negligible bias.
Its straightforward implementation and adaptability for ratio estimation and zero-truncated count data make it a robust choice in complex survey designs.

The truncated Horvitz–Thompson estimator refers to a family of bias–variance-reducing modifications of the classical Horvitz–Thompson (HT) estimator, primarily relevant when survey sampling inclusion probabilities are highly heterogeneous or when data are subject to truncation (e.g., studies reporting only nonzero counts). The core strategy is to either threshold small inclusion probabilities by a fixed value (the "hard-threshold" or "truncated" HT) to control estimator variance, or to explicitly model truncation and estimate latent totals via inverse probability weighting. These approaches yield estimators with controlled or minimized mean squared error (MSE) at the cost of small bias, with rigorous theoretical and empirical foundations.

1. Classical Horvitz–Thompson Estimator and Its Limitations

Let $U = \{1, \ldots, N\}$ be a finite population, with variable of interest $Y$ taking values $y_i$ for unit $i$ . Under an unequal-probability, without-replacement sampling design, each unit $i$ is included in the sample $s$ with first-order probability $\pi_i = P(i \in s)$ and second-order probability $\pi_{ij} = P(i, j \in s)$ .

The classical Horvitz–Thompson estimator for the population total $T = \sum_{i \in U} y_i$ is: $\widehat T_{\rm HT} = \sum_{i \in s} \frac{y_i}{\pi_i}$ It is unbiased: $E[\widehat T_{\rm HT}] = T$ . Its variance is

$\operatorname{Var}(\widehat T_{\rm HT}) = \sum_{i \in U} \frac{\Delta_{ii}}{\pi_i^2} y_i^2 + \sum_{i \neq j} \frac{\Delta_{ij}}{\pi_i \pi_j} y_i y_j$

where $\Delta_{ii} = \pi_i(1-\pi_i)$ and $\Delta_{ij} = \pi_{ij} - \pi_i\pi_j$ .

A key limitation arises when some $\pi_i$ are very small: the $y_i / \pi_i$ terms can become extreme, inflating variance and leading to unstable estimation, particularly when the population exhibits strong heterogeneity in inclusion probabilities.

2. Hard-Threshold (Truncated) Horvitz–Thompson Estimator: Formulation

The improved Horvitz–Thompson (IHT) estimator modifies the classical estimator by introducing a hard threshold $\tau > 0$ on the inclusion probabilities: $\pi_i^* = \max(\pi_i, \tau) \qquad i=1, \ldots, N$ Typically, $\tau$ is chosen as the $K$ th-smallest inclusion probability, $\pi_{(K)}$ , subject to $\pi_{(K)} \leq 1/(K+1)$ .

The IHT estimator is then: $\widehat T_{\rm IHT} = \sum_{i \in s} \frac{y_i}{\pi_i^*}$ This shrinkage strategy dampens extreme weights, controls variance, and introduces a bias that is, under regular sampling conditions, asymptotically negligible.

3. Bias, Variance, and Exact MSE

Define $U_2 = \{i : \pi_i \leq \tau\}$ (units with thresholded probabilities) and $U_1 = U \setminus U_2$ . The estimator's bias is: $\operatorname{Bias}(\widehat T_{\rm IHT}) = \sum_{i \in U_2} \left( \frac{\pi_i}{\tau} - 1 \right) y_i$ The exact variance is: $\operatorname{Var}(\widehat T_{\rm IHT}) = \sum_{i \in U} \frac{\Delta_{ii}}{(\pi_i^*)^2} y_i^2 + \sum_{i \neq j} \frac{\Delta_{ij}}{\pi_i^* \pi_j^*} y_i y_j$ Therefore, the mean squared error is

$\operatorname{MSE}(\widehat T_{\rm IHT}) = \left[\sum_{i \in U_2} \left(\frac{\pi_i}{\tau} - 1\right) y_i \right]^2 + \sum_{i \in U} \frac{\Delta_{ii}}{(\pi_i^*)^2} y_i^2 + \sum_{i \neq j} \frac{\Delta_{ij}}{\pi_i^* \pi_j^*} y_i y_j$

An unbiased estimator of the MSE can be constructed using sample data and the partition $s_2 = s \cap U_2$ , with terms involving $\check\Delta_{ii} = \Delta_{ii}/\pi_i$ , $\check\Delta_{ij} = \Delta_{ij}/\pi_{ij}$ , and the corresponding sample means.

4. Theoretical Comparison and Asymptotics

The IHT estimator achieves MSE of order $O(n^{-1})$ , matching that of the classical HT estimator. Its bias is also $O(n^{-1})$ and thus negligible compared to variance for large samples. Theoretical results establish: $\operatorname{MSE}(N^{-1} \widehat T_{\rm IHT}) \leq \operatorname{MSE}(N^{-1} \widehat T_{\rm HT}) + o(n^{-1})$ For Poisson sampling, the IHT estimator yields uniform improvement in MSE over the classical HT as long as there exist two units $i, j$ such that $(\pi_i - \tau)y_i \neq (\pi_j - \tau)y_j$ (Zong et al., 2018). In all cases, as $\pi$ -heterogeneity increases, the benefit of truncation becomes more pronounced.

5. Practical Selection of Threshold and Implementation

Set $\tau$ as follows: Arrange inclusion probabilities in increasing order, $\pi_{(1)} \leq \cdots \leq \pi_{(N)}$ ; initialize $K = 0$ , and for $j = 1, 2, \ldots$ , increment $K$ while $\pi_{(j)} \leq 1/(j+1)$ . Finally, take $\tau = \pi_{(K)}$ . For large populations, $K/N = O(n^{-1})$ .

The estimator is readily implemented: replace all $\pi_i < \tau$ with $\tau$ , compute weights $1 / \pi_i^*$ , and estimate the total as a weighted sum of $y_i$ . Standard software for survey analysis can accommodate weight adjustments; the unbiased MSE estimator requires access to pairwise inclusion probabilities.

Empirically, reductions in relative MSE $(\operatorname{MSE}_{\rm HT} - \operatorname{MSE}_{\rm IHT})/\operatorname{MSE}_{\rm HT}$ range from 20% to 70% in simulations; real data applications show efficiency gains of 16–55% (Zong et al., 2018). In scenarios with extremely small inclusion probabilities, the variance control afforded by truncation can be critical.

6. Extension to Ratio Estimation

When an auxiliary variable $Z$ with known population total $t_z$ is available, the improved ratio estimator is obtained by replacing all $\pi_i$ by $\pi_i^*$ : $\widehat R^* = \frac{\widehat T_{y,\rm IHT}}{\widehat T_{z,\rm IHT}}, \qquad \widehat Y_R^* = t_z \widehat R^*$ Under smoothness and higher-order inclusion probability conditions, the MSE for these improved ratio estimators satisfies

$\operatorname{MSE}(\widehat R^*) \leq \operatorname{MSE}(\widehat R) + o(n^{-1})$

and likewise for $\widehat Y_R^*$ (Zong et al., 2018).

7. Applicability, Robustness, and Empirical Performance

The truncated HT/IHT estimator is most advantageous in survey designs or population structures where certain units have very low selection probabilities, leading to instability in the classical estimator. Its effectiveness is largely independent of the relationship between $Y$ and $\pi$ : the reduction in MSE persists even when $Y$ is not related to selection. The bias introduced is typically insignificant relative to variance reduction, and practical implementation is straightforward, including for ratio estimation.

Empirical analysis in real datasets (e.g., the "Lucy" firm data) demonstrates robustness and efficiency: with sample sizes up to 30% of the population, MSE reductions of 16–55% were observed, and the optimal $K$ threshold adapts as the sampling fraction increases. The estimator is accompanied by an unbiased and tractable variance estimate, and operational guidelines for threshold choice have emerged from both theoretical and empirical investigations.

A plausible implication is that in large, highly stratified or unequal-probability sampling, routine adoption of the IHT estimator may offer substantial efficiency gains with negligible bias, especially in applications where maximal control of estimator variance is sought.

For zero-truncated count data meta-analysis, as in studies of post-bariatric-surgery suicide, the truncated HT framework is adapted to estimate the total number of studies—including those systematically missing due to zero outcomes—by estimating inclusion probabilities via a zero-truncated count regression and applying the HT formula to infer the unobserved population size. This approach enables unbiased estimation even under exclusion mechanisms, and the use of a parametric bootstrap robustly quantifies total uncertainty (Dennett et al., 2023).