Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Difference-in-Means Estimator: Definition & Efficiency

Updated 30 June 2025
  • Difference-in-means estimator is an unbiased tool for estimating population means and average treatment effects under simple random sampling.
  • It serves as a baseline method in survey sampling and causal inference, enabling comparisons with more sophisticated estimators like PEML and GREG.
  • Incorporating auxiliary information can improve efficiency, as alternative estimators often achieve lower mean squared error than the basic difference-in-means estimator.

The difference-in-means estimator is a foundational tool for estimating the population mean or average treatment effect in survey sampling and experimental design. It is classically associated with the sample mean under simple random sampling without replacement (SRSWOR) and coincides with the Horvitz-Thompson estimator for equal-inclusion probability designs. The estimator continues to play a central role in modern causal inference and survey methodology, both as a baseline estimator and as a standard for gauging the efficiency of more sophisticated estimators that exploit auxiliary information.

1. Definition and Purpose

The difference-in-means estimator (often called the sample mean in the SRSWOR context) is defined as follows for a finite population {Y1,Y2,,YN}\{Y_1, Y_2, \dots, Y_N\} and a sample ss of size nn: Y^=1nisYi\hat{\overline{Y}} = \frac{1}{n} \sum_{i\in s} Y_i The purpose of Y^\hat{\overline{Y}} is to provide an unbiased estimate of the population mean: Y=1Ni=1NYi\overline{Y} = \frac{1}{N}\sum_{i=1}^N Y_i This estimator is fundamental in descriptive statistics, survey sampling, and causal inference (as the average treatment effect estimator in randomized experiments).

2. Comparison with Other Estimators

The difference-in-means estimator is evaluated in the context of various estimators, including:

  • PEML (Pseudo Empirical Likelihood) Estimator: Leverages auxiliary information to potentially improve efficiency using empirical likelihood.
  • GREG (Generalized Regression) Estimator: Utilizes regression adjustment with auxiliary variables.
  • Plug-in Estimators: Estimate functions (variance, correlation, regression coefficients) of the mean by substituting estimates.
  • Ratio/Product Estimators: Use auxiliary variables to adjust the estimate when correlation is strong with the target variable.

Key findings indicate:

  • The PEML estimator under SRSWOR achieves the lowest asymptotic MSE for the mean among considered estimators.
  • Plug-in PEML estimators outperform plug-in difference-in-means estimators for variance, correlation, and regression coefficients when YY and the auxiliary variable XX are related.
  • Under high-entropy π\piPS designs (probability proportional to size), the plug-in Hájek estimator is optimal.

The difference-in-means estimator is asymptotically efficient under SRSWOR when no auxiliary information is present or informative, but PEML and GREG models offer improvements when such information is pertinent.

3. Performance Metrics

The central performance metric for the difference-in-means and related estimators is asymptotic Mean Squared Error (MSE): MSE(Y^)=E[(Y^Y)2]\mathrm{MSE}(\hat{\overline{Y}}) = E \left[ \big(\hat{\overline{Y}} - \overline{Y}\big)^2 \right] Asymptotic variance under SRSWOR is given by: Δ12=(1λ)limν(Sw2(SxwSx)2)\Delta_1^2 = (1-\lambda) \lim_{\nu\rightarrow\infty} \left(S^2_w - \left(\frac{S_{xw}}{S_x}\right)^2\right) where λ=limn/N\lambda = \lim n/N, Sw2S^2_{w} and SxwS_{xw} refer to (super)population variance and covariance with respect to WW (a function of YY).

Relative efficiency is assessed by: RE(θ^1,P1θ^2,P2)=MSEP2(θ^2)MSEP1(θ^1)\mathrm{RE}(\hat{\theta}_1, P_1 \mid \hat{\theta}_2, P_2) = \frac{\mathrm{MSE}_{P_2}(\hat{\theta}_2)}{\mathrm{MSE}_{P_1}(\hat{\theta}_1)} where values greater than 1 indicate greater efficiency of estimator 1.

4. Sampling Designs

The estimator's properties are examined across sampling designs:

  • Simple Random Sampling Without Replacement (SRSWOR): The estimator is optimal and unbiased.
  • Lahiri-Midzuno-Sen (LMS): Probability proportional to size design.
  • High-Entropy π\piPS (HEπ\piPS): General high-entropy designs (e.g., Rao-Sampford).
  • Rao-Hartley-Cochran (RHC): Stratified probability proportional to size.

Under SRSWOR, the difference-in-means estimator coincides with the HT estimator. In designs exploiting auxiliary information, non-auxiliary estimators can be inefficient; plug-in variants (e.g., Hájek estimator under HEπ\piPS) are more appropriate.

5. Mathematical Formulation

The sample mean under SRSWOR: Y^=1nisYi\hat{\overline{Y}} = \frac{1}{n} \sum_{i \in s} Y_i

General Horvitz-Thompson estimator for arbitrary design: Y^HT=isYiNπi\hat{\overline{Y}}_{HT} = \sum_{i \in s} \frac{Y_i}{N \pi_i} Asymptotic distribution (for function gg of sample mean): n[g(h^)g(h)]LN(0,Δ2)\sqrt{n}[g(\hat{\overline{h}}) - g(\overline{h})] \xrightarrow{\mathcal{L}} N(0, \Delta^2) Equivalence classes are constructed to identify estimators with equivalent asymptotic MSEs under different designs.

6. Applications and Implications

The difference-in-means estimator is widely used for:

  • Survey Sampling: Estimating averages, proportions, and more complex statistics in finite populations.
  • Causal Inference: As the canonical estimator for the average treatment effect in randomized studies.
  • Official Statistics, Public Health, and the Social Sciences: When inferring population-level averages from sampled data.

Implications derived from the paper include:

  • The difference-in-means estimator is robust, simple, and unbiased under SRSWOR, but does not utilize auxiliary information.
  • If auxiliary variables are strongly correlated with outcomes, PEML and GREG type estimators offer significant efficiency gains.
  • For secondary parameters beyond the mean (variance, correlation, regression), plug-in PEML estimators are consistently more efficient and do not suffer from issues (e.g., negative variance estimates) encountered by some alternatives.
  • The efficiency of the difference-in-means estimator may deteriorate under complex designs with unequal sampling probabilities, especially if auxiliary variables are ignored.

Conclusion

The difference-in-means estimator is a foundational, unbiased estimator of the population mean under simple random sampling, with well-characterized asymptotic properties and robust performance in the absence of auxiliary information. Its efficiency is, however, superseded by contemporary estimators such as PEML and GREG in the presence of informative auxiliary variables and alternative sampling designs. The referenced paper rigorously develops comparative asymptotic MSE results, equivalence classes, and practical guidance for estimator selection in finite population settings.


Key LaTeX Formulas

Sample mean (difference-in-means): Y^=1nisYi\hat{\overline{Y}} = \frac{1}{n} \sum_{i \in s} Y_i Horvitz-Thompson estimator (general design): Y^HT=isYiNπi\hat{\overline{Y}}_{HT} = \sum_{i \in s} \frac{Y_i}{N \pi_i} Asymptotic variance: Δ12=(1λ)limνSY2\Delta_1^2 = (1-\lambda) \lim_{\nu\rightarrow\infty} S^2_Y Relative efficiency: RE(Y^1,P1Y^2,P2)=MSEP2(Y^2)MSEP1(Y^1)\mathrm{RE}(\hat{\overline{Y}}_1, P_1 \mid \hat{\overline{Y}}_2, P_2) = \frac{\mathrm{MSE}_{P_2}(\hat{\overline{Y}}_2)}{\mathrm{MSE}_{P_1}(\hat{\overline{Y}}_1)}


All content presented is based on the findings and results from "A comparison of estimators of mean and its functions in finite populations" by Anurag Dey and Probal Chaudhuri.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.