Combining Data from Surveys and Related Sources

Published 19 Oct 2022 in stat.ME and stat.AP | (2210.10830v1)

Abstract: To improve the precision of inferences and reduce costs there is considerable interest in combining data from several sources such as sample surveys and administrative data. Appropriate methodology is required to ensure satisfactory inferences since the target populations and methods for acquiring data may be quite different. To provide improved inferences we use methodology that has a more general structure than the ones in current practice. We start with the case where the analyst has only summary statistics from each of the sources. In our primary method, uncertain pooling, it is assumed that the analyst can regard one source, survey $r$, as the single best choice for inference. This method starts with the data from survey $r$ and adds data from those other sources that are shown to form clusters that include survey $r$. We also consider Dirichlet process mixtures, one of the most popular nonparametric Bayesian methods. We use analytical expressions and the results from numerical studies to show properties of the methodology.

Abstract PDF Upgrade to Chat

Summary

The paper introduces "uncertain pooling," which combines summary data from multiple sources to improve inference precision while accounting for differences between sources.
Analyses show uncertain pooling provides substantial precision gains for a primary source's estimates, with up to a 29% reduction in estimate variability demonstrated by combining data from other sources.
The uncertain pooling method offers fully Bayesian inferences for combining data sources and is particularly applicable when covariate data is limited, providing a robust approach to account for differences across diverse sources.

The paper introduces a methodology for combining data from multiple sources, such as sample surveys and administrative data, to improve the precision of inferences while accounting for potential differences in target populations and data acquisition methods. The authors address the scenario where only summary statistics are available from each source. They introduce a more general structure than current survey sampling methods to provide improved inferences. The approach focuses on "uncertain pooling," where one data source is considered the single best choice for inference and is augmented with data from other sources that form clusters including the primary survey. The authors also explore Dirichlet process mixtures (DPM), a non-parametric Bayesian method. The properties of the methodology are demonstrated through analytical expressions and numerical studies.

The motivation stems from a study of health insurance coverage in Florida counties, where significantly different estimates were observed across three surveys. The methodology is applicable when covariate data is limited. The authors assume that survey estimates $Y_i$ are independent and normally distributed, $Y_i \sim N(\mu_i, V_i)$ , where $V_i$ is known. A common prior distribution expresses similarity among the $\mu_i$ : $\mu_i | \nu, \delta^2 \sim N(\nu, \delta^2)$ , independently for each $i$ , with $\nu$ and $\delta$ assigned locally uniform prior distributions.