Papers
Topics
Authors
Recent
Search
2000 character limit reached

Combining Data from Surveys and Related Sources

Published 19 Oct 2022 in stat.ME and stat.AP | (2210.10830v1)

Abstract: To improve the precision of inferences and reduce costs there is considerable interest in combining data from several sources such as sample surveys and administrative data. Appropriate methodology is required to ensure satisfactory inferences since the target populations and methods for acquiring data may be quite different. To provide improved inferences we use methodology that has a more general structure than the ones in current practice. We start with the case where the analyst has only summary statistics from each of the sources. In our primary method, uncertain pooling, it is assumed that the analyst can regard one source, survey $r$, as the single best choice for inference. This method starts with the data from survey $r$ and adds data from those other sources that are shown to form clusters that include survey $r$. We also consider Dirichlet process mixtures, one of the most popular nonparametric Bayesian methods. We use analytical expressions and the results from numerical studies to show properties of the methodology.

Summary

  • The paper introduces "uncertain pooling," which combines summary data from multiple sources to improve inference precision while accounting for differences between sources.
  • Analyses show uncertain pooling provides substantial precision gains for a primary source's estimates, with up to a 29% reduction in estimate variability demonstrated by combining data from other sources.
  • The uncertain pooling method offers fully Bayesian inferences for combining data sources and is particularly applicable when covariate data is limited, providing a robust approach to account for differences across diverse sources.

The paper introduces a methodology for combining data from multiple sources, such as sample surveys and administrative data, to improve the precision of inferences while accounting for potential differences in target populations and data acquisition methods. The authors address the scenario where only summary statistics are available from each source. They introduce a more general structure than current survey sampling methods to provide improved inferences. The approach focuses on "uncertain pooling," where one data source is considered the single best choice for inference and is augmented with data from other sources that form clusters including the primary survey. The authors also explore Dirichlet process mixtures (DPM), a non-parametric Bayesian method. The properties of the methodology are demonstrated through analytical expressions and numerical studies.

The motivation stems from a study of health insurance coverage in Florida counties, where significantly different estimates were observed across three surveys. The methodology is applicable when covariate data is limited. The authors assume that survey estimates YiY_i are independent and normally distributed, YiN(μi,Vi)Y_i \sim N(\mu_i, V_i), where ViV_i is known. A common prior distribution expresses similarity among the μi\mu_i: μiν,δ2N(ν,δ2)\mu_i | \nu, \delta^2 \sim N(\nu, \delta^2), independently for each ii, with ν\nu and δ\delta assigned locally uniform prior distributions.

Here is a list of the variables in the previous equations:

  • YiY_i: Survey estimates
  • μi\mu_i: Mean of the normal distribution for survey i
  • ViV_i: Variance of the normal distribution for survey i
  • ν\nu: Mean of the prior distribution
  • δ2\delta^2: Variance of the prior distribution

The resulting posterior expected value of μi\mu_i is a convex combination of the estimate YiY_i and a weighted average of {Y1,...,YL}\{Y_1, ..., Y_L\}. The authors argue that the standard approach's assumption of independent sampling from a common distribution may lead to unsatisfactory inferences due to its inflexibility. The paper uses more flexible prior distributions to allow the sample data to determine the degree and nature of pooling.

The uncertain pooling method builds upon prior work and assumes that subsets of μ=(μ1,μ2,...,μL)\mu = (\mu_1, \mu_2, ..., \mu_L) are "similar" and that there is uncertainty about the composition of these subsets. Given GG total partitions of the set £={1,...,L}\pounds = \{1, ..., L\}, a particular partition gg, the number of subsets d(g)d(g) in the gthg^{th} partition, and Sk(g)S_k(g) as the set of survey labels in subset kk, the authors condition on gg. They assume independence between subsets, and within Sk(g)S_k(g), the μi\mu_i are independent with μiνk(g)N(νk(g),δk2(g))\mu_i | \nu_k(g) \sim N(\nu_k(g), \delta^2_k(g)), iSk(g)i \in S_k(g). The νk(g)\nu_k(g) are mutually independent with νk(g)θk(g)N(θk(g),γ2(g))\nu_k(g) | \theta_k(g) \sim N(\theta_k(g), \gamma^2(g)), where θk(g)\theta_k(g) and γk(g)\gamma_k(g) are hyperparameters. The δ2(g)\delta^2(g) are also hyperparameters with assigned prior distributions.

Conditioning on θk(g)\theta_k(g) and γk(g)\gamma_k(g) and letting γ2(g)\gamma^2(g) approach infinity, the expected posterior moments conditional on partition gg are derived. Letting y=(Y1,...,YL)y = (Y_1, ..., Y_L), Δ2={δk2(g):k=1,...,d(g);g=1,...,G}\Delta^2 = \{\delta^2_k(g) : k = 1, ..., d(g); g = 1, ..., G\}, and assuming Yi=Y^iY_i = \hat{Y}_i, the conditional posterior expectation is:

E(μiy,g,Δ2)={λi(g)}Y^i+{1λi(g)}μ^k(g)E(\mu_i | y, g, \Delta^2) = \{\lambda_i(g)\} \hat{Y}_i + \{1 - \lambda_i(g)\} \hat{\mu}_k(g), iSk(g)i \in S_k(g)

Here is a list of the variables in the previous equations:

  • E(μiy,g,Δ2)E(\mu_i | y, g, \Delta^2): Conditional posterior expectation of μi\mu_i
  • λi(g)\lambda_i(g): Weighting factor for individual survey estimate
  • Y^i\hat{Y}_i: Estimate from survey i
  • μ^k(g)\hat{\mu}_k(g): Estimate of the mean for cluster k in partition g

The posterior covariance is:

cov(μi,μjy,g,Δ<sup>2)</sup>={δ<sup>2k(g)</sup>1λi(g)+1λi(g)<sup>2</sup>δ<sup>2(g)i</sup>Sk(g)λi(g),amp;i=j;i,jSk(g) 1λi(g)1λj(g)δ<sup>2(g)i</sup>Sk(g)λi(g),amp;ij;i,jSk(g) 0,amp;iSk1(g),jSk2(g),k1k2cov(\mu_i, \mu_j | y, g, \Delta<sup>2)</sup> = \begin{cases} \frac{\delta<sup>2_k(g)</sup> {1 - \lambda_i(g)} + {1 - \lambda_i(g)}<sup>2</sup> \delta<sup>2(g)}{\sum_{i</sup> \in S_k(g)} \lambda_i(g)}, &amp; i = j; i, j \in S_k(g) \ \frac{{1 - \lambda_i(g)}{1 - \lambda_j(g)} \delta<sup>2(g)}{\sum_{i</sup> \in S_k(g)} \lambda_i(g)}, &amp; i \neq j; i, j \in S_k(g) \ 0, &amp; i \in S_{k1}(g), j \in S_{k2}(g), k1 \neq k2 \end{cases}

Here is a list of the variables in the previous equations:

  • cov(μi,μjy,g,Δ2)cov(\mu_i, \mu_j | y, g, \Delta^2): Conditional posterior covariance between μi\mu_i and μj\mu_j
  • δk2(g)\delta^2_k(g): Variance for cluster k in partition g
  • λi(g)\lambda_i(g): Weighting factor for individual survey estimate
  • Sk(g)S_k(g): Set of survey labels in subset k

where

λi(g)=δ2(g)δ2(g)+Vi\lambda_i(g) = \frac{\delta^2(g)}{\delta^2(g) + V_i}

μ^k(g)=jSk(g)λj(g)YjjSk(g)λj(g)\hat{\mu}_k(g) = \frac{\sum_{j \in S_k(g)} \lambda_j(g) Y_j}{\sum_{j \in S_k(g)} \lambda_j(g)}

Here is a list of the variables in the previous equations:

  • λi(g)\lambda_i(g): Weighting factor for individual survey estimate
  • δ2(g)\delta^2(g): Variance parameter for partition g
  • ViV_i: Variance for survey i
  • μ^k(g)\hat{\mu}_k(g): Estimated mean for cluster k in partition g
  • YjY_j: Estimate from survey j
  • Sk(g)S_k(g): Set of survey labels in subset k

The authors note that the basic model corresponds to the "pool-all" partition, {g=1}\{g = 1\}, where all L surveys are in a single cluster. The inference about μ\mu incorporates uncertainty about the value of gg:

f(μy)=f(μy,g,Δ2)f(g,Δ2y)dgdΔ2f(\mu | y) = \int \int f(\mu | y, g, \Delta^2) f(g, \Delta^2 | y) \, dg \, d\Delta^2

Here is a list of the variables in the previous equations:

  • f(μy)f(\mu | y): Posterior distribution of μ given data y
  • f(μy,g,Δ2)f(\mu | y, g, \Delta^2): Conditional posterior distribution of μ
  • f(g,Δ2y)f(g, \Delta^2 | y): Posterior distribution of partition g and variance parameters Δ2\Delta^2 given data y

The paper addresses the challenge of specifying the rate at which γ2(g)\gamma^2(g) approaches infinity when evaluating f(gΔ2,y)f(g | \Delta^2, y). The authors use a fully Bayesian alternative that postulates little prior information about the νk(g)\nu_k(g) and is invariant to changes in the scale of YY. Given a prior f(g,Δ2)=f(g)f(Δ2)f(g, \Delta^2) = f(g)f(\Delta^2) and letting γ2(g)\gamma^2(g) approach infinity subject to a constant Kullback-Leibler information about ν(g)\nu(g), the posterior distribution is proportional to:

f(g,Δ2y)f(Δ2)f(g)exp(d(g)/2)k=1d(g)iSk(g){1λi(g)}1/2×exp(12k=1d(g)1δ2(g)iSk(g)λi(g){Yiμ^k(g)}2)f(g, \Delta^2 | y) \propto f(\Delta^2) f(g) \exp(-d(g)/2) \prod_{k=1}^{d(g)} \prod_{i \in S_k(g)} \{1 - \lambda_i(g)\}^{1/2} \times \exp \left(-\frac{1}{2} \sum_{k=1}^{d(g)} \frac{1}{\delta^2(g)} \sum_{i \in S_k(g)} \lambda_i(g) \{Y_i - \hat{\mu}_k(g)\}^2 \right)

Here is a list of the variables in the previous equations:

  • f(g,Δ2y)f(g, \Delta^2 | y): Posterior distribution of partition gg and variance parameters Δ2\Delta^2 given data yy
  • f(Δ2)f(\Delta^2): Prior distribution of variance parameters Δ2\Delta^2
  • f(g)f(g): Prior distribution of partition gg
  • d(g)d(g): Number of subsets in partition gg
  • λi(g)\lambda_i(g): Weighting factor for individual survey estimate
  • YiY_i: Estimate from survey i
  • μ^k(g)\hat{\mu}_k(g): Estimated mean for cluster k in partition g
  • δ2(g)\delta^2(g): Variance parameter for partition g
  • Sk(g)S_k(g): Set of survey labels in subset k

The term in the exponent, Q{d(g)}=k=1d(g)1δ2(g)iSk(g)λi(g){Yiμ^k(g)}2Q\{d(g)\} = \sum_{k=1}^{d(g)} \frac{1}{\delta^2(g)} \sum_{i \in S_k(g)} \lambda_i(g) \{Y_i - \hat{\mu}_k(g)\}^2, tends to decrease as d(g)d(g) increases. The term exp{d(g)/2}\exp\{-d(g)/2\} penalizes partitions with larger values of d(g)d(g). The authors assume f(g)f(g) is constant and use an Inverse Beta prior for δ2\delta^2: f(δ2)1/(1+δ2)f(\delta^2) \propto 1/(1 + \delta^2), 0<δ2<0 < \delta^2 < \infty.

The authors approximate f(g,δ2y)f(g, \delta^2 | y) by evaluating the right side of the equation above for a grid of values and standardizing. A random sample of size B is drawn from the normalized values, and for each selection (g,δ2)(g*, \delta^2), μ\mu is sampled from f(μy,g,δ2)f(\mu | y, g*, \delta^2). Marginal posterior distributions, f(gy)f(g | y) and f(δ2y)f(\delta^2 | y), are approximated directly from the grid.

Assuming survey rr is the single best choice for inference, the posterior distribution corresponding to survey rr is the object of inference:

E(μry)=Eg,δ2yE(μry,g,δ2)E(\mu_r | y) = E_{g,\delta^2 | y} E(\mu_r | y, g, \delta^2)

Here is a list of the variables in the previous equations:

  • E(μry)E(\mu_r | y): Expected value of μr\mu_r given data yy
  • Eg,δ2yE_{g,\delta^2 | y}: Expectation over gg and δ2\delta^2 given data yy
  • E(μry,g,δ2)E(\mu_r | y, g, \delta^2): Conditional expectation of μr\mu_r

The authors compare their uncertain pooling method with the model by Chakraborty et al. (2014), noting that while the latter can treat outliers, it does not exploit potential clustering of the μi\mu_i.

The paper then introduces the DPM as an alternative. The model is specified as yiθiiidfθiy_i | \theta_i \stackrel{iid}{\sim} f_{\theta_i} and θiiidH\theta_i \stackrel{iid}{\sim} H, with HDP(M,H0)H \sim DP(M, H_0). Here, yi=Yiy_i = Y_i, θi=μi\theta_i = \mu_i, fθif_{\theta_i} is the pdf of a N(μi,Vi)N(\mu_i, V_i) random variable with fixed ViV_i, and H0=N(η,τ2)H_0 = N(\eta, \tau^2). The hyperparameters are assigned independent distributions: Ma0,b0Gamma(a0,b0)M | a_0, b_0 \sim Gamma(a_0, b_0), ηnb,StN(nb,St)\eta | n_b, S_t \sim N(n_b, S_t), and τ2ϕ1,ϕ2Gamma(ϕ1/2,ϕ2/2)\tau^{-2} | \phi_1, \phi_2 \sim Gamma(\phi_1/2, \phi_2/2). The authors note the value of extending typical random effects models by using a DPM.

Here is a list of the variables in the previous equations:

  • yiy_i: Observed data point for individual i
  • θi\theta_i: Parameter associated with individual i
  • fθif_{\theta_i}: Probability density function of the data given the parameter
  • HH: Distribution from which the parameters are drawn
  • DP(M,H0)DP(M, H_0): Dirichlet process with base distribution H0H_0 and concentration parameter M
  • MM: Concentration parameter of the Dirichlet process
  • H0H_0: Base distribution of the Dirichlet process
  • η\eta: Mean of the base distribution
  • τ2\tau^2: Variance of the base distribution
  • a0,b0a_0, b_0: Parameters of the Gamma distribution for M
  • nb,Stn_b, S_t: Parameters of the Normal distribution for η
  • ϕ1,ϕ2\phi_1, \phi_2: Parameters of the Gamma distribution for τ2\tau^{-2}

Unlike uncertain pooling, DPmeta requires substantial prior input, which can be problematic with a small number of surveys. The authors follow Escobar (1994) and make inferences for a selected set of values of MM. They replace nbn_b, StS_t, ϕ1\phi_1, and ϕ2\phi_2 with their maximum a posteriori probability estimates.

The authors use data from Ha and Sedransk (2019) on health insurance coverage in Florida counties, along with modifications of these data, to demonstrate the benefits of their methodology. The three data sources are the Small Area Health Insurance Estimates Program (SAHIE), a survey by Ha and Sedransk (HS), and the Centers for Disease Control and Prevention (CDC). The SAHIE program uses point estimates from the American Community Survey (ACS) along with administrative data. The HS and CDC surveys use unit-level models based on data from the Behavioral Risk Factor Surveillance System (BRFSS).

The authors analyze the data using both uncertain pooling and DPmeta. They also conduct a simulation study to establish sampling properties. Each analysis is based only on data from a single county.

In the data-based analyses, a summary of results for Dixie County using uncertain pooling indicates little support for pooling data from all three surveys. The posterior probability for the "pool-all" partition is very low. Assuming a common source and using a locally uniform prior on ν\nu and the Inverse Beta prior on δ2\delta^2, inferences based on the posterior distribution of ν\nu are inconsistent with the notion that any one of the three surveys is the nominal "gold standard." The uncertain pooling methodology can provide substantial gains in precision, measured by the posterior standard deviation, compared to using only the data from a specific survey.

The authors note that the small standard errors for each of the surveys mean that the "all singletons" partition has a relatively large posterior probability. When the CDC standard error is kk times the HS standard error, the reductions in the posterior standard deviation for HS are 29%, 18%, and 7% for k=0.5,1,2k = 0.5, 1, 2, respectively.

Due to the need to specify numerous hyperparameters in DPmeta and the lack of prior information, the authors replace some hyperparameters with their maximum a posteriori probability estimates and consider a range of values for MM. Results from uncertain pooling and DPmeta are generally in close agreement, with similar posterior means and standard deviations.

To address the limitation of small SAHIE standard errors, the authors modify the data sets and perform additional analyses. These analyses demonstrate that the uncertain pooling methodology appropriately accounts for the increased variability associated with the SAHIE estimates. Comparing uncertain pooling and DPmeta results with modified data reveals greater differences than in the original analyses. DPmeta exhibits greater pooling of data from survey 1 with survey 3 and smaller posterior standard deviations for survey 1.

The simulation study, based on modifications of the Orange County data, generates data from three normal distributions with parameters chosen to represent a common situation where survey 1 is a probability sample with relatively large sample variance, while surveys 2 and 3 are non-probability samples with much smaller sample variances. The simulation results indicate that the medians of the posterior means are close to the true values, the coverages are close to the nominal 95%, and there are significant reductions in the posterior standard deviation for survey 1.

The authors conclude that both uncertain pooling and DPM methods provide appropriate inferences, but their analyses based on uncertain pooling are fully Bayes, while those from DPmeta are empirical Bayes. The uncertain pooling method also provides additional information in the form of posterior probabilities for the partitions. They also discuss the challenges of making inference for the sample variances and suggest potential avenues for future research, including extending the uncertain pooling methodology to small area inference and improving the methodology for handling the extension to small area inference when there are data from several surveys.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.