Data-Driven Sampling Network
- Data-Driven Sampling Network is a framework that uses empirical network data and ERGM-based simulations to drive sampling design.
- It integrates model-assisted estimation with respondent-driven sampling to adjust for seed bias and account for network dependencies.
- Extensive simulations and tailored bootstrap methods demonstrate significant improvements in bias correction and variance control.
A data-driven sampling network is a framework in which the selection, weighting, and inferential procedures of a sampling design are explicitly driven by empirical network data—structural properties, observed response patterns, or features inferred from partial observations. This approach is particularly pertinent for sampling and making inference in complex networked populations, especially when standard random sampling is infeasible and the underlying network exerts strong effects on inclusion probabilities, dependency structure, or estimator properties. In the context of link-tracing designs such as Respondent-Driven Sampling (RDS), model-assisted and data-driven methods tightly integrate network models, simulation, and estimation strategies to yield improved inference and robust quantification of uncertainty.
1. Model-Assisted Network-Based Inference
A central concept in data-driven sampling networks is the use of a working statistical model—such as an exponential-family random graph model (ERGM)—to represent the (partially observed or unobserved) network over which sampling is conducted. In the context of RDS, where recruitment begins with a convenience sample of seeds and propagates via respondent-driven branching, the estimator requires information about the network structure that cannot be directly observed. The model-assisted approach proceeds as follows:
- Specify a working ERGM, where the probability of observing a network given degrees , node-level covariates , and parameter takes the form
with typically encoding homophily or other mesoscale structure, and the normalizing constant.
- Simulate a set of networks from the ERGM, conditioned on the observed summary statistics (e.g., degree sequence, infection status).
- For each simulated network, replicate the original RDS process (starting from the actual seed sample), producing empirical estimates of inclusion probabilities for each node .
- These inclusion probabilities are then used in design-based Horvitz–Thompson–type estimators, allowing for incorporation of both structural bias (homophily, degree effects) and nonrandom seed selection.
This method contrasts with standard RDS estimators relying on with-replacement random walk assumptions, which generally ignore seed bias and finite-population effects.
2. Design-Based Estimator and Bias Adjustment
The model-assisted estimator adopts a generalized Horvitz–Thompson (Hajek) form:
where encodes the outcome (e.g., infection status), and are the empirically estimated inclusion probabilities.
A particular innovation is explicit adjustment for bias induced by convenience seeds. Since RDS generally begins with non-random initial samples, the downstream recruitment process can substantially over- or under-represent certain subpopulations, especially when homophily is strong or seed composition is skewed. By conditioning the simulation on the observed seed characteristics, matching degree and infection status, the estimator corrects the initial imbalance, rather than passively assuming long recruitment chains eliminate bias. This methodology enables accurate adjustment even if the number of waves is limited or if the network contains bottlenecks or strong modularity.
3. Comparative Performance in Simulation
Extensive simulation reveals substantial improvements in estimator performance relative to standard alternatives—the naive sample mean, Volz–Heckathorn (VH), Salganik–Heckathorn (SH), and successive sampling (SS) estimators:
- In scenarios with small sample fraction and no seed bias, all estimators behave similarly.
- When seed bias is present or when outcome groups have differential degree or homophily, conventional estimators exhibit notable bias (positive or negative), whereas the model-assisted (MA) estimator essentially eliminates bias.
- For example, with all-infected seeds and high homophily (), only the MA estimator returns nearly unbiased prevalence estimates; others are strongly distorted.
- Variance of the MA estimator is not increased and may be reduced compared to conventional estimators, due to more precise modeling of .
This improvement is attributed not only to mean bias correction but also to improved variance control under complex seed-selection and branching dynamics.
4. Sensitivity Analysis: Population Size and Model Specification
The methodology’s robustness was assessed along two primary axes:
- Unknown Population Size (): The ERGM-conditional simulation requires an assumed . Results show that as long as is not grossly underestimated, moderate errors in have limited impact on when sample fraction is small. As sampling fraction approaches population size, sensitivity grows.
- Model Misspecification: Using true networks with elevated triadic transitivity (i.e., higher geometrically weighted edgewise shared partners, GWESP), simulations show that misspecification of higher-order structure in the working model induces limited additional bias (e.g., 0.46% in extreme cases), and has minor effects on variance. This suggests the MA estimator is robust to certain forms of misspecification, provided degree and primary homophily are accurately modeled.
Figures (e.g., Figs. 5, 6 in the source) provide empirical support under various generated scenarios.
5. Bootstrap for Uncertainty Quantification
A parametric bootstrap, tailored to the model-assisted estimation process, is used for standard error estimation and uncertainty quantification:
- After fitting the working ERGM and estimating parameters and , one generates bootstrap replicates by simulating new networks from the fitted ERGM and carrying out the full RDS process for each replicate.
- Each bootstrap sample yields an estimator , generating a bootstrap distribution for inference.
- This approach captures both design-based sampling variability and model-based uncertainty, providing more trustworthy coverage properties than resampling methods that ignore recruitment structure.
6. Application: HIV Prevalence Estimation in Hidden Populations
The methodology is applied to estimation of HIV prevalence among injecting drug users (IDU) in Mykolaiv, Ukraine:
- Data comprised 260 participants reached via 6 HIV-positive seeds over 10 RDS waves.
- The working model includes estimated HIV prevalence, activity differentials (degree differences by infection status), and empirical homophily (). is assumed; sensitivity is analyzed.
- The recruitment process was further modified in the model to allow for differential recruitment effectiveness by infection status and wave, informed by domain knowledge (e.g., uninfected recruits in early waves were less likely to recruit).
- The modified MA estimator, incorporating this application-specific branching, estimates HIV prevalence at 0.817, considerably lower than uncorrected estimators, and more credible given the observed recruitment patterns.
This demonstrates not just suitability for general RDS inference, but flexibility in incorporating nuanced, context-specific features.
7. Summary and Implications
In summary, the data-driven sampling network paradigm introduced in (Gile et al., 2011) fundamentally integrates a working network model (ERGM) with simulation-based estimation of inclusion probabilities in link-tracing network sampling. The resulting estimator:
provides effective bias correction for nonrandom seeds and differential nodal activity, is robust under moderate population size mis-specification and model misspecification, and supports rigorous inference via tailored bootstrap simulation.
Simulation and applied evidence show substantial bias and variance reduction relative to standard estimators. This methodology broadens the applicability of RDS and similar sampling frameworks in hard-to-reach populations and hidden networks, and supports integration of application-specific recruitment rules. It sets a foundation for future work on network-aware adaptive sampling, joint modeling of network and outcome process, and robust extrapolation to unobserved portions of network structure.