Papers
Topics
Authors
Recent
2000 character limit reached

N-Ball Sampling for Unbiased Graph Estimation

Updated 12 December 2025
  • N-ball Sampling Technique is a probabilistic graph sampling method that employs BFS expansion from a randomized seed to unbiasedly estimate subgraph motifs.
  • It generalizes classical designs such as random-node, multiplicity, and adaptive cluster sampling while effectively capturing local subgraph densities.
  • The technique's flexibility in adjusting the number of expansion waves enables a balance between computational cost and variance reduction for accurate motif counts.

N-ball sampling, also referred to as T-wave snowball sampling (SBS), is a probabilistic graph sampling technique defined by breadth-first expansion from a randomized initial node sample, aiming to unbiasedly estimate finite-order subgraph totals such as triangles, cycles, or stars. The methodology generalizes multiple classical sampling paradigms, including random-node, random-edge, multiplicity, and adaptive cluster sampling, by constructing “N-balls” (subgraphs induced by nodes within T steps from seed nodes) and employing Horvitz–Thompson normalization for unbiased inference (Oguz-Alper et al., 2020).

1. Formal Definition

Let G=(V,E)G=(V,E) denote a simple undirected graph with V=N|V|=N nodes. The N-ball (or T-wave SBS) protocol comprises the following steps:

  • Initial Sample and Observation: Select an initial seed set s0Vs_0\subset V (typically of fixed size nn) using any probability design admitting known first- and higher-order inclusion probabilities.
  • Reciprocal Incident Observation Procedure (RIOP): Whenever uVu\in V enters the sample, observe all incident edges (u,v)E(u,v)\in E and their endpoints vv.
  • Wave Expansion (N-balls): For step t1t\geq1, define

α(st1)={vVr<tsr:ust1,(u,v)E}\alpha(s_{t-1}) = \{ v\in V \setminus \bigcup_{r<t} s_r : \exists u\in s_{t-1}, (u,v)\in E \}

and set the new sample increment as

Δt=α(st1)r<tsr.\Delta_t = \alpha(s_{t-1}) \setminus \bigcup_{r< t}s_r.

The process is iterated up to TT waves (or until Δt=\Delta_t = \emptyset), yielding the final sample s=s0Δ1ΔT1s = s_0 \cup \Delta_1 \cup \ldots \cup \Delta_{T-1}.

  • Final Sample Graph: The discovered sample graph is Gs=(Vs,Es)G_s = (V_s, E_s), consisting of the observed node set VsV_s and all edges EsE_s exposed during RIOP.

This framework yields an “N-ball of radius TT around s0s_0” in the breadth-first search (BFS) sense.

2. Inclusion Probabilities and Ancestor Sets

An essential component of unbiased graph motif estimation under SBS is the computation of inclusion probabilities, leveraging the structure of SBS-ancestor sets:

  • For a motif κ\kappa (e.g., node, edge, triangle), define its ancestor set as:

βκ={iV:if s0={i} then κ is observed in T waves}\beta_\kappa = \{ i \in V : \text{if } s_0 = \{i\} \text{ then } \kappa \text{ is observed in }T \text{ waves}\}

Typically, a recognized subset βκβκ\beta_\kappa^* \subseteq \beta_\kappa is used, identifiable from the observed GsG_s when κ\kappa is seen.

  • The first-order inclusion probability is:

π(κ)=Pr(βκs0)\pi_{(\kappa)} = \Pr(\beta_\kappa^* \cap s_0 \neq \emptyset)

For a single-node motif ii and initial simple random sampling (SRS) without replacement of size nn:

πi=1C(Nmi,n)C(N,n)\pi_i = 1 - \frac{C(N - m_i, n)}{C(N, n)}

where mi=βim_i = |\beta_i^*| is the number of ancestors of node ii.

  • Higher-order (joint) inclusion probabilities for any motif set (κ1,,κr)(\kappa_1,\dots,\kappa_r) are:

π(κ1κr)=Pr(j=1r{βκjs0})\pi_{(\kappa_1 \ldots \kappa_r)} = \Pr\left(\bigcap_{j=1}^r \{\beta_{\kappa_j}^* \cap s_0 \neq \emptyset\}\right)

In SRS, these probabilities admit closed-form combinatorial representation in terms of β\beta-set overlaps (e.g., π(κ)=1C(Nβκβ,n)C(N,n)\pi_{(\kappa \ell)} = 1 - \frac{C(N - |\beta_\kappa^*\cup\beta_\ell^*|, n)}{C(N,n)}).

3. Unbiased Estimation via Horvitz–Thompson

Given motif set Ω\Omega, with associated values yκy_\kappa, the inferential target is θ=κΩyκ\theta = \sum_{\kappa \in \Omega} y_\kappa. The canonical unbiased estimator (Horvitz–Thompson) is

θ^HT=κΩsyκπ(κ)\hat{\theta}_{HT} = \sum_{\kappa\in\Omega_s} \frac{y_\kappa}{\pi_{(\kappa)}}

where Ωs\Omega_s is the set of motifs observed in GsG_s (those for which βκs0\beta_\kappa^*\cap s_0\neq\emptyset). This estimator is unbiased for θ\theta:

E[θ^HT]=θ,E[\hat{\theta}_{HT}] = \theta,

since E[I{βκs0}]=π(κ)E[I\{\beta_\kappa^*\cap s_0\neq\emptyset\}] = \pi_{(\kappa)} and summing over κ\kappa yields exact expectation.

This formalism encompasses node totals (Ω=V\Omega=V), edge totals (Ω=E\Omega=E), triangle counts, or totals over arbitrary motifs.

4. Sampling Variance Analysis

The variance of the Horvitz–Thompson estimator under SBS design on the seed set s0s_0 is

Var(θ^HT)=κ(π(κ)π(κ)π()1)yκy\operatorname{Var}(\hat{\theta}_{HT}) = \sum_{\kappa}\sum_{\ell} \left( \frac{\pi_{(\kappa\ell)}}{\pi_{(\kappa)}\pi_{(\ell)}} - 1 \right) y_\kappa y_\ell

For node-sum estimation,

Var(S^)=ij(πijπiπj1)yiyj.\operatorname{Var}(\hat{S}) = \sum_{i}\sum_{j} \left( \frac{\pi_{ij}}{\pi_i \pi_j} - 1 \right) y_i y_j.

Approximate forms arise if all πi\pi_i are small or the sampling fraction n/Nn/N is small, in which case independence is assumed:

VarHTi(1πiπi)yi2.\operatorname{Var}_{HT} \approx \sum_i \left( \frac{1 - \pi_i}{\pi_i} \right) y_i^2.

The HT methodology supports unbiased variance estimation:

V^=κΩs(π(κ)π(κ)π())yκyπ(κ)π(κ)π().\hat{V} = \sum_{\kappa} \sum_{\ell \in \Omega_s} \frac{ (\pi_{(\kappa\ell)} - \pi_{(\kappa)}\pi_{(\ell)}) y_\kappa y_\ell }{ \pi_{(\kappa\ell)} \pi_{(\kappa)} \pi_{(\ell)} }.

5. Special Cases and Theoretical Generalizations

T-wave SBS unifies several classical sampling designs in a single graph-theoretic formalism:

  • Random-node (T=0) sampling: No waves. The ancestor set reduces to βi={i}\beta_i^*=\{i\}; πi\pi_i is the initial inclusion probability.
  • Random-edge sampling: Utilizing a bipartite T=0 design on the incidence graph (edges as “nodes”).
  • Multiplicity sampling (Birnbaum–Sirken): The bipartite sampling-unit/measurement-unit relation forms a graph G=(FΩ,E)G=(F\cup\Omega, E); 1-wave SBS from FF with RIOP reconstructs classical multiplicity estimators through known ancestor sets and yields direct application of the HT formula.
  • Adaptive cluster sampling (Thompson): Treating the spatial grid and neighbor-criteria as an infinite-wave (TT\to\infty) directed procedure, clusters form connected components detected via suitable βi\beta_i^*.

6. Methodological Advantages and Practical Considerations

T-wave SBS possesses several notable advantages:

  • Flexibility: Adjusting TT controls the trade-off between traversal cost and variance.
  • Motif Generality: Any finite-order subgraph motif can be estimated within the same HT framework.
  • Sampling Efficiency: Breadth-first “waves” tend to oversample high-degree regions, increasing efficiency for motifs yκy_\kappa associated with local subgraph density.
  • Unification: SBS instantiates a general theory encompassing otherwise disparate sampling regimes, allowing analytical connection between motif estimation and finite-population sampling theory.

The key operational principle is that T-wave snowball sampling equates to BFS truncated at depth TT from a randomly sampled seed set, with motif inclusion determined by ancestor-set intersection with s0s_0. Once inclusion probabilities for motifs are characterized, unbiased estimation and closed-form variance expressions are immediate (Oguz-Alper et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to N-ball Sampling Technique.