N-Ball Sampling for Unbiased Graph Estimation

Updated 12 December 2025

N-ball Sampling Technique is a probabilistic graph sampling method that employs BFS expansion from a randomized seed to unbiasedly estimate subgraph motifs.
It generalizes classical designs such as random-node, multiplicity, and adaptive cluster sampling while effectively capturing local subgraph densities.
The technique's flexibility in adjusting the number of expansion waves enables a balance between computational cost and variance reduction for accurate motif counts.

N-ball sampling, also referred to as T-wave snowball sampling (SBS), is a probabilistic graph sampling technique defined by breadth-first expansion from a randomized initial node sample, aiming to unbiasedly estimate finite-order subgraph totals such as triangles, cycles, or stars. The methodology generalizes multiple classical sampling paradigms, including random-node, random-edge, multiplicity, and adaptive cluster sampling, by constructing “N-balls” (subgraphs induced by nodes within T steps from seed nodes) and employing Horvitz–Thompson normalization for unbiased inference (Oguz-Alper et al., 2020).

1. Formal Definition

Let $G=(V,E)$ denote a simple undirected graph with $|V|=N$ nodes. The N-ball (or T-wave SBS) protocol comprises the following steps:

Initial Sample and Observation: Select an initial seed set $s_0\subset V$ (typically of fixed size $n$ ) using any probability design admitting known first- and higher-order inclusion probabilities.
Reciprocal Incident Observation Procedure (RIOP): Whenever $u\in V$ enters the sample, observe all incident edges $(u,v)\in E$ and their endpoints $v$ .
Wave Expansion (N-balls): For step $t\geq1$ , define

$\alpha(s_{t-1}) = \{ v\in V \setminus \bigcup_{r<t} s_r : \exists u\in s_{t-1}, (u,v)\in E \}$

and set the new sample increment as

$\Delta_t = \alpha(s_{t-1}) \setminus \bigcup_{r< t}s_r.$

The process is iterated up to $T$ waves (or until $\Delta_t = \emptyset$ ), yielding the final sample $s = s_0 \cup \Delta_1 \cup \ldots \cup \Delta_{T-1}$ .

Final Sample Graph: The discovered sample graph is $G_s = (V_s, E_s)$ , consisting of the observed node set $V_s$ and all edges $E_s$ exposed during RIOP.

This framework yields an “N-ball of radius $T$ around $s_0$ ” in the breadth-first search (BFS) sense.

2. Inclusion Probabilities and Ancestor Sets

An essential component of unbiased graph motif estimation under SBS is the computation of inclusion probabilities, leveraging the structure of SBS-ancestor sets:

For a motif $\kappa$ (e.g., node, edge, triangle), define its ancestor set as:

$\beta_\kappa = \{ i \in V : \text{if } s_0 = \{i\} \text{ then } \kappa \text{ is observed in }T \text{ waves}\}$

Typically, a recognized subset $\beta_\kappa^* \subseteq \beta_\kappa$ is used, identifiable from the observed $G_s$ when $\kappa$ is seen.

The first-order inclusion probability is:

$\pi_{(\kappa)} = \Pr(\beta_\kappa^* \cap s_0 \neq \emptyset)$

For a single-node motif $i$ and initial simple random sampling (SRS) without replacement of size $n$ :

$\pi_i = 1 - \frac{C(N - m_i, n)}{C(N, n)}$

where $m_i = |\beta_i^*|$ is the number of ancestors of node $i$ .

Higher-order (joint) inclusion probabilities for any motif set $(\kappa_1,\dots,\kappa_r)$ are:

$\pi_{(\kappa_1 \ldots \kappa_r)} = \Pr\left(\bigcap_{j=1}^r \{\beta_{\kappa_j}^* \cap s_0 \neq \emptyset\}\right)$

In SRS, these probabilities admit closed-form combinatorial representation in terms of $\beta$ -set overlaps (e.g., $\pi_{(\kappa \ell)} = 1 - \frac{C(N - |\beta_\kappa^*\cup\beta_\ell^*|, n)}{C(N,n)}$ ).

3. Unbiased Estimation via Horvitz–Thompson

Given motif set $\Omega$ , with associated values $y_\kappa$ , the inferential target is $\theta = \sum_{\kappa \in \Omega} y_\kappa$ . The canonical unbiased estimator (Horvitz–Thompson) is

$\hat{\theta}_{HT} = \sum_{\kappa\in\Omega_s} \frac{y_\kappa}{\pi_{(\kappa)}}$

where $\Omega_s$ is the set of motifs observed in $G_s$ (those for which $\beta_\kappa^*\cap s_0\neq\emptyset$ ). This estimator is unbiased for $\theta$ :

$E[\hat{\theta}_{HT}] = \theta,$

since $E[I\{\beta_\kappa^*\cap s_0\neq\emptyset\}] = \pi_{(\kappa)}$ and summing over $\kappa$ yields exact expectation.

This formalism encompasses node totals ( $\Omega=V$ ), edge totals ( $\Omega=E$ ), triangle counts, or totals over arbitrary motifs.

4. Sampling Variance Analysis

The variance of the Horvitz–Thompson estimator under SBS design on the seed set $s_0$ is

$\operatorname{Var}(\hat{\theta}_{HT}) = \sum_{\kappa}\sum_{\ell} \left( \frac{\pi_{(\kappa\ell)}}{\pi_{(\kappa)}\pi_{(\ell)}} - 1 \right) y_\kappa y_\ell$

For node-sum estimation,

$\operatorname{Var}(\hat{S}) = \sum_{i}\sum_{j} \left( \frac{\pi_{ij}}{\pi_i \pi_j} - 1 \right) y_i y_j.$

Approximate forms arise if all $\pi_i$ are small or the sampling fraction $n/N$ is small, in which case independence is assumed:

$\operatorname{Var}_{HT} \approx \sum_i \left( \frac{1 - \pi_i}{\pi_i} \right) y_i^2.$

The HT methodology supports unbiased variance estimation:

$\hat{V} = \sum_{\kappa} \sum_{\ell \in \Omega_s} \frac{ (\pi_{(\kappa\ell)} - \pi_{(\kappa)}\pi_{(\ell)}) y_\kappa y_\ell }{ \pi_{(\kappa\ell)} \pi_{(\kappa)} \pi_{(\ell)} }.$

5. Special Cases and Theoretical Generalizations

T-wave SBS unifies several classical sampling designs in a single graph-theoretic formalism:

Random-node (T=0) sampling: No waves. The ancestor set reduces to $\beta_i^*=\{i\}$ ; $\pi_i$ is the initial inclusion probability.
Random-edge sampling: Utilizing a bipartite T=0 design on the incidence graph (edges as “nodes”).
Multiplicity sampling (Birnbaum–Sirken): The bipartite sampling-unit/measurement-unit relation forms a graph $G=(F\cup\Omega, E)$ ; 1-wave SBS from $F$ with RIOP reconstructs classical multiplicity estimators through known ancestor sets and yields direct application of the HT formula.
Adaptive cluster sampling (Thompson): Treating the spatial grid and neighbor-criteria as an infinite-wave ( $T\to\infty$ ) directed procedure, clusters form connected components detected via suitable $\beta_i^*$ .

6. Methodological Advantages and Practical Considerations

T-wave SBS possesses several notable advantages:

Flexibility: Adjusting $T$ controls the trade-off between traversal cost and variance.
Motif Generality: Any finite-order subgraph motif can be estimated within the same HT framework.
Sampling Efficiency: Breadth-first “waves” tend to oversample high-degree regions, increasing efficiency for motifs $y_\kappa$ associated with local subgraph density.
Unification: SBS instantiates a general theory encompassing otherwise disparate sampling regimes, allowing analytical connection between motif estimation and finite-population sampling theory.

The key operational principle is that T-wave snowball sampling equates to BFS truncated at depth $T$ from a randomly sampled seed set, with motif inclusion determined by ancestor-set intersection with $s_0$ . Once inclusion probabilities for motifs are characterized, unbiased estimation and closed-form variance expressions are immediate (Oguz-Alper et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Snowball sampling from graphs (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to N-ball Sampling Technique.