N-Ball Sampling for Unbiased Graph Estimation
- N-ball Sampling Technique is a probabilistic graph sampling method that employs BFS expansion from a randomized seed to unbiasedly estimate subgraph motifs.
- It generalizes classical designs such as random-node, multiplicity, and adaptive cluster sampling while effectively capturing local subgraph densities.
- The technique's flexibility in adjusting the number of expansion waves enables a balance between computational cost and variance reduction for accurate motif counts.
N-ball sampling, also referred to as T-wave snowball sampling (SBS), is a probabilistic graph sampling technique defined by breadth-first expansion from a randomized initial node sample, aiming to unbiasedly estimate finite-order subgraph totals such as triangles, cycles, or stars. The methodology generalizes multiple classical sampling paradigms, including random-node, random-edge, multiplicity, and adaptive cluster sampling, by constructing “N-balls” (subgraphs induced by nodes within T steps from seed nodes) and employing Horvitz–Thompson normalization for unbiased inference (Oguz-Alper et al., 2020).
1. Formal Definition
Let denote a simple undirected graph with nodes. The N-ball (or T-wave SBS) protocol comprises the following steps:
- Initial Sample and Observation: Select an initial seed set (typically of fixed size ) using any probability design admitting known first- and higher-order inclusion probabilities.
- Reciprocal Incident Observation Procedure (RIOP): Whenever enters the sample, observe all incident edges and their endpoints .
- Wave Expansion (N-balls): For step , define
and set the new sample increment as
The process is iterated up to waves (or until ), yielding the final sample .
- Final Sample Graph: The discovered sample graph is , consisting of the observed node set and all edges exposed during RIOP.
This framework yields an “N-ball of radius around ” in the breadth-first search (BFS) sense.
2. Inclusion Probabilities and Ancestor Sets
An essential component of unbiased graph motif estimation under SBS is the computation of inclusion probabilities, leveraging the structure of SBS-ancestor sets:
- For a motif (e.g., node, edge, triangle), define its ancestor set as:
Typically, a recognized subset is used, identifiable from the observed when is seen.
- The first-order inclusion probability is:
For a single-node motif and initial simple random sampling (SRS) without replacement of size :
where is the number of ancestors of node .
- Higher-order (joint) inclusion probabilities for any motif set are:
In SRS, these probabilities admit closed-form combinatorial representation in terms of -set overlaps (e.g., ).
3. Unbiased Estimation via Horvitz–Thompson
Given motif set , with associated values , the inferential target is . The canonical unbiased estimator (Horvitz–Thompson) is
where is the set of motifs observed in (those for which ). This estimator is unbiased for :
since and summing over yields exact expectation.
This formalism encompasses node totals (), edge totals (), triangle counts, or totals over arbitrary motifs.
4. Sampling Variance Analysis
The variance of the Horvitz–Thompson estimator under SBS design on the seed set is
For node-sum estimation,
Approximate forms arise if all are small or the sampling fraction is small, in which case independence is assumed:
The HT methodology supports unbiased variance estimation:
5. Special Cases and Theoretical Generalizations
T-wave SBS unifies several classical sampling designs in a single graph-theoretic formalism:
- Random-node (T=0) sampling: No waves. The ancestor set reduces to ; is the initial inclusion probability.
- Random-edge sampling: Utilizing a bipartite T=0 design on the incidence graph (edges as “nodes”).
- Multiplicity sampling (Birnbaum–Sirken): The bipartite sampling-unit/measurement-unit relation forms a graph ; 1-wave SBS from with RIOP reconstructs classical multiplicity estimators through known ancestor sets and yields direct application of the HT formula.
- Adaptive cluster sampling (Thompson): Treating the spatial grid and neighbor-criteria as an infinite-wave () directed procedure, clusters form connected components detected via suitable .
6. Methodological Advantages and Practical Considerations
T-wave SBS possesses several notable advantages:
- Flexibility: Adjusting controls the trade-off between traversal cost and variance.
- Motif Generality: Any finite-order subgraph motif can be estimated within the same HT framework.
- Sampling Efficiency: Breadth-first “waves” tend to oversample high-degree regions, increasing efficiency for motifs associated with local subgraph density.
- Unification: SBS instantiates a general theory encompassing otherwise disparate sampling regimes, allowing analytical connection between motif estimation and finite-population sampling theory.
The key operational principle is that T-wave snowball sampling equates to BFS truncated at depth from a randomly sampled seed set, with motif inclusion determined by ancestor-set intersection with . Once inclusion probabilities for motifs are characterized, unbiased estimation and closed-form variance expressions are immediate (Oguz-Alper et al., 2020).