Circulated Neighbors Random Walk (CNRW)
- CNRW is a higher-order Markov chain Monte Carlo algorithm that enhances uniform node sampling by tracking visited neighbors per entry edge.
- It systematically circulates through all available neighbors before repeating transitions, reducing variance and shortening burn-in time.
- Empirical results show a 30–50% reduction in sampling steps needed to achieve target estimator accuracy on large social networks.
Circulated Neighbors Random Walk (CNRW) is a higher-order Markov chain Monte Carlo sampling algorithm for graphs, introduced to accelerate uniform node sampling over large online social networks while accessing only local neighborhood information through restricted web or API interfaces. CNRW modifies the simple random walk (SRW) paradigm by deterministically circulating over neighbors of each node—conditioned on the entry edge—before repeating neighbors, thereby improving the mixing rate and reducing variance without altering the stationary distribution. Extensively evaluated on large-scale real-world and synthetic networks, CNRW achieves a 30–50% reduction in burn-in and sampling cost for fixed estimation accuracy relative to SRW (Zhou et al., 2015).
1. CNRW Algorithmic Design and Intuition
Traditional SRW generates a node sequence on graph , choosing each next node uniformly at random from the neighbors of the current node , disregarding prior visit history. In contrast, CNRW extends to a higher-order chain by tracking, for each directed edge , which neighbors of have already been selected when traversing from to . Traversal from to a neighbor uniformly samples from , where captures already-used neighbors; upon exhausting all neighbors, resets to and selection proceeds afresh. This discourages repeated transitions along the same localized paths (which can trap the walk in subgraphs), fostering better exploration and expedited mixing.
The intuition is that CNRW avoids local trapping by forcing the walk to exploit all outgoing possibilities from conditioned on how it was entered, before any repetition, thus systematically enhancing coverage of the network's topology.
2. Formal Specification and Transition Probabilities
Let the time index be , with denoting the walk location and the path history. For each traversal of , records the set of neighbors of that have already been chosen immediately after arriving via . The formal transition rule is: If , resets to , and
where is the degree of .
The algorithm uses a hash-map data structure for each distinct to store already-used neighbors. Each step requires expected time using dynamic hashing. The storage requirement grows as over steps, corresponding to the distinct traversed edges and their blocked-neighbor lists.
3. Stationary Distribution and Variance Properties
CNRW preserves the stationary distribution of SRW: where is the stationary probability of visiting node , and is the total number of edges. The proof constructs the infinite walk trace as a concatenation of path-blocks—subpaths initiated by traversing —showing that under CNRW, these path-blocks cycle through all possible neighbor-rooted block types in a without-replacement fashion, but with unchanged internal structure compared to SRW. The ergodic occupation time is thus invariant.
A key property is variance reduction: block-stratification (cycling through all neighbor options before repeating) guarantees that CNRW's asymptotic variance for ergodic averages is no higher than SRW's. This result extends Neal’s (2004) findings on variance minimization under deterministic stratification (Zhou et al., 2015).
4. Burn-in, Mixing, and Efficiency Gains
Although no closed-form mixing time bound is derived, theoretical and empirical evidence demonstrates significantly accelerated mixing and reduced burn-in:
- On structured bottlenecked graphs (e.g., barbell graphs comprising two cliques), the probability of crossing sparse bridges under CNRW is enhanced by a factor of , yielding dramatically faster escapes from bottlenecks compared to SRW.
- Asymptotic variance analyses confirm reduced estimator variance for attribute aggregates due to stratified coverage.
- Empirical results reveal CNRW and its extension GNRW require 30–50% fewer steps to reach target estimator accuracy or achieve similar bias as SRW/NB-SRW.
Practical experiments further demonstrate for real-world social network datasets (Google+, Yelp, Facebook, YouTube) that estimator relative error and measures of sampling bias (KL-divergence, distance to stationary) fall more rapidly for CNRW than for traditional schemes, across sampling budgets from 100 to 1,000 steps.
5. Query and Space Complexity
CNRW maintains efficient per-step computational complexity, with random neighbor selection implemented via hash-maps tracking used neighbors for each traversed edge. Total storage overhead is for -step runs, scaling linearly with the number of distinct edge traversals. Query complexity—number of unique neighborhood retrievals—matches the number of walk steps, with redundant neighborhood requests resolved by local caching. The overall convergence is such that, for any target estimator variance or bias, CNRW never requires more steps, and typically fewer, than SRW.
Implementation over real-world APIs necessitates local cache management of node neighborhoods and modest additional memory for blocked-neighbor tracking per directed edge. This overhead remains practical for sampling budgets up to approximately steps.
6. Practical Considerations and Extensions
CNRW and its generalization, Groupby Neighbors Random Walk (GNRW), function as drop-in replacements for SRW, requiring no global access to network topology. GNRW further partitions neighbors by attribute (such as node degree or review count) and cycles within strata, yielding further improvements for attribute-specific estimations—an effect analogous to stratified sampling. The grouping function in GNRW is application-dependent; the selection and potential automation of such grouping remains an open area for research.
Directed graphs may be handled by symmetrization of the adjacency structure or by tracking in- and out-neighbors separately for each edge. Memory scalability limits applicability in extremely large-diameter graphs unless walk budgets remain moderate.
A principal open question is the derivation of explicit mixing-time bounds for CNRW relative to core graph metrics such as conductance. Notably, empirical results suggest substantial improvement, but formal characterization is pending.
7. Experimental Results on Network Datasets
Extensive experiments document the efficiency and accuracy of CNRW and GNRW:
- For Google+ (240K nodes), Yelp (120K nodes), and other major social network snapshots:
- Estimator relative error for average degree drops below $0.06$ after approximately $450$ steps for CNRW/GNRW, compared to about $800$ steps for SRW/NB-SRW.
- KL-divergence and distance to the stationary distribution remain 30–50% lower for CNRW/GNRW across sample budgets of 100–1,000 steps.
- On synthetic clustered and barbell graphs, CNRW/GNRW reduce statistical bias by a factor of $1.5-2$ compared to SRW.
- GNRW yields further error reduction for aggregates over grouped attributes, demonstrating the benefits of stratification.
These results confirm the practical advantage of incorporating history-aware transition mechanisms such as those in CNRW for network sampling in applications constrained by API access or local-neighborhood visibility (Zhou et al., 2015).