Stop-and-Stare: Optimal Sampling Algorithms for Viral Marketing in Billion-scale Networks

Published 25 May 2016 in cs.SI, cs.DS, and physics.soc-ph | (1605.07990v3)

Abstract: Influence Maximization (IM), that seeks a small set of key users who spread the influence widely into the network, is a core problem in multiple domains. It finds applications in viral marketing, epidemic control, and assessing cascading failures within complex systems. Despite the huge amount of effort, IM in billion-scale networks such as Facebook, Twitter, and World Wide Web has not been satisfactorily solved. Even the state-of-the-art methods such as TIM+ and IMM may take days on those networks. In this paper, we propose SSA and D-SSA, two novel sampling frameworks for IM-based viral marketing problems. SSA and D-SSA are up to 1200 times faster than the SIGMOD'15 best method, IMM, while providing the same $(1-1/e-ε)$ approximation guarantee. Underlying our frameworks is an innovative Stop-and-Stare strategy in which they stop at exponential check points to verify (stare) if there is adequate statistical evidence on the solution quality. Theoretically, we prove that SSA and D-SSA are the first approximation algorithms that use (asymptotically) minimum numbers of samples, meeting strict theoretical thresholds characterized for IM. The absolute superiority of SSA and D-SSA are confirmed through extensive experiments on real network data for IM and another topic-aware viral marketing problem, named TVM. The source code is available at https://github.com/hungnt55/Stop-and-Stare

Abstract PDF Upgrade to Chat

Authors (3)

Citations (368)

View on Semantic Scholar

Summary

The paper’s main contribution is the development of SSA and D-SSA algorithms that achieve near-optimal sample complexity for influence maximization.
It introduces a sequential, adaptive stop-and-stare procedure using reverse reachable sets with tight statistical bounds to verify seed set quality.
Experiments on massive networks show up to 1200x speedup and 10–100x memory reduction, demonstrating practical scalability in viral marketing.

Stop-and-Stare: Optimal Sampling Algorithms for Viral Marketing in Billion-scale Networks — Expert Essay

Overview and Problem Formulation

The paper "Stop-and-Stare: Optimal Sampling Algorithms for Viral Marketing in Billion-scale Networks" (1605.07990) addresses the computational bottleneck in Influence Maximization (IM) over large-scale social networks under the fundamental Independent Cascade (IC) and Linear Threshold (LT) models. IM, as originally formalized, asks for a size- $k$ seed set that maximizes expected influence spread—an NP-hard problem, and quantifying the precise influence of even a single node set is #P-complete. Despite previous progress, state-of-the-art RIS (Reverse Influence Sampling)-based algorithms (e.g., IMM, TIM+) still require days to scale to networks with billions of edges. This paper’s core contribution is the development and analysis of two new sampling-based approximation algorithms—SSA (Stop-and-Stare Algorithm) and D-SSA (Dynamic SSA)—which achieve theoretical optimality in the number of required samples while empirically outperforming existing methods by several orders of magnitude in both runtime and memory use.

Unified RIS Framework and Minimality

The paper provides a rigorous unification and extension of previous RIS sampling results. The RIS paradigm leverages submodularity of influence to reduce IM to maximum coverage over a set of random Reverse Reachable (RR) sets. However, the precision parameter (i.e., how many RR sets are necessary for approximation guarantees) remained suboptimal and loosely bounded in prior work. The authors formalize two classes of minimal RIS thresholds:

Type-1 minimal threshold: The minimal number of RR sets necessary to $(1-1/e-\epsilon)$ -approximate the true expected spread across all size- $k$ seed sets, under a fixed split of error budget into estimation and coverage components.
Type-2 minimal threshold: The global minimum over all feasible splits of the error budget, providing the tightest guarantee achievable by RIS-based methods.

The paper proves that previous thresholds (e.g., in IMM/TIM+) are neither minimal nor necessarily constant approximations thereto, and it quantifies the improvement made by its new algorithms.

Stop-and-Stare Algorithm (SSA)

SSA introduces a sequential, statistically-driven sampling procedure that adaptively determines when its estimate is acceptable, avoiding superfluous sample generation. The algorithm's operation is as follows:

In exponentially increasing rounds, SSA generates RR sets and leverages the greedy submodular maximization to select a solution.
At each checkpoint, it verifies—using a statistically tight stopping rule—whether sufficient evidence exists to guarantee the solution’s quality with high probability.
If the evidence is insufficient, the algorithm doubles sample size and repeats.
This "stop-and-stare" protocol is provably within a constant factor of the type-1 minimal RIS threshold.

Parametric tuning of error splits is explicitly addressed. Both detailed theoretical analysis and concrete parameter prescriptions for practical networks are given. A technical highlight is the proof that, under reasonable assumptions on network size and influence spread, the Chernoff-Hoeffding bounds are tight and the algorithm cannot be significantly improved in terms of RIS sample economy.

Dynamic Stop-and-Stare Algorithm (D-SSA)

D-SSA extends SSA by automatically and adaptively selecting the most advantageous split of the error budget at runtime, thereby achieving a constant-factor approximation to the type-2 minimal RIS threshold. Technically, D-SSA maintains two pools of RR sets per round: one for candidate solution discovery, one for rigorous, independent verification. Previous RR sets are re-used in subsequent rounds, optimizing sample utilization. Parameters $(\epsilon_1, \epsilon_2, \epsilon_3)$ governing solution, coverage, and estimation errors are adaptively monitored and adjusted, ensuring D-SSA always terminates with provable $(1-1/e-\epsilon)$ approximation using near-minimal samples.

Experimental Evaluation

Extensive empirical studies validate both the theoretical findings and the practical superiority of SSA and D-SSA. Experiments were performed on networks with up to 65.6 million nodes and billions of edges (e.g., Friendster, Twitter), across both IC and LT models. Numerical highlights:

Runtime: On the Friendster network (3.6B edges), SSA and D-SSA select a $k=500$ seed set in about 3.5 seconds, while IMM takes over an hour—up to 1200x speedup. On Twitter (41M nodes, 1.5B edges), D-SSA is $2\times10^9$ times faster than CELF++.
Memory and Sample Efficiency: Both SSA and D-SSA require 10–100x fewer RR sets (and hence RAM) than IMM in practical regimes.
Solution Quality: Across all datasets and parameter settings, seed set influence attained by SSA and D-SSA matches that of IMM and CELF++, verifying no significant loss in effectiveness.

The methods also generalize efficiently to Targeted Viral Marketing (TVM) via integration with WRIS, maintaining guarantees and yielding similar (500x) speedup over KB-TIM.

Theoretical and Practical Implications

On the theoretical front, this work establishes that RIS-based approximation algorithms for IM can reach the fundamental limit on sample complexity dictated by concentration inequalities and analysis of the associated coverage problem. Furthermore, D-SSA shows that adaptive error splitting can be performed online without incurring computational overhead, always producing solutions as efficiently as theoretically possible.

Practically, these results enable IM on graphs at least an order of magnitude larger than previously feasible, unlocking real-time viral marketing, online epidemiological intervention, and near-instant evaluation for analytics on massive social networks. The fact that SSA/D-SSA do not presuppose particular network structure and remain distribution-agnostic further enhances their applicability.

It is also noteworthy that the Stop-and-Stare generic approach extends to other sample-based optimization problems on sketches or summaries, not just IM.

Future Directions

Given these achievements, further parallelization and distributed implementations of SSA/D-SSA would be the logical next step for leveraging modern, large-memory, multi-core, or distributed computing environments. Another promising avenue is extending the stop-and-stare statistical validity paradigm to non-submodular diffusion models or competitive influence settings. Furthermore, dynamic networks or settings with temporal or contextual constraints potentially benefit from minimal sample adaptive algorithms, building on the presented framework.

Conclusion

This paper's main contributions lie in explicitly characterizing the theoretical lower bounds for sample complexity in RIS-based IM, and in providing the first algorithms—SSA and D-SSA—that (i) achieve those bounds up to constant factors and (ii) verify their own approximation guarantees online. The convergence of statistical optimality, empirical performance, and broad generality constitute a significant consolidation and advancement in scalable viral marketing and social-influence optimization. The Stop-and-Stare strategy constitutes an essential technique for any provable, scalable sample-based optimization over massive graphs.

Markdown Report Issue