A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs (2403.13286v1)
Abstract: Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m- dimensional random walk that accounts for the paths specified in a hypothesis. We further optimize its time efficiency and propose PHASEopt. Experiments on real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling in terms of accuracy and time efficiency.
- Search in Power-Law Networks. CoRR cs.NI/0103016 (2001).
- Ery Arias-Castro and Nicolas Verzelen. 2014. Community detection in dense random networks. (2014).
- Slim graph: practical lossy graph compression for approximate graph processing, storage, and analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019. ACM, 35:1–35:25.
- Peter J. Bickel and Purnamrita Sarkar. 2013. Hypothesis Testing for Automated Community Detection in Networks. CoRR abs/1311.2694 (2013).
- Hypothesis testing in animal social networks. Trends in ecology & evolution 26, 10 (2011), 502–507.
- Two-Sample Tests for Large Random Graphs Using Network Statistics. In Proceedings of the 30th Conference on Learning Theory, COLT, Amsterdam, The Netherlands, 7-10 July (Proceedings of Machine Learning Research), Vol. 65. PMLR, 954–977.
- Two-sample hypothesis testing for inhomogeneous random graphs. (2020).
- Debarghya Ghoshdastidar and Ulrike von Luxburg. 2018. Practical Methods for Graph Two-Sample Testing. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. 3019–3028.
- ECoHeN: A Hypothesis Testing Framework for Extracting Communities from Heterogeneous Networks. CoRR abs/2212.10513 (2022).
- Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, 29th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 15-19 March, San Diego, CA, USA. IEEE, 2498–2506.
- Leo A Goodman. 1961. Snowball sampling. The annals of mathematical statistics (1961), 148–170.
- F. Maxwell Harper and Joseph A. Konstan. 2016. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4 (2016), 19:1–19:19.
- Metropolis Algorithms for Representative Subgraph Sampling. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa, Italy. IEEE Computer Society, 283–292.
- Goodness of fit of social network models. Journal of the american statistical association 103, 481 (2008), 248–258.
- Reducing Large Internet Topologies for Faster Simulations. In NETWORKING: Networking Technologies, Services, and Protocols; Performance of Computer and Communication Networks; Mobile and Wireless Communication Systems, 4th International IFIP-TC6 Networking Conference, Waterloo, Canada, May 2-6, Proceedings (Lecture Notes in Computer Science), Vol. 3462. Springer, 328–341.
- Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling. In ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, London, United Kingdom, June 11-15. ACM, 319–330.
- Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23. ACM, 631–636.
- Walking with Perception: Efficient Random Walk Sampling via Common Neighbor Awareness. In 35th IEEE International Conference on Data Engineering, ICDE, Macao, China, April 8-11. IEEE, 962–973.
- László Lovász. 1993. Random walks on graphs. Combinatorics, Paul erdos is eighty 2, 1-46 (1993), 4.
- Arun S. Maiya and Tanya Y. Berger-Wolf. 2010. Sampling community structure. In Proceedings of the 19th International Conference on World Wide Web, WWW, Raleigh, North Carolina, USA, April 26-30. ACM, 701–710.
- Whitney K Newey and Daniel McFadden. 1994. Large sample estimation and hypothesis testing. Handbook of econometrics 4 (1994), 2111–2245.
- User group analytics: hypothesis generation and exploratory analysis of user data. VLDB J. 28, 2 (2019), 243–266.
- Davood Rafiei and Stephen Curial. 2005. Effectively Visualizing Large Networks Through Sampling. In 16th IEEE Visualization Conference, IEEE Vis 2005, Minneapolis, MN, USA, October 23-28, 2005, Proceedings. IEEE Computer Society, 375–382.
- Bruno F. Ribeiro and Donald F. Towsley. 2010. Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th ACM SIGCOMM Internet Measurement Conference, IMC, Melbourne, Australia - November 1-3. ACM, 390–403.
- Subnets of scale-free networks are not scale-free: sampling properties of networks. Proceedings of the National Academy of Sciences 102, 12 (2005), 4221–4224.
- On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans. Netw. 17, 2 (2009), 377–390.
- ArnetMiner: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27. ACM, 990–998.
- A semiparametric two-sample hypothesis testing problem for random graphs. Journal of Computational and Graphical Statistics 26, 2 (2017), 344–354.
- Jaime Waters. 2015. Snowball sampling: A cautionary tale involving a study of older drug users. International Journal of Social Research Methodology 18, 4 (2015), 367–380.
- Evaluation of Graph Sampling: A Visualization Perspective. IEEE Trans. Vis. Comput. Graph. 23, 1 (2017), 401–410.
- Yin Xia and Lexin Li. 2017. Hypothesis testing of matrix graph model with application to brain connectivity analysis. Biometrics 73, 3 (2017), 780–791.
- Community extraction for social networks. Proceedings of the National Academy of Sciences 108, 18 (2011), 7321–7326.