Cardinality Estimation of Subgraph Matching: A Filtering-Sampling Approach (2309.15433v2)
Abstract: Subgraph counting is a fundamental problem in understanding and analyzing graph structured data, yet computationally challenging. This calls for an accurate and efficient algorithm for Subgraph Cardinality Estimation, which is to estimate the number of all isomorphic embeddings of a query graph in a data graph. We present FaSTest, a novel algorithm that combines (1) a powerful filtering technique to significantly reduce the sample space, (2) an adaptive tree sampling algorithm for accurate and efficient estimation, and (3) a worst-case optimal stratified graph sampling algorithm for difficult instances. Extensive experiments on real-world datasets show that FaSTest outperforms state-of-the-art sampling-based methods by up to two orders of magnitude and GNN-based methods by up to three orders of magnitude in terms of accuracy.
- 2024. Boost C++ Libraries. Retrieved 2024-03-16 from http://www.boost.org/
- David J. Aldous. 1990. The Random Walk Construction of Uniform Spanning Trees and Uniform Labelled Trees. SIAM J. Discret. Math. 3, 4 (1990), 450–465.
- A Simple Sublinear-Time Algorithm for Counting Arbitrary Subgraphs via Edge Sampling. In Proceedings of ITCS. 6:1–6:20.
- Size bounds and query plans for relational joins. In Proceedings of FOCS. 739–748.
- Suman K. Bera and C. Seshadhri. 2020. How to Count Triangles, without Seeing the Whole Graph. In Proceedings of ACM SIGKDD. 306–316.
- CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching. In Proceedings of ACM SIGMOD. 1447–1462.
- Efficient Subgraph Matching by Postponing Cartesian Products. In Proceedings of ACM SIGMOD. 1199–1214.
- Vincenzo Bonnici and Rosalba Giugno. 2017. On the Variable Ordering in Subgraph Isomorphism Algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14 (2017), 193–203. Issue 1.
- A subgraph isomorphism algorithm and its application to biochemical data. BMC Bioinformatics (2013).
- Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling. In Proceedings of the VLDB Endowment. 1651–1663.
- Interval Estimation for a Binomial Proportion. Statist. Sci. 16, 2 (2001), 101–133.
- George Casella and Roger L Berger. 2021. Statistical inference.
- Xiaowei Chen and John C. S. Lui. 2016. Mining Graphlet Counts in Online Social Networks. In Proceedings of IEEE ICDM. 71–80.
- Yu Chen and Ke Yi. 2020. Random Sampling and Size Estimation Over Cyclic Joins. In Proceedings of ICDT. 7:1–7:18.
- Herman Chernoff. 1952. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. The Annals of Mathematical Statistics (1952), 493–507.
- Norishige Chiba and Takao Nishizeki. 1985. Arboricity and Subgraph Listing Algorithms. SIAM Journal of Computing (1985), 210–223.
- C. J. Clopper and E. S. Pearson. 1934. The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika 26, 4 (1934), 404–413.
- A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10 (2004), 1367–1372.
- Devdatt P Dubhashi and Alessandro Panconesi. 2012. Concentration of measure for the analysis of randomized algorithms.
- Wenfei Fan. 2012. Graph Pattern Matching Revised for Social Network Analysis. In Proceedings of ICDT. 8–21.
- L. R. Ford and D. R. Fulkerson. 1956. Maximal Flow Through a Network. Canadian Journal of Mathematics 8 (1956), 399–404.
- Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together. In Proceedings of ACM SIGMOD. 1429–1446.
- Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proceedings of ACM SIGMOD. 337–348.
- Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-Time: Query Language and Access Methods for Graph Databases. In Proceedings of ACM SIGMOD. 405–418.
- Cyclic Pattern Kernels for Predictive Graph Mining. In Proceedings of ACM SIGKDD. 158–167.
- The Complexity of Planar Counting Problems. Siam Journal on Computing 27, 4 (1998), 1142–1167.
- Shweta Jain and C. Seshadhri. 2017. A Fast and Provable Method for Estimating Clique Counts Using Turán’s Theorem. In Proceedings of ACM WWW. 441–449.
- Versatile Equivalences: Speeding up Subgraph Query Processing and Subgraph Matching. In Proceedings of ACM SIGMOD. 925–937.
- Taming subgraph isomorphism for RDF query processing. In Proceedings of the VLDB Endowment. 1238–1249.
- Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation. In Proceedings of ACM SIGMOD. 964–976.
- Wander Join: Online Aggregation via Random Walks. In Proceedings of ACM SIGMOD. 615–629.
- Don R. Lick and Arthur T. White. 1970. k𝑘kitalic_k-Degenerate Graphs. Canadian Journal of Mathematics 22, 5 (1970).
- Sharon L Lohr. 2021. Sampling: design and analysis.
- Dániel Marx and Michal Pilipczuk. 2014. Everything you always wanted to know about the parameterized complexity of Subgraph Isomorphism (but were afraid to ask). In Proceedings of STACS. 542–553.
- Amine Mhedhbi and Semih Salihoglu. 2018. Optimizing subgraph queries by combining binary and worstcase optimal joins. In Proceedings of the VLDB Endowment. 1692–1704.
- Preventing Bad Plans by Bounding the Impact of Cardinality Estimation Errors. In Proceedings of the VLDB Endowment. 982–993.
- SING: Subgraph search In Non-homogeneous Graphs. BMC Bioinformatics 11 (2010), 96. Issue 1.
- Thomas Neumann and Guido Moerkotte. 2011. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In Proceedings of IEEE ICDE. 984–994.
- Worst-Case Optimal Join Algorithms: [Extended Abstract]. In Proceedings of the ACM PODS. 37–48.
- G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching. In Proceedings of ACM SIGMOD. 1099–1114.
- R. C. Prim. 1957. Shortest connection networks and some generalizations. The Bell System Technical Journal 36, 6 (1957), 1389–1401.
- Nataša Pržulj. 2007. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007), e177–e183.
- The ubiquity of large graphs and surprising challenges of graph processing. In Proceedings of the VLDB Endowment. 420–431.
- Efficient graphlet kernels for large graph comparison. In Proceedings of the AISTATS. 488–495.
- Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation. In Proceedings of ACM WWW. 1043–1052.
- Shixuan Sun and Qiong Luo. 2020. In-Memory Subgraph Matching: An In-depth Study. In Proceedings of ACM SIGMOD. 1083–1098.
- Shixuan Sun and Qiong Luo. 2022. Subgraph Matching With Effective Matching Order and Indexing. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2022), 491–505.
- Tamir Tassa. 2012. Finding all maximally-matchable edges in a bipartite graph. Theoretical Computer Science 423 (2012), 50–58.
- J. R. Ullmann. 1976. An Algorithm for Subgraph Isomorphism. J. ACM (1976), 31–42.
- Princeton University. 2010. About WordNet. Retrieved 2023-04-07 from https://wordnet.princeton.edu/.
- Join Size Estimation Subject to Filter Conditions. In Proceedings of the VLDB Endowment. 1530–1541.
- Neural Subgraph Counting with Wasserstein Estimator. In Proceedings of ACM SIGMOD. 160–175.
- Edwin B. Wilson. 1927. Probable Inference, the Law of Succession, and Statistical Inference. J. Amer. Statist. Assoc. 22, 158 (1927), 209–212.
- Lightning Fast and Space Efficient k-clique Counting. In Proceedings of ACM WWW. 1191–1202.
- A Learned Sketch for Subgraph Counting. In Proceedings of ACM SIGMOD. 2142–2155.
- Random Sampling over Joins Revisited. In Proceedings of ACM SIGMOD. 1525–1539.
- Dongxiao Zhu and Zhaohui S Qin. 2005. Structural comparison of metabolic networks in selected single cell organisms. BMC Bioinformatics 6 (2005), 8. Issue 1.