Color: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation (2405.06767v1)
Abstract: Graph workloads pose a particularly challenging problem for query optimizers. They typically feature large queries made up of entirely many-to-many joins with complex correlations. This puts significant stress on traditional cardinality estimation methods which generally see catastrophic errors when estimating the size of queries with only a handful of joins. To overcome this, we propose COLOR, a framework for subgraph cardinality estimation which applies insights from graph compression theory to produce a compact summary that captures the global topology of the data graph. Further, we identify several key optimizations that enable tractable estimation over this summary even for large query graphs. We then evaluate several designs within this framework and find that they improve accuracy by up to 10$3$x over all competing methods while maintaining fast inference, a small memory footprint, efficient construction, and graceful degradation under updates.
- 2024. COLOR Tech Report & Repository. Technical Report. https://anonymous.4open.science/r/Cardinality-with-Colors-4333/README.md
- Foundations of modern query languages for graph databases. ACM Computing Surveys (CSUR) 50, 5 (2017), 1–40.
- Hannah Bast and Björn Buchhold. 2017. QLever: A Query Engine for Efficient SPARQL+Text Search. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tseng, and Chenliang Li (Eds.). ACM, 647–656. https://doi.org/10.1145/3132847.3132921
- Amazon Neptune: Graph Data Management in the Cloud. In Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference (ISWC 2018), Monterey, USA, October 8th - to - 12th, 2018 (CEUR Workshop Proceedings), Marieke van Erp, Medha Atre, Vanessa López, Kavitha Srinivas, and Carolina Fortuna (Eds.), Vol. 2180. CEUR-WS.org. https://ceur-ws.org/Vol-2180/paper-79.pdf
- Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. ACM Comput. Surv. 56, 2 (2024), 31:1–31:40. https://doi.org/10.1145/3604932
- An analytical study of large SPARQL query logs. VLDB J. 29, 2-3 (2020), 655–679. https://doi.org/10.1007/S00778-019-00558-9
- Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 18–35. https://doi.org/10.1145/3299869.3319894
- Xiaowei Chen and John C. S. Lui. 2018. Mining Graphlet Counts in Online Social Networks. ACM Trans. Knowl. Discov. Data 12, 4 (2018), 41:1–41:38. https://doi.org/10.1145/3182392
- SafeBound: A Practical System for Generating Cardinality Bounds. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–26.
- TigerGraph: A Native MPP Graph Database. CoRR abs/1901.08248 (2019). arXiv:1901.08248 http://arxiv.org/abs/1901.08248
- Orri Erling and Ivan Mikhailov. 2009. Virtuoso: RDF Support in a Native RDBMS. In Semantic Web Information Management - A Model-Based Perspective, Roberto De Virgilio, Fausto Giunchiglia, and Letizia Tanca (Eds.). Springer, 501–519. https://doi.org/10.1007/978-3-642-04329-1_21
- Cypher: An evolving query language for property graphs. In Proceedings of the 2018 international conference on management of data. 1433–1445.
- Martin Grohe. 2017. Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Lecture Notes in Logic, Vol. 47. Cambridge University Press. https://doi.org/10.1017/9781139028868
- Martin Grohe and Daniel Neuen. 2020. Recent Advances on the Graph Isomorphism Problem. CoRR abs/2011.01366 (2020). arXiv:2011.01366 https://arxiv.org/abs/2011.01366
- Martin Grohe and Pascal Schweitzer. 2020. The graph isomorphism problem. Commun. ACM 63, 11 (2020), 128–134. https://doi.org/10.1145/3372123
- Laura M. Haas. 1999. Review - Access Path Selection in a Relational Database Management System. ACM SIGMOD Digit. Rev. 1 (1999). https://dblp.org/db/journals/dr/Haas99a.html
- László Hajdu and Miklós Krész. 2020. Temporal Network Analytics for Fraud Detection in the Banking Sector. In ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium - International Workshops: DOING, MADEISD, SKG, BBIGAP, SIMPDA, AIMinScience 2020 and Doctoral Consortium, Lyon, France, August 25-27, 2020, Proceedings (Communications in Computer and Information Science), Ladjel Bellatreche, Mária Bieliková, Omar Boussaïd, Barbara Catania, Jérôme Darmont, Elena Demidova, Fabien Duchateau, Mark M. Hall, Tanja Mercun, Boris Novikov, Christos Papatheodorou, Thomas Risse, Oscar Romero, Lucile Sautot, Guilaine Talens, Robert Wrembel, and Maja Zumer (Eds.), Vol. 1260. Springer, 145–157. https://doi.org/10.1007/978-3-030-55814-7_12
- Olaf Hartig and Jorge Pérez. 2018. Semantics and complexity of GraphQL. In Proceedings of the 2018 World Wide Web Conference. 1155–1164.
- Moe Kayali and Dan Suciu. 2022. Quasi-stable Coloring for Graph Compression: Approximating Max-Flow, Linear Programs, and Centrality. Proc. VLDB Endow. 16, 4 (2022), 803–815. https://www.vldb.org/pvldb/vol16/p803-kayali.pdf
- FAQ: Questions Asked Frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Tova Milo and Wang-Chiew Tan (Eds.). ACM, 13–28. https://doi.org/10.1145/2902251.2902280
- Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 964–976. https://doi.org/10.1145/3448016.3457246
- How Good Are Query Optimizers, Really? Proc. VLDB Endow. 9, 3 (2015), 204–215.
- Wander Join and XDB: Online Aggregation via Random Walks. ACM Trans. Database Syst. 44, 1 (2019), 2:1–2:41. https://doi.org/10.1145/3284551
- Tianyu Liu and Chi Wang. 2020. Understanding the hardness of approximate query processing with joins. arXiv preprint arXiv:2010.00307 (2020).
- Wim Martens and Tina Trautner. 2019. Bridging Theory and Practice with Query Log Analysis. SIGMOD Rec. 48, 1 (2019), 6–13. https://doi.org/10.1145/3371316.3371319
- Weisfeiler and Leman go Machine Learning: The Story so far. CoRR abs/2112.09992 (2021). arXiv:2112.09992
- Inc. Neo4j. 2007. https://neo4j.com/
- Thomas Neumann and Guido Moerkotte. 2011. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan (Eds.). IEEE Computer Society, 984–994. https://doi.org/10.1109/ICDE.2011.5767868
- G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1099–1114. https://doi.org/10.1145/3318464.3389702
- G-CARE: A framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1099–1114.
- Real-time Constrained Cycle Detection in Large Dynamic Graphs. Proc. VLDB Endow. 11, 12 (2018), 1876–1888. https://doi.org/10.14778/3229863.3229874
- Emma Rollon and Javier Larrosa. 2011. On Mini-Buckets and the Min-fill Elimination Ordering. In Principles and Practice of Constraint Programming - CP 2011 - 17th International Conference, CP 2011, Perugia, Italy, September 12-16, 2011. Proceedings (Lecture Notes in Computer Science), Jimmy Ho-Man Lee (Ed.), Vol. 6876. Springer, 759–773. https://doi.org/10.1007/978-3-642-23786-7_57
- The ubiquity of large graphs and surprising challenges of graph processing: extended survey. VLDB J. 29, 2-3 (2020), 595–618. https://doi.org/10.1007/s00778-019-00548-x
- How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 1623–1633. https://doi.org/10.1145/3308558.3313556
- Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, Pierre-Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (Eds.). ACM, 1043–1052. https://doi.org/10.1145/3178876.3186003
- Shixuan Sun and Qiong Luo. 2020. In-memory subgraph matching: An in-depth study. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1083–1098.
- Join Size Estimation Subject to Filter Conditions. Proc. VLDB Endow. 8, 12 (2015), 1530–1541. https://doi.org/10.14778/2824032.2824051
- PRESTO: probabilistic cardinality estimation for RDF queries based on subgraph overlapping. arXiv preprint arXiv:1801.06408 (2018).
- Random Sampling over Joins Revisited. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1525–1539. https://doi.org/10.1145/3183713.3183739