Rapidash: Efficient Constraint Discovery via Rapid Verification (2309.12436v1)
Abstract: Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. Given their significance, there has been considerable research interest in achieving fast verification and discovery of exact DCs within the database community. Despite the significant advancements in the field, prior work exhibits notable limitations when confronted with large-scale datasets. The current state-of-the-art exact DC verification algorithm demonstrates a quadratic (worst-case) time complexity relative to the dataset's number of rows. In the context of DC discovery, existing methodologies rely on a two-step algorithm that commences with an expensive data structure-building phase, often requiring hours to complete even for datasets containing only a few million rows. Consequently, users are left without any insights into the DCs that hold on their dataset until this lengthy building phase concludes. In this paper, we introduce Rapidash, a comprehensive framework for DC verification and discovery. Our work makes a dual contribution. First, we establish a connection between orthogonal range search and DC verification. We introduce a novel exact DC verification algorithm that demonstrates near-linear time complexity, representing a theoretical improvement over prior work. Second, we propose an anytime DC discovery algorithm that leverages our novel verification algorithm to gradually provide DCs to users, eliminating the need for the time-intensive building phase observed in prior work. To validate the effectiveness of our algorithms, we conduct extensive evaluations on four large-scale production datasets. Our results reveal that our DC verification algorithm achieves up to 40 times faster performance compared to state-of-the-art approaches.
- Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004.
- Data profiling: A tutorial. In Proceedings of the 2017 ACM International Conference on Management of Data. 1747–1751.
- Data Profiling. Morgan & Claypool Publishers. Synthesis Lectures on Data Management (2018).
- DFD: Efficient functional dependency discovery. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 949–958.
- Pankaj K Agarwal. 2004. Range Searching.
- The priority R-tree: A practically efficient and worst-case optimal R-tree. ACM Transactions on Algorithms (TALG) 4, 1 (2008), 1–30.
- Constraint-generating dependencies. J. Comput. System Sci. 59, 1 (1999), 94–115.
- The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data. 322–331.
- Norbert Beckmann and Bernhard Seeger. 2009. A revised R*-tree in comparison with related index structures. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 799–812.
- Jon Louis Bentley and Jerome H Friedman. 1979. Data structures for range searching. ACM Computing Surveys (CSUR) 11, 4 (1979), 397–409.
- Jon Louis Bentley and James B Saxe. 1980. Decomposable searching problems I. Static-to-dynamic transformation. Journal of Algorithms 1, 4 (1980), 301–358.
- Efficient denial constraint discovery with hydra. Proceedings of the VLDB Endowment 11, 3 (2017), 311–323.
- King Lum Cheung and Ada Wai-Chee Fu. 1998. Enhanced nearest neighbour search on the R-tree. ACM SIGMOD Record 27, 3 (1998), 16–21.
- Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498–1509.
- Mark De Berg. 2000. Computational geometry: algorithms and applications. Springer Science & Business Media.
- Parallel discrepancy detection and incremental detection. Proceedings of the VLDB Endowment 14, 8 (2021), 1351–1364.
- Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 499–512. https://doi.org/10.1145/3448016.3452795
- A log log n data structure for three-sided range queries. Inform. Process. Lett. 25, 4 (1987), 269–273.
- Kamino: Constraint-Aware Differentially Private Data Synthesis. Proc. VLDB Endow. 14, 10 (2021), 1886–1899. https://doi.org/10.14778/3467861.3467876
- Cleaning data with Llunatic. The VLDB Journal 29 (2020), 867–892.
- Cleaning denial constraint violations through relaxation. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 805–815.
- Jaroslaw SAlBCDEaF Parke GodfreyF Jarek GryA. 2012. Fundamentals of Ordering Dependencies. Proceedings of the VLDB Endowment 5, 11 (2012).
- Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD international conference on Management of data. 47–57.
- Generalized search trees for database systems. September.
- Introduction to automata theory, languages, and computation. Acm Sigact News 32, 1 (2001), 60–65.
- TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100–111.
- Functional dependencies in Horn theories. Artificial Intelligence 108, 1-2 (1999), 1–30.
- Functional aggregate queries with additive inequalities. ACM Transactions on Database Systems (TODS) 45, 4 (2020), 1–41.
- Joins via geometric resolutions: Worst case and beyond. ACM Transactions on Database Systems (TODS) 41, 4 (2016), 1–45.
- Lightning fast and space efficient inequality joins. (2015).
- Data dependencies for query optimization: a survey. The VLDB Journal 31, 1 (2022), 1–22.
- Quadtree and R-tree indexes in oracle spatial: a comparison using GIS data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 546–557.
- Approximate denial constraints. arXiv preprint arXiv:2005.08540 (2020).
- Beyond worst-case analysis for joins with minesweeper. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 234–245.
- Mark H Overmars. 1983. The design of dynamic data structures. Vol. 156. Springer Science & Business Media.
- Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the 2016 International Conference on Management of Data. 821–833.
- Discovery of approximate (and exact) denial constraints. Proceedings of the VLDB Endowment 13, 3 (2019), 266–278.
- Fast detection of denial constraint violations. Proceedings of the VLDB Endowment 15, 4 (2021), 859–871.
- Efficient detection of data dependency violations. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1235–1244.
- Fast Algorithms for Denial Constraint Discovery. Proceedings of the VLDB Endowment 16, 4 (2022), 684–696.
- Holoclean: Holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820 (2017).
- Distributed implementations of dependency discovery algorithms. Proceedings of the VLDB Endowment 12, 11 (2019), 1624–1636.
- Efficient discovery of matching dependencies. ACM Transactions on Database Systems (TODS) 45, 3 (2020), 1–33.
- Qichen Wang and Ke Yi. 2022. Conjunctive Queries with Comparisons. In Proceedings of the 2022 International Conference on Management of Data. 108–121.
- Fast approximate denial constraint discovery. Proceedings of the VLDB Endowment 16, 2 (2022), 269–281.
- Shlomo Zilberstein. 1996. Using anytime algorithms in intelligent systems. AI magazine 17, 3 (1996), 73–73.