WaZI: A Learned and Workload-aware Z-Index (2310.04268v3)
Abstract: Learned indexes fit ML models to the data and use them to make query operations more time and space-efficient. Recent works propose using learned spatial indexes to improve spatial query performance by optimizing the storage layout or internal search structures according to the data distribution. However, only a few learned indexes exploit the query workload distribution to enhance their performance. In addition, building and updating learned spatial indexes are often costly on large datasets due to the inefficiency of (re)training ML models. In this paper, we present WaZI, a learned and workload-aware variant of the Z-index, which jointly optimizes the storage layout and search structures, as a viable solution for the above challenges of spatial indexing. Specifically, we first formulate a cost function to measure the performance of a Z-index on a dataset for a range-query workload. Then, we optimize the Z-index structure by minimizing the cost function through adaptive partitioning and ordering for index construction. Moreover, we design a novel page-skipping mechanism to improve the query performance of WaZI by reducing access to irrelevant data pages. Our extensive experiments show that the WaZI index improves range query time by 40% on average over the baselines while always performing better or comparably to state-of-the-art spatial indexes. Additionally, it also maintains good point query performance. Generally, WaZI provides favorable tradeoffs among query latency, construction time, and index size.
- The “AI + R”-tree: An Instance-optimized R-tree. In 2022 23rd IEEE International Conference on Mobile Data Management (MDM). IEEE, 9–18.
- The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD ’04). Association for Computing Machinery, New York, NY, USA, 347–358.
- Rudolf Bayer. 1997. The Universal B-Tree for Multidimensional Indexing: General Concepts. In Worldwide Computing and Its Applications – International Conference, WWCA ’97, Tsukuba, Japan, March 10-11, 1997 Proceedings. Springer, Berlin, Heidelberg, 198–209.
- The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD ’90). Association for Computing Machinery, New York, NY, USA, 322–331.
- Norbert Beckmann and Bernhard Seeger. 2009. A Revised R*-Tree in Comparison with Related Index Structures. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD ’09). Association for Computing Machinery, New York, NY, USA, 799–812.
- Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18, 9 (1975), 509–517.
- ALEX: An Updatable Adaptive Learned Index. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 969–984.
- Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow. 14, 2 (2020), 74–86.
- RW-Tree: A Learned Workload-aware Framework for R-tree Construction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2073–2085.
- Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-Index: A Fully-Dynamic Compressed Learned Index with Provable Worst-Case Bounds. Proc. VLDB Endow. 13, 8 (2020), 1162–1175.
- Raphael A. Finkel and Jon Louis Bentley. 1974. Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Inform. 4 (1974), 1–9.
- Volker Gaede and Oliver Günther. 1998. Multidimensional Access Methods. ACM Comput. Surv. 30, 2 (1998), 170–231.
- FITing-Tree: A Data-aware Index Structure. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1189–1206.
- A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44:1–44:37.
- LMSFC: A Novel Multidimensional Index based on Learned Monotonic Space Filling Curves. Proc. VLDB Endow. 16, 10 (2023), 2605–2617.
- The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data. Proc. ACM Manag. Data 1, 1, Article 63 (2023), 26 pages.
- Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD ’84). Association for Computing Machinery, New York, NY, USA, 47–57.
- Ali Hadian and Thomas Heinis. 2021. Shift-Table: A Low-latency Learned Index for Range Queries using Model Correction. In Proceedings of the 24th International Conference on Extending Database Technology (EDBT). OpenProceedings.org, 253–264.
- Database Cracking. In 3rd Biennial Conference on Innovative Data Systems Research (CIDR). cidrdb.org, 68–78.
- Ibrahim Kamel and Christos Faloutsos. 1994. Hilbert R-tree: An Improved R-tree using Fractals. In Proceedings of 20th International Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann, San Mateo, CA, USA, 500–509.
- RadixSpline: A Single-Pass Learned Index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM@SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 5:1–5:5.
- The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 489–504.
- Jonathan K. Lawder and Peter J. H. King. 2000. Using Space-Filling Curves for Multi-dimensional Indexing. In Advances in Databases – 17th British National Conference on Databases, BNCOD 17 Exeter, UK, July 3-5, 2000 Proceedings. Springer, Berlin, Heidelberg, 20–35.
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
- STR: A Simple and Efficient Algorithm for R-Tree Packing. In Proceedings of the 13th International Conference on Data Engineering (ICDE). IEEE, 497–506.
- Towards Designing and Learning Piecewise Space-Filling Curves. Proc. VLDB Endow. 16, 9 (2023), 2158–2171.
- LISA: A Learned Index Structure for Spatial Data. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2119–2133.
- Waffle: A Workload-Aware and Query-Sensitive Framework for Disk-Based Spatial Indexing. Proc. VLDB Endow. 16, 4 (2022), 670–683.
- Learning Multi-Dimensional Indexes. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 985–1000.
- The Grid File: An Adaptable, Symmetric Multikey File Structure. ACM Trans. Database Syst. 9, 1 (1984), 38–71.
- Shoji Nishimura and Haruo Yokota. 2017. QUILTS: Multidimensional Data Partitioning Framework Based on Query-Aware and Skew-Tolerant Space-Filling Curves. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 1525–1537.
- OpenStreetMap 2017. OpenStreetMap. https://www.openstreetmap.org.
- Towards an Instance-Optimal Z-Index. In 4th International Workshop on Applied AI for Database Systems and Applications (AIDB ’22). 4 pages.
- Enhancing In-Memory Spatial Indexing with Learned Search. CoRR abs/2309.06354 (2023). arXiv:2309.06354
- QUASII: QUery-Aware Spatial Incremental Index. In Proceedings of the 21st International Conference on Extending Database Technology (EDBT). OpenProceedings.org, 325–336.
- Effectively Learning Spatial Indices. Proc. VLDB Endow. 13, 11 (2020), 2341–2354.
- Theoretically Optimal and Empirically Efficient R-trees with Strong Parallelizability. Proc. VLDB Endow. 11, 5 (2018), 621–634.
- Packing R-trees with Space-filling Curves: Theoretical Optimality, Empirical Efficiency, and Bulk-loading Parallelizability. ACM Trans. Database Syst. 45, 3 (2020), 14:1–14:47.
- Raghu Ramakrishnan and Johannes Gehrke. 2003. Database Management Systems (3rd ed.). McGraw-Hill, New York, NY, USA.
- Cost-based Unbalanced R-Trees. In Proceedings of the 13th International Conference on Scientific and Statistical Database Management (SSDBM). IEEE, 203–212.
- The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. In Proceedings of 13th International Conference on Very Large Data Bases (VLDB ’87). Morgan Kaufmann, San Mateo, CA, USA, 507–518.
- Herbert Tropf and Helmut Herzog. 1981. Multidimensional Range Search in Dynamically Balanced Trees. Angew. Info. 2 (1981), 71–77.
- Learned Index for Spatial Queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). IEEE, 569–574.
- Hongwei Wen and Hanyuan Hang. 2022. Random Forest Density Estimation. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 23701–23722.
- Qd-tree: Learning Data Layouts for Big Data Analytics. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 193–208.
- SPRIG: A Learned Spatial Index for Range and kNN Queries. In Proceedings of the 17th International Symposium on Spatial and Temporal Databases (SSTD ’21). Association for Computing Machinery, New York, NY, USA, 96–105.