How Good Are Multi-dimensional Learned Indices? An Experimental Survey (2405.05536v1)
Abstract: Efficient indexing is fundamental for multi-dimensional data management and analytics. An emerging tendency is to directly learn the storage layout of multi-dimensional data by simple machine learning models, yielding the concept of Learned Index. Compared with the conventional indices used for decades (e.g., kd-tree and R-tree variants), learned indices are empirically shown to be both space- and time-efficient on modern architectures. However, there lacks a comprehensive evaluation of existing multi-dimensional learned indices under a unified benchmark, which makes it difficult to decide the suitable index for specific data and queries and further prevents the deployment of learned indices in real application scenarios. In this paper, we present the first in-depth empirical study to answer the question of how good multi-dimensional learned indices are. Six recently published indices are evaluated under a unified experimental configuration including index implementation, datasets, query workloads, and evaluation metrics. We thoroughly investigate the evaluation results and discuss the findings that may provide insights for future learned index design.
- In: Database Theory—ICDT 2001: 8th International Conference London, UK, January 4–6, 2001 Proceedings 8, pp. 420–434. Springer (2001)
- ANN Project. http://www.cs.umd.edu/~mount/ANN/. Accessed: 2024-04-15
- ACM Trans. Algorithms 4(1), 9:1–9:30 (2008)
- In: SODA, pp. 1027–1035. SIAM (2007)
- In: SIGMOD Conference, pp. 322–331. ACM Press (1990)
- Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
- In: ALENEX, pp. 46–59. SIAM (2021)
- Boost Geometry. http://boost.org/libs/geometry. Accessed: 2024-04-15
- In: ICDE, pp. 421–430. IEEE Computer Society (2001)
- ACM Transactions on Database Systems (TODS) 24(3), 361–404 (1999)
- MIS quarterly pp. 1165–1188 (2012)
- Proceedings of the VLDB Endowment 10(10) (2017)
- In: EDBT, pp. 407–410. OpenProceedings.org (2020)
- In: SIGMOD Conference, pp. 969–984. ACM (2020)
- Proc. VLDB Endow. 14(2), 74–86 (2020)
- In: Data mining and knowledge discovery for big data, pp. 83–116. Springer (2014)
- In: ICML, Proceedings of Machine Learning Research, vol. 119, pp. 3123–3132. PMLR (2020)
- Proc. VLDB Endow. 13(8), 1162–1175 (2020)
- FourSquare Data. https://sites.google.com/site/yangdingqi/home/foursquare-dataset. Accessed: 2024-04-15
- GEOS. https://github.com/libgeos/geos. Accessed: 2024-04-15
- gperftools. https://github.com/gperftools/gperftools. Accessed: 2024-04-15
- CoRR abs/2103.04541 (2021)
- Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD Conference, pp. 47–57. ACM Press (1984)
- In: AIDB@VLDB (2020)
- In: SIGMOD Conference, pp. 237–248. ACM Press (1998)
- Huber, P.J.: Robust estimation of a location parameter. In: Breakthroughs in statistics: Methodology and distribution, pp. 492–518. Springer (1992)
- ACM Trans. Database Syst. 30(2), 364–397 (2005)
- In: VLDB, pp. 500–509. Morgan Kaufmann (1994)
- In: SIGMOD Conference, pp. 546–557. ACM (2002)
- In: aiDM@SIGMOD, pp. 5:1–5:5. ACM (2020)
- In: CIDR. www.cidrdb.org (2019)
- In: SIGMOD Conference, pp. 489–504. ACM (2018)
- In: SIGMOD Conference, pp. 1001–1016. ACM (2020)
- In: ICDE, pp. 497–506. IEEE Computer Society (1997)
- In: SIGMOD Conference, pp. 2119–2133. ACM (2020)
- In: Proceedings of the 2022 International Conference on Management of Data, pp. 917–930 (2022)
- Proc. VLDB Endow. 13(11), 2355–2367 (2020)
- Proc. VLDB Endow. 14(1), 1–13 (2020)
- Mitzenmacher, M.: A model for learned bloom filters and optimizing by sandwiching. In: NeurIPS, pp. 462–471 (2018)
- nanoflann. https://github.com/jlblancoc/nanoflann. Accessed: 2024-04-15
- In: SIGMOD Conference, pp. 985–1000. ACM (2020)
- ACM Trans. Database Syst. 9(1), 38–71 (1984)
- Numpy. https://numpy.org/. Accessed: 2024-04-15
- O’Rourke, J.: An on-line algorithm for fitting straight lines between data ranges. Communications of the ACM 24(9), 574–578 (1981)
- OpenStreet Map. https://planet.openstreetmap.org. Accessed: 2024-04-15
- Proc. VLDB Endow. 11(11), 1661–1673 (2018)
- Data Sci. Eng. 6(2), 192–208 (2021)
- PostgreSQL: Postgresql: The world’s most advanced open source relational database. Web resource: https://www.postgresql.org/ (2021)
- PyTorch. https://pytorch.org/. Accessed: 2024-04-15
- Proc. VLDB Endow. 13(11), 2341–2354 (2020)
- ACM Trans. Database Syst. 45(3), 14:1–14:47 (2020)
- In: VLDB, pp. 263–272. Morgan Kaufmann (2000)
- ACM Computing Surveys (CSUR) 55(1), 1–38 (2022)
- Samet, H.: Foundations of multidimensional and metric data structures. Morgan Kaufmann series in data management systems. Academic Press (2006)
- Nature neuroscience 17(11), 1440–1441 (2014)
- Apache Spark. https://spark.apache.org/. Accessed: 2024-04-15
- Proc. VLDB Endow. 16(8), 1992–2004 (2023)
- Proc. VLDB Endow. 9(13), 1565–1568 (2016)
- In: VLDB, pp. 790–801. Morgan Kaufmann (2003)
- Tensorflow. https://www.tensorflow.org/. Accessed: 2024-04-15
- Toronto3D Data. https://github.com/WeikaiTan/Toronto-3D. Accessed: 2024-04-15
- In: MDM, pp. 569–574. IEEE (2019)
- Proc. VLDB Endow. 14(8), 1276–1288 (2021)
- In: VLDB, pp. 756–767. Morgan Kaufmann (2004)
- In: SIGMOD Conference, pp. 1071–1085. ACM (2016)
- In: SIGMOD Conference, pp. 193–208. ACM (2020)
- Yianilos, P.N.: Data structures and algorithms for nearest neighbor. In: Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, vol. 66, p. 311. SIAM (1993)
- In: SIGSPATIAL/GIS, pp. 70:1–70:4. ACM (2015)
- IEEE Trans. Intell. Transp. Syst. 20(1), 383–398 (2019)
- morton-nd. https://github.com/morton-nd/morton-nd. Accessed: 2024-04-15