Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HTAP Databases: A Survey (2404.15670v1)

Published 24 Apr 2024 in cs.DB

Abstract: Since Gartner coined the term, Hybrid Transactional and Analytical Processing (HTAP), numerous HTAP databases have been proposed to combine transactions with analytics in order to enable real-time data analytics for various data-intensive applications. HTAP databases typically process the mixed workloads of transactions and analytical queries in a unified system by leveraging both a row store and a column store. As there are different storage architectures and processing techniques to satisfy various requirements of diverse applications, it is critical to summarize the pros and cons of these key techniques. This paper offers a comprehensive survey of HTAP databases. We mainly classify state-of-the-art HTAP databases according to four storage architectures: (a) Primary Row Store and In-Memory Column Store; (b) Distributed Row Store and Column Store Replica; (c) Primary Row Store and Distributed In-Memory Column Store; and (d) Primary Column Store and Delta Row Store. We then review the key techniques in HTAP databases, including hybrid workload processing, data organization, data synchronization, query optimization, and resource scheduling. We also discuss existing HTAP benchmarks. Finally, we provide the research challenges and opportunities for HTAP techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (146)
  1. The design and implementation of modern column-oriented database systems. Found. Trends Databases, 5(3):197–280, 2013.
  2. Proteus: Autonomous adaptive storage for mixed workloads. In SIGMOD, pages 700–714. ACM, 2022.
  3. Tiresias: Enabling predictive autonomous storage and indexing. Proceedings of the VLDB Endowment, 15(11):3126–3136, 2022.
  4. Weaving relations for cache performance. In VLDB, pages 169–180. Morgan Kaufmann, 2001.
  5. H2o: a hands-free adaptive store. In SIGMOD, pages 1103–1114, 2014.
  6. A. Alhomssi and V. Leis. Scalable and robust snapshot isolation for high-performance storage engines. Proceedings of the VLDB Endowment, 16(6):1426–1438, 2023.
  7. The cost of serializability on platforms that use snapshot isolation. In ICDE, pages 576–585. IEEE Computer Society, 2008.
  8. Apache Arrow. https://arrow.apache.org/, 2022.
  9. The Case For Heterogeneous HTAP. In CIDR, 2017.
  10. Janus: A hybrid scalable multi-representation cloud datastore. TKDE, 30(4):689–702, 2017.
  11. Bridging the Archipelago between Row-stores and Column-stores for Hybrid Workloads. In SIGMOD, pages 583–598, 2016.
  12. Optimal Column Layout for Hybrid Workloads. Proceedings of the VLDB Endowment, 12(13):2393–2407, 2019.
  13. Evolving databases for new-gen big data applications. In CIDR, 2017.
  14. N. Boeschen and C. Binnig. GaccO - A GPU-accelerated OLTP DBMS. In SIGMOD, pages 1003–1016. ACM, 2022.
  15. M. Boissier. Robust and budget-constrained encoding configurations for in-memory database systems. Proceedings of the VLDB Endowment, 15(4):780–793, 2021.
  16. Hybrid data layouts for tiered HTAP databases with pareto-optimal data placements. In ICDE, pages 209–220. IEEE, 2018.
  17. Enabling high-performance and energy-efficient hybrid transactional/analytical databases with hardware/software cooperation. In ICDE. IEEE, 2022.
  18. D. Borthakur et al. Hdfs architecture guide. Hadoop apache project, 53(1-13):2, 2008.
  19. M. Bouzeghoub. A Framework for Analysis of Data Freshness. In International workshop on Information quality in information systems, pages 59–67, 2004.
  20. Tastes great! less filling! high performance and accurate training data collection for self-driving database management systems. In SIGMOD, pages 617–630. ACM, 2022.
  21. Replication at the speed of change - a fast, scalable replication solution for near real-time HTAP processing. Proceedings of the VLDB Endowment, 13(12):3245–3257, 2020.
  22. Polardb-x: An elastic distributed relational database for cloud-native applications. In ICDE, pages 2859–2872. IEEE, 2022.
  23. Polardb serverless: A cloud native database for disaggregated data centers. In SIGMOD, pages 2477–2489. ACM, 2021.
  24. Bytehtap: Bytedance’s HTAP system with high data freshness and strong data consistency. Proceedings of the VLDB Endowment, 15(12):3411–3424, 2022.
  25. HTAPBench: Hybrid Transactional and Analytical Processing Benchmark. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, pages 293–304, 2017.
  26. The Mixed Workload CH-benCHmark. In Proceedings of the Fourth International Workshop on Testing Database Systems, pages 1–6, 2011.
  27. Spanner: Google’s globally-distributed database. In OSDI, pages 251–264. USENIX Association, 2012.
  28. Citus: Distributed postgresql for data-intensive applications. In SIGMOD, pages 2490–2502, 2021.
  29. The snowflake elastic data warehouse. In SIGMOD, pages 215–226. ACM, 2016.
  30. Hekaton: SQL Server’s Memory-Optimized OLTP Engine. In SIGMOD, pages 1243–1254, 2013.
  31. Hyrise re-engineered: An extensible database system for research in relational in-memory data management. In EDBT, pages 313–324, 2019.
  32. Columnstore and B+ tree-Are Hybrid Physical Designs Important? In SIGMOD, pages 177–190, 2018.
  33. Relational data synthesis using generative adversarial networks: A design space exploration. arXiv preprint arXiv:2008.12763, 2020.
  34. The SAP HANA Database–An Architecture Overview. IEEE Data Eng. Bull., 35(1):28–33, 2012.
  35. D. Feinberg. Setting the Record Straight: HTAP OPDBMS, 2018.
  36. Benchmarking hybrid oltp&olap database systems. In BTW, volume P-180 of LNI, pages 390–409. GI, 2011.
  37. Geode. Performance is key. Consistency is a must, 2022.
  38. Towards Scalable Real-Time Analytics: An Architecture for Scale-Out of OLxP Workloads. Proceedings of the VLDB Endowment, 8(12):1716–1727, 2015.
  39. Google AlloyDB. AlloyDB for PostgreSQL, 2024.
  40. Hyrise: a main memory hybrid storage engine. Proceedings of the VLDB Endowment, 4(2):105–116, 2010.
  41. Multi-model query languages: taming the variety of big data. Distributed and Parallel Databases, 42(1):31–71, 2024.
  42. HBase. Apache HBase Reference Guide, 2016.
  43. D. Hieber and G. Grambow. Hybrid transactional and analytical processing databases-state of research and production usage.
  44. D. Hieber and G. Grambow. Hybrid transactional and analytical processing databases: A systematic literature review. In DATA ANALYTICS, pages 90–98, 2020.
  45. TiDB: A Raft-based HTAP Database. Proceedings of the VLDB Endowment, 13(12):3072–3084, 2020.
  46. N. Jeba and S. Rathi. Effective data management and real-time analytics in internet of things. Int. J. Cloud Comput., 10(1/2):112–128, 2021.
  47. Good to the last bit: Data-driven encoding with codecdb. In SIGMOD, pages 843–856. ACM, 2021.
  48. Adaptive update handling for graph htap. Distributed and Parallel Databases, pages 1–27, 2023.
  49. Benchmarking htap databases for performance isolation and real-time analytics. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 3(2):100122, 2023.
  50. Nhtapdb: Native htap databases. arXiv preprint arXiv:2302.09927, 2023.
  51. Olxpbench: Real-time, semantically consistent, and domain-specific are essential in benchmarking, designing, and implementing HTAP systems. In ICDE, pages 1822–1834. IEEE, 2022.
  52. A. Kemper and T. Neumann. Hyper: A hybrid oltp&olap main memory database system based on virtual memory snapshots. In ICDE, pages 195–206. IEEE, 2011.
  53. Rethink the scan in MVCC databases. In SIGMOD, pages 938–950. ACM, 2021.
  54. Diva: Making MVCC systems htap-friendly. In SIGMOD, pages 49–64. ACM, 2022.
  55. ERMIA: fast memory-optimized database system for heterogeneous workloads. In SIGMOD, pages 1675–1687. ACM, 2016.
  56. The case for learned index structures. In G. Das, C. M. Jermaine, and P. A. Bernstein, editors, SIGMOD, pages 489–504. ACM, 2018.
  57. Oracle Database In-Memory: A Dual Format In-Memory Database. In ICDE, pages 1253–1258. IEEE, 2015.
  58. L. Lamport. Paxos made simple, fast, and byzantine. In A. Bui and H. Fouchal, editors, OPODIS, volume 3, pages 7–9. Suger, Saint-Denis, rue Catulienne, France, 2002.
  59. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In SIGMOD, pages 311–326. ACM, 2016.
  60. Real-Time Analytical Processing with SQL Server. VLDB, 8(12):1740–1751, 2015.
  61. Parallel Replication across Formats in SAP HANA for Scaling Out Mixed OLTP/OLAP workloads. VLDB, 10(12):1598–1609, 2017.
  62. The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product. Proceedings of the VLDB Endowment, 14(12):2999–3013, 2021.
  63. Cloud databases: New techniques, challenges, and opportunities. Proceedings of the VLDB Endowment, 15(12):3758–3761, 2022.
  64. G. Li and C. Zhang. HTAP databases: What is new and what is next. In SIGMOD, pages 2483–2488. ACM, 2022.
  65. G. Li and X. Zhou. Machine learning for data management: A system view. In ICDE, pages 3198–3201. IEEE, 2022.
  66. AI Meets Database: AI4DB and DB4AI. In SIGMOD, pages 2859–2866, 2021.
  67. Machine learning for databases. In AIMLSystems 2021: The First International Conference on AI-ML-Systems, Bangalore India, October 21 - 23, 2021, pages 28:1–28:2. ACM, 2021.
  68. Machine learning for databases. VLDB, 14(12):3190–3193, 2021.
  69. Real-time novel event detection from social media. In ICDE, pages 1129–1139. IEEE Computer Society, 2017.
  70. Mainlining databases: Supporting fast transactional workloads on universal columnar data file formats. Proceedings of the VLDB Endowment, 14(4):534–546, 2020.
  71. Heterogeneous graph neural networks for malicious account detection. In CIKM, pages 2077–2085. ACM, 2018.
  72. Fluidkv: Seamlessly bridging the gap between indexing performance and memory-footprint on ultra-fast storage.
  73. Umzi: Unified multi-zone indexing for large-scale HTAP. In EDBT, pages 1–12. OpenProceedings.org, 2019.
  74. Greenplum: A hybrid database for transactional and analytical workloads. In SIGMOD, pages 2530–2542, 2021.
  75. Query-based workload forecasting for self-driving database management systems. In SIGMOD, pages 631–645. ACM, 2018.
  76. MB2: decomposed behavior modeling for self-driving database management systems. In SIGMOD, pages 1248–1261. ACM, 2021.
  77. Batchdb: Efficient isolated execution of hybrid oltp+ olap workloads for interactive applications. In SIGMOD, pages 37–50, 2017.
  78. MariaDB. Deploy an HTAP Server with MariaDB ColumnStore 5.5 and Community Server 10.6, 2021.
  79. MatrixOne. HSTAP architecture. https://docs.matrixorigin.cn, 2024.
  80. How good is my HTAP system? In SIGMOD, pages 1810–1824. ACM, 2022.
  81. Snappydata: A unified cluster for streaming, transactions and interactice analytics. In CIDR, volume 17, pages 8–11, 2017.
  82. Scyper: A hybrid oltp&olap distributed main memory database system for scalable real-time analytics. In BTW, pages 499–502, 2013.
  83. MySQL Heatwave. Real-time Analytics for MySQL Database Service, 2024.
  84. Extending postgresql to handle olxp workloads. In INTECH 2015, pages 40–44, 2015.
  85. V. R. Narasayya and S. Chaudhuri. Cloud Data Services: Workloads, Architectures and Multi-Tenancy. Foundations and Trends in Databases, 10(1):1–107, 2021.
  86. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In SIGMOD, pages 677–689, 2015.
  87. Nikita Ivanov. How HTAP Enables Real-Time Banking Services At Scale, 2021.
  88. NoisePage. https://noise.page/, 2022.
  89. In-memory database and memsql.
  90. The star schema benchmark and augmented fact table indexing. In TPCTC, volume 5895 of Lecture Notes in Computer Science, pages 237–252. Springer, 2009.
  91. D. Ongaro and J. K. Ousterhout. In search of an understandable consensus algorithm. In USENIX ATC, pages 305–319. USENIX Association, 2014.
  92. Oracle 21c. Automating Management of In-Memory Objects.
  93. Hybrid Transactional/Analytical Processing: A Survey. In SIGMOD, pages 1771–1775, 2017.
  94. Self-driving database management systems. In CIDR. www.cidrdb.org, 2017.
  95. A. Pavlo and M. Aslett. What’s really new with newsql? SIGMOD Rec., 45(2):45–55, 2016.
  96. D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI, pages 251–264. USENIX, 2010.
  97. Starling: A scalable query engine on cloud functions. In SIGMOD, pages 131–141. ACM, 2020.
  98. Hybrid Transaction/Analytical Processing Will Foster Opportunities For Dramatic Business Innovation. Gartner, pages 4–20, 2014.
  99. Real-time Insights and Decision Making using Hybrid Streaming, In-Memory Computing Analytics and Transaction Processing. 2016.
  100. Magic Quadrant for Cloud Database Management Systems. Gartner (2021, December 13), pages 1–37, 2021.
  101. Phoenix. OLTP and operational analytics for Apache Hadoop.
  102. PolarDB. PolarDB HTAP Real-Time Data Analysis Technology Decryption, 2021.
  103. Cloud-Native Transactions and Analytics in SingleStore. In SIGMOD, pages 2340–2352, 2022.
  104. Task scheduling for highly concurrent analytical and transactional main-memory workloads. In ADMS, pages 36–45, 2013.
  105. Scaling Up Mixed Workloads: A Battle of Data Freshness, Flexibility, and Scheduling. In TPCTC, pages 97–112. Springer, 2014.
  106. Real-time constrained cycle detection in large dynamic graphs. Proceedings of the VLDB Endowment, 11(12):1876–1888, 2018.
  107. J. T. S. Quah and M. Sriganesh. Real-time credit card fraud detection using computational intelligence. Expert Syst. Appl., 35(4):1721–1732, 2008.
  108. DB2 with BLU Acceleration: So Much More Than Just A Column Store. VLDB, 6(11):1080–1091, 2013.
  109. Adaptive HTAP Through Elastic Resource Scheduling. In SIGMOD, pages 2043–2054, 2020.
  110. MV-PBT: Multi-Version Indexing for Large Datasets and HTAP Workloads. In EDBT, pages 217–228, 2020.
  111. Relational memory: Native in-memory accesses on rows and columns. In EDBT, pages 66–79. OpenProceedings.org, 2023.
  112. L-store: A real-time oltp and olap system. arXiv preprint arXiv:1601.04084, 2016.
  113. F1 query: Declarative querying at scale. Proceedings of the VLDB Endowment, 11(12):1835–1848, 2018.
  114. Real-Time LSM-Trees for HTAP Workloads. arXiv preprint arXiv:2101.06801, 2021.
  115. What serverless computing is and should become: the next phase of cloud computing. Commun. ACM, 64(5):76–84, 2021.
  116. Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing. In OSDI, pages 219–238, 2021.
  117. Bridging the gap between relational {{\{{OLTP}}\}} and graph-based {{\{{OLAP}}\}}. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 181–196, 2023.
  118. Efficient Transaction Processing in SAP HANA Database: The End of A Column Store Myth. In SIGMOD, pages 731–742, 2012.
  119. U. Sirin and A. Ailamaki. Micro-architectural analysis of OLAP: limitations and opportunities. Proceedings of the VLDB Endowment, 13(6):840–853, 2020.
  120. Performance Characterization of HTAP Workloads. In ICDE, pages 1829–1834, 2021.
  121. Micro-architectural analysis of in-memory oltp. In SIGMOD, pages 387–402, 2016.
  122. Snowflake Unistore. Getting Started with Transactional and Analytical data in Snowflake, 2024.
  123. Rethink query optimization in htap databases. Proceedings of the ACM on Management of Data, 1(4):1–27, 2023.
  124. Splice Machine. Defining HTAP, 2017.
  125. StoneDB. A Real-time HTAP Database, 2022.
  126. Learned cardinality estimation: A design space exploration and a comparative evaluation. Proceedings of the VLDB Endowment, 15(1):85–97, 2021.
  127. On Supporting Efficient Snapshot Isolation for Hybrid Workloads with Multi-Versioned Indexes. VLDB, 13(2), 2019.
  128. TATP. TATP Benchmark Description (Version 1.0), 2009.
  129. Tecent. Webank. www.webank.com, 2023.
  130. Transaction Processing Performance Council. TPC-C, 2021.
  131. Transaction Processing Performance Council. TPC-H, 2021.
  132. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In SIGMOD, pages 1041–1052. ACM, 2017.
  133. Near-data processing in database systems on native computational storage under HTAP workloads. Proceedings of the VLDB Endowment, 15(10):1991–2004, 2022.
  134. Polardb-imci: A cloud-native htap database system at alibaba. Proceedings of the ACM on Management of Data, 1(2):1–25, 2023.
  135. An empirical evaluation of in-memory multi-version concurrency control. VLDB, 10(7):781–792, 2017.
  136. F1 lightning: Htap as a service. Proceedings of the VLDB Endowment, 13(12):3313–3325, 2020.
  137. Oceanbase: A 707 million tpmc distributed relational database system. Proceedings of the VLDB Endowment, 15(12):3385–3397, 2022.
  138. Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS. Proceedings of the VLDB Endowment, 15(11):2491–2503, 2022.
  139. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.
  140. Survey of Key Techniques of HTAP Databases. Journal of Software, 34(2):761–785, 2022.
  141. HyBench: A New Benchmark for HTAP Databases. Proceedings of the VLDB Endowment, 17, 2024.
  142. C. Zhang and J. Lu. Selectivity Estimation for Relation-Tree Joins. In SSDBM, pages 1–12, 2020.
  143. C. Zhang and J. Lu. Holistic evaluation in multi-model databases benchmarking. Distributed Parallel Databases, 39(1):1–33, 2021.
  144. UniBench: A Benchmark for Multi-model Database Management Systems. In TPCTC, volume 11135, pages 7–23. Springer, 2018.
  145. AutoCE: An Accurate and Efficient Model Advisor for Learned Cardinality Estimation. In ICDE, pages 2621–2633. IEEE, 2023.
  146. PACE: Poisoning Attacks on Learned Cardinality Estimation. Proceedings of the ACM on Management of Data, 2(1):1–27, 2024.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

HackerNews

  1. HTAP Databases: A Survey (3 points, 0 comments)