vLSM: Low tail latency and I/O amplification in LSM-based KV stores (2407.15581v1)
Abstract: LSM-based key-value (KV) stores are an important component in modern data infrastructures. However, they suffer from high tail latency, in the order of several seconds, making them less attractive for user-facing applications. In this paper, we introduce the notion of compaction chains and we analyse how they affect tail latency. Then, we show that modern designs reduce tail latency, by trading I/O amplification or require large amounts of memory. Based on our analysis, we present vLSM, a new KV store design that improves tail latency significantly without compromising on memory or I/O amplification. vLSM reduces (a) compaction chain width by using small SSTs and eliminating the tiering compaction required in L0 by modern systems and (b) compaction chain length by using a larger than typical growth factor between L1 and L2 and introducing overlap-aware SSTs in L1. We implement vLSM in RocksDB and evaluate it using db_bench and YCSB. Our evaluation highlights the underlying trade-off among memory requirements, I/O amplification, and tail latency, as well as the advantage of vLSM over current approaches. vLSM improves P99 tail latency by up to 4.8x for writes and by up to 12.5x for reads, reduces cumulative write stalls by up to 60% while also slightly improves I/O amplification at the same memory budget.
- Silk+ preventing latency spikes in log-structured merge key-value stores running heterogeneous workloads. ACM Trans. Comput. Syst., 36(4), may 2020.
- Vat: Asymptotic cost analysis for multi-level key-value stores, 2020.
- Characterizing, modeling, and benchmarking rocksdb key-value workloads at facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies, FAST’20, page 209–224, USA, 2020. USENIX Association.
- Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), jun 2008.
- Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, page 143–154, New York, NY, USA, 2010. Association for Computing Machinery.
- Dostoevsky: Better space-time trade-offs for lsm-tree based key-value stores via adaptive removal of superfluous merging. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, page 505–520, New York, NY, USA, 2018. Association for Computing Machinery.
- Spooky: Granulating lsm-tree compactions correctly. Proc. VLDB Endow., 15(11):3071–3084, jul 2022.
- Discord. How Discord stores trillions of messages? https://discord.com/blog/how-discord-stores-trillions-of-messages, 2023. Accessed: July 22, 2024.
- Optimizing space amplification in rocksdb. In CIDR, volume 3, page 3, 2017.
- Rocksdb: Evolution of development priorities in a key-value store serving large-scale applications. ACM Trans. Storage, 17(4), oct 2021.
- Revisiting log-structured merging for kv stores in hybrid memory systems. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 674–687, New York, NY, USA, 2023. Association for Computing Machinery.
- Jason Evans. jemalloc. http://jemalloc.net/, 2018.
- Facebook. Benchmarking Tools:dbbench. https://github.com/facebook/rocksdb/wiki/Benchmarking-tools, 2018. Accessed: July 22, 2024.
- Facebook. Rocksdb. http://rocksdb.org/, 2018.
- Facebook. RocksDB Direct I/O. https://github.com/facebook/rocksdb/wiki/Direct-IO, 2018. Accessed: July 22, 2024.
- Facebook. RocksDB Leveled Compaction. https://github.com/facebook/rocksdb/wiki/Leveled-Compaction, 2018. Accessed: July 22, 2024.
- Gil Tene. How not to measure tail latency. https://www.infoq.com/presentations/latency-response-time/, 2016. Accessed: July 22, 2024.
- The partitioned exponential file for database storage management. The VLDB Journal, 16(4):417–437, oct 2007.
- Redesigning lsms for nonvolatile memory with novelsm. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’18, page 993–1005, USA, 2018. USENIX Association.
- Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10, 2016.
- ListDB: Union of Write-Ahead logs and persistent SkipLists for incremental checkpointing on persistent memory. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 161–177, Carlsbad, CA, July 2022. USENIX Association.
- Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35–40, April 2010.
- The log-structured merge-tree (lsm-tree). Acta Inf., 33(4):351–385, June 1996.
- Jinglei Ren. Ycsb-c. https://github.com/basicthinker/YCSB-C, 2016.
- blsm: A general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 217–228, New York, NY, USA, 2012. ACM.
- Flexgen: high-throughput generative inference of large language models with a single gpu. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- tigerbeetle. The world’s fastest financial accounting database. https://tigerbeetle.com, 2020. Accessed: July 22, 2024.
- Wikipedia. NVM discontinuation. https://en.wikipedia.org/wiki/3D_XPoint, 2023. Accessed: July 22, 2024.
- Matrixkv: Reducing write stalls and write amplification in lsm-tree based kv stores with a matrix container in nvm. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC’20, USA, 2020. USENIX Association.
- Adoc: Automatically harmonizing dataflow between components in log-structured key-value stores for improved performance. In Proceedings of the 21st USENIX Conference on File and Storage Technologies, FAST’23, USA, 2023. USENIX Association.
- Chameleondb: a key-value store for optane persistent memory. In Proceedings of the Sixteenth European Conference on Computer Systems, EuroSys ’21, page 194–209, New York, NY, USA, 2021. Association for Computing Machinery.
- Treadmill: attributing the source of tail latency through precise load testing and statistical inference. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, page 456–468. IEEE Press, 2016.
- Calcspar: A Contract-Aware LSM store for cloud storage with low latency spikes. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 451–465, Boston, MA, July 2023. USENIX Association.