Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lightweight Frequency-Based Tiering for CXL Memory Systems (2312.04789v1)

Published 8 Dec 2023 in cs.DC and cs.OS

Abstract: Modern workloads are demanding increasingly larger memory capacity. Compute Express Link (CXL)-based memory tiering has emerged as a promising solution for addressing this trend by utilizing traditional DRAM alongside slow-tier CXL-memory devices in the same system. Unfortunately, most prior tiering systems are recency-based, which cannot accurately identify hot and cold pages, since a recently accessed page is not necessarily a hot page. On the other hand, more accurate frequency-based systems suffer from high memory and runtime overhead as a result of tracking large memories. In this paper, we propose FreqTier, a fast and accurate frequency-based tiering system for CXL memory. We observe that memory tiering systems can tolerate a small amount of tracking inaccuracy without compromising the overall application performance. Based on this observation, FreqTier probabilistically tracks the access frequency of each page, enabling accurate identification of hot and cold pages while maintaining minimal memory overhead. Finally, FreqTier intelligently adjusts the intensity of tiering operations based on the application's memory access behavior, thereby significantly reducing the amount of migration traffic and application interference. We evaluate FreqTier on two emulated CXL memory devices with different bandwidths. On the high bandwidth CXL device, FreqTier can outperform state-of-the-art tiering systems while using 4$\times$ less local DRAM memory for in-memory caching workloads. On GAP graph analytics and XGBoost workloads with 1:32 local DRAM to CXL-memory ratio, FreqTier outperforms prior works by 1.04$-$2.04$\times$ (1.39$\times$ on average). Even on the low bandwidth CXL device, FreqTier outperforms AutoNUMA by 1.14$\times$ on average.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent page management for two-tiered main memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).   New York, NY, USA: Association for Computing Machinery, 2017, p. 631–644.
  2. Amazon, “Xgboost algorithm,” https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#Instance-XGBoost-training-cpu.
  3. Apache Software Foundation, “Class countingbloomfilter,” https://hadoop.apache.org/docs/r2.7.5/api/org/apache/hadoop/util/bloom/CountingBloomFilter.html.
  4. T. L. K. Archives, “Fuse,” https://www.kernel.org/doc/html/next/filesystems/fuse.html.
  5. S. Beamer, K. Asanović, and D. Patterson, “GAP benchmark suite,” https://github.com/sbeamer/gapbs.
  6. S. Beamer, K. Asanović, and D. Patterson, “The gap benchmark suite,” 2015. [Online]. Available: https://arxiv.org/abs/1508.03619
  7. B. Berg, D. S. Berger, S. McAllister, I. Grosof, S. Gunasekar, J. Lu, M. Uhlar, J. Carrig, N. Beckmann, M. Harchol-Balter, and G. R. Ganger, “The CacheLib caching engine: Design and experiences at scale,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI).   USENIX Association, Nov. 2020, pp. 753–768.
  8. A. Broder and M. Mitzenmacher, “Survey: Network applications of bloom filters: A survey.” Internet Mathematics, vol. 1, 11 2003.
  9. CacheLib, “Cachebench overview,” https://cachelib.org/docs/Cache_Library_User_Guides/Cachebench_Overview/.
  10. CacheLib, “Hybrid cache,” https://cachelib.org/docs/Cache_Library_Architecture_Guide/hybrid_cache.
  11. J. Corbet, “Two memory-tiering patch sets,” https://lwn.net/Articles/898766/.
  12. Criteo, “Criteo 1TB click logs dataset,” https://ailab.criteo.com/criteo-1tb-click-logs-dataset/.
  13. DPDK, “Data plane development kit,” https://github.com/DPDK/dpdk.
  14. S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, J. Jackson, and K. Schwan, “Data tiering in heterogeneous memory systems,” in Proceedings of the Eleventh European Conference on Computer Systems (EuroSys).   New York, NY, USA: Association for Computing Machinery, 2016.
  15. P. Duraisamy, W. Xu, S. Hare, R. Rajwar, D. Culler, Z. Xu, J. Fan, C. Kennelly, B. McCloskey, D. Mijailovic, B. Morris, C. Mukherjee, J. Ren, G. Thelen, P. Turner, C. Villavieja, P. Ranganathan, and A. Vahdat, “Towards an adaptable systems architecture for memory tiering at warehouse-scale,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).   New York, NY, USA: Association for Computing Machinery, 2023, p. 727–741.
  16. G. Einziger, R. Friedman, and B. Manes, “TinyLFU: A highly efficient cache admission policy,” ACM Trans. Storage, vol. 13, no. 4, nov 2017. [Online]. Available: https://doi.org/10.1145/3149371
  17. Facebook, “Cachelib,” https://github.com/facebook/CacheLib.
  18. L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: A scalable wide-area web cache sharing protocol,” SIGCOMM Comput. Commun. Rev., vol. 28, no. 4, p. 254–265, oct 1998.
  19. H. Herodotou and E. Kakoulli, “Automating distributed tiered storage management in cluster computing,” Proc. VLDB Endow., vol. 13, no. 1, p. 43–56, sep 2019. [Online]. Available: https://doi.org/10.14778/3357377.3357381
  20. Y. Huang, “memory tiering: hot page selection with hint page fault latency,” https://lore.kernel.org/linux-mm/[email protected]/.
  21. Y. Huang, “memory tiering: skip to scan fast memory,” https://github.com/torvalds/linux/commit/a1a3a2fc304df326ff67a1814364f640f2d5121c.
  22. Y. Huang, “Numa balancing: optimize page placement for memory tiering system,” https://github.com/torvalds/linux/commit/c574bbe917036c8968b984c82c7b13194fe5ce98.
  23. Y. Huang, “Optimize page placement in tiered memory system,” https://lpc.events/event/11/contributions/967/attachments/811/1654/Optimize%20Page%20Placement%20in%20Tiered%20Memory%20System.pdf.
  24. Y. Huang, “[patch -v4 0/3] memory tiering: hot page selection,” https://lwn.net/ml/linux-kernel/[email protected]/.
  25. Intel, “Intel® optane™ persistent memory,” https://www.intel.com/content/www/us/en/products/details/memory-storage/optane-dc-persistent-memory.htmlk.
  26. Intel, “Maximize your CPU resources for XGBoost training and inference,” https://www.intel.com/content/www/us/en/developer/videos/maximize-cpu-resources-xgboost-training-inference.html#gs.47qye6.
  27. Intel, “Why is the intel optane persistent memory in memory mode not persistent?” https://www.intel.com/content/www/us/en/support/articles/000055895/memory-and-storage/intel-optane-persistent-memory.html.
  28. E. Kakoulli and H. Herodotou, “OctopusFS: A distributed file system with tiered storage management,” in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD).   New York, NY, USA: Association for Computing Machinery, 2017, p. 65–78. [Online]. Available: https://doi.org/10.1145/3035918.3064023
  29. S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, “HeteroOS - OS design for heterogeneous memory management in datacenter,” in ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 521–534.
  30. H. T. Kassa, J. Akers, M. Ghosh, Z. Cao, V. Gogte, and R. Dreslinski, “Improving performance of flash based Key-Value stores using storage class memory as a volatile memory extension,” in USENIX Annual Technical Conference (ATC).   USENIX Association, Jul. 2021, pp. 821–837. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/kassa
  31. A. Lagar-Cavilla, J. Ahn, S. Souhlal, N. Agarwal, R. Burny, S. Butt, J. Chang, A. Chaugule, N. Deng, J. Shahid, G. Thelen, K. A. Yurtsever, Y. Zhao, and P. Ranganathan, “Software-defined far memory in warehouse-scale computers,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).   New York, NY, USA: Association for Computing Machinery, 2019, p. 317–330. [Online]. Available: https://doi.org/10.1145/3297858.3304053
  32. T. Lee and Y. I. Eom, “Optimizing the page hotness measurement with re-fault latency for tiered memory systems,” in IEEE International Conference on Big Data and Smart Computing (BigComp), 2022, pp. 275–279.
  33. H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini, “Pond: CXL-based memory pooling systems for cloud platforms,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).   New York, NY, USA: Association for Computing Machinery, 2023, p. 574–587. [Online]. Available: https://doi.org/10.1145/3575693.3578835
  34. W. Liu, Z. Xu, J. Tian, and Y. Zhang, “Towards in-network compact representation: Mergeable counting bloom filter vis cuckoo scheduling,” IEEE Access, vol. PP, pp. 1–1, 04 2021.
  35. B. Manes, “Caffeine,” https://github.com/ben-manes/caffeine.
  36. M. Marty, M. de Kruijf, J. Adriaens, C. Alfeld, S. Bauer, C. Contavalli, M. Dalton, N. Dukkipati, W. C. Evans, S. Gribble, N. Kidd, R. Kononov, G. Kumar, C. Mauer, E. Musick, L. Olson, M. Ryan, E. Rubow, K. Springborn, P. Turner, V. Valancius, X. Wang, and A. Vahdat, “Snap: a microkernel approach to host networking,” in In ACM SIGOPS 27th Symposium on Operating Systems Principles (SOSP), 2019.
  37. A. Maruf, A. Ghosh, J. Bhimani, D. Campello, A. Rudoff, and R. Rangaswami, “MULTI-CLOCK: Dynamic tiering for hybrid memory systems,” in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 925–937.
  38. H. A. Maruf, “Transparent page placement for tiered-memory,” https://lore.kernel.org/lkml/[email protected]/.
  39. H. A. Maruf and M. Chowdhury, “Effectively prefetching remote memory with leap,” in USENIX Annual Technical Conference (ATC).   USENIX Association, Jul. 2020, pp. 843–857. [Online]. Available: https://www.usenix.org/conference/atc20/presentation/al-maruf
  40. H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan, “TPP: Transparent page placement for CXL-enabled tiered-memory,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).   New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3582016.3582063
  41. G. Marçais and C. Kingsford, “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers,” Bioinformatics, vol. 27, no. 6, pp. 764–770, 01 2011. [Online]. Available: https://doi.org/10.1093/bioinformatics/btr011
  42. M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. H. Loh, “Heterogeneous memory architectures: A hw/sw approach for mixing die-stacked and off-package memories,” in IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 126–136.
  43. Micron, “CZ120 memory expansion module,” https://www.micron.com/solutions/server/cxl#:~:text=CXL%20memory%20expansion%20serves%20as,workloads%20for%20CXL%2Denabled%20servers.
  44. D. Moura, D. Mosse, and V. Petrucci, “Performance characterization of AutoNUMA memory tiering on graph analytics,” in IEEE International Symposium on Workload Characterization (IISWC).   IEEE, nov 2022. [Online]. Available: https://doi.org/10.1109%2Fiiswc55918.2022.00024
  45. O. Mutlu, “Memory scaling: A systems architecture perspective,” https://users.ece.cmu.edu/~omutlu/pub/mutlu_memory-scaling_memcon13_talk.pdf.
  46. S. Nayak and R. Patgiri, “countbf: A general-purpose high accuracy and space efficient counting bloom filter,” in 17th International Conference on Network and Service Management (CNSM), 2021.
  47. OpenMPDK, “Scalable memory development kit,” https://github.com/OpenMPDK/SMDK.
  48. Panmnesia, “Panmnesia technologies,” https://panmnesia.com/#technology.
  49. S. Park, M. Bhowmik, and A. Uta, “Daos: Data access-aware operating system,” in Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC).   New York, NY, USA: Association for Computing Machinery, 2022, p. 4–15. [Online]. Available: https://doi.org/10.1145/3502181.3531466
  50. J. T. Pawlowski, “Prospects for memory,” https://passlab.github.io/mchpc/mchpc2019/presentations/MCHPC_Pawlowski_keynote.pdf.
  51. perf stat, “perf-stat(1) — linux manual page,” https://man7.org/linux/man-pages/man1/perf-stat.1.html.
  52. pmem.io, “memkind,” https://pmem.io/memkind/.
  53. L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hybrid memory systems,” in Proceedings of the International Conference on Supercomputing (ICS).   New York, NY, USA: Association for Computing Machinery, 2011, p. 85–95. [Online]. Available: https://doi.org/10.1145/1995896.1995911
  54. A. Raybuck, T. Stamler, W. Zhang, M. Erez, and S. Peter, “HeMem: Scalable tiered memory management for big data applications and real NVM,” in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP).   New York, NY, USA: Association for Computing Machinery, 2021, p. 392–407. [Online]. Available: https://doi.org/10.1145/3477132.3483550
  55. P. Reviriego and O. Rottenstreich, “The tandem counting bloom filter - it takes two counters to tango,” IEEE/ACM Transactions on Networking, vol. 27, no. 6, pp. 2252–2265, 2019.
  56. V. C. Rik van Riel, “Automatic numa balancing,” https://www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf, 2014.
  57. Z. Ruan, M. Schwarzkopf, M. K. Aguilera, and A. Belay, “AIFM: High-Performance, Application-Integrated far memory,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI).   USENIX Association, Nov. 2020, pp. 315–332. [Online]. Available: https://www.usenix.org/conference/osdi20/presentation/ruan
  58. Samsung, “Samsung electronics introduces industry’s first 512gb CXL memory module,” https://news.samsung.com/us/samsung-electronics-introduces-industrys-first-512gb-cxl-memory-module/.
  59. D. D. Sharma, “Introduction to compute express link,” https://docs.wixstatic.com/ugd/0c1418_d9878707bbb7427786b70c3c91d5fbd1.pdf.
  60. Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. Ji, S. Agarwal, J. Lou, I. Jeong, R. Wang, J. H. Ahn, T. Xu, and K. N. S., “Demystifying cxl memory with genuine cxl-ready systems and devices,” in 2023 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2023.
  61. The kernel development community, “Damon-based reclamation,” https://www.kernel.org/doc/html/next/admin-guide/mm/damon/reclaim.html?highlight=damon.
  62. The kernel development community, “DAMON: Data access monitor,” https://www.kernel.org/doc/html/next/mm/damon/index.html?highlight=damon.
  63. K. Viswanathan, “Intel memory latency checker v3.11,” https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html, 2021.
  64. W. Wei, D. Jiang, S. A. McKee, J. Xiong, and M. Chen, “Exploiting program semantics to place data in hybrid memory,” in International Conference on Parallel Architecture and Compilation (PACT), 2015, pp. 163–173.
  65. J. Weiner, N. Agarwal, D. Schatzberg, L. Yang, H. Wang, B. Sanouillet, B. Sharma, T. Heo, M. Jain, C. Tang, and D. Skarlatos, “TMO: Transparent memory offloading in datacenters,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).   New York, NY, USA: Association for Computing Machinery, 2022, p. 609–621. [Online]. Available: https://doi.org/10.1145/3503222.3507731
  66. K. Wu, Y. Huang, and D. Li, “Unimem: Runtime data management on non-volatile memory-based heterogeneous main memory,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC).   New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3126908.3126923
  67. XGBoost, “Using XGBoost external memory version,” https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html.
  68. XGBoost, “XGBoost: eXtreme gradient boosting,” https://github.com/dmlc/xgboost.
  69. Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Nimble page management for tiered memory systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).   New York, NY, USA: Association for Computing Machinery, 2019, p. 331–345.
  70. J. Yang, Y. Yue, and K. V. Rashmi, “A large scale analysis of hundreds of in-memory cache clusters at Twitter,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI).   USENIX Association, Nov. 2020, pp. 191–208.
  71. H. Ying, “autonuma: Optimize memory placement for memory tiering system,” https://lwn.net/Articles/845102/.
  72. Y. Zhou, H. M. G. Wassel, S. Liu, J. Gao, J. Mickens, M. Yu, C. Kennelly, P. Turner, D. E. Culler, H. M. Levy, and A. Vahdat, “Carbink: Fault-tolerant far memory,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, Jul. 2022, pp. 55–71.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kevin Song (9 papers)
  2. Jiacheng Yang (11 papers)
  3. Sihang Liu (14 papers)
  4. Gennady Pekhimenko (52 papers)
Citations (4)