Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering (2403.18702v2)

Published 27 Mar 2024 in cs.AR

Abstract: The Compute Express Link (CXL) interconnect makes it feasible to integrate diverse types of memory into servers via its byte-addressable SerDes links. Considering the various access latency, harnessing the full potential of CXL-based heterogeneous memory systems requires efficient memory tiering. However, prior work can hardly make a fundamental progress owing to low-resolution and high-overhead memory access profiling techniques. To address this critical challenge, we propose a novel memory tiering solution called NeoMem, which features a hardware/software co-design. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf. NeoProf readily monitors memory accesses and provides the OS with crucial page hotness statistics and other useful system state information. On the OS kernel side, we design a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. We implement NeoMem on a real FPGA-based CXL memory platform and Linux kernel v6.3. Comprehensive evaluations demonstrate that NeoMem achieves 32% to 67% geomean speedup over several existing memory tiering solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. “Gups,” http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/.
  2. “Memkind,” https://memkind.github.io/memkind/, 2021.
  3. “Persistent memory programming,” http://pmem.io/, 2017.
  4. R. Achermann and A. Panwar, “Mitosis workload btree,” http://icl.cs.utk.edu/ projectsfiles/hpcc/RandomAccess/, 2019.
  5. N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent page management for two-tiered main memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017, pp. 631–644.
  6. H. Al Maruf and M. Chowdhury, “Effectively prefetching remote memory with leap,” in 2020 USENIX Annual Technical Conference (USENIX ATC 20), 2020, pp. 843–857.
  7. M. Arif, A. Maurya, and M. M. Rafique, “Accelerating performance of gpu-based workloads using cxl,” in Proceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing, 2023, pp. 27–31.
  8. V. Banakar, K. Wu, Y. Patel, K. Keeton, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Wiscsort: External sorting for byte-addressable storage,” arXiv preprint arXiv:2307.06476, 2023.
  9. S. Beamer, K. Asanović, and D. Patterson, “The gap benchmark suite,” arXiv preprint arXiv:1508.03619, 2015.
  10. S. Bergman, P. Faldu, B. Grot, L. Vilanova, and M. Silberstein, “Reconsidering os memory optimizations in the presence of disaggregated memory,” in Proceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management, 2022, pp. 1–14.
  11. D. Boles, D. Waddington, and D. A. Roberts, “Cxl-enabled enhanced memory functions,” IEEE Micro, vol. 43, no. 2, pp. 58–65, 2023.
  12. I. Calciu, M. T. Imran, I. Puddu, S. Kashyap, H. A. Maruf, O. Mutlu, and A. Kolli, “Rethinking software runtimes for disaggregated memory,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 79–92.
  13. P. Chen, Y. Wu, T. Yang, J. Jiang, and Z. Liu, “Precise error estimation for sketch-based flow measurement,” in Proceedings of the 21st ACM Internet Measurement Conference, 2021, pp. 113–121.
  14. A. Cho, A. Saxena, M. Qureshi, and A. Daglis, “A case for cxl-centric server processors,” arXiv preprint arXiv:2305.05033, 2023.
  15. J. Choi, S. Blagodurov, and H.-W. Tseng, “Dancing in the dark: Profiling for tiered memory,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).   IEEE, 2021, pp. 13–22.
  16. C. C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.   IEEE, 2014, pp. 1–12.
  17. J. Corbet, “Autonuma: the other approach to numa scheduling,” LWN. net, 2012.
  18. G. Cormode and S. Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications,” Journal of Algorithms, vol. 55, no. 1, pp. 58–75, 2005.
  19. I. Corporation, “Intel® fpga compute express link (cxl) ip,” https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.html, 2024.
  20. S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, J. Jackson, and K. Schwan, “Data tiering in heterogeneous memory systems,” in Proceedings of the Eleventh European Conference on Computer Systems, 2016, pp. 1–16.
  21. P. Duraisamy, W. Xu, S. Hare, R. Rajwar, D. Culler, Z. Xu, J. Fan, C. Kennelly, B. McCloskey, D. Mijailovic, B. Morris, C. Mukherjee, J. Ren, G. Thelen, P. Turner, C. Villavieja, P. Ranganathan, and A. Vahdat, “Towards an adaptable systems architecture for memory tiering at warehouse-scale,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023.   New York, NY, USA: Association for Computing Machinery, 2023, p. 727–741. [Online]. Available: https://doi.org/10.1145/3582016.3582031
  22. P. Flajolet et al., “Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm,” in Discrete mathematics & theoretical computer science Proceedings, 2007.
  23. C. foundation, “Cxl 3.0 specification,” https://www.computeexpresslink.org/download-the-specification, 2022.9.
  24. Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson et al., “An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 3–18.
  25. J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, “Badgertrap: A tool to instrument x86-64 tlb misses,” ACM SIGARCH Computer Architecture News, vol. 42, no. 2, pp. 20–23, 2014.
  26. D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct access,{{\{{High-Performance}}\}} memory disaggregation with {{\{{DirectCXL}}\}},” in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 287–294.
  27. A. Goyal and H. Daumé, “Lossy conservative update (lcu) sketch: Succinct approximate count storage,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 25, no. 1, 2011, pp. 878–883.
  28. J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin, “Efficient memory disaggregation with infiniswap,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 2017, pp. 649–667.
  29. M. Ha, J. Ryu, J. Choi, K. Ko, S. Kim, S. Hyun, D. Moon, B. Koh, H. Lee, M. Kim, H. Kim, and K. Park, “Dynamic capacity service for improving cxl pooled memory efficiency,” IEEE Micro, vol. 43, no. 2, pp. 39–47, 2023.
  30. T. Heo, Y. Wang, W. Cui, J. Huh, and L. Zhang, “Adaptive page migration policy with huge pages in tiered memory systems,” IEEE Transactions on Computers, vol. 71, no. 1, pp. 53–68, 2022.
  31. M. Hildebrand, J. Khan, S. Trika, J. Lowe-Power, and V. Akella, “Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 875–890.
  32. S. Hynix, “Sk hynix cxl memory,” https://news.skhynix.com/sk-hynix-develops-ddr5-dram-cxltm-memory-to-expand-the-cxl-memory-ecosystem, 2022.9.
  33. J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, “{{\{{CXL-ANNS}}\}}:{{\{{Software-Hardware}}\}} collaborative memory disaggregation and computation for {{\{{Billion-Scale}}\}} approximate nearest neighbor search,” in 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 585–600.
  34. X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica, “Netcache: Balancing key-value stores with fast in-network caching,” in Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 121–136.
  35. T. Johnson and D. Shasha, “2q: A low overhead high performance buffer management replacement algorithm,” in Proceedings of the 20th International Conference on Very Large Data Bases, ser. VLDB ’94.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1994, p. 439–450.
  36. S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, “Heteroos: Os design for heterogeneous memory management in datacenter,” SIGARCH Comput. Archit. News, vol. 45, no. 2, p. 521–534, jun 2017. [Online]. Available: https://doi.org/10.1145/3140659.3080245
  37. M. Kiefer, I. Poulakis, S. Breß, and V. Markl, “Scotch: Generating fpga-accelerators for sketching at line rate,” Proceedings of the VLDB Endowment, vol. 14, no. 3, pp. 281–293, 2020.
  38. M. Kiefer, I. Poulakis, E. T. Zacharatou, and V. Markl, “Optimistic data parallelism for fpga-accelerated sketching,” Proceedings of the VLDB Endowment, vol. 16, no. 5, pp. 1113–1125, 2023.
  39. J. Kim, W. Choe, and J. Ahn, “Exploring the design space of page management for {{\{{Multi-Tiered}}\}} memory systems,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 715–728.
  40. K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, and H. Song, “SMT: software-defined memory tiering for heterogeneous computing systems with CXL memory expander,” IEEE Micro, vol. 43, no. 2, pp. 20–29, 2023. [Online]. Available: https://doi.org/10.1109/MM.2023.3240774
  41. K. Koh, K. Kim, S. Jeon, and J. Huh, “Disaggregated cloud memory with elastic block management,” IEEE Transactions on Computers, vol. 68, no. 1, pp. 39–52, 2019.
  42. M. Kwon, S. Lee, and M. Jung, “Cache in hand: Expander-driven cxl prefetcher for next generation cxl-ssd,” in Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems, 2023, pp. 24–30.
  43. T. Lee and Y. I. Eom, “Optimizing the page hotness measurement with re-fault latency for tiered memory systems,” in 2022 IEEE International Conference on Big Data and Smart Computing (BigComp), 2022, pp. 275–279.
  44. T. Lee, S. K. Monga, C. Min, and Y. I. Eom, “Memtis: Efficient memory tiering with dynamic page classification and page size determination,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 17–34.
  45. H. Li, K. Liu, T. Liang, Z. Li, T. Lu, H. Yuan, Y. Xia, Y. Bao, M. Chen, and Y. Shan, “Hopp: Hardware-software co-designed page prefetching for disaggregated memory,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2023, pp. 1168–1181.
  46. H. Li, Q. Chen, Y. Zhang, T. Yang, and B. Cui, “Stingy sketch: a sketch framework for accurate and fast frequency estimation,” Proceedings of the VLDB Endowment, vol. 15, no. 7, pp. 1426–1438, 2022.
  47. H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini, “Pond: Cxl-based memory pooling systems for cloud platforms,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023, T. M. Aamodt, N. D. E. Jerger, and M. M. Swift, Eds.   ACM, 2023, pp. 574–587. [Online]. Available: https://doi.org/10.1145/3575693.3578835
  48. Z. Li and M. Wu, “Transparent and lightweight object placement for managed workloads atop hybrid memories,” in Proceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, ser. VEE 2022.   New York, NY, USA: Association for Computing Machinery, 2022, p. 72–80. [Online]. Available: https://doi.org/10.1145/3516807.3516822
  49. Linux, “Automatic numa balancing,” https://www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf.
  50. Linux, “Damon: Data access monitor,” https://docs.kernel.org/mm/damon/index.html.
  51. Linux, “Linux memmap command for reserving physical memory,” https://www.kernel.org/doc/html/v5.16/admin-guide/kernel-parameters.html.
  52. A. Maruf, A. Ghosh, J. Bhimani, D. Campello, A. Rudoff, and R. Rangaswami, “Multi-clock: Dynamic tiering for hybrid memory systems,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 925–937.
  53. H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. O. Kanaujia, and P. Chauhan, “TPP: transparent page placement for cxl-enabled tiered-memory,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023, T. M. Aamodt, N. D. E. Jerger, and M. M. Swift, Eds.   ACM, 2023, pp. 742–755. [Online]. Available: https://doi.org/10.1145/3582016.3582063
  54. D.-J. Oh, Y. Moon, D. K. Ham, T. J. Ham, Y. Park, J. W. Lee, J. H. Ahn, and E. Lee, “Maphea: A framework for lightweight memory hierarchy-aware profile-guided heap allocation,” ACM Trans. Embed. Comput. Syst., vol. 22, no. 1, dec 2022. [Online]. Available: https://doi.org/10.1145/3527853
  55. A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, “Mempod: A clustered architecture for efficient and scalable migration in flat address space multi-level memories,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 433–444.
  56. M. Ramakrishna, E. Fu, and E. Bahcekapili, “Efficient hardware hashing functions for high performance computers,” IEEE Transactions on Computers, vol. 46, no. 12, pp. 1378–1381, 1997.
  57. A. Raybuck, T. Stamler, W. Zhang, M. Erez, and S. Peter, “Hemem: Scalable tiered memory management for big data applications and real nvm,” in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 2021, pp. 392–407.
  58. Redis, “Redis data base,” https://github.com/redis/redis, 2023.10.
  59. J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li, “Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 598–611.
  60. J. Ren, D. Xu, I. Peng, J. Ryu, K. Shin, D. Kim, and D. Li, “Hm-keeper: Scalable page management for multi-tiered large memory systems,” arXiv preprint arXiv:2302.09468, 2023.
  61. J. Ren, D. Xu, I. Peng, J. Ryu, K. Shin, D. Kim, and D. Li, “Rethinking memory profiling and migration for multi-tiered large memory systems,” 2023.
  62. J. H. Ryoo, L. K. John, and A. Basu, “A case for granularity aware page migration,” in Proceedings of the 2018 International Conference on Supercomputing, 2018, pp. 352–362.
  63. J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm: Subblocked interleaved cache-like flat memory organization,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2017, pp. 349–360.
  64. A. Saavedra, C. Hernández, and M. Figueroa, “Heavy-hitter detection using a hardware sketch with the countmin-cu algorithm,” in 2018 21st Euromicro Conference on Digital System Design (DSD).   IEEE, 2018, pp. 38–45.
  65. S. Sha, C. Li, Y. Luo, X. Wang, and Z. Wang, “vtmm: Tiered memory management for virtual machines,” in Proceedings of the Eighteenth European Conference on Computer Systems, 2023, pp. 283–297.
  66. J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.   IEEE, 2014, pp. 13–24.
  67. J. Sim, S. Ahn, T. Ahn, S. Lee, M. Rhee, J. Kim, K. Shin, D. Moon, E. Kim, and K. Park, “Computational cxl-memory solution for accelerating memory-intensive applications,” IEEE Computer Architecture Letters, vol. 22, no. 1, pp. 5–8, 2022.
  68. Sumsung, “Expanding the limits of memory bandwidth and density: Samsung’s cxl dram memory expander,” https://semiconductor.samsung.com/newsroom/tech-blog/expanding-the-limits-of-memory-bandwidth-and-density-samsungs-cxl-dram-memory-expander/, 2022.9.
  69. Y. Sun, Y. Yuan, Z. Yu, R. Kuper, I. Jeong, R. Wang, and N. S. Kim, “Demystifying CXL memory with genuine cxl-ready systems and devices,” CoRR, vol. abs/2303.15375, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.15375
  70. D. Tong and V. K. Prasanna, “Sketch acceleration on fpga and its applications in network anomaly detection,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 4, pp. 929–942, 2017.
  71. J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz, “XSBench - the development and verification of a performance abstraction for Monte Carlo reactor analysis,” in PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future, Kyoto, 2014. [Online]. Available: https://www.mcs.anl.gov/papers/P5064-0114.pdf
  72. S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden, “Speedy transactions in multicore in-memory databases,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, 2013, pp. 18–32.
  73. C. Wang, H. Cui, T. Cao, J. Zigman, H. Volos, O. Mutlu, F. Lv, X. Feng, and G. H. Xu, “Panthera: Holistic memory management for big data processing over hybrid memories,” in Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI 2019.   New York, NY, USA: Association for Computing Machinery, 2019, p. 347–362. [Online]. Available: https://doi.org/10.1145/3314221.3314650
  74. V. M. Weaver et al., “Advanced hardware profiling and sampling (pebs, ibs, etc.): creating a new papi sampling interface,” Technical Report UMAINE-VMWTR-PEBS-IBS-SAMPLING-2016-08. University of Maine, Tech. Rep., 2016.
  75. W. Wei, D. Jiang, S. A. McKee, J. Xiong, and M. Chen, “Exploiting program semantics to place data in hybrid memory,” in 2015 International Conference on Parallel Architecture and Compilation (PACT), 2015, pp. 163–173.
  76. WikiPedia, “Bloom filter,” https://en.wikipedia.org/wiki/Bloom_filter.
  77. WikiPedia, “Counting bloom filter,” https://en.wikipedia.org/wiki/Counting_Bloom_filter.
  78. K. Wu, Y. Huang, and D. Li, “Unimem: Runtime data managementon non-volatile memory-based heterogeneous main memory,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’17.   New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3126908.3126923
  79. Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Nimble page management for tiered memory systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 331–345.
  80. T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li, and S. Uhlig, “Elastic sketch: Adaptive and fast network-wide measurements,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 2018, pp. 561–575.
  81. T. Yang, Y. Zhou, H. Jin, S. Chen, and X. Li, “Pyramid sketch: A sketch framework for frequency estimation of data streams,” Proceedings of the VLDB Endowment, vol. 10, no. 11, pp. 1442–1453, 2017.
  82. X. Zhang, Y. Chang, T. Lu, K. Zhang, and M. Chen, “Rethinking design paradigm of graph processing system with a cxl-like memory semantic fabric,” in 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid).   IEEE, 2023, pp. 25–35.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Zhe Zhou (33 papers)
  2. Yiqi Chen (17 papers)
  3. Tao Zhang (481 papers)
  4. Yang Wang (670 papers)
  5. Ran Shu (7 papers)
  6. Shuotao Xu (3 papers)
  7. Peng Cheng (229 papers)
  8. Lei Qu (12 papers)
  9. Yongqiang Xiong (10 papers)
  10. Guangyu Sun (47 papers)
  11. Jie Zhang (846 papers)
Citations (1)