Trimma: Trimming Metadata Storage and Latency for Hybrid Memory Systems (2402.16343v2)
Abstract: Hybrid main memory systems combine both performance and capacity advantages from heterogeneous memory technologies. With larger capacities, higher associativities, and finer granularities, hybrid memory systems currently exhibit significant metadata storage and lookup overheads for flexibly remapping data blocks between the two memory tiers. To alleviate the inefficiencies of existing designs, we propose Trimma, the combination of a multi-level metadata structure and an efficient metadata cache design. Trimma uses a multi-level metadata table to only track truly necessary address remap entries. The saved memory space is effectively utilized as extra DRAM cache capacity to improve performance. Trimma also uses separate formats to store the entries with non-identity and identity address mappings. This improves the overall remap cache hit rate, further boosting the performance. Trimma is transparent to software and compatible with various types of hybrid memory systems. When evaluated on a representative hybrid memory system with HBM3 and DDR5, Trimma achieves up to 1.68$\times$ and on average 1.33$\times$ speedup benefits, compared to state-of-the-art hybrid memory designs. These results show that Trimma effectively addresses metadata management overheads, especially for future scalable large-scale hybrid memory architectures.
- “Numa-aware allocation,” https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt.
- “numactl manual,” https://linux.die.net/man/8/numactl.
- “Sk hynix announces development of hbm3 dram,” https://news.skhynix.com/sk-hynix-announces-development-of-hbm3-dram/.
- N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent page management for two-tiered main memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017, pp. 631–644.
- M. K. Aguilera, N. Amit, I. Calciu, X. Deguillard, J. Gandhi, S. Novakovic, A. Ramanathan, P. Subrahmanyam, L. Suresh, K. Tati et al., “Remote regions: a simple abstraction for remote memory,” in 2018 {normal-{\{{USENIX}normal-}\}} Annual Technical Conference ({normal-{\{{USENIX}normal-}\}}{normal-{\{{ATc}normal-}\}} 18), 2018, pp. 775–787.
- H. Al Maruf and M. Chowdhury, “Effectively prefetching remote memory with leap,” in 2020 {normal-{\{{USENIX}normal-}\}} Annual Technical Conference ({normal-{\{{USENIX}normal-}\}}{normal-{\{{ATC}normal-}\}} 20), 2020, pp. 843–857.
- M. Alwadi, V. R. Kommareddy, C. Hughes, S. Hammond, and A. Awad, “Stealth-Persist: Architectural Support for Persistent Applications in Hybrid Memory Systems,” in 2021 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2021.
- E. Amaro, C. Branner-Augmon, Z. Luo, A. Ousterhout, M. K. Aguilera, A. Panda, S. Ratnasamy, and S. Shenker, “Can far memory improve job throughput?” in Proceedings of the Fifteenth European Conference on Computer Systems, 2020, pp. 1–16.
- D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber et al., “The nas parallel benchmarks&mdash,” Summary and Preliminary Results, 1991.
- M. Bakhshalipour, H. Zare, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Die-stacked dram: Memory, cache, or memcache?” arXiv preprint arXiv:1809.08828, 2018.
- B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970.
- J. Bucek, K.-D. Lange, and J. v. Kistowski, “Spec cpu2017: Next-generation compute benchmark,” in Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, 2018, pp. 41–42.
- I. Calciu, M. T. Imran, I. Puddu, S. Kashyap, H. A. Maruf, O. Mutlu, and A. Kolli, “Rethinking software runtimes for disaggregated memory,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 79–92.
- N. Chatterjee, M. O’Connor, D. Lee, D. R. Johnson, S. W. Keckler, M. Rhu, and W. J. Dally, “Architecting an energy-efficient dram system for gpus,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 73–84.
- C. C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014, pp. 1–12.
- C. Chou, A. Jaleel, and M. Qureshi, “Batman: Techniques for maximizing system bandwidth of memory systems with stacked-dram,” in Proceedings of the International Symposium on Memory Systems, 2017, pp. 268–280.
- C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 198–210, 2015.
- A. E. Darling, B. Mau, and N. T. Perna, “progressivemauve: multiple genome alignment with gene gain, loss and rearrangement,” PloS one, vol. 5, no. 6, p. e11147, 2010.
- A. L. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, “Fast algorithms for large-scale genome alignment and comparison,” Nucleic acids research, vol. 30, no. 11, pp. 2478–2483, 2002.
- M. El-Nacouzi, I. Atta, M. Papadopoulou, J. Zebchuk, N. E. Jerger, and A. Moshovos, “A dual grain hit-miss detector for large die-stacked dram caches,” in 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2013, pp. 89–92.
- L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: a scalable wide-area web cache sharing protocol,” IEEE/ACM transactions on networking, vol. 8, no. 3, pp. 281–293, 2000.
- A. Fornito, A. Zalesky, and M. Breakspear, “Graph analysis of the human connectome: promise, progress, and pitfalls,” Neuroimage, vol. 80, pp. 426–444, 2013.
- S. Franey and M. Lipasti, “Tag tables,” in 2015 ieee 21st international symposium on high performance computer architecture (hpca). IEEE, 2015, pp. 514–525.
- M. Gorman and P. Healy, “Supporting superpage allocation without additional hardware support,” in Proceedings of the 7th international symposium on Memory management, 2008, pp. 41–50.
- J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin, “Efficient memory disaggregation with infiniswap,” in 14th {normal-{\{{USENIX}normal-}\}} Symposium on Networked Systems Design and Implementation ({normal-{\{{NSDI}normal-}\}} 17), 2017, pp. 649–667.
- N. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan, “Bi-modal dram cache: Improving hit rate, hit latency and bandwidth,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014, pp. 38–50.
- D. Gureya, J. Neto, R. Karimi, J. Barreto, P. Bhatotia, V. Quema, R. Rodrigues, P. Romano, and V. Vlassov, “Bandwidth-aware page placement in numa,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2020, pp. 546–556.
- C.-C. Huang and V. Nagarajan, “Atcache: Reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, 2014, pp. 51–60.
- Intel, “Intel optane DC persistent memory,” 2020, https://builders.intel.com/docs/networkbuilders/intel-optane-dc-persistent-memory-telecom-use-case-workloads.pdf.
- H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee, “Efficient footprint caching for tagless dram caches,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2016, pp. 237–248.
- JEDEC, “DDR3 SDRAM standard,” JESD79-3F,2012, https://www.jedec.org/standards-documents/docs/jesd-79-3d.
- JEDEC, “DDR5 SDRAM,” JESD79-5,2020, https://www.jedec.org/standards-documents/docs/jesd79-5.
- JEDEC, “High bandwidth memory (HBM) DRAM,” JESD235D,2021, https://www.jedec.org/standards-documents/docs/jesd235a.
- JEDEC, “Low power double data rate 4 (LPDDR4),” JESD209-4D,2021, https://www.jedec.org/standards-documents/docs/jesd209-4b.
- D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A scalable and effective die-stacked dram cache,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014, pp. 25–37.
- D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 404–415, 2013.
- S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, “Heteroos: Os design for heterogeneous memory management in datacenter,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017, pp. 521–534.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
- D. Kim, J. M. Paggi, C. Park, C. Bennett, and S. L. Salzberg, “Graph-based genome alignment and genotyping with hisat2 and hisat-genotype,” Nature biotechnology, vol. 37, no. 8, pp. 907–915, 2019.
- J. Kim, W. Choe, and J. Ahn, “Exploring the design space of page management for multi-tiered memory systems,” in 2021 USENIX Annual Technical Conference (ATC’21). USENIX Association, 2021, pp. 715–728.
- Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram simulator,” IEEE Computer architecture letters, vol. 15, no. 1, pp. 45–49, 2015.
- A. Kokolis, D. Skarlatos, and J. Torrellas, “Pageseer: Using page walks to trigger page swaps in hybrid memory systems,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 596–608.
- J. B. Kotra, H. Zhang, A. R. Alameldeen, C. Wilkerson, and M. T. Kandemir, “Chameleon: A dynamically reconfigurable heterogeneous memory system,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 533–545.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
- Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel, “Coordinated and efficient huge page management with ingens,” in 12th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 16), 2016, pp. 705–721.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- S. Lee, H. Bahn, and S. H. Noh, “Clock-dwf: A write-history-aware page replacement algorithm for hybrid pcm and dram memory architectures,” IEEE Transactions on Computers, vol. 63, no. 9, pp. 2187–2200, 2013.
- Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 211–222, 2015.
- H. Liu, Y. Chen, X. Liao, H. Jin, B. He, L. Zheng, and R. Guo, “Hardware/software cooperative caching for hybrid dram/nvm memory architectures,” in Proceedings of the International Conference on Supercomputing, 2017, pp. 1–10.
- G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 454–464.
- S. A. McKee, “Reflections on the memory wall,” in Proceedings of the 1st conference on Computing frontiers, 2004, p. 162.
- J. S. Meena, S. M. Sze, U. Chand, and T.-Y. Tseng, “Overview of emerging nonvolatile memory technologies,” Nanoscale research letters, vol. 9, no. 1, pp. 1–33, 2014.
- M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. H. Loh, “Heterogeneous memory architectures: A hw/sw approach for mixing die-stacked and off-package memories,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015, pp. 126–136.
- J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, no. 2, pp. 61–64, 2012.
- Micron, “Hybrid memory cube – HMC gen2,” 2018, https://www.micron.com/-/media/client/global/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf.
- J. Navarro, S. Iyer, P. Druschel, and A. Cox, “Practical, transparent operating system support for superpages,” ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 89–104, 2002.
- D. Nguyen, A. Lenharth, and K. Pingali, “A lightweight infrastructure for graph analytics,” in Proceedings of the twenty-fourth ACM symposium on operating systems principles, 2013, pp. 456–471.
- M. O’Connor, “Highlights of the high-bandwidth memory (hbm) standard,” in Memory Forum Workshop, vol. 3, 2014.
- A. Panwar, S. Bansal, and K. Gopinath, “Hawkeye: Efficient fine-grained os support for huge pages,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 347–360.
- Y. Perez, R. Sosič, A. Banerjee, R. Puttagunta, M. Raison, P. Shah, and J. Leskovec, “Ringo: Interactive graph analytics on big-memory machines,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1105–1110.
- A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, “Mempod: A clustered architecture for efficient and scalable migration in flat address space multi-level memories,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 433–444.
- M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012, pp. 235–246.
- L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hybrid memory systems,” in Proceedings of the international conference on Supercomputing, 2011, pp. 85–95.
- J. B. Rothman and A. J. Smith, “Sector cache design and performance,” in Proceedings 8th international symposium on modeling, analysis and simulation of computer and telecommunication systems (cat. no. pr00728). IEEE, 2000, pp. 124–133.
- M. Rybczyńska, “Top-tier memory management,” 2021, https://lwn.net/Articles/857133/.
- J. H. Ryoo, L. K. John, and A. Basu, “A case for granularity aware page migration,” in Proceedings of the 2018 International Conference on Supercomputing, 2018, pp. 352–362.
- J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm: Subblocked interleaved cache-like flat memory organization,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 349–360.
- D. Sanchez and C. Kozyrakis, “Zsim: Fast and accurate microarchitectural simulation of thousand-core systems,” ACM SIGARCH Computer architecture news, vol. 41, no. 3, pp. 475–486, 2013.
- Y. Shan, S.-Y. Tsai, and Y. Zhang, “Distributed shared persistent memory,” in Proceedings of the 2017 Symposium on Cloud Computing, 2017, pp. 323–337.
- J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013, pp. 135–146.
- J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014, pp. 13–24.
- N. Sundaram, N. R. Satish, M. M. A. Patwary, S. R. Dulloor, S. G. Vadlamudi, D. Das, and P. Dubey, “Graphmat: High performance graph analytics made productive,” arXiv preprint arXiv:1503.07241, 2015.
- E. Vasilakis, V. Papaefstathiou, P. Trancoso, and I. Sourdis, “Decoupled fused cache: Fusing a decoupled llc with a dram cache,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 15, no. 4, pp. 1–23, 2019.
- E. Vasilakis, V. Papaefstathiou, P. Trancoso, and I. Sourdis, “Llc-guided data migration in hybrid memory systems,” in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. 932–942.
- E. Vasilakis, V. Papaefstathiou, P. Trancoso, and I. Sourdis, “Hybrid2: Combining caching and migration in hybrid memory systems,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 649–662.
- Z. Wang, X. Liu, J. Yang, T. Michailidis, S. Swanson, and J. Zhao, “Characterizing and modeling non-volatile memory systems,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 496–508.
- H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings of the IEEE, vol. 98, no. 12, pp. 2201–2227, 2010.
- Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Nimble page management for tiered memory systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 331–345.
- V. Young, C. Chou, A. Jaleel, and M. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 328–339.
- V. Young and M. K. Qureshi, “To update or not to update?: Bandwidth-efficient intelligent replacement policies for dram caches,” in 2019 IEEE 37th International Conference on Computer Design (ICCD). IEEE, 2019, pp. 119–128.
- X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,” in 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2017, pp. 1–14.
- P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A durable and energy efficient main memory using phase change memory technology,” ACM SIGARCH computer architecture news, vol. 37, no. 3, pp. 14–23, 2009.
- W. Zhu, A. L. Cox, and S. Rixner, “A comprehensive analysis of superpage management mechanisms and policies,” in 2020 {normal-{\{{USENIX}normal-}\}} Annual Technical Conference ({normal-{\{{USENIX}normal-}\}}{normal-{\{{ATC}normal-}\}} 20), 2020, pp. 829–842.
- Yiwei Li (107 papers)
- Boyu Tian (5 papers)
- Mingyu Gao (22 papers)