Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory (2403.09358v1)

Published 14 Mar 2024 in cs.AR

Abstract: We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (156)
  1. “Graph500 Benchmark specification.” [Online]. Available: https://graph500.org/?page_id=12
  2. “Nvidia tensor cores.” [Online]. Available: https://www.nvidia.com/en-us/data-center/tensor-cores
  3. “High bandwidth memory (hbm) dram,” JEDEC Standard, 2013.
  4. “Nvidia tesla p100,” NVIDIA whitepaper, 2016.
  5. “Nvidia tesla v100 gpu architecture,” NVIDIA whitepaper, 2017.
  6. “Nvidia nvswitch: The world’s highest-bandwidth on-node switch,” NVIDIA Whitepaper, 2018.
  7. “Introducing amd cdna architecture,” AMD whitepaper, 2020.
  8. “Nvidia a100 tensor core gpu architecture,” NVIDIA Whitepaper, 2020.
  9. “Nvidia grace hopper superchip architecture,” 2020. [Online]. Available: https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper
  10. “Compute express link specification 3.0,” CXL Consortium, 2022.
  11. “Nvidia h100 tensor core gpu architecture,” NVIDIA whitepaper, 2022.
  12. “High bandwidth memory (hbm2e) interface intel agilex® 7 m-series fpga ip user guide,” July 2023. [Online]. Available: https://cdrdv2-public.intel.com/781867/ug-773264-781867.pdf
  13. N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent page management for two-tiered main memory,” in Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017.
  14. T. Allen and R. Ge, “Demystifying gpu uvm cost with deep runtime and workload analysis,” in Proceedings of the 33rd International Parallel and Distributed Processing Symposium (IPDPS), 2021.
  15. T. Allen and R. Ge, “In-depth analyses of unified virtual memory system for gpu accelerated computing,” in Proceedings of the 34th International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
  16. AMD, “Amd instinct™ mi100 accelerator.” [Online]. Available: https://www.amd.com/en/products/server-accelerators/instinct-mi100
  17. AMD, “Amd instinct™ mi250x accelerator.” [Online]. Available: https://www.amd.com/en/products/server-accelerators/instinct-mi250x
  18. AMD, “Amd instinct™ mi300x accelerator.” [Online]. Available: https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html
  19. AMD, “Amd radeon instinct™ mi25 accelerator.” [Online]. Available: https://www.amd.com/ko/products/professional-graphics/instinct-mi25
  20. V.-G. Anghel, “Exploring current immersion cooling deployments,” March 2023. [Online]. Available: https://www.datacenterdynamics.com/en/analysis/exploring-current-immersion-cooling-deployments/
  21. A. Azad, M. M. Aznaveh, S. Beamer, M. Blanco, J. Chen, L. D’Alessandro, R. Dathathri, T. Davis, K. Deweese, J. Firoz, H. A. Gabb, G. Gill, B. Hegyi, S. Kolodziej, T. M. Low, A. Lumsdaine, T. Manlaibaatar, T. G. Mattson, S. McMillan, R. Peri, K. Pingali, U. Sridhar, G. Szarnyas, Y. Zhang, and Y. Zhang, “Evaluation of graph analytics frameworks using the gap benchmark suite,” in 2020 IEEE International Symposium on Workload Characterization (IISWC), 2020, pp. 216–227.
  22. A. Bakhoda, J. Kim, and T. M. Aamodt, “Throughput-effective on-chip networks for manycore accelerators,” in 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010.
  23. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in Proceedings of the 2nd International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.
  24. R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,” in Proceedings of the 14th Transactions on Architecture and Code Optimization (TACO), 2017.
  25. P. Behnam and M. N. Bojnordi, “Redcache: Reduced dram caching,” in Proceedings of the 57th Design Automation Conference (DAC), 2020.
  26. G. Boeing, “Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks,” Computers, Environment and Urban Systems, vol. 65, pp. 126–139, 2017.
  27. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS), 2020.
  28. N. Chatterjee, M. O’Connor, D. Lee, D. R. Johnson, S. W. Keckler, M. Rhu, and W. J. Dally, “Architecting an energy-efficient dram system for gpus,” in Proceedings of the 23rd International Symposium on High Performance Computer Architecture (HPCA), 2017.
  29. S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia: Understanding irregular gpgpu graph applications,” in Proceedings of the 16th International Symposium on Workload Characterization (IISWC), 2013.
  30. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proceedings of the 12th International Symposium on Workload Characterization (IISWC), 2009.
  31. X. Chen, R. Dathathri, G. Gill, and K. Pingali, “Pangolin: An efficient and flexible graph mining system on cpu and gpu,” Proc. VLDB Endow., vol. 13, no. 8, p. 1190–1205, apr 2020.
  32. W.-C. Chien, C.-W. Yeh, R. L. Bruce, H.-Y. Cheng, I. T. Kuo, C.-H. Yang, A. Ray, H. Miyazoe, W. Kim, F. Carta, E.-K. Lai, M. J. BrightSky, and H.-L. Lung, “A study on ots-pcm pillar cell for 3-d stackable memory,” IEEE Transactions on Electron Devices, vol. 65, no. 11, pp. 5172–5179, 2018.
  33. J. Choe, “Intel’s 2nd generation xpoint memory - will it be worth the long wait ahead?” 2021. [Online]. Available: https://www.techinsights.com/blog/memory/intels-2nd-generation-xpoint-memory
  34. C. C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache,” in Proceedings of the 47th International Symposium on Microarchitecture (MICRO), 2014.
  35. C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), 2015.
  36. E. Choukse, M. B. Sullivan, M. O’Connor, M. Erez, J. Pool, D. Nellans, and S. W. Keckler, “Buddy compression: Enabling larger memory for deep learning and hpc workloads on gpus,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020.
  37. B. Dally, “GTC china 2020 keynote,” NVIDIA GPU Technology Conference, 2020. [Online]. Available: https://s201.q4cdn.com/141608511/files/doc_presentations/2020/12/GTC-China_2020_FINAL-(with-FLS).pdf
  38. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 19th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2019.
  39. Z. Duan, J. Yao, H. Liu, X. Liao, H. Jin, and Y. Zhang, “Revisiting log-structured merging for kv stores in hybrid memory systems,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023.
  40. A. Eisenman, D. Gardner, I. AbdelRahman, J. Axboe, S. Dong, K. Hazelwood, C. Petersen, A. Cidon, and S. Katti, “Reducing dram footprint with nvm in facebook,” in Proceedings of the Thirteenth EuroSys Conference (EuroSys), 2018.
  41. A. Fazio, “Advanced technology and systems of cross point memory,” in Proceedings of the 65th International Electron Devices Meeting (IEDM), 2020.
  42. D. Foley and J. Danskin, “Ultra-performance pascal gpu and nvlink interconnect,” IEEE Micro, vol. 37, no. 2, pp. 7–17, 2017.
  43. S. W. Fong, C. M. Neumann, and H.-S. P. Wong, “Phase-change memory—towards a storage-class memory,” IEEE Transactions on Electron Devices, vol. 64, no. 11, pp. 4374–4385, 2017.
  44. S. Franey and M. Lipasti, “Tag tables,” in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), 2015.
  45. D. Ganguly, “Uvm smart,” 2019. [Online]. Available: https://github.com/DebashisGanguly/gpgpu-sim_UVMSmart
  46. D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory,” in Proceedings of the 46th International Symposium on Computer Architecture (ISCA), 2019.
  47. D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription,” in Proceedings of the 32nd International Parallel and Distributed Processing Symposium (IPDPS), 2020.
  48. P. Gera, H. Kim, P. Sao, H. Kim, and D. Bader, “Traversing large graphs on gpus with unified memory,” Proc. VLDB Endow., vol. 13, no. 7, p. 1119–1133, mar 2020.
  49. A. Gholami, Z. Yao, S. Kim, M. W. Mahoney, and K. Keutzer, “Ai and memory wall,” 2021. [Online]. Available: https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8
  50. B. Gopireddy and J. Torrellas, “Designing vertical processors in monolithic 3d,” in Proceedings of the 46th International Symposium on Computer Architecture (ISCA), 2019.
  51. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to gpu codes,” in Proceedings of the 1st Innovative Parallel Computing (InPar), 2012.
  52. Y. Gu, W. Wu, Y. Li, and L. Chen, “Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus,” in International Conference on Scientific Computing, 2021.
  53. T. Haruta, T. Nakajima, J. Hashizume, T. Umebayashi, H. Takahashi, K. Taniguchi, M. Kuroda, H. Sumihiro, K. Enoki, T. Yamasaki, K. Ikezawa, A. Kitahara, M. Zen, M. Oyama, H. Koga, H. Tsugawa, T. Ogita, T. Nagano, S. Takano, and T. Nomoto, “4.6 a 1/2.3inch 20mpixel 3-layer stacked cmos image sensor with dram,” in Proceedings of the 62nd International Solid-State Circuits Conference (ISSCC), 2017.
  54. A. Hay, K. Strauss, T. Sherwood, G. H. Loh, and D. Burger, “Preventing pcm banks from seizing too much power,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44.   Association for Computing Machinery, 2011, p. 186–195.
  55. M. Hildebrand, J. T. Angeles, J. Lowe-Power, and V. Akella, “A case against hardware managed dram caches for nvram based systems,” in Proceedings of the 14th International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021.
  56. S. Hong, H. Choi, J. Park, Y. Bae, K. Kim, W. Lee, S. Lee, H. Lee, S. Cho, J. Ahn, S. Kim, T. Kim, M. Na, and S. Cha, “Extremely high performance, high density 20nm self-selecting cross-point memory for compute express link,” in 2022 International Electron Devices Meeting (IEDM), 2022.
  57. C.-C. Huang, R. Kumar, M. Elver, B. Grot, and V. Nagarajan, “C3d: Mitigating the numa bottleneck via coherent dram caches,” in Proceedings of the 49th International Symposium on Microarchitecture (MICRO), 2016.
  58. C.-C. Huang and V. Nagarajan, “Atcache: Reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), 2014.
  59. T.-H. Hung, Y.-M. Pan, and K.-N. Chen, “Stress issue of vertical connections in 3d integration for high-bandwidth memory applications,” Memories - Materials, Devices, Circuits and Systems, vol. 4, p. 100024, 2023.
  60. D. Ielmini and S. Ambrogio, “Emerging neuromorphic devices,” Nanotechnology, vol. 31, no. 9, p. 092001, dec 2019.
  61. J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J. Soh, Z. Wang, Y. Xu, S. R. Dulloor et al., “Basic performance measurements of the intel optane dc persistent memory module,” arXiv preprint arXiv:1903.05714, 2019.
  62. H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee, “Efficient footprint caching for tagless dram caches,” in Proceedings of the 22nd International Symposium on High Performance Computer Architecture (HPCA), 2016.
  63. V. Jatala, R. Dathathri, G. Gill, L. Hoang, V. K. Nandivada, and K. Pingali, “A study of graph analytics for massive datasets on distributed multi-gpus,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 84–94.
  64. D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A scalable and effective die-stacked dram cache,” in Proceedings of the 47th International Symposium on Microarchitecture (MICRO), 2014.
  65. D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th International Symposium on Computer Architecture (ISCA), 2013.
  66. X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian, “Chop: Adaptive filter-based dram caching for cmp server platforms,” in Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA), 2010.
  67. H. Jun, J. Cho, K. Lee, H.-Y. Son, K. Kim, H. Jin, and K. Kim, “Hbm (high bandwidth memory) dram technology and architecture,” in 2017 IEEE International Memory Workshop (IMW), 2017.
  68. H. Jun, J. Cho, K. Lee, H.-Y. Son, K. Kim, H. Jin, and K. Kim, “Hbm (high bandwidth memory) dram technology and architecture,” in Proceedings of the 9th International Memory Workshop (IMW), 2017.
  69. V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, “Accelwattch: A power modeling framework for modern gpus,” in Proceedings of the 54th International Symposium on Microarchitecture (MICRO), 2021.
  70. S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, “Heteroos: Os design for heterogeneous memory management in datacenter,” in Proceedings of the 44th International Symposium on Computer Architecture (ISCA), 2017.
  71. F. Kaplan, C. De Vivero, S. Howes, M. Arora, H. Homayoun, W. Burleson, D. Tullsen, and A. K. Coskun, “Modeling and analysis of phase change materials for efficient thermal management,” in Proceedings of the 32nd International Conference on Computer Design (ICCD), 2014.
  72. M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in Proceedings of the 47th International Symposium on Computer Architecture (ISCA), 2020.
  73. H. Kim, J. Sim, P. Gera, R. Hadidi, and H. Kim, “Batch-aware unified memory management in gpus for irregular workloads,” in Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020.
  74. T. Kim, H. Choi, M. Kim, J. Yi, D. Kim, S. Cho, H. Lee, C. Hwang, E.-R. Hwang, J. Song, S. Chae, Y. Chun, and J.-K. Kim, “High-performance, cost-effective 2z nm two-deck cross-point memory integrated by self-align scheme for 128 gb scm,” in Proceedings of the 63rd International Electron Devices Meeting (IEDM), 2018.
  75. W. Kim, M. BrightSky, T. Masuda, N. Sosa, S. Kim, R. Bruce, F. Carta, G. Fraczak, H. Y. Cheng, A. Ray, Y. Zhu, H. L. Lung, K. Suu, and C. Lam, “Ald-based confined pcm with a metallic liner toward unlimited endurance,” in 2016 IEEE International Electron Devices Meeting (IEDM), 2016.
  76. W. Kim, R. Bruce, T. Masuda, G. Fraczak, N. Gong, P. Adusumilli, S. Ambrogio, H. Tsai, J. Bruley, J.-P. Han, M. Longstreet, F. Carta, K. Suu, and M. BrightSky, “Confined pcm-based analog synaptic devices offering low resistance-drift and 1000 programmable states for deep learning,” in 2019 Symposium on VLSI Technology, 2019.
  77. Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram simulator,” IEEE Computer architecture letters, vol. 15, no. 1, pp. 45–49, 2015.
  78. Y. Kim, H. Kim, and W. J. Song, “Nomad: Enabling non-blocking os-managed dram cache via tag-data decoupling,” in Proceedings of the 29th International Symposium on High Performance Computer Architecture (HPCA), 2023.
  79. Y. Kim, J. Lee, J.-E. Jo, and J. Kim, “Gpudmm: A high-performance and memory-oblivious gpu architecture using dynamic memory management,” in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), 2014.
  80. Y. Ko, H. Kim, and H. Han, “Escalating memory accesses to shared memory by profiling reuse,” in Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication (IMCOM), 2016.
  81. D. Kwon, S. Lee, K. Kim, S. Oh, J. Park, G.-M. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, V. Kornijcuk, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. An, J. Lee, D. Ko, Y. Jun, I. Kim, C. Song, I. Kim, C. Park, S. Kim, C. Jeong, E. Lim, D. Kim, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25v 8gb 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep learning application,” IEEE Journal of Solid-State Circuits, vol. 58, no. 1, pp. 291–302, 2023.
  82. D. Kwon, H. S. Jeong, J. Choi, W. Kim, J. W. Kim, J. Yoon, J. Choi, S. Lee, H. N. Rie, J.-i. Lee, J. Lee, T. Jang, J. Kim, S. Kang, J. Shin, Y. Loh, C. Y. Lee, J. Woo, H. Yu, C. Bae, R. Oh, Y.-s. Sohn, C. Yoo, and J. Lee, “28.7 a 1.1v 6.4gb/s/pin 24-gb ddr5 sdram with a highly-accurate duty corrector and nbti-tolerant dll,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023.
  83. B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change memory as a scalable dram alternative,” in Proceedings of the 36th international symposium on Computer architecture (ISCA), 2009.
  84. H. Lee, H. Kim, S. Shim, S. Lee, D. Hong, H.-J. Lee, and H. Kim, “Pcmcsim: An accurate phase-change memory controller simulator and its performance analysis,” in Proceedings of the 15th International Symposium on Performance Analysis of Systems and Software (ISPASS).
  85. N. Lee, “Expanding the boundaries of ai revolution: An in-depth study of hbm (presented by sk hynix),” NVIDIA GPU Technology Conference, 2018. [Online]. Available: https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8949/
  86. S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021.
  87. S. Lee, K. Lee, M. Sung, M. Alian, C. Kim, W. Cho, R. Oh, S. O, J. H. Ahn, and N. S. Kim, “3d-xpath: High-density managed dram architecture with cost-effective alternative paths for memory transactions,” in Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2018.
  88. Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), 2015.
  89. Y. S. Lee, K. M. Kim, J. H. Lee, J. H. Choi, and S. W. Chung, “A high-performance processing-in-memory accelerator for inline data deduplication,” in Proceedings of the 37th International Conference on Computer Design (ICCD), 2019.
  90. C. Li, R. Ausavarungnirun, C. J. Rossbach, Y. Zhang, O. Mutlu, Y. Guo, and J. Yang, “A framework for memory oversubscription management in graphics processing units,” in Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019.
  91. Y. Li and M. Gao, “Baryon: Efficient hybrid memory management with compression and sub-blocking,” in Proceedings of the 29th International Symposium on High Performance Computer Architecture (HPCA), 2023.
  92. Y. Li, A. Phanishayee, D. Murray, J. Tarnawski, and N. S. Kim, “Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers,” Proc. VLDB Endow., vol. 15, no. 11, p. 2747–2760, jul 2022.
  93. J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An experimental study of data retention behavior in modern dram devices: Implications for retention time profiling mechanisms,” in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 2013.
  94. L. Liu, S. Yang, L. Peng, and X. Li, “Hierarchical hybrid memory management in os for tiered memory systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 10, pp. 2223–2236, 2019.
  95. G. Loh and M. D. Hill, “Supporting very large dram caches with compound-access scheduling and missmap,” IEEE Micro, vol. 32, no. 3, pp. 70–78, 2012.
  96. G. H. Loh, N. E. Jerger, A. Kannan, and Y. Eckert, “Interconnect-memory challenges for multi-chip, silicon interposer systems,” in Proceedings of the 1st International Symposium on Memory Systems (MEMSYS), 2015.
  97. T. Lu, C. Serafy, Z. Yang, S. K. Samal, S. K. Lim, and A. Srivastava, “Tsv-based 3-d ics: Design methods and tools,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 10, pp. 1593–1619, 2017.
  98. S. Mach, F. Schuiki, F. Zaruba, and L. Benini, “Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 4, pp. 774–787, 2020.
  99. J. Macri, “Amd’s next generation gpu and high bandwidth memory architecture: Fury,” in 2015 IEEE Hot Chips 27 Symposium (HCS), 2015.
  100. J. Meng, K. Kawakami, and A. K. Coskun, “Optimizing energy efficiency of 3-d multicore systems with stacked dram under power and thermal constraints,” in Proceedings of the 49th Design Automation Conference (DAC), 2012.
  101. M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. H. Loh, “Heterogeneous memory architectures: A hw/sw approach for mixing die-stacked and off-package memories,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015.
  102. J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, no. 2, pp. 61–64, 2012.
  103. P. Micikevicius, “Multi-gpu programming,” NVIDIA GPU Technology Conference, 2012.
  104. T. P. Morgan, “The era of big memory is upon us,” September 2020. [Online]. Available: https://www.nextplatform.com/2020/09/23/the-era-of-big-memory-is-upon-us/
  105. T. P. Morgan, “THE THIRD TIME CHARM OF AMD’S INSTINCT GPU,” June 2023. [Online]. Available: https://www.nextplatform.com/2023/06/14/the-third-time-charm-of-amds-instinct-gpu/
  106. L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin, “Graphbig: Understanding graph computing in the context of industrial solutions,” in Proceedings of the 28th International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015.
  107. D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
  108. M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keckler, and W. J. Dally, “Fine-grained dram: Energy-efficient dram for extreme bandwidth systems,” in the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2017.
  109. Y. Pan, Y. Wang, Y. Wu, C. Yang, and J. D. Owens, “Multi-gpu graph analytics,” in IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017.
  110. S. Pandey, A. K. Kamath, and A. Basu, “Gpm: Leveraging persistent memory from a gpu,” in Proceedings of the 27th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022.
  111. M. K. Qureshi, M. M. Franceschini, L. A. Lastras-Montaño, and J. P. Karidis, “Morphable memory system: A robust architecture for exploiting multi-level phase change memories,” in Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), 2010.
  112. M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Set-dueling-controlled adaptive insertion for high-performance caching,” IEEE Micro, vol. 28, no. 1, pp. 91–98, 2008.
  113. M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in Proceedings of the 45th International Symposium on Microarchitecture (MICRO), 2012.
  114. M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high performance main memory system using phase-change memory technology,” in Proceedings of the 36th International Symposium on Computer Architecture (ISCA), 2009.
  115. Z. Qureshi, V. S. Mailthody, S. W. Min, I.-H. Chung, J. Xiong, and W. mei Hwu, “Tearing down the memory wall,” Semiconductor Research Corporation TechCon, 2020.
  116. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 16th Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  117. M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “Vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
  118. M. Rhu, M. Sullivan, J. Leng, and M. Erez, “A locality-aware memory hierarchy for energy-efficient gpu architectures,” in Proceedings of the 46th International Symposium on Microarchitecture (MICRO), 2013.
  119. J. Roach, “To cool datacenter servers, microsoft turns to boiling liquid,” April 2021. [Online]. Available: https://news.microsoft.com/source/features/innovation/datacenter-liquid-cooling/
  120. N. Sakharnykh, “Everything you need to know about unified memory,” NVIDIA GPU Technology Conference, 2018.
  121. C. Shao, J. Guo, P. Wang, J. Wang, C. Li, and M. Guo, “Oversubscribing gpu unified virtual memory: Implications and suggestions,” in Proceedings of the 12nd International Conference on Performance Engineering (ICPE), 2022.
  122. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  123. J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in Proceedings of the 47th International Symposium on Microarchitecture (MICRO), 2014.
  124. J. Sim, G. H. Loh, H. Kim, M. OConnor, and M. Thottethodi, “A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,” in Proceedings of the 45th International Symposium on Microarchitecture (MICRO), 2012.
  125. A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights landing: Second-generation intel xeon phi product,” Ieee micro, vol. 36, no. 2, pp. 34–46, 2016.
  126. S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, “Spatial memory streaming,” in Proceedings of the 33rd International Symposium on Computer Architecture (ISCA), 2006.
  127. Y. Song, W.-H. Kim, S. K. Monga, C. Min, and Y. I. Eom, “Prism: Optimizing key-value store for modern heterogeneous storage devices,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023.
  128. K. Stern, N. Wainstein, Y. Keller, C. M. Neumann, E. Pop, S. Kvatinsky, and E. Yalon, “Uncovering phase change memory energy limits by sub-nanosecond probing of power dissipation dynamics,” Advanced Electronic Materials, vol. 7, no. 8, p. 2100217, 2021.
  129. J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, “Parboil: A revised benchmark suite for scientific and commercial throughput computing,” Center for Reliable and High-Performance Computing, vol. 127, p. 27, 2012.
  130. G. Thomas-Collignon and V. Mehta, “Optimizing cuda applications for nvidia a100 gpu,” NVIDIA GPU Technology Conference, 2020.
  131. D. Ustiugov, A. Daglis, J. Picorel, M. Sutherland, E. Bugnion, B. Falsafi, and D. Pnevmatikatos, “Design guidelines for high-performance scm hierarchies,” in Proceedings of the 4th International Symposium on Memory Systems (MEMSYS), 2018.
  132. Z. Wang, “Microsystems using three-dimensional integration and tsv technologies: Fundamentals and applications,” Microelectronic Engineering, vol. 210, pp. 35–64, 2019.
  133. Z. Wang, X. Liu, J. Yang, T. Michailidis, S. Swanson, and J. Zhao, “Characterizing and modeling non-volatile memory systems,” in Proceedings of the 53rd International Symposium on Microarchitecture (MICRO), 2020.
  134. M. Webb, “Annual update on emerging memories 2020,” Flash Memory Summit, 2020.
  135. J. Wu, Y. Chen, W. S. Khwa, S. M. Yu, T. Y. Wang, J. Tseng, Y. Chih, and C. H. Diaz, “A 40nm low-power logic compatible phase change memory technology,” in Proceedings of the 63rd International Electron Devices Meeting (IEDM), 2018.
  136. L. Xiang, X. Zhao, J. Rao, S. Jiang, and H. Jiang, “Characterizing the performance of intel optane persistent memory: A close look at its on-dimm buffering,” in Proceedings of the 17th European Conference on Computer Systems (EuroSys), 2022.
  137. F. Xiong, E. Yalon, A. Behnam, C. Neumann, K. Grosse, S. Deshmukh, and E. Pop, “Towards ultimate scaling limits of phase-change memory,” in 2016 IEEE International Electron Devices Meeting (IEDM), 2016.
  138. Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Nimble page management for tiered memory systems,” in Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019.
  139. D. Yang, J. Liu, J. Qi, and J. Lai, “Wholegraph: A fast graph neural network training framework with multi-gpu distributed shared memory architecture,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2022.
  140. L. Yavits, L. Orosa, S. Mahar, J. D. Ferreira, M. Erez, R. Ginosar, and O. Mutlu, “Wolfram: Enhancing wear-leveling and fault tolerance in resistive memories using programmable address decoders,” in Proceedings of the 38th International Conference on Computer Design (ICCD), 2020.
  141. J. Yi, M. Kim, J. Seo, N. Park, S. Lee, J. Kim, G. Do, H. Jang, H. Koo, S. Cho, S. Chae, T. Kim, M.-H. Na, and S. Cha, “The chalcogenide-based memory technology continues: beyond 20nm 4-deck 256gb cross-point memory,” in 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2023.
  142. H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding, and O. Mutlu, “Row buffer locality aware caching policies for hybrid memories,” in Proceedings of the 30th International Conference on Computer Design (ICCD), 2012.
  143. V. Young, Z. A. Chishti, and M. K. Qureshi, “Tictoc: Enabling bandwidth-efficient dram caching for both hits and misses in hybrid memory systems,” in Proceedings of the 37th International Conference on Computer Design (ICCD), 2019.
  144. V. Young, C. Chou, A. Jaleel, and M. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in Proceedings of the 45th International Symposium on Computer Architecture (ISCA), 2018.
  145. V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa, “Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,” in Proceedings of the 51st International Symposium on Microarchitecture (MICRO), 2018.
  146. V. Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dram caches for bandwidth and capacity,” in Proceedings of the 44th International Symposium on Computer Architecture (ISCA), 2017.
  147. V. Young and M. K. Qureshi, “To update or not to update?: Bandwidth-efficient intelligent replacement policies for DRAM caches,” in Proceedings of the 37th International Conference on Computer Design (ICCD), 2019.
  148. X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,” in Proceedings of the 50th International Symposium on Microarchitecture (MICRO), 2017.
  149. D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Top-pim: Throughput-oriented programmable processing in memory,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing (HPDC), 2014.
  150. J. Zhang and M. Jung, “Zng: Architecting gpu multi-processors with new flash for scalable data analysis,” in Proceedings of the 47th International Symposium on Computer Architecture (ISCA), 2020.
  151. J. Zhang and M. Jung, “Ohm-gpu: Integrating new optical network and heterogeneous memory into gpu multi-processors,” in Proceedings of the 54th International Symposium on Microarchitecture (MICRO), 2021.
  152. J. Zhang, M. Kwon, H. Kim, H. Kim, and M. Jung, “Flashgpu: Placing new flash next to gpu cores,” in Proceedings of the 56th Design Automation Conference (DAC), 2019.
  153. R. Zhang, M. R. Stan, and K. Skadron, “Hotspot 6.0: Validation, acceleration and extension,” University of Virginia, Tech. Rep, 2015.
  154. W. Zhang and T. Li, “Exploring phase change memory and 3d die-stacking for power/thermal friendly, fast and durable memory architectures,” in Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2009.
  155. J. Zhao and Y. Xie, “Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration,” in Proceedings of the International Conference on Computer-Aided Design (ICCAD), 2012.
  156. T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, “Towards high performance paged memory for gpus,” in Proceedings of the 22nd International Symposium on High Performance Computer Architecture (HPCA), 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jeongmin Hong (7 papers)
  2. Sungjun Cho (18 papers)
  3. Geonwoo Park (3 papers)
  4. Wonhyuk Yang (3 papers)
  5. Young-Ho Gong (1 paper)
  6. Gwangsun Kim (4 papers)
Citations (4)