Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access (2404.11044v1)

Published 17 Apr 2024 in cs.AR

Abstract: The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency. While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation. This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside a contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies. Evaluation with a cycle-accurate simulation shows AMI achieves 2.42x speedup on average for memory-bound benchmarks with 1us additional far memory latency. Over 130 outstanding requests are supported with 26.86x speedup for GUPS (random access) with 5 us latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. [n. d.]. NVIDIA CONNECTX-6 Datasheet. https://nvdam.widen.net/content/m0doaw9s14/original/connectx-6-en-smartnic-datasheet-1730950.pdf [Online; accessed: Febrary 2022].
  2. 2017. Cray MTA-2 System. ttp://www.cray.com/products/programs/mta_2/ [Online].
  3. 2017. openCAPI Specification. http://opencapi.org [Online; accessed: Febrary 2022].
  4. 2018. Gen-Z Specification. https://genzconsortium.org/specifications [Online; accessed: Febrary 2022].
  5. 2020. IBM Reveals Next-Generation IBM POWER10 Processor. https://newsroom.ibm.com/2020-08-17-IBM-Reveals-Next-Generation-IBM-POWER10-Processor [Online; accessed: Febrary 2022].
  6. 2022. Intel optane persistent memory. https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html [Online; accessed: Febrary 2022].
  7. The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics Processor. IEEE Micro 43, 5 (2023), 78–87. https://doi.org/10.1109/MM.2023.3295848
  8. The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics Processor. IEEE Micro 43, 5 (Sept. 2023), 78–87. https://doi.org/10.1109/MM.2023.3295848 Conference Name: IEEE Micro.
  9. FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). 716–721. https://doi.org/10.23919/DATE.2019.8715034
  10. The Rocket Chip Generator. Technical Report UCB/EECS-2016-17. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html
  11. Mikhail Asiatici and Paolo Ienne. 2019. Stop crying over your cache miss rate: Handling efficiently thousands of outstanding misses in fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 310–319.
  12. The NAS parallel benchmarks summary and preliminary results. In Supercomputing’91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing. IEEE, 158–165.
  13. Domino Temporal Data Prefetcher. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 131–142. https://doi.org/10.1109/HPCA.2018.00021
  14. Bingo Spatial Data Prefetcher. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 399–411. https://doi.org/10.1109/HPCA.2019.00053
  15. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 362–373.
  16. Improving hash join performance through prefetching. ACM Transactions on Database Systems (TODS) 32, 3 (2007), 17–es.
  17. Command vector memory systems: high performance at low cost. In Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques. 68–77. https://doi.org/10.1109/PACT.1998.727154
  18. Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (Istanbul, Turkey) (ASPLOS ’15). Association for Computing Machinery, New York, NY, USA, 631–644. https://doi.org/10.1145/2694344.2694359
  19. Andreas Diavastos and Trevor E. Carlson. 2022. Efficient Instruction Scheduling Using Real-time Load Delay Tracking. ACM Transactions on Computer Systems 40, 1-4 (Nov. 2022), 1:1–1:21. https://doi.org/10.1145/3548681
  20. Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Synthesis Lectures on Computer Architecture 9, 1 (May 2014), 1–67. https://doi.org/10.2200/S00581ED1V01Y201405CAC028 Publisher: Morgan & Claypool Publishers.
  21. Smarco: An efficient many-core processor for high-throughput applications in datacenters. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 596–607.
  22. Speculative paging for future NVM storage. In Proceedings of the International Symposium on Memory Systems (Alexandria, Virginia) (MEMSYS ’17). Association for Computing Machinery, New York, NY, USA, 399–410. https://doi.org/10.1145/3132402.3132409
  23. VMT: Virtualized Multi-Threading for Accelerating Graph Workloads on Commodity Processors. IEEE Trans. Comput. 71, 6 (June 2022), 1386–1398. https://doi.org/10.1109/TC.2021.3086069 Conference Name: IEEE Transactions on Computers.
  24. ELDORADO. In Proceedings of the 2nd Conference on Computing Frontiers (Ischia, Italy) (CF ’05). Association for Computing Machinery, New York, NY, USA, 28–34. https://doi.org/10.1145/1062261.1062268
  25. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 1–16.
  26. The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 1050–1063. https://doi.org/10.1109/ISCA45697.2020.00089
  27. Milad Hashemi and Yale N. Patt. 2015. Filtered runahead execution with a runahead buffer. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 358–369. https://doi.org/10.1145/2830772.2830812
  28. Maurice Herlihy and Nir Shavit. 2012. The Art of Multiprocessor Programming, revised first edition. Morgan Kaufmann.
  29. Software-hardware cooperative memory disambiguation. In The Twelfth International Symposium on High-Performance Computer Architecture, 2006. 244–253. https://doi.org/10.1109/HPCA.2006.1598133 ISSN: 2378-203X.
  30. Access Map Pattern Matching for Data Cache Prefetch. In Proceedings of the 23rd International Conference on Supercomputing (ICS ’09). ACM, New York, NY, USA, 499–500. https://doi.org/10.1145/1542275.1542349
  31. Checkpointed Early Load Retirement. In 11th International Symposium on High-Performance Computer Architecture. 16–27. https://doi.org/10.1109/HPCA.2005.9
  32. Asynchronous memory access chaining. Proceedings of the VLDB Endowment 9, 4 (2015), 252–263.
  33. Exploring System Challenges of Ultra-Low Latency Solid State Drives. In USENIX Annual Technical Conference.
  34. Cache in Hand: Expander-Driven CXL Prefetcher for Next Generation CXL-SSD. In Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage ’23). Association for Computing Machinery, New York, NY, USA, 24–30. https://doi.org/10.1145/3599691.3603406
  35. Evaluating STT-RAM as an energy-efficient main memory alternative. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 256–267. https://doi.org/10.1109/ISPASS.2013.6557176
  36. Software-defined far memory in warehouse-scale computers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 317–330.
  37. When Prefetching Works, When It Doesn’T, and Why. ACM Trans. Archit. Code Optim. 9, 1 (March 2012), 2:1–2:29. https://doi.org/10.1145/2133382.2133384
  38. Seok-Hee Lee. 2016. Technology scaling challenges and opportunities of memory devices. In 2016 IEEE International Electron Devices Meeting (IEDM). 1.1.1–1.1.8. https://doi.org/10.1109/IEDM.2016.7838026
  39. MERCI: efficient embedding reduction on commodity hardware via sub-query memoization. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA, 302–313. https://doi.org/10.1145/3445814.3446717
  40. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. https://doi.org/10.48550/arXiv.2203.00241 arXiv:2203.00241 [cs].
  41. HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 1168–1181. https://doi.org/10.1109/HPCA56546.2023.10070986
  42. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 469–480.
  43. Ankur Limaye and Tosiron Adegbija. 2018. A Workload Characterization of the SPEC CPU2017 Benchmark Suite. In 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 149–158. https://doi.org/10.1109/ISPASS.2018.00028
  44. CRISP: critical slice prefetching. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22). Association for Computing Machinery, New York, NY, USA, 300–313. https://doi.org/10.1145/3503222.3507745
  45. Object-Level Memory Allocation and Migration in Hybrid Memory Systems. IEEE Trans. Comput. 69, 9 (2020), 1401–1413. https://doi.org/10.1109/TC.2020.2973134
  46. A Survey of Non-Volatile Main Memory Technologies: State-of-the-Arts, Practices, and Future Directions. Journal of Computer Science and Technology 36, 1 (Jan. 2021), 4–32. https://doi.org/10.1007/s11390-020-0780-z
  47. The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152 (2020).
  48. P. Michaud. 2016. Best-offset hardware prefetching. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 469–480. https://doi.org/10.1109/HPCA.2016.7446087
  49. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1126–1135. https://doi.org/10.1109/IPDPS.2014.117 ISSN: 1530-2075.
  50. Runahead execution: An effective alternative to large instruction windows. IEEE Micro 23, 6 (2003), 20–25. https://doi.org/10.1109/MM.2003.1261383
  51. Vector Runahead. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 195–208. https://doi.org/10.1109/ISCA52012.2021.00024
  52. Precise Runahead Execution. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 397–410. https://doi.org/10.1109/HPCA47549.2020.00040
  53. Decoupled Vector Runahead. In 2023 56rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
  54. {Latency-Tolerant} Software Distributed Shared Memory. 291–305. https://www.usenix.org/conference/atc15/technical-session/presentation/nelson
  55. Tiny but mighty: designing and realizing scalable latency tolerance for manycore SoCs. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 817–830. https://doi.org/10.1145/3470496.3527400
  56. Davide Pala. 2016-2017. Design and programming of a coprocessor for a RISC-V architecture. Master’s thesis. POLITECNICO DI TORINO.
  57. LSP: Collective Cross-Page Prefetching for NVM. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). 501–506. https://doi.org/10.23919/DATE51398.2021.9474127
  58. Scalable High Performance Main Memory System Using Phase-Change Memory Technology. SIGARCH Comput. Archit. News 37, 3 (jun 2009), 24–33. https://doi.org/10.1145/1555815.1555760
  59. Reza Salkhordeh and Hossein Asadi. 2016. An Operating System level data migration scheme in hybrid DRAM-NVM memory architecture. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). 936–941.
  60. Debendra Das Sharma. 2022. Compute Express Link®: An open industry-standard interconnect enabling heterogeneous data-centric computing. In 2022 IEEE Symposium on High-Performance Interconnects (HOTI). 5–12. https://doi.org/10.1109/HOTI55740.2022.00017
  61. Continual flow pipelines: achieving resource-efficient latency tolerance. IEEE Micro 24, 6 (2004), 62–73.
  62. Clairvoyance: Look-ahead compile-time scheduling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 171–184. https://doi.org/10.1109/CGO.2017.7863738
  63. SWOOP: Software-Hardware Co-Design for Non-Speculative, Execute-Ahead, in-Order Cores. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). Association for Computing Machinery, New York, NY, USA, 328–343. https://doi.org/10.1145/3192366.3192393
  64. Scalable cache miss handling for high memory-level parallelism. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, 409–422.
  65. Asynchronous memory access unit for general purpose processors. BenchCouncil Transactions on Benchmarks, Standards and Evaluations 2, 2 (2022), 100061. https://doi.org/10.1016/j.tbench.2022.100061
  66. Songyue Wang. 2022. Architecture and RISC-V ISA Extension Supporting Asynchronous and Flexible Parallel Far Memory Access.
  67. OpenMem: Hardware/Software Cooperative Management for Mobile Memory System. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 109–114. https://doi.org/10.1109/DAC18074.2021.9586186
  68. Software Hint-Driven Data Management for Hybrid Memory in Mobile Systems. ACM Trans. Embed. Comput. Syst. 21, 1, Article 8 (jan 2022), 18 pages. https://doi.org/10.1145/3494536
  69. Hardware Memory Management for Future Mobile Hybrid Memory Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 3627–3637. https://doi.org/10.1109/TCAD.2020.3012213
  70. Towards Developing High Performance RISC-V Processors Using Agile Methodology. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1178–1199. https://doi.org/10.1109/MICRO56248.2022.00080
  71. IMP: Indirect Memory Prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 178–190.
  72. Making caches work for graph analytics. In 2017 IEEE International Conference on Big Data (Big Data). 293–302. https://doi.org/10.1109/BigData.2017.8257937
  73. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th annual international symposium on Computer architecture (ISCA ’07). Association for Computing Machinery, New York, NY, USA, 35–45. https://doi.org/10.1145/1250662.1250668

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com