Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access (2404.11044v1)
Abstract: The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency. While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation. This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside a contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies. Evaluation with a cycle-accurate simulation shows AMI achieves 2.42x speedup on average for memory-bound benchmarks with 1us additional far memory latency. Over 130 outstanding requests are supported with 26.86x speedup for GUPS (random access) with 5 us latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.
- [n. d.]. NVIDIA CONNECTX-6 Datasheet. https://nvdam.widen.net/content/m0doaw9s14/original/connectx-6-en-smartnic-datasheet-1730950.pdf [Online; accessed: Febrary 2022].
- 2017. Cray MTA-2 System. ttp://www.cray.com/products/programs/mta_2/ [Online].
- 2017. openCAPI Specification. http://opencapi.org [Online; accessed: Febrary 2022].
- 2018. Gen-Z Specification. https://genzconsortium.org/specifications [Online; accessed: Febrary 2022].
- 2020. IBM Reveals Next-Generation IBM POWER10 Processor. https://newsroom.ibm.com/2020-08-17-IBM-Reveals-Next-Generation-IBM-POWER10-Processor [Online; accessed: Febrary 2022].
- 2022. Intel optane persistent memory. https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html [Online; accessed: Febrary 2022].
- The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics Processor. IEEE Micro 43, 5 (2023), 78–87. https://doi.org/10.1109/MM.2023.3295848
- The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics Processor. IEEE Micro 43, 5 (Sept. 2023), 78–87. https://doi.org/10.1109/MM.2023.3295848 Conference Name: IEEE Micro.
- FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). 716–721. https://doi.org/10.23919/DATE.2019.8715034
- The Rocket Chip Generator. Technical Report UCB/EECS-2016-17. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html
- Mikhail Asiatici and Paolo Ienne. 2019. Stop crying over your cache miss rate: Handling efficiently thousands of outstanding misses in fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 310–319.
- The NAS parallel benchmarks summary and preliminary results. In Supercomputing’91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing. IEEE, 158–165.
- Domino Temporal Data Prefetcher. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 131–142. https://doi.org/10.1109/HPCA.2018.00021
- Bingo Spatial Data Prefetcher. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 399–411. https://doi.org/10.1109/HPCA.2019.00053
- Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 362–373.
- Improving hash join performance through prefetching. ACM Transactions on Database Systems (TODS) 32, 3 (2007), 17–es.
- Command vector memory systems: high performance at low cost. In Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques. 68–77. https://doi.org/10.1109/PACT.1998.727154
- Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (Istanbul, Turkey) (ASPLOS ’15). Association for Computing Machinery, New York, NY, USA, 631–644. https://doi.org/10.1145/2694344.2694359
- Andreas Diavastos and Trevor E. Carlson. 2022. Efficient Instruction Scheduling Using Real-time Load Delay Tracking. ACM Transactions on Computer Systems 40, 1-4 (Nov. 2022), 1:1–1:21. https://doi.org/10.1145/3548681
- Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Synthesis Lectures on Computer Architecture 9, 1 (May 2014), 1–67. https://doi.org/10.2200/S00581ED1V01Y201405CAC028 Publisher: Morgan & Claypool Publishers.
- Smarco: An efficient many-core processor for high-throughput applications in datacenters. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 596–607.
- Speculative paging for future NVM storage. In Proceedings of the International Symposium on Memory Systems (Alexandria, Virginia) (MEMSYS ’17). Association for Computing Machinery, New York, NY, USA, 399–410. https://doi.org/10.1145/3132402.3132409
- VMT: Virtualized Multi-Threading for Accelerating Graph Workloads on Commodity Processors. IEEE Trans. Comput. 71, 6 (June 2022), 1386–1398. https://doi.org/10.1109/TC.2021.3086069 Conference Name: IEEE Transactions on Computers.
- ELDORADO. In Proceedings of the 2nd Conference on Computing Frontiers (Ischia, Italy) (CF ’05). Association for Computing Machinery, New York, NY, USA, 28–34. https://doi.org/10.1145/1062261.1062268
- The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 1–16.
- The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 1050–1063. https://doi.org/10.1109/ISCA45697.2020.00089
- Milad Hashemi and Yale N. Patt. 2015. Filtered runahead execution with a runahead buffer. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 358–369. https://doi.org/10.1145/2830772.2830812
- Maurice Herlihy and Nir Shavit. 2012. The Art of Multiprocessor Programming, revised first edition. Morgan Kaufmann.
- Software-hardware cooperative memory disambiguation. In The Twelfth International Symposium on High-Performance Computer Architecture, 2006. 244–253. https://doi.org/10.1109/HPCA.2006.1598133 ISSN: 2378-203X.
- Access Map Pattern Matching for Data Cache Prefetch. In Proceedings of the 23rd International Conference on Supercomputing (ICS ’09). ACM, New York, NY, USA, 499–500. https://doi.org/10.1145/1542275.1542349
- Checkpointed Early Load Retirement. In 11th International Symposium on High-Performance Computer Architecture. 16–27. https://doi.org/10.1109/HPCA.2005.9
- Asynchronous memory access chaining. Proceedings of the VLDB Endowment 9, 4 (2015), 252–263.
- Exploring System Challenges of Ultra-Low Latency Solid State Drives. In USENIX Annual Technical Conference.
- Cache in Hand: Expander-Driven CXL Prefetcher for Next Generation CXL-SSD. In Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage ’23). Association for Computing Machinery, New York, NY, USA, 24–30. https://doi.org/10.1145/3599691.3603406
- Evaluating STT-RAM as an energy-efficient main memory alternative. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 256–267. https://doi.org/10.1109/ISPASS.2013.6557176
- Software-defined far memory in warehouse-scale computers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 317–330.
- When Prefetching Works, When It Doesn’T, and Why. ACM Trans. Archit. Code Optim. 9, 1 (March 2012), 2:1–2:29. https://doi.org/10.1145/2133382.2133384
- Seok-Hee Lee. 2016. Technology scaling challenges and opportunities of memory devices. In 2016 IEEE International Electron Devices Meeting (IEDM). 1.1.1–1.1.8. https://doi.org/10.1109/IEDM.2016.7838026
- MERCI: efficient embedding reduction on commodity hardware via sub-query memoization. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA, 302–313. https://doi.org/10.1145/3445814.3446717
- Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. https://doi.org/10.48550/arXiv.2203.00241 arXiv:2203.00241 [cs].
- HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 1168–1181. https://doi.org/10.1109/HPCA56546.2023.10070986
- McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 469–480.
- Ankur Limaye and Tosiron Adegbija. 2018. A Workload Characterization of the SPEC CPU2017 Benchmark Suite. In 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 149–158. https://doi.org/10.1109/ISPASS.2018.00028
- CRISP: critical slice prefetching. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22). Association for Computing Machinery, New York, NY, USA, 300–313. https://doi.org/10.1145/3503222.3507745
- Object-Level Memory Allocation and Migration in Hybrid Memory Systems. IEEE Trans. Comput. 69, 9 (2020), 1401–1413. https://doi.org/10.1109/TC.2020.2973134
- A Survey of Non-Volatile Main Memory Technologies: State-of-the-Arts, Practices, and Future Directions. Journal of Computer Science and Technology 36, 1 (Jan. 2021), 4–32. https://doi.org/10.1007/s11390-020-0780-z
- The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152 (2020).
- P. Michaud. 2016. Best-offset hardware prefetching. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 469–480. https://doi.org/10.1109/HPCA.2016.7446087
- Scaling Irregular Applications through Data Aggregation and Software Multithreading. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1126–1135. https://doi.org/10.1109/IPDPS.2014.117 ISSN: 1530-2075.
- Runahead execution: An effective alternative to large instruction windows. IEEE Micro 23, 6 (2003), 20–25. https://doi.org/10.1109/MM.2003.1261383
- Vector Runahead. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 195–208. https://doi.org/10.1109/ISCA52012.2021.00024
- Precise Runahead Execution. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 397–410. https://doi.org/10.1109/HPCA47549.2020.00040
- Decoupled Vector Runahead. In 2023 56rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
- {Latency-Tolerant} Software Distributed Shared Memory. 291–305. https://www.usenix.org/conference/atc15/technical-session/presentation/nelson
- Tiny but mighty: designing and realizing scalable latency tolerance for manycore SoCs. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 817–830. https://doi.org/10.1145/3470496.3527400
- Davide Pala. 2016-2017. Design and programming of a coprocessor for a RISC-V architecture. Master’s thesis. POLITECNICO DI TORINO.
- LSP: Collective Cross-Page Prefetching for NVM. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). 501–506. https://doi.org/10.23919/DATE51398.2021.9474127
- Scalable High Performance Main Memory System Using Phase-Change Memory Technology. SIGARCH Comput. Archit. News 37, 3 (jun 2009), 24–33. https://doi.org/10.1145/1555815.1555760
- Reza Salkhordeh and Hossein Asadi. 2016. An Operating System level data migration scheme in hybrid DRAM-NVM memory architecture. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). 936–941.
- Debendra Das Sharma. 2022. Compute Express Link®: An open industry-standard interconnect enabling heterogeneous data-centric computing. In 2022 IEEE Symposium on High-Performance Interconnects (HOTI). 5–12. https://doi.org/10.1109/HOTI55740.2022.00017
- Continual flow pipelines: achieving resource-efficient latency tolerance. IEEE Micro 24, 6 (2004), 62–73.
- Clairvoyance: Look-ahead compile-time scheduling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 171–184. https://doi.org/10.1109/CGO.2017.7863738
- SWOOP: Software-Hardware Co-Design for Non-Speculative, Execute-Ahead, in-Order Cores. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). Association for Computing Machinery, New York, NY, USA, 328–343. https://doi.org/10.1145/3192366.3192393
- Scalable cache miss handling for high memory-level parallelism. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, 409–422.
- Asynchronous memory access unit for general purpose processors. BenchCouncil Transactions on Benchmarks, Standards and Evaluations 2, 2 (2022), 100061. https://doi.org/10.1016/j.tbench.2022.100061
- Songyue Wang. 2022. Architecture and RISC-V ISA Extension Supporting Asynchronous and Flexible Parallel Far Memory Access.
- OpenMem: Hardware/Software Cooperative Management for Mobile Memory System. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 109–114. https://doi.org/10.1109/DAC18074.2021.9586186
- Software Hint-Driven Data Management for Hybrid Memory in Mobile Systems. ACM Trans. Embed. Comput. Syst. 21, 1, Article 8 (jan 2022), 18 pages. https://doi.org/10.1145/3494536
- Hardware Memory Management for Future Mobile Hybrid Memory Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 3627–3637. https://doi.org/10.1109/TCAD.2020.3012213
- Towards Developing High Performance RISC-V Processors Using Agile Methodology. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1178–1199. https://doi.org/10.1109/MICRO56248.2022.00080
- IMP: Indirect Memory Prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 178–190.
- Making caches work for graph analytics. In 2017 IEEE International Conference on Big Data (Big Data). 293–302. https://doi.org/10.1109/BigData.2017.8257937
- Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th annual international symposium on Computer architecture (ISCA ’07). Association for Computing Machinery, New York, NY, USA, 35–45. https://doi.org/10.1145/1250662.1250668