Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
52 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

A Mess of Memory System Benchmarking, Simulation and Application Profiling (2405.10170v4)

Published 16 May 2024 in cs.AR

Abstract: The Memory stress (Mess) framework provides a unified view of the memory system benchmarking, simulation and application profiling. The Mess benchmark provides a holistic and detailed memory system characterization. It is based on hundreds of measurements that are represented as a family of bandwidth--latency curves. The benchmark increases the coverage of all the previous tools and leads to new findings in the behavior of the actual and simulated memory systems. We deploy the Mess benchmark to characterize Intel, AMD, IBM, Fujitsu, Amazon and NVIDIA servers with DDR4, DDR5, HBM2 and HBM2E memory. The Mess memory simulator uses bandwidth--latency concept for the memory performance simulation. We integrate Mess with widely-used CPUs simulators enabling modeling of all high-end memory technologies. The Mess simulator is fast, easy to integrate and it closely matches the actual system performance. By design, it enables a quick adoption of new memory technologies in hardware simulators. Finally, the Mess application profiling positions the application in the bandwidth--latency space of the target memory system. This information can be correlated with other application runtime activities and the source code, leading to a better overall understanding of the application's behavior. The current Mess benchmark release covers all major CPU and GPU ISAs, x86, ARM, Power, RISC-V, and NVIDIA's PTX. We also release as open source the ZSim, gem5 and OpenPiton Metro-MPI integrated with the Mess simulator for DDR4, DDR5, Optane, HBM2, HBM2E and CXL memory expanders. The Mess application profiling is already integrated into a suite of production HPC performance analysis tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. Hitting the memory wall: Implications of the obvious. In SIGARCH, pages 20–24. Comput. Archit. News, March 1995.
  2. Richard Sites. It’s the memory, stupid! Microprocessor Report, pages 2–3, August 1996.
  3. Missing the memory wall: The case for processor/memory integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, ISCA, pages pages 90–101, New York, 1996. ACM.
  4. Intel. Intel Advisor. https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html, 2021. [Online; accessed 27-June-2023].
  5. John D McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, 1995.
  6. Camp: a synthetic micro-benchmark for assessing deep memory hierarchies. In 2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar), 2022.
  7. ”The HPCG Benchmark,” http://www.hpcg-benchmark.org, 2016.
  8. LMbench. http://lmbench.sourceforge.net, 12 2005.
  9. Google. Multichase. https://github.com/google/multichase, 2021.
  10. R. S. Verdejo and P. Radojkovic. Microbenchmarks for Detailed Validation and Tuning of Hardware Simulators. In International Conference on High Performance Computing Simulation (HPCS), 2017.
  11. Intel Corporation. Intel memory latency checker v3.5. https://software.intel.com/en-us/articles/intelr-memory-latency-checker, 2023.
  12. X-mem: A cross-platform and extensible memory characterization tool for the cloud. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2016.
  13. An Evaluation of High-Level Mechanistic Core Models. ACM Trans. Archit. Code Optim., 11(3), August 2014.
  14. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, November 2011.
  15. ZSim: fast and accurate microarchitectural simulation of thousand-core systems. In ISCA’ 13 Proceedings of the 40th Annual International Symposium on Computer Architecture, pages 475–486, 2013.
  16. The gem5 Simulator: Version 20.0+, 2020.
  17. Anatomy: An analytical model of memory system performance. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, 2014.
  18. Graphite: A distributed parallel simulator for multicores. In HPCA - 16 The Sixteenth International Symposium on High-Performance Computer Architecture, pages 1–12, 2010.
  19. CMP Memory Modeling: How Much Does Accuracy Matter? In Fifth Annual Workshop on Modeling, Benchmarking and Simulation, 2009.
  20. Usimm: the utah simulated memory module. University of Utah and Intel, Tech. Rep, 2012.
  21. DRAMsim: A memory-system simulator, November 2005.
  22. DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator. IEEE Computer Architecture Letters, 19(2):106–109, 2020.
  23. Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator. IEEE Computer Architecture Letters, pages 1–4, November 2023.
  24. Ramulator: A Fast and Extensible DRAM Simulator. In IEEE Computer Architecture Letters, volume 15, pages 45–49, 2016.
  25. DRAMSys4. 0: An open-source simulation framework for in-depth DRAM Analyses. volume 50, pages 217–242. Springer, 2022.
  26. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proceedings of 30th Annual International Symposium on Microarchitecture, pages 292–302. IEEE, 1997.
  27. Perfmemplus: A tool for automatic discovery of memory performance problems. In International Conference on High Performance Computing, pages 209–226. Springer, 2019.
  28. Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System Programming Guide, Part 2, November 2009.
  29. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
  30. Cache-aware roofline model: Upgrading the loft. IEEE Computer Architecture Letters, 13(1):21–24, 2013.
  31. Ahmad Yasin. A top-down method for performance analysis and counters architecture. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35–44. IEEE, 2014.
  32. Fast Behavioural RTL Simulation of 10B Transistor SoC Designs with Metro-Mpi. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–6. IEEE, 2023.
  33. CXL Consortium. Compute express link (cxl), 2020.
  34. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro, 26(4):18–31, 2006.
  35. Barcelona Supercomputing Center Performance Tools. BSC Tools, 2023, Dec.
  36. Bruce Jacob. The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It. Morgan and Claypool Publishers, 2009.
  37. Measurement of Main Memory Bandwidth and Memory Access Latency in Intel Processors. Technical report, 2019.
  38. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 169–182, Santa Clara, CA, February 2020. USENIX Association.
  39. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module, 2019.
  40. Norman P. Jouppi. Cache write policies and performance. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA ’93, page 191–201, New York, NY, USA, 1993. Association for Computing Machinery.
  41. John L. Henning. SPEC CPU2006 benchmark descriptions. In ACM SIGARCH Computer Architecture News, volume 34, pages Pages 1 – 17, September 2006.
  42. SPEC CPU2017: Next-generation compute benchmark. In ACM/SPEC International Conference on Performance Engineering, pages 41–42, April 2018.
  43. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22Nd Annual International Symposium on Computer Architecture, ISCA ’95, pages 24–36, New York, NY, USA, 1995. ACM.
  44. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, page 190–200, New York, NY, USA, 2005. Association for Computing Machinery.
  45. Microbenchmarks for Detailed Validation and Tuning of Hardware Simulators. In International Conference on High-Performance Computing & Simulation (HPCS), pages 881–883, 2017.
  46. Main memory latency simulation: the missing link. In Proceedings of the International Symposium on Memory Systems, MEMSYS ’18, page 107–116. Association for Computing Machinery, 2018.
  47. Linux. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page, 2023.
  48. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009: Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing, September 2009, ZIH, Dresden, pages 157–173. Springer, 2010.
  49. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In 39th International Conference on Parallel Processing Workshops, pages 207–216, 2010.
  50. CUDA CUPTI: Cuda profiling tools interface. https://docs.nvidia.com/cupti, 2024.
  51. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems, 28(1):72–86, 2017.
  52. Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual, November 2009.
  53. PROFET: Modeling System Performance and Energy Without Simulating the CPU. SIGMETRICS Perform. Eval. Rev., 47(1):71–72, December 2019.
  54. Quantifying the Performance Impact of Memory Latency and Bandwidth for Big Data Workloads. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization, IISWC ’15, pages 213–224. IEEE Computer Society, 2015.
  55. JEDEC STANDARD. DDR4 SDRAM. JEDEC Solid State Technology Association, 2017.
  56. GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. In High Performance Computing, pages 489–507. Springer International Publishing, 2016.
  57. A. Kashyap. High Performance Computing: Tuning guide for AMD EPYC 7002 Series Processor, 2020.
  58. IBM Power9 Processor Architecture. IEEE Micro, 37(2):40–51, 2017.
  59. B. Wheeler. Graviton3 Debuts Neoverse V1. Technical report, Linley Group Microprocessor, 2022.
  60. Arijit Biswas. Sapphire Rapids. In IEEE Hot Chips 33 Symposium (HCS), pages 1–22, 2021.
  61. Fujitsu. A64FX Microarchitecture Manual. Technical report, 2019.
  62. Jack Choquette. NVIDIA Hopper H100 GPU: Scaling Performance. IEEE Micro, 43(3):9–17, 2023.
  63. A comparative analysis of microarchitecture effects on cpu and gpu memory system behavior. In IEEE International Symposium on Workload Characterization (IISWC), pages 150–160, 2014.
  64. DRAM Bandwidth and Latency Stacks: Visualizing DRAM Bottlenecks. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 322–331, 2022.
  65. Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering, ICPE ’22, 2022.
  66. John D McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers. https://www.cs.virginia.edu/stream/., 2022.
  67. ZSim+DRAMsim3 Simulation Infrastructure for Process-In-Memory. https://github.com/bsc-mem/zsim/tree/zsim+DRAMsim3+ACM, 2022.
  68. 𝒪⁢(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ) Key–Value Sort With Active Compute Memory. IEEE Transactions on Computers, 73(05):1341–1356, may 2024.
  69. DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks. IEEE Access, 2021.
  70. Rethinking Cycle Accurate DRAM Simulation. In Proceedings of the International Symposium on Memory Systems, MEMSYS ’19, page 184–191, 2019.
  71. Modeling dram timing in parallel simulators with immediate-response memory model. IEEE Computer Architecture Letters, 20(2):90–93, July 2021.
  72. OpenPiton: An Open Source Manycore Research Framework. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’16, page 217–232, New York, NY, USA, 2016. Association for Computing Machinery.
  73. BYOC: A ”Bring Your Own Core” Framework for Heterogeneous-ISA Research. ASPLOS ’20, page 699–714, New York, NY, USA, 2020. Association for Computing Machinery.
  74. OpenPiton Optimizations Towards High Performance Manycores. In Proceedings of the 16th International Workshop on Network on Chip Architectures, NoCArc ’23, page 27–33, New York, NY, USA, 2023. Association for Computing Machinery.
  75. OpenPiton+ Ariane: The first open-source, SMP Linux-booting RISC-V system scaling from one to many cores. In Workshop on Computer Architecture Research with RISC-V (CARRV), pages 1–6, 2019.
  76. Wilson Snyder. Verilator and systemperl. In North American SystemC Users’ Group, Design Automation Conference, 2004.
  77. Feedback Control Of Dynamic Systems. 1994.
  78. Control System Design. 2000.
  79. A Survey of Computer Architecture Simulation Techniques and Tools. IEEE Access, 7:78120–78145, 2019.
  80. Accuracy evaluation of GEM5 simulator system. In 7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), pages 1–7, 2012.
  81. Sources of error in full-system simulation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 13–22, 2014.
  82. A. Akram and L. Sawalha. x86 computer architecture simulators: A comparative study. In IEEE 34th International Conference on Computer Design (ICCD), pages 638–645, 2016.
  83. Frank Ghenassia. Transaction level modeling with SystemC. Springer, 2005.
  84. Barcelona Supercomputing Center Performance Tools. Paraver data browser, 2023, December.
  85. PARAVER : A Tool to Visualizeand Analyze Parallel. 1995.
  86. Barcelona Supercomputing Center Performance Tools. Paraver – Tracefile description. Technical report, 2023, December.
  87. Barcelona Supercomputing Center Performance Tools. Extrae tracing framework, 2023, December.
  88. HPCG benchmark technical specification. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2013.
  89. Intel. Intel Xeon CPU Max Series Configuration and Tuning Guide. https://www.intel.com/content/www/us/en/content-details/769060/intel-xeon-cpu-max-series-configuration-and-tuning-guide.html, 2023.
  90. John D McCalpin. Bandwidth Limits in the Intel Xeon Max (Sapphire Rapids with HBM) Processors. In International Conference on High Performance Computing, pages 403–413. Springer, 2023.
  91. John D McCalpin. The STREAM 2 Benchmark. https://www.cs.virginia.edu/stream/stream2/., 2022.
  92. Hopscotch: a micro-benchmark suite for memory performance evaluation. In Proceedings of the International Symposium on Memory Systems, MEMSYS ’19, page 167–172, New York, NY, USA, 2019. Association for Computing Machinery.
  93. Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Transactions on Computers, 65(12):3537–3551, 2016.
  94. Tpp: Transparent page placement for cxl-enabled tiered-memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, 2023.
  95. Pond: Cxl-based memory pooling systems for cloud platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.