Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Performance bottlenecks detection through microarchitectural sensitivity (2402.15773v1)

Published 24 Feb 2024 in cs.PF

Abstract: Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to make the most of hardware resources. We provide an in-depth overview of performance bottlenecks in recent OoO microarchitectures and describe the difficulties of detecting them. Techniques that measure resources utilization can offer a good understanding of a program's execution, but, due to the constraints inherent to Performance Monitoring Units (PMU) of CPUs, do not provide the relevant metrics for each use case. Another approach is to rely on a performance model to simulate the CPU behavior. Such a model makes it possible to implement any new microarchitecture-related metric. Within this framework, we advocate for implementing modeled resources as parameters that can be varied at will to reveal performance bottlenecks. This allows a generalization of bottleneck analysis that we call sensitivity analysis. We present Gus, a novel performance analysis tool that combines the advantages of sensitivity analysis and dynamic binary instrumentation within a resource-centric CPU model. We evaluate the impact of sensitivity on bottleneck analysis over a set of high-performance computing kernels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Performance analysis tools for linux. https://man7.org/linux/man-pages/man1/perf.1.html.
  2. Intel architecture code analyzer user’s guide. https://www.intel.com/content/dam/develop/external/us/en/documents/intel-architecture-code-analyzer-3-0 -users-guide-157552.pdf, 2017.
  3. Intel vtune profiler performance analysis cookbook. https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2024-0/overview.html, 2021.
  4. Intel 64 and ia-32 architectures optimization reference manual, 2023. https://cdrdv2-public.intel.com/671488/248966-046A-software-optimization-manual.pdf.
  5. Llvm machine code analyzer. https://llvm.org/docs/CommandGuide/llvm-mca.html, 2023.
  6. uops.info: Characterizing latency, throughput, and port usage of instructions on intel microarchitectures. In ASPLOS, ASPLOS ’19, pages 673–686, New York, NY, USA, 2019. ACM.
  7. nanobench: A low-overhead tool for running microbenchmarks on x86 systems. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), August 2020.
  8. Uica: Accurate throughput prediction of basic blocks on recent intel microarchitectures. In Proceedings of the 36th ACM International Conference on Supercomputing, ICS ’22, New York, NY, USA, 2022. Association for Computing Machinery.
  9. Facile: Fast, accurate, and interpretable basic-block throughput prediction. In 2023 IEEE International Symposium on Workload Characterization (IISWC), pages 87–99. IEEE Computer Society, 10 2023.
  10. Archexplorer: Microarchitecture exploration via bottleneck analysis. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23, page 268–282, New York, NY, USA, 2023. Association for Computing Machinery.
  11. Adding virtualization capabilities to the Grid’5000 testbed. In Ivan I. Ivanov, Marten van Sinderen, Frank Leymann, and Tony Shan, editors, Cloud Computing and Services Science, volume 367 of Communications in Computer and Information Science, pages 3–20. Springer International Publishing, 2013.
  12. Cesasme and staticdeps: static detection of memory-carried dependencies for code analyzers, 2024.
  13. Fabrice Bellard. Qemu, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, page 41, USA, 2005. USENIX Association.
  14. Pluto: A practical and fully automatic polyhedral parallelizer and locality optimizer. Technical Report OSU-CISRC-10/07-TR70, The Ohio State University, October 2007.
  15. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, CGO ’03, page 265–275, USA, 2003. IEEE Computer Society.
  16. Chapter 5 - source code transformations and optimizations. In João M.P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz, editors, Embedded Computing for High Performance, pages 137–183. Morgan Kaufmann, Boston, 2017.
  17. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization (TACO), 2014.
  18. Cqa: A code quality analyzer tool at binary level. In 2014 21st International Conference on High Performance Computing (HiPC), pages 1–10, 2014.
  19. Palmed: Throughput characterization for superscalar architectures. In Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’22, page 106–117. IEEE Press, 2022.
  20. Maqao: Modular assembler quality analyzer and optimizer for itanium 2. In Workshop on Explicitly Parallel Instruction Computing Techniques, Santa Jose, California, March 2005.
  21. Dinero iv trace-driven uniprocessor cache simulator. https://pages.cs.wisc.edu/ markhill/DineroIV/.
  22. Dark silicon and the end of multicore scaling. In 2011 38th Annual International Symposium on Computer Architecture (ISCA), pages 365–376, 2011.
  23. Christophe Guillon. Dinero iv with plru replacement policy support. https://github.com/atos-tools/dineroIV.
  24. Christophe Guillon. Program instrumentation with qemu. In Proceedings of the International QEMU User’s Forum, 2011.
  25. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience, 28(2):189–210, feb 2016.
  26. Kerncraft: A tool for analytic performance modeling of loop kernels. In Christoph Niethammer, José Gracia, Tobias Hilbrich, Andreas Knüpfer, Michael M. Resch, and Wolfgang E. Nagel, editors, Tools for High Performance Computing 2016, pages 1–22, Cham, 2017. Springer International Publishing.
  27. On the accuracy and usefulness of analytic energy models for contemporary multicore processors. In Rio Yokota, Michèle Weiland, David Keyes, and Carsten Trinitis, editors, High Performance Computing, pages 22–43, Cham, 2018. Springer International Publishing.
  28. Gpu code optimization using abstract kernel emulation and sensitivity analysis. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, page 736–751, New York, NY, USA, 2018. Association for Computing Machinery.
  29. Intel. Vtune profiler. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html, 2011.
  30. Efficiently exploring architectural design spaces via predictive modeling. SIGOPS Oper. Syst. Rev., 40(5):195–206, oct 2006.
  31. Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
  32. Andi Kleen. pmu tools, intel pmu profiling tools. https://github.com/andikleen/pmu-tools.
  33. Quantifying performance bottleneck cost through differential analysis. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, page 263–272, New York, NY, USA, 2013. Association for Computing Machinery.
  34. Core-level performance engineering with the open-source architecture code analyzer (osaca) and the compiler explorer. In Companion of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE ’23 Companion, page 127–131, New York, NY, USA, 2023. Association for Computing Machinery.
  35. The gem5 simulator: Version 20.0+, 2020.
  36. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, page 190–200, New York, NY, USA, 2005. Association for Computing Machinery.
  37. Paul E. McKenney. Differential profiling. In Proceedings of the 3rd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS ’95, page 237–241, USA, 1995. IEEE Computer Society.
  38. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. CoRR, abs/1808.07412, 2018.
  39. James George Mitchell. The Design and Construction of Flexible and Efficient Interactive Programming Systems. PhD thesis, USA, 1970. AAI7104538.
  40. Nicholas Nethercote. Dynamic Binary Analysis and Instrumentation. A dissertation submitted for the degree of doctor of philosophy, University of Cambridge, November 2004.
  41. Valgrind: A framework for heavyweight dynamic binary instrumentation. SIGPLAN Not., 42(6):89–100, jun 2007.
  42. PoCC, the polyhedral compiler collection. https://www.cs.colostate.edu/~pouchet/software/pocc/.
  43. PolyBench/C: The polyhedral benchmark suite, version 4.2, 2016. http://polybench.sf.net.
  44. Anica: Analyzing inconsistencies in microarchitectural code analyzers. Proc. ACM Program. Lang., 6(OOPSLA2), oct 2022.
  45. Branch prediction and the performance of interpreters — don’t trust folklore. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 103–114, 2015.
  46. Zsim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, page 475–486, New York, NY, USA, 2013. Association for Computing Machinery.
  47. Full speed ahead: Detailed architectural simulation at near-native speed. In 2015 IEEE International Symposium on Workload Characterization, pages 183–192, 2015.
  48. André Seznec. The l-tage branch predictor. J. Instr. Level Parallelism, 9, 2007.
  49. André Seznec. A 64-kbytes ittage indirect branch predictor. 2011.
  50. The amd “zen 2” processor. IEEE Micro, 40(2):45–52, 2020.
  51. Granite: A graph neural network model for basic block throughput estimation. In 2022 IEEE International Symposium on Workload Characterization (IISWC), pages 14–26, Los Alamitos, CA, USA, nov 2022. IEEE Computer Society.
  52. R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 11(1):25–33, 1967.
  53. Non-determinism and overcount on modern hardware performance counter implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 215–224. IEEE, 2013.
  54. Smarts: accelerating microarchitecture simulation via rigorous statistical sampling. In 30th Annual International Symposium on Computer Architecture, 2003. Proceedings., pages 84–95, 2003.
  55. Ahmad Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35–44, 2014.
  56. Half&Half: Demystifying Intel’s directional branch predictors for fast, secure partitioned execution. In IEEE Symposium on Security and Privacy (S&P). IEEE, May 2023.
  57. On the precision of precise event based sampling. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys ’20, page 98–105, New York, NY, USA, 2020. Association for Computing Machinery.

Summary

We haven't generated a summary for this paper yet.