Automated Programmatic Performance Analysis of Parallel Programs (2401.13150v1)
Abstract: Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale of performance data, but often rely on the user to manually explore low-level data and are rigid in how the data can be manipulated. We propose a Python-based API, Chopper, which provides high-level and flexible performance analysis for both single and multiple executions of parallel applications. Chopper facilitates performance analysis and reduces developer effort by providing configurable high-level methods for common performance analysis tasks such as calculating load imbalance, hot paths, scalability bottlenecks, correlation between metrics and CCT nodes, and causes of performance variability within a robust and mature Python environment that provides fluid access to lower-level data manipulations. We demonstrate how Chopper allows developers to quickly and succinctly explore performance and identify issues across applications such as AMG, Laghos, LULESH, Quicksilver and Tortuga.
- HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685–701.
- Effectively presenting call path profiles of application performance. In 2010 39th International Conference on Parallel Processing Workshops. IEEE, 179–188.
- Paraprof: A portable, extensible, and scalable tool for parallel performance profile analysis. In European Conference on Parallel Processing. Springer, 17–26.
- Automatic performance analysis of large scale simulations. In European Conference on Parallel Processing. Springer, 199–207.
- Hatchet: Pruning the Overgrowth in Parallel Profiles. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). http://doi.acm.org/10.1145/3295500.3356219 LLNL-CONF-772402.
- There goes the neighborhood: performance degradation due to nearby jobs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’13). IEEE Computer Society. http://doi.acm.org/10.1145/2503210.2503247
- Caliper: Performance Introspection for HPC Software Stacks. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 550–560. https://doi.org/10.1109/SC.2016.46
- Thicket: Seeing the Performance Experiment Forest for the Individual Run Trees. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing (Orlando, FL, USA) (HPDC ’23). Association for Computing Machinery, New York, NY, USA, 281–293. https://doi.org/10.1145/3588195.3592989
- Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In Proc. of the ACM/IEEE Conference on Supercomputing (SC13), Denver, CO, USA. ACM, 1–12. https://doi.org/10.1145/2503210.2503277
- Scalability analysis of SPMD codes using expectations. In Proceedings of the 21st annual international conference on Supercomputing. 13–22.
- Detecting application load imbalance on high end massively parallel systems. In European Conference on Parallel Processing. Springer, 150–159.
- Van Emden Henson and Ulrike Meier Yang. 2002. BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41, 1 (2002), 155–177. https://doi.org/10.1016/S0168-9274(01)00115-5 Developments and Trends in Iterative Methods for Large Systems of Equations - in memorium Rudiger Weiss.
- Kevin A Huck and Jesus Labarta. 2010. Detailed load balance analysis of large scale parallel applications. In 2010 39th International Conference on Parallel Processing. IEEE, 535–544.
- Kevin A Huck and Allen D Malony. 2005. Perfexplorer: A performance data mining framework for large-scale parallel computing. In SC’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE, 41–41.
- LULESH 2.0 Updates and Changes. Technical Report LLNL-TR-641973. 1–9 pages.
- The vampir performance analysis tool-set. In Tools for high performance computing. Springer, 139–155.
- Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope,Scalasca, TAU, and Vampir. In Tools for High Performance Computing 2011, Holger Brunst, Matthias S. Müller, Wolfgang E. Nagel, and Michael M. Resch (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 79–91.
- Xu Liu and Bo Wu. 2015. Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.
- P. E. McKenney. 1995. Differential profiling. In MASCOTS ’95. Proceedings of the Third International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. 237–241. https://doi.org/10.1109/MASCOT.1995.378681
- Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman (Eds.). 51 – 56.
- Wes McKinney. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media.
- HPCView: A tool for top-down analysis of node performance. The Journal of Supercomputing 23 (2002), 81–101.
- Resource Utilization Aware Job Scheduling to Mitigate Performance Variability. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS ’22). IEEE Computer Society.
- The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing (SC’03) (Phoenix, AZ, USA).
- Quicksilver: a proxy app for the Monte Carlo transport code mercury. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 866–873.
- Cube v4: From performance report explorer to performance analysis tool. Procedia Computer Science 51 (2015), 1343–1352.
- Martin Schulz and Bronis R. de Supinski. 2007. Practical Differential Profiling. In Euro-Par 2007 Parallel Processing, Anne-Marie Kermarrec, Luc Bougé, and Thierry Priol (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 97–106.
- Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2 (2006), 287–311.
- Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles.
- Diagnosing performance bottlenecks in emerging petascale applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1–11.