Comprehensive Resource Measurement and Analysis for HPC Systems with TACC_Stats (1302.4085v1)
Abstract: High-performance computing (HPC) systems are a complex combination of software, processors, memory, networks, and storage systems characterized by frequent disruptive technological advances. Anomalous behavior has to be manually diagnosed and remedied with incomplete and sparse data. It also has been effort-intensive for users to assess the effectiveness with which they are using the available resources. The data available for system level analyses appear from multiple sources and in disparate formats (from Linux "sysstat" and accounting to scheduler/kernel logs). Sysstat does not resolve its measurements by job so that job-oriented analyses require individual measurements. There are many user-oriented performance instrumentation and profiling tools but they require extensive system knowledge, code changes and recompilation, and thus are not widely used. To address this issue, we develop TACC_Stats, a job-oriented and logically structured version of the conventional Linux "sysstat/sar" system-wide performance monitor. We use TACC_Stats-collected data from a supercomputer "Ranger" to demonstrate its effectiveness in two case studies.