Papers
Topics
Authors
Recent
2000 character limit reached

Comprehensive Resource Measurement and Analysis for HPC Systems with TACC_Stats

Published 17 Feb 2013 in cs.DC and cs.PF | (1302.4085v1)

Abstract: High-performance computing (HPC) systems are a complex combination of software, processors, memory, networks, and storage systems characterized by frequent disruptive technological advances. Anomalous behavior has to be manually diagnosed and remedied with incomplete and sparse data. It also has been effort-intensive for users to assess the effectiveness with which they are using the available resources. The data available for system level analyses appear from multiple sources and in disparate formats (from Linux "sysstat" and accounting to scheduler/kernel logs). Sysstat does not resolve its measurements by job so that job-oriented analyses require individual measurements. There are many user-oriented performance instrumentation and profiling tools but they require extensive system knowledge, code changes and recompilation, and thus are not widely used. To address this issue, we develop TACC_Stats, a job-oriented and logically structured version of the conventional Linux "sysstat/sar" system-wide performance monitor. We use TACC_Stats-collected data from a supercomputer "Ranger" to demonstrate its effectiveness in two case studies.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.