A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems

Published 2 Apr 2026 in cs.DC, cs.LG, and cs.PF | (2604.02158v1)

Abstract: Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA's Data Center GPU Manager (DCGM). VASP is a widely used materials science application on Perlmutter at NERSC, an HPE Cray EX system based on NVIDIA A100 GPUs. Using our insights from the resource utilization analysis of VASP applications, we propose a resource prediction framework to predict the average GPU power, maximum GPU utilization, and maximum GPU memory utilization values of heterogeneous HPC system applications to enable more efficient scheduling decisions and power-aware system operation. Our prediction framework consists of two stages: 1) using only the Slurm accounting logs as training data and 2) augmenting the training data with historical GPU profiling metrics collected with DCGM. The maximum GPU utilization predictions using only the Slurm submission features achieve up to 97% accuracy. Furthermore, features engineered from GPU-compute and memory activity metrics exhibit good correlations with average power utilization, and our runtime power usage prediction experiments result in up to 92% prediction accuracy. These findings demonstrate the effectiveness of DCGM metrics in capturing application characteristics and highlight their potential for developing predictive models to support dynamic power management in HPC systems.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a two-stage ML framework that achieves up to 97% symmetric accuracy in predicting GPU resource and power usage.
It employs LightGBM on static Slurm logs pre-runtime and fine-grained DCGM telemetry during execution for robust predictions.
Results show broad applicability across HPC applications, reducing resource overprovisioning and enabling dynamic energy management.

Two-Stage Machine Learning Framework for GPU Resource and Power Prediction in Heterogeneous HPC Environments

Background and Motivation

Modern high-performance computing (HPC) facilities face acute challenges in utilizing and provisioning GPU resources efficiently, driven by growing application complexity and increasingly stringent power budgets. Classical job submission systems such as Slurm, while providing static resource specifications, do not sufficiently capture the heterogeneous and dynamic patterns of GPU and memory utilization, nor do they provide capabilities for power-aware job scheduling. Applications like VASP, which dominate scientific workloads on systems such as NERSC's Perlmutter, exhibit broad variability in both computational and power demand, leading to resource overprovisioning, underutilization, and unpredictable energy costs. This context necessitates data-driven frameworks capable of accurate pre-runtime and runtime prediction of application resource and power needs using both static job descriptors and low-overhead, high-frequency telemetry.

Large-Scale Empirical Analysis of VASP GPU Jobs

The authors systematically analyze one month (March 2025) of VASP job data on Perlmutter, integrating Slurm historical logs and NVIDIA DCGM metrics, comprising 32,322 jobs. The analyzed dataset exposes key operational characteristics: GPU utilization for VASP jobs is consistently high, while GPU memory utilization exhibits substantial underutilization, with only 5% of jobs surpassing 34% memory usage. Power consumption, meanwhile, displays a heavy-tailed distribution, with most jobs in the 107–220 W average power range but a significant minority greatly exceeding this bracket.

Figure 1: Distributions of maximum GPU utilization, maximum memory utilization, and average power for VASP GPU jobs on Perlmutter (March 2025).

These findings underwrite the need for generalizable prediction methods capable of mapping job descriptors and low-level metrics to actionable insights for both resource selection and power management.

Proposed Two-Stage Prediction Framework

The framework comprises two prediction stages, designed to operate both before job initiation and during job execution. Stage 1 exploits only Slurm submission features (job name, user, account, scientific category, requested CPUs/GPUs/memory/time limit); Stage 2 augments these with DCGM time-series telemetry, enabling fine-grained modeling of application runtime dynamics.

Figure 2: Overview of the proposed two-stage framework for GPU resource and power prediction.

Both stages employ LightGBM as the primary ML model, chosen for its scalability and favorable trade-off between accuracy and computational cost. Regression models are used for pre-runtime, continuous-valued prediction; runtime power modeling is recast as a classification task, with power usage discretized for compatibility with hardware-level power capping strategies.

Prediction Performance Before Job Execution

Pre-runtime predictions achieve up to 97% symmetric accuracy for maximum GPU utilization, 94% for average power, and 88% for maximum memory utilization; $R^2$ values indicate strong fit for average power and memory utilization. Contrasted with user-based KNN baselines (UoPC), the LightGBM-based framework generalizes robustly to all users, including infrequent submitters—a key requirement for system-wide deployment.

Feature importances reveal that wall-time limit, job name, and user dominate predictive power, indicating that users encode relevant application behavior in batch script metadata.

Figure 3: Before job execution prediction: Normalized feature importance scores from the LightGBM regression models for each target variable.

These results support automatic, low-overhead integration of the first-stage predictor into batch systems, enabling informed scheduling and provisioning without privileged user intervention.

Runtime Power Prediction Leveraging Fine-Grained DCGM Metrics

The runtime prediction component adopts a sliding window over DCGM metrics, using the previous 30 seconds of telemetry to forecast the average power consumption class at the next time interval. This approach is explicitly designed to inform system-level power capping mechanisms, and is validated against simple historical baselines (mean and max of prior windows).

The ML-based runtime predictor achieves 82% overall accuracy and 0.80 macro-averaged F1, vastly outperforming both naive alternatives (0.63 accuracy, 0.62 F1). Crucially, misclassification errors nearly always occur between adjacent power classes, providing operational robustness for dynamic energy management.

Figure 4: During job execution power prediction: Normalized confusion matrices comparing baseline methods and the proposed framework.

Examination of representative job windows demonstrates that the ML predictor closely tracks ground-truth power class evolution, whereas naive methods systematically lag or over/under-predict, particularly across transitions in workload behavior.

Figure 5: Test samples from a 10-minute VASP job execution window, showing runtime predictions of average power consumption class.

DCGM metrics—specifically power, memory, and SM utilization—are the most important runtime features; static Slurm attributes are largely non-informative, highlighting the necessity of real-time telemetry for power-aware system operations.

Figure 6: During job execution prediction: Normalized feature importance scores from the LightGBM classifier.

Generalizability and Broader Applicability

Although initial development and validation center on VASP workloads, the framework is found to generalize robustly to other HPC applications, including LAMMPS, Espresso, Atlas, and E3SM, with similar performance trends: average power prediction accuracy ranges from 0.73 to 0.92, and maximum GPU utilization from 0.69 to 0.80. Memory usage remains more variable across workloads but shows consistent improvement over static baseline methods.

This underscores the value of the proposed approach: it is agnostic to scientific domain, incorporates only widely-available input features (Slurm, DCGM), and exhibits minimal operational overhead, making it directly applicable to a range of GPU-accelerated scientific applications in large-scale heterogeneous HPC environments.

Practical and Theoretical Implications

The work closes the gap between static, user-driven resource requests and the inherently dynamic behavior of real-world scientific codes. Accurate pre-runtime predictions directly reduce overprovisioning and underutilization by guiding resource allocation decisions, while runtime power predictions enable automated, fine-grained power capping and energy-aware scheduling.

From a theoretical perspective, this framework demonstrates that tree-based ML models trained on modest, well-chosen features can effectively learn the mapping from application descriptors and high-frequency telemetry to resource and power requirements, across diverse workload families. This signals that more complex, deep learning-based approaches may not be universally required for deployment-scale power and resource prediction tasks in operational HPC.

Conclusion

This paper delivers a practical, end-to-end ML framework for two-stage GPU resource and power prediction in heterogeneous HPC settings, validated at scale on Perlmutter's production workloads (2604.02158). The high prediction accuracy, broad generalizability, and low deployment cost suggest that such methodologies will increasingly become foundational for sustainable, high-throughput, and energy-aware supercomputing. Future directions include real-time system integration, closed-loop adaptive scheduling, and the extension of the framework to cover multi-tenant resources and more irregular job types.

Markdown Report Issue