ML Workloads on HPC Systems
- ML Workloads on HPC Systems are dynamic, data-intensive tasks that integrate ML into scientific simulations and deep-learning model training.
- They leverage containerization, elastic resource management, and specialized accelerators to overcome traditional HPC limitations.
- They require advanced orchestration, robust monitoring, and adaptive scheduling to ensure optimal performance and operational efficiency.
Machine learning (ML) workloads have become an integral part of high-performance computing (HPC) environments, fundamentally altering system design, resource allocation, software infrastructure, data management strategies, and operational practices. The rise of ML within HPC has led to the development and adoption of new frameworks, benchmarks, and workflow orchestration systems, prompted changes in computing and storage architectures, and necessitated improvements in observability, robustness, and user accessibility.
1. Evolution and Definition of ML Workloads on HPC Systems
ML workloads on HPC systems encompass a spectrum ranging from training deep learning models on massive scientific datasets to integrating ML modules within simulation workflows for tasks like surrogate modeling, adaptive sampling, and automated data analysis. Unlike traditional simulation codes—which typically execute in batch, rely on large sequential file I/O, and demand rigid job scheduling—ML workloads exhibit requirements for fast iterative prototyping, dynamic resource usage, higher rates of short and interactive jobs, as well as diverse support for containerized and service-oriented applications (Samsi et al., 2021, Schuppli et al., 2 Jul 2025). This evolution compels the adaptation of classical HPC centers, which were optimized for tightly-coupled, long-running applications, to now support the flexible, dynamic, and data-intensive nature of ML.
The heterogeneity of ML workloads is further reflected in the need for specialized accelerators (GPUs, IPUs, CS-2, etc.), multi-modal data ingestion and storage pathways, and the integration of cloud-native orchestration solutions (e.g., Kubernetes), even within traditional supercomputing environments (Rae et al., 16 Apr 2024, Schuppli et al., 2 Jul 2025).
2. Infrastructure, Resource Management, and Containerization
The deployment of ML workloads on large-scale HPC facilities faces several technical challenges relating to resource management, environment consistency, and performance isolation. Solutions include:
- Container Environments and User Abstraction Container-based user environments enable ML practitioners to operate within familiar ecosystems while abstracting away HPC-specific details (Schuppli et al., 2 Jul 2025). For example, a Container Engine using an Environment Definition File (EDF) allows specification of images, volumes, working directories, and SSH settings in a portable, declarative syntax.
- Elastic Resource Allocation and Process Malleability Static allocation models typical of HPC hinder efficient ML utilization. Methodologies such as those implemented in Kub permit elastic scaling of resources by orchestrating checkpoints and coordinated process restarts across containerized workloads in Kubernetes clusters (Medeiros et al., 14 Oct 2024). Scheduling frameworks incorporating MPI process malleability (through DMRlib and the Malleability Module of Proteo) dynamically adjust resource allocations, enabling reduction of workload completion time by up to 40% and increasing utilization by over 20% (Iserte et al., 17 Jun 2025).
- Security and Robustness ML workflows often require frequent software updates and use external dependencies, presenting security concerns. Enhanced infrastructure includes container image scanning, firewall monitoring, and user education to ensure compliance with both traditional HPC and modern ML security needs (Schuppli et al., 2 Jul 2025).
- Node Vetting and Early Abort Pre-execution validation tools diagnose node health—testing GPU temperatures, memory utilization, and network/NCCL bandwidth—to avoid expensive failures or wasted computational resources (Schuppli et al., 2 Jul 2025).
3. I/O Patterns, Storage, and Data Management
ML workloads introduce distinctive I/O challenges distinct from classical HPC applications (Lewis et al., 16 Apr 2024, Schuppli et al., 2 Jul 2025):
- Random, Small File Access and Metadata Bottlenecks During training, frameworks often perform many small, random reads (e.g., loading mini-batches from thousands of files), leading to metadata server overload and underutilization of sequential storage bandwidth.
- Specialized Storage Tiering A multi-tiered approach combines NVMe and NVMe-oF for high IOPS workloads, SSD/HDD-backed Lustre for bulk sequential I/O, and object storage (Ceph) for large checkpoints and unstructured data (Schuppli et al., 2 Jul 2025).
- I/O Optimizations
Techniques such as node-local caching (burst buffers), asynchronous data loading (e.g., TensorFlow’s
tf.data.prefetch
or PyTorch’s DataLoader), partial or hash-based shuffling, and file formats supporting chunking and decompression (TFRecord, NPZ, HDF5) are widely used or actively researched (Lewis et al., 16 Apr 2024). Some research middleware (NoPFS, iCache) exploits knowledge of ML data access patterns to prefetch data more effectively. - Profiling and Benchmarking Tools including DLIO, Darshan, and tf-Darshan—alongside ML benchmarks like MLPerf HPC, CosmoFlow, and DeepCAM—provide detailed insight into I/O throughput, latency, and system bottlenecks (Lewis et al., 16 Apr 2024, Farrell et al., 2021, Rae et al., 16 Apr 2024).
4. Workflow Orchestration, Service Planes, and Observability
The complexity and dynamism of ML-driven scientific campaigns necessitate robust workflow engines, service infrastructures, and observability tools that go beyond traditional batch schedulers (Peterson et al., 2019, Merzky et al., 17 Mar 2025, Schuppli et al., 2 Jul 2025):
- Hybrid and Multi-Component Workflow Systems Orchestration frameworks like Merlin (Peterson et al., 2019) and scalable runtime extensions to RADICAL-Pilot (Merzky et al., 17 Mar 2025) enable composite workflows combining distributed simulations, ML model training and inference, and post-processing. Modern systems can launch service endpoints (ML inference servers) alongside batch simulations, expose service APIs via REST/ZeroMQ, and handle asynchronous communication for efficient concurrent execution.
- Service Plane Infrastructure The integration of Kubernetes (RKE2) as a service plane, orchestrated with tools such as ArgoCD and NVIDIA GPU Operators, facilitates the deployment of support and inference services, lightweight control-plane components, and experiment tracking systems (e.g., MLFlow), alongside the main data plane for GPU-intensive workloads (Schuppli et al., 2 Jul 2025).
- Monitoring, Observability, and Diagnostics Stacks such as EMOI aggregate per-job and global telemetry (GPU, network, I/O) and visualize them via dashboards/web UIs, enabling anomaly detection, performance bottleneck identification, and straggler analysis (Schuppli et al., 2 Jul 2025). GPU saturation scorers use weighted aggregations of device metrics to present real-time, actionable performance summaries:
where denotes individual GPU performance metrics (SM occupancy, memory bandwidth, etc.), and are weights (Schuppli et al., 2 Jul 2025).
5. Resource Utilization, Scheduling, and Operational Impacts
Research based on long-term operational data yields several critical insights into the impact of ML workloads (Chu et al., 13 Sep 2024, Iserte et al., 17 Jun 2025):
- Resource and Energy Consumption ML jobs, although comprising only ~9% of submissions, can account for ~39% of cluster energy consumption due to their much longer runtimes and high GPU usage. Median run times are an order of magnitude greater compared to generic jobs.
- Failure Modes and Node Health ML jobs exhibit higher failure (17% vs. 14%) and cancellation (13% vs. 4%) rates. Thermal challenges are prominent: GPUs often reach thermal limits (17.4% above 90% utilization), with observed variance in temperature response by GPU index, suggesting scheduling could be optimized to exploit cooler hardware positions (Chu et al., 13 Sep 2024).
- Service Robustness and Early Detection Node vetting and early abort mechanisms reduce wasted time and energy by halting jobs on problematic nodes before full-scale deployment (Schuppli et al., 2 Jul 2025).
- Scheduling Approaches for Heterogeneous, Dynamic Loads Techniques include process malleability (DMRlib, MaM/Proteo), hybrid workload scheduling that supports malleable and on-demand jobs, and predictive queue wait time estimation via machine learning models, which dramatically improve start-time accuracy relative to default estimators (Slurm) (Fan et al., 2021, Iserte et al., 17 Jun 2025, Brown et al., 2022).
6. Performance Characterization, Benchmarking, and Communication Optimization
Performance evaluation and modeling for ML training on large HPC systems are essential for optimization and capacity planning (Lin et al., 19 Apr 2024, Farrell et al., 2021, Rae et al., 16 Apr 2024, Namashivayam, 31 Mar 2025):
- End-to-End Benchmarking Suites like MLPerf HPC provide standardized, science-oriented benchmarks measuring not only raw compute throughput but also data staging, algorithmic convergence, and I/O (Farrell et al., 2021, Rae et al., 16 Apr 2024).
- Performance Modeling Universal models account for communication collectives (allreduce, all-to-all), data-distribution-aware operations (e.g., embedding lookups), and the impact of hardware topology (NVLink, PCIe). A representative collective communication cost model uses a piecewise, sigmoid-fitted function:
where is startup latency and the saturating bandwidth (Lin et al., 19 Apr 2024).
- GPU-Centric Communication Schemes Recent work explores offloading communication control from CPUs to GPUs, using "Stream Triggered," "Kernel Triggered," and "Kernel Initiated" schemes to reduce latency and improve overlap between compute and communication phases in distributed ML training (Namashivayam, 31 Mar 2025).
- Benchmarking Across Architectures Tools such as Reframe, extended to Kubernetes-managed clouds, enable unified benchmarking of ML workloads across diverse accelerators (Nvidia GPUs, Graphcore Bow Pod64, Cerebras CS-2) and management planes (Rae et al., 16 Apr 2024).
7. Future Directions and Open Challenges
Several key challenges and potential research avenues persist across the literature:
- Standardization of I/O Stacks There is a lack of a universal, robust I/O middleware for ML on HPC, leading to fragmented, often suboptimal solutions that hinder performance (Lewis et al., 16 Apr 2024).
- Integration Across Heterogeneous, Elastic, and Service-Oriented Platforms Harmonizing the interaction between dynamic, containerized job management (Kubernetes), traditional batch schedulers (Slurm), and service-based workflows remains an ongoing effort (Medeiros et al., 14 Oct 2024, Merzky et al., 17 Mar 2025, Schuppli et al., 2 Jul 2025).
- Energy Efficiency and Adaptive Scheduling As data center energy budgets tighten, deeper integration of energy-aware scheduling, predictive job management, and early failure detection will be increasingly critical (Chu et al., 13 Sep 2024).
- Transparent, Automated Optimization There is a recognized gap in the development of user-transparent, automated optimization layers that dynamically adjust workflow and I/O strategies to match workload requirements and system performance (Lewis et al., 16 Apr 2024, Schuppli et al., 2 Jul 2025).
- Observability, Monitoring, and Usability As workloads and infrastructure grow in complexity, comprehensive, real-time observability stacks, along with structured diagnostic and node vetting protocols, will be essential for ensuring both scientific productivity and operational robustness.
The convergence of ML and HPC is reshaping both research and operational paradigms. Through architectural evolution, workflow and scheduling innovation, nuanced performance modeling, and operational best practices, modern HPC centers are increasingly able to serve the dynamic, heterogeneous demands of ML workloads while preserving reliability, usability, and efficiency.