ML Workloads on HPC Systems
- ML Workloads on HPC Systems are dynamic, data-intensive tasks that integrate ML into scientific simulations and deep-learning model training.
- They leverage containerization, elastic resource management, and specialized accelerators to overcome traditional HPC limitations.
- They require advanced orchestration, robust monitoring, and adaptive scheduling to ensure optimal performance and operational efficiency.
Machine learning (ML) workloads have become an integral part of high-performance computing (HPC) environments, fundamentally altering system design, resource allocation, software infrastructure, data management strategies, and operational practices. The rise of ML within HPC has led to the development and adoption of new frameworks, benchmarks, and workflow orchestration systems, prompted changes in computing and storage architectures, and necessitated improvements in observability, robustness, and user accessibility.
1. Evolution and Definition of ML Workloads on HPC Systems
ML workloads on HPC systems encompass a spectrum ranging from training deep learning models on massive scientific datasets to integrating ML modules within simulation workflows for tasks like surrogate modeling, adaptive sampling, and automated data analysis. Unlike traditional simulation codes—which typically execute in batch, rely on large sequential file I/O, and demand rigid job scheduling—ML workloads exhibit requirements for fast iterative prototyping, dynamic resource usage, higher rates of short and interactive jobs, as well as diverse support for containerized and service-oriented applications (2108.02037, 2507.01880). This evolution compels the adaptation of classical HPC centers, which were optimized for tightly-coupled, long-running applications, to now support the flexible, dynamic, and data-intensive nature of ML.
The heterogeneity of ML workloads is further reflected in the need for specialized accelerators (GPUs, IPUs, CS-2, etc.), multi-modal data ingestion and storage pathways, and the integration of cloud-native orchestration solutions (e.g., Kubernetes), even within traditional supercomputing environments (2404.10536, 2507.01880).
2. Infrastructure, Resource Management, and Containerization
The deployment of ML workloads on large-scale HPC facilities faces several technical challenges relating to resource management, environment consistency, and performance isolation. Solutions include:
- Container Environments and User Abstraction Container-based user environments enable ML practitioners to operate within familiar ecosystems while abstracting away HPC-specific details (2507.01880). For example, a Container Engine using an Environment Definition File (EDF) allows specification of images, volumes, working directories, and SSH settings in a portable, declarative syntax.
- Elastic Resource Allocation and Process Malleability Static allocation models typical of HPC hinder efficient ML utilization. Methodologies such as those implemented in Kub permit elastic scaling of resources by orchestrating checkpoints and coordinated process restarts across containerized workloads in Kubernetes clusters (2410.10655). Scheduling frameworks incorporating MPI process malleability (through DMRlib and the Malleability Module of Proteo) dynamically adjust resource allocations, enabling reduction of workload completion time by up to 40% and increasing utilization by over 20% (2506.14743).
- Security and Robustness ML workflows often require frequent software updates and use external dependencies, presenting security concerns. Enhanced infrastructure includes container image scanning, firewall monitoring, and user education to ensure compliance with both traditional HPC and modern ML security needs (2507.01880).
- Node Vetting and Early Abort Pre-execution validation tools diagnose node health—testing GPU temperatures, memory utilization, and network/NCCL bandwidth—to avoid expensive failures or wasted computational resources (2507.01880).
3. I/O Patterns, Storage, and Data Management
ML workloads introduce distinctive I/O challenges distinct from classical HPC applications (2404.10386, 2507.01880):
- Random, Small File Access and Metadata Bottlenecks During training, frameworks often perform many small, random reads (e.g., loading mini-batches from thousands of files), leading to metadata server overload and underutilization of sequential storage bandwidth.
- Specialized Storage Tiering A multi-tiered approach combines NVMe and NVMe-oF for high IOPS workloads, SSD/HDD-backed Lustre for bulk sequential I/O, and object storage (Ceph) for large checkpoints and unstructured data (2507.01880).
- I/O Optimizations
Techniques such as node-local caching (burst buffers), asynchronous data loading (e.g., TensorFlow’s
tf.data.prefetch
or PyTorch’s DataLoader), partial or hash-based shuffling, and file formats supporting chunking and decompression (TFRecord, NPZ, HDF5) are widely used or actively researched (2404.10386). Some research middleware (NoPFS, iCache) exploits knowledge of ML data access patterns to prefetch data more effectively. - Profiling and Benchmarking Tools including DLIO, Darshan, and tf-Darshan—alongside ML benchmarks like MLPerf HPC, CosmoFlow, and DeepCAM—provide detailed insight into I/O throughput, latency, and system bottlenecks (2404.10386, 2110.11466, 2404.10536).
4. Workflow Orchestration, Service Planes, and Observability
The complexity and dynamism of ML-driven scientific campaigns necessitate robust workflow engines, service infrastructures, and observability tools that go beyond traditional batch schedulers (1912.02892, 2503.13343, 2507.01880):
- Hybrid and Multi-Component Workflow Systems Orchestration frameworks like Merlin (1912.02892) and scalable runtime extensions to RADICAL-Pilot (2503.13343) enable composite workflows combining distributed simulations, ML model training and inference, and post-processing. Modern systems can launch service endpoints (ML inference servers) alongside batch simulations, expose service APIs via REST/ZeroMQ, and handle asynchronous communication for efficient concurrent execution.
- Service Plane Infrastructure The integration of Kubernetes (RKE2) as a service plane, orchestrated with tools such as ArgoCD and NVIDIA GPU Operators, facilitates the deployment of support and inference services, lightweight control-plane components, and experiment tracking systems (e.g., MLFlow), alongside the main data plane for GPU-intensive workloads (2507.01880).
- Monitoring, Observability, and Diagnostics Stacks such as EMOI aggregate per-job and global telemetry (GPU, network, I/O) and visualize them via dashboards/web UIs, enabling anomaly detection, performance bottleneck identification, and straggler analysis (2507.01880). GPU saturation scorers use weighted aggregations of device metrics to present real-time, actionable performance summaries:
where denotes individual GPU performance metrics (SM occupancy, memory bandwidth, etc.), and are weights (2507.01880).
5. Resource Utilization, Scheduling, and Operational Impacts
Research based on long-term operational data yields several critical insights into the impact of ML workloads (2409.08949, 2506.14743):
- Resource and Energy Consumption ML jobs, although comprising only ~9% of submissions, can account for ~39% of cluster energy consumption due to their much longer runtimes and high GPU usage. Median run times are an order of magnitude greater compared to generic jobs.
- Failure Modes and Node Health ML jobs exhibit higher failure (17% vs. 14%) and cancellation (13% vs. 4%) rates. Thermal challenges are prominent: GPUs often reach thermal limits (17.4% above 90% utilization), with observed variance in temperature response by GPU index, suggesting scheduling could be optimized to exploit cooler hardware positions (2409.08949).
- Service Robustness and Early Detection Node vetting and early abort mechanisms reduce wasted time and energy by halting jobs on problematic nodes before full-scale deployment (2507.01880).
- Scheduling Approaches for Heterogeneous, Dynamic Loads Techniques include process malleability (DMRlib, MaM/Proteo), hybrid workload scheduling that supports malleable and on-demand jobs, and predictive queue wait time estimation via machine learning models, which dramatically improve start-time accuracy relative to default estimators (Slurm) (2109.05412, 2506.14743, 2204.13543).
6. Performance Characterization, Benchmarking, and Communication Optimization
Performance evaluation and modeling for ML training on large HPC systems are essential for optimization and capacity planning (2404.12674, 2110.11466, 2404.10536, 2503.24230):
- End-to-End Benchmarking Suites like MLPerf HPC provide standardized, science-oriented benchmarks measuring not only raw compute throughput but also data staging, algorithmic convergence, and I/O (2110.11466, 2404.10536).
- Performance Modeling Universal models account for communication collectives (allreduce, all-to-all), data-distribution-aware operations (e.g., embedding lookups), and the impact of hardware topology (NVLink, PCIe). A representative collective communication cost model uses a piecewise, sigmoid-fitted function:
where is startup latency and the saturating bandwidth (2404.12674).
- GPU-Centric Communication Schemes Recent work explores offloading communication control from CPUs to GPUs, using "Stream Triggered," "Kernel Triggered," and "Kernel Initiated" schemes to reduce latency and improve overlap between compute and communication phases in distributed ML training (2503.24230).
- Benchmarking Across Architectures Tools such as Reframe, extended to Kubernetes-managed clouds, enable unified benchmarking of ML workloads across diverse accelerators (Nvidia GPUs, Graphcore Bow Pod64, Cerebras CS-2) and management planes (2404.10536).
7. Future Directions and Open Challenges
Several key challenges and potential research avenues persist across the literature:
- Standardization of I/O Stacks There is a lack of a universal, robust I/O middleware for ML on HPC, leading to fragmented, often suboptimal solutions that hinder performance (2404.10386).
- Integration Across Heterogeneous, Elastic, and Service-Oriented Platforms Harmonizing the interaction between dynamic, containerized job management (Kubernetes), traditional batch schedulers (Slurm), and service-based workflows remains an ongoing effort (2410.10655, 2503.13343, 2507.01880).
- Energy Efficiency and Adaptive Scheduling As data center energy budgets tighten, deeper integration of energy-aware scheduling, predictive job management, and early failure detection will be increasingly critical (2409.08949).
- Transparent, Automated Optimization There is a recognized gap in the development of user-transparent, automated optimization layers that dynamically adjust workflow and I/O strategies to match workload requirements and system performance (2404.10386, 2507.01880).
- Observability, Monitoring, and Usability As workloads and infrastructure grow in complexity, comprehensive, real-time observability stacks, along with structured diagnostic and node vetting protocols, will be essential for ensuring both scientific productivity and operational robustness.
The convergence of ML and HPC is reshaping both research and operational paradigms. Through architectural evolution, workflow and scheduling innovation, nuanced performance modeling, and operational best practices, modern HPC centers are increasingly able to serve the dynamic, heterogeneous demands of ML workloads while preserving reliability, usability, and efficiency.