Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPU Worker: Architecture & Scheduling

Updated 17 April 2026
  • GPU Worker is a compute entity that executes GPGPU workloads, integrating scientific computing, machine learning inference, and high-performance tasks across varied architectures.
  • It employs both explicit scheduling in grid systems and dynamic, late binding in serverless environments to maximize hardware utilization and reduce latency.
  • Advanced designs use interference-aware schedulers, fine-grained load balancing, and dual-tier memory management, as demonstrated by approaches like Stream-K for scalable performance.

A GPU Worker is a logical or physical compute entity responsible for executing general-purpose GPU (GPGPU) workloads, including scientific computing, machine learning inference, or high-performance grid tasks. GPU Workers can be realized in varied system architectures—from high-throughput grid middleware with explicit scheduling and resource discovery to serverless execution environments with dynamic swapping, late binding, and fine-grained load-balancing. The GPU Worker concept underpins contemporary strategies for maximizing hardware utilization, lowering latency, and providing flexible access for heterogeneous and large-scale computational workloads.

1. GPU Worker Architectures and Resource Discovery

In grid and cluster environments, a GPU Worker must be discoverable and schedulable through middleware that exposes physical GPU devices to higher-level orchestration tools. For instance, ARC’s Compute Element (CE) exposes resources via a multi-stage information provider pipeline: CEinfo.pl probes the local resource management system (LRMS)—such as SLURM—collects raw GPU resource information (using commands like sinfo -a -h -o "gresinfo=%G"), and propagates it into a unified GLUE2 data model. This discovery mechanism encodes GPU devices, types (e.g., K80, V100), and relevant static flags (e.g., Multi-Process Service, exclusive allocation) into a standard XML representation for downstream consumers (Isacson et al., 2019).

In serverless environments, the GPU Worker concept is abstracted to nodes comprising GPU pools, host-resident model repositories, intra-node routers, and per-request function containers. Here, hardware details are managed implicitly by a software stack that dynamically binds models, intercepts CUDA API calls, and orchestrates GPU runtime sharing (Yu et al., 2023).

2. GPU Worker Scheduling and Job Binding

Job binding mechanisms depend on system design and intended workload. ARC-based clusters leverage explicit job submission flows, where the user request is tagged to a GPU-specific RuntimeEnvironment (RTE), resulting in job scripts with directives like #SBATCH --gres=gpu:k80:1 for SLURM-managed environments. This enables deterministic binding of jobs to desired GPU types or partitions, with resources surfaced via the GeneralResources section in GLUE2 XML (Isacson et al., 2019).

In serverless systems such as FaaSwap, the scheduling policy is dynamically computed: when a request arrives, the controller assigns the request to an available GPU Worker based on workload, model residency (host or device), and swap requirements. Late binding ensures models are only loaded onto a GPU when required, and requests are scheduled to maximize Service Level Objective (SLO) compliance while minimizing swap-induced interference (Yu et al., 2023).

3. GPU Resource and Memory Management

Resource metrics in grid middleware are generally static, capturing device type, quantity, allocation flags, and associated memory (e.g., high-bandwidth memory, hbm:16G). There is no native support for dynamic utilization reporting—additional instrumentation, such as probes for real-time GPU usage (via nvidia-smi or NVML), must be integrated with care to avoid monitoring overheads (Isacson et al., 2019).

Modern serverless GPU Workers require sophisticated memory management. All models reside in main memory and are swapped to GPU device memory on demand. Memory allocation policies employ pre-allocation (cudaMalloc for the full device size), fixed-size slot management for typical tensors, and buddy allocators for irregular large chunks. A per-model mapping from host to device pointer addresses is maintained for address translation. Eviction policy often employs a two-tier LRU mechanism, distinguishing between “heavy” (latency-impacting) and “light” models to optimize for both memory locality and PCIe/NVLink bandwidth usage (Yu et al., 2023).

4. Work Scheduling and Load Balancing Abstractions

Fine-grained load balancing is a critical function of the GPU Worker, especially for irregular-parallel workloads and dynamic computational graphs. Abstracting work as “atoms” (smallest units) and “tiles” (collections) enables static (thread/group/work-oriented) or dynamic (work-stealing, queue-based) scheduling. Programmable interfaces expose work partitioning first-class, decoupling scheduling from core computation execution (Osama, 2022).

Static scheduling pre-allocates computation, optimal for regular workloads but suboptimal for skewed or irregular problems. Dynamic schemes use centralized or distributed work queues, atomics, and work-stealing/donation to adapt at runtime, balancing loads across thread blocks and entire devices. A precisely managed schedule enables near-peak utilization across both regular and irregular workloads.

A notable instantiation is Stream-K, a work-centric parallelization of matrix multiplication that divides aggregate loop iterations evenly among GPU worker threads or cooperative thread arrays (CTAs). This methodology achieves greater and more consistent utilization across tens of thousands of problem shapes compared to traditional tile-based decompositions, as evidenced by geomean and peak speedups over industry-standard libraries (CUTLASS, cuBLAS) (Osama, 2022).

5. Runtime Sharing and Interference-Aware Schedulers

GPU Worker runtimes in advanced serverless systems employ shared CUDA contexts, cuDNN, and cuBLAS handles, eliminating per-request cold-start and enabling true multi-tenancy. Asynchronous API redirection batches non-blocking runtime library calls, further reducing communication and scheduling overheads. The scheduling algorithms can account for swap status, peer-to-peer copy capabilities (NVLink), and memory constraints, making greedy, interference-aware decisions to maximize request SLO adherence. Schedulers prioritize requests dynamically based on observed compliance rates and remaining required “on-time” executions, adjusting queue priorities in a TCP-like fashion (Yu et al., 2023).

6. Performance Characteristics and Evaluation

Performance analysis of GPU Worker systems documents the importance of low information-provider latency (tens of milliseconds for resource discovery) and the ability to avoid “all-or-nothing” queueing in grid systems. Schedulers that expose GPU resources enable predictable job turnaround, avoiding non-determinism of generic queues (Isacson et al., 2019).

Serverless GPU Workers, using pipelined PCIe/NVLink swaps and model-late binding, support scaling to 1,000+ concurrent function instances while maintaining high SLO compliance (≈100%) and throughput up to 10× greater than native early-binding. Even with host-to-GPU swaps, pipelined implementations achieve execution within 20% of single-model bound performance, and async runtime remoting can surpass native performance by fusing configuration calls onto host CPUs (Yu et al., 2023).

For irregular and compute-bound matrix problems, work-centric scheduling in Stream-K enables geometric mean speedups ~1.6× (mixed precision), with peaks up to 14.7× over best tile-based approaches. Cost models and hierarchical scheduling abstractions further enhance adaptability to dynamic workloads (Osama, 2022).

7. Design Implications and Recommendations

Best practices for GPU Worker deployment include:

  • Treating GPUs as general resources in scheduler models, without conflating with CPU partitioning.
  • Defining explicit RTEs for every supported GPU type to ensure users can target the needed accelerators.
  • Avoiding integration of dynamic utilization probes into latency-sensitive discovery hot-paths; instead, offload such monitoring to side-channel pollers or periodic cache updates (Isacson et al., 2019).
  • Employing modular, hierarchical scheduling layers with clear APIs to support static, dynamic, and hybrid work partitioning that can be configured or auto-tuned using analytic cost models (Osama, 2022).
  • Late binding of models, interference-aware scheduling, and memory-efficient slotting are foundational for ultra-dense, low-latency serverless GPU inference or high-throughput batch GPGPU workloads (Yu et al., 2023).
  • For future GPU Worker designs, several directions are highlighted: multi-tiered memory hierarchies (host–disk–GPU), support for concurrent kernel execution via spatial multiplexing (MIG, MPS), and first-class, device-side load-balancing APIs (Osama, 2022, Yu et al., 2023).

The architecture, scheduling strategies, and memory management policies of GPU Workers determine scalability, utilization, and service predictability for both traditional HPC and emerging serverless inference contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPU Worker.