GPU-Affinity-Aware Scheduling

Updated 8 December 2025

GPU-affinity-aware scheduling is a strategy that assigns tasks to GPUs based on data locality, minimizing cold-start delays and redundant data transfers.
It employs algorithmic techniques such as greedy selection, dual approximation, and compiler-guided methods to optimize resource allocation and throughput.
Integration with memory management and cluster topology yields measurable performance gains by lowering communication overhead and enhancing overall system efficiency.

GPU-affinity-aware scheduling denotes a class of algorithmic and systems methods that explicitly incorporate GPU–workload affinity metrics when assigning computation to GPU devices in heterogeneous environments. “Affinity” here refers to the extent to which a workload or model—such as a LLM invocation, a deep learning training task, or a compute kernel—can benefit from a specific GPU’s local data (parameters, memory state) or hardware topology. By leveraging affinity, these schedulers minimize data movement, reduce cold-start latency, maximize throughput, and deliver predictable performance in systems ranging from single-node multi-GPU machines to distributed clusters and serverless platforms.

1. Principles and Formulation of GPU Affinity

Affinity-based scheduling exploits spatial and temporal locality to minimize high-latency data transfers when assigning work to GPUs. The core principle is: select the GPU where the task’s requirements (parameters, intermediate states, memory blocks) most closely coincide with what is already present or most quickly accessible on that device.

In serverless LLM platforms, affinity is quantitatively formulated by the fraction $A(m,g) = R_{m,g} / S_m$ , where $S_m$ is the total size of model $m$ and $R_{m,g}$ is the resident size of $m$ ’s parameters already loaded on GPU $g$ . Scheduling aims to minimize expected cold-start loading time,

$t_{\mathrm{load}}(m,g) = \alpha_m \cdot S_m \cdot (1-A(m,g))/B_s$

where $B_s$ is PCIe bandwidth and $\alpha_m$ is a model-specific latency-sensitivity parameter. The scheduler solves $\min_g t_{\mathrm{load}}(m,g)$ for each request, biasing decisions toward GPUs with maximal parameter reuse (Zhu et al., 1 Dec 2025).

Variations in cluster scheduling introduce alternative affinity measures, such as network-tier locality in DDL clusters—quantifying affinity via placement on the same NVSwitch node, rack, or inter-rack link—thereby minimizing communication time and synchronizations (Sharma et al., 29 Jan 2024).

2. Algorithmic Approaches and Heuristics

Affinity-aware scheduling is addressed through several algorithmic frameworks:

Greedy per-request selection: Directly chooses, for each incoming workload, the GPU with the highest immediate affinity (maximal $R_{m,g}$ or minimal transfer time). This method is efficient for batch or streaming settings where lookup cost is dominated by request scale (Zhu et al., 1 Dec 2025).
Dual approximation and affinity grouping: Constructs schedules via two-phase dual-approximation, first greedily packing tasks by affinity up to a fraction $\alpha\lambda$ of a guessed makespan, then assigning the remainder via a balancing phase. This delivers performance guarantees (makespan $\leq (2+\alpha)\,\mathrm{OPT}$ ), with affinity grouping minimizing data movement and improving scalability in dense, multi-GPU systems (Bleuse et al., 2014).
Queueing-theoretic optimal placement: Formulates the general throughput-optimal policy via nonlinear integer optimization. Task–resource affinity is encoded in the service-rate matrix $\mu_{ij}$ ; maximizing system throughput $X_{\mathrm{sys}}$ requires persistent scheduling in the affinity-maximizing state $S_{\mathrm{max}}$ . Efficient heuristics (MAP for throughput, MIS for priority constraints) reassign tasks based on the sensitivity metric $D_{t,j}$ , improving placement near optimally with polynomial complexity (Chen et al., 2017).
Compiler-guided task construction: Statically or dynamically extracts resource requirement vectors $(M_t, TB_t, W_t)$ for each GPU task. This enables resource-aware assignment (memory safety, SM/core packing) and affinity-aware packing across GPUs, using either exact SM simulation or fast warp-based heuristics for task placement (Chen et al., 2021).
Delay scheduling with auto-tuning: For cluster workloads, delays acceptance of suboptimal placement offers (e.g., on remote racks), allowing jobs to preferentially wait for high-affinity (proximal) resources. Delay timers are auto-tuned based on observed wait histories, adapting to temporal and demand variation (Sharma et al., 29 Jan 2024).

3. Integration with Memory Reuse and Cluster Topology

Affinity-aware scheduling is tightly coupled with memory management subsystems. In Tangram, the unified GPU memory pool enables tensor-level parameter sharing across models, while the on-demand KV cache allocator enforces dynamic memory safety. The scheduler consults real-time memory-residency maps to select the optimal device. ElasticKV supports cost-aware evictions and merges, ensuring that per-GPU assignments respect both affinity and resource constraints (Zhu et al., 1 Dec 2025).

In distributed clusters, affinity metrics reflect not only memory but physical topology. Dally’s scheduler computes offer locality levels based on network-tier distances, prioritizing placements that minimize communication overhead (NVSwitch $\rightarrow$ rack $\rightarrow$ inter-rack). Consolidation (delayed acceptance) and network-sensitive job preemption further skew placements toward affinity, optimizing both throughput and tail latency under variable network congestion (Sharma et al., 29 Jan 2024).

4. Performance Evaluation and Comparative Results

GPU-affinity-aware scheduling demonstrates substantial performance improvements across several domains:

System / Framework	Performance Metric	Affinity-aware Gains
Tangram (Zhu et al., 1 Dec 2025)	Time-To-First-Token (LLM cold start), Load time	23–55% TTFT reduction, up to 6.2× faster Load
DADA (Bleuse et al., 2014)	GFLOPS, Host-GPU transfer volume	Up to 2.3% higher throughput, 50–75% less data movement
Dally (Sharma et al., 29 Jan 2024)	Cluster makespan, JCT, comm overhead	69% faster makespan, 83% lower JCT, up to 98% lower communication overhead
Compiler-guided (Chen et al., 2021)	Throughput, turnaround, kernel slowdown	Up to 2.7× throughput, 4.9× faster turnaround, $<$ 3% slowdown per kernel
MAP/MIS (Chen et al., 2017)	System throughput ( $X_{\mathrm{sys}}$ ), energy-delay	0.3% from optimum, 46% priority-error reduction

Ablation experiments in Tangram show that affinity scheduling reduces 99th-percentile tail latencies by 8–54% over random assignment and remains robust under high traffic, confirming the centrality of affinity in modern multi-GPU resource management (Zhu et al., 1 Dec 2025). DADA’s affinity grouping is crucial for data-movement minimization, enabling scale beyond two GPUs by concentrating task-to-data placement (Bleuse et al., 2014). In cluster environments, Dally’s adaptive scheduling achieves dramatic declines in queuing delay and communication bottlenecks (Sharma et al., 29 Jan 2024).

5. Limitations, Trade-Offs, and Contextual Adaptation

GPU-affinity-aware scheduling faces several practical and theoretical constraints:

Lookup and control-plane overhead: Per-request affinity queries may incur nontrivial RPC cost ( $\sim$ 16 ms in Tangram), impacting ultra-low-latency applications (Zhu et al., 1 Dec 2025).
Temporal locality: The method’s effectiveness depends on workload repeat patterns; low locality or large batch diversity forces cold loads, negating affinity advantages.
Fairness versus hot-spotting: Greedy affinity focus may starve cold models, causing load imbalances or memory pressure. Adaptive or look-ahead schedulers can mitigate this but at higher complexity (Zhu et al., 1 Dec 2025).
Topology heterogeneity: Cluster hardware with nonuniform bandwidth (PCIe, NVSwitch, Infiniband, Ethernet) requires affinity metrics that incorporate communication profiles and accurate bandwidth estimation (Sharma et al., 29 Jan 2024).
Scheduling complexity: Integer-program-based optimal policies scale polynomially in the number of resources and task types, but fine-grained priority enforcement or decentralized decision-making may introduce practical overheads (Chen et al., 2017).
Static analysis limitations: Compiler-guided frameworks must conservatively estimate dynamically allocated resources, leading to possible under-utilization (Chen et al., 2021).

Operational adaptation is critical. Auto-tuned delays in cluster scheduling remove brittle, administrator-set thresholds, learning high-affinity waiting tolerances in real time. Compiler-guidance at the task level exposes multi-resource demands otherwise unavailable to runtime-only heuristics.

GPU-affinity-aware scheduling intersects with broader multi-resource scheduling, memory management, and distributed system placement literatures. It is complementary to priority-aware scheduling, energy-delay optimizations, and multi-tenant isolation techniques.

Affinity is a unifying metric in both single-node and cluster-wide scheduling. Dual approximation methods, sensitivity-based placement, and locality-aware delay mechanisms collectively extend GPU-affinity concepts across variably sized heterogeneous systems (Bleuse et al., 2014, Chen et al., 2017, Sharma et al., 29 Jan 2024). Compiler-guided frameworks hint at the utility of static/dynamic hybrid resource-aware scheduling, while unified memory pools suggest direct synergy between data-orchestrators and placement engines.

Future directions include generalizing affinity metrics to encompass not only data residency but also compute migratability, flexibility in partitioning (e.g., NVIDIA MIG), and integration with cloud-scale orchestration layers, as well as building affinity-aware preemption and QoS for multi-priority workloads.

7. Significance in Modern Compute Systems

GPU-affinity-aware scheduling is central to current and emerging multi-GPU architectures, including serverless LLM platforms, distributed deep learning clusters, and cloud resource orchestration. Its demonstrated efficacy in lowering cold-start and communication bottlenecks, as well as its formal near-optimality in task–device assignment under a variety of constraints, marks it as a cornerstone for scalable AI and HPC deployments (Zhu et al., 1 Dec 2025, Sharma et al., 29 Jan 2024, Chen et al., 2017, Bleuse et al., 2014, Chen et al., 2021).

The continuous evolution of heterogeneous compute platforms and workload diversity will further emphasize the importance of affinity-aware approaches for future scheduling and resource management research.