Papers
Topics
Authors
Recent
Search
2000 character limit reached

CUDA Scheduler Kernel

Updated 20 April 2026
  • CUDA scheduler kernel is a software or firmware component that manages the assignment, ordering, and resource sharing of CUDA kernel launches on NVIDIA GPUs.
  • Advanced techniques such as FIFO, SRTF, and kernel slicing optimize resource utilization and throughput while addressing fairness and latency issues.
  • Innovative runtime prediction, adaptive scheduling, and synchronization methods demonstrate substantial performance gains in multiprogrammed and diverse workloads.

A CUDA scheduler kernel is a software or firmware component—hardware, driver-embedded, or implemented as part of a runtime system—that manages the assignment, ordering, and resource sharing of CUDA kernel launches or thread block dispatches on modern NVIDIA GPUs. The architecture of CUDA GPUs presents unique challenges for the scheduling of concurrent work, efficient device resource utilization, fairness, and kernel interoperability. Over the last decade, a rich body of research has investigated the design, modeling, and mechanisms of scheduler kernels for CUDA in the context of throughput computing, multiprogramming, deep learning, and sparse workloads.

1. Default CUDA Scheduler Architecture and FIFO Policy

The baseline scheduling policy used by the CUDA driver on Fermi and Kepler architectures is a first-in, first-out (FIFO) thread block scheduler (TBS). In the FIFO design, kernel launches issued by host code are queued in launch order. The scheduler dispatches all thread blocks of the earliest-launched grid to available Streaming Multiprocessors (SMs) before dispatching any blocks of subsequent kernels. There is no preemption, backfilling, or interleaving at the block level—dispatch priority is solely by launch order, and all blocks of one kernel must be issued before the next is considered.

This approach has several critical limitations:

  • Throughput randomness: System throughput varies highly depending on the launch sequence, as shorter kernels can be forced to wait for long-running kernels even when ample resources would permit overlapping execution.
  • No kernel-level preemption: High-priority or latency-sensitive kernels can be delayed indefinitely behind computationally intensive ones.
  • Potential unfairness: FIFO results in thread block starvation and priority inversion, with no policy to guarantee turnaround time or proportional sharing.

This non-preemptive, launch-ordered scheduling introduces substantial inefficiencies for shared devices and multiprogrammed workloads (Pai et al., 2014).

2. Runtime Prediction and Scheduling via the Staircase Model

To address deficiencies in FIFO scheduling, advanced scheduler kernels incorporate runtime prediction models and adaptive scheduling policies. The Structural Runtime Predictor leverages properties of the CUDA grid and SM execution resources:

  • Given a grid with BB blocks and NSMN_{\rm SM} SMs, each SM receives N=⌈B/NSM⌉N = \lceil B/N_{\rm SM} \rceil blocks.
  • The max-per-SM thread block residency is RR (from CUDA occupancy constraints).
  • Blocks execute in ⌈N/R⌉\lceil N/R \rceil waves per SM; each wave executes RR blocks, and the per-block execution time is tt.

The total predicted runtime for a kernel on one SM:

T=⌈NR⌉tT = \left\lceil \frac{N}{R} \right\rceil t

By executing a single thread block and measuring tt, the scheduler can estimate the full kernel's length extremely early, yielding accurate predictions within [0.6,1.2][0.6, 1.2] of actual kernel runtime on a broad set of workloads.

This online profiling is run per-kernel, per-SM, and is continuously updated as blocks are dispatched and completed. Such predictive modeling is the cornerstone for informed, work-aware, and fair scheduling (Pai et al., 2014).

3. Preemptive and Adaptive Scheduling Policies: SRTF and SRTF/Adaptive

Replacing FIFO, the Shortest Remaining Time First (SRTF) scheduler is enabled by online predictions:

  • At each scheduling point, thread blocks of the kernel with the least predicted remaining runtime are dispatched, potentially broaching running kernels with high estimated durations.
  • This per-block SRTF can be made preemptive at the block issue boundary but is not kernel-preemptive at finer granularity than block granularity.

For fairness, the SRTF/Adaptive policy constrains the resource (SM occupancy) allocation of concurrently running kernels to maximize fairness, not just throughput, by throttling dominant kernels and boosting laggards until service levels converge.

Empirical evaluation on representative benchmarks demonstrates:

  • STP (system throughput) improvements of 1.18× (SRTF) and 1.12× (SRTF/Adaptive) over FIFO,
  • ANTT (average normalized turnaround time) improvements up to 2.25×,
  • Fairness improvements up to 2.95× with adaptive allocation,
  • SRTF reaches throughput within 12.64% of the Oracle SJF policy, bridging half of the FIFO-SJF gap (Pai et al., 2014).

4. Slicing, Co-Scheduling, and Markov Modeling: The Kernelet Approach

Beyond per-block policies, Kernelet introduces kernel slicing and co-scheduling:

  • Each kernel is dynamically partitioned into "slices," contiguous block ranges forming small sub-kernels.
  • At run-time, the scheduler identifies slices from different kernels to co-schedule, tuning slice sizes per kernel such that the aggregate occupancy approaches the hardware's peak.
  • Slice launches incur launch overhead, which is empirically bounded by limiting slice sizes to maintain less than 2% overhead per device.

Index rectification in the generated PTX/SASS enables slices to execute with correct block indices. The scheduling decision is guided by a Markov chain performance model, characterizing instruction and memory resource usage (PUR and MUR), and a greedy search for slice size ratios that balance execution times of co-scheduled slices (Zhong et al., 2013).

Kernelet yields significant throughput improvements—up to 31.1% and 23.4% on Tesla C2050 and GTX680 respectively—demonstrating the efficacy of flexible resource slicing and stochastic modeling for kernel scheduling.

5. Fine-Grained Synchronization and Tile-Level Scheduling: cuSync

For chain or DAG-style kernel dependencies, tile-level scheduler kernels are exemplified by cuSync:

  • Kernels are divided into tiles (thread blocks), with dependencies between tiles of producer and consumer kernels expressed as a relation in a DAG.
  • Fine-grained synchronization relies on semaphore arrays in device global memory, with flexible policies: per-tile (TileSync), per-row (RowSync), and strided (StridedSync).
  • Producer tiles atomically increment semaphores on completion; consumer tiles busy-wait (at block granularity) on their specific dependencies.
  • A compiler front-end (cusyncGen) automates code generation for dependency, synchronization policy, and tile launch order.

By merging compute waves (producer and consumer) at the tile level, rather than kernel or stream boundaries, utilization is maximized and effective GPU occupancy approaches theoretical limits, with realized speedups up to 15–22% in large-scale models such as MegatronLM GPT-3, LLaMA, ResNet-38, and VGG-19 (Jangda et al., 2023). The essential trade-off is between synchronization overhead (atomic operations, SM cycle burns in spin-waits) and overlap benefit: fine-grained (TileSync) gives maximum overlap with maximum synchronization cost, while coarser (RowSync) reduces overhead at the cost of underutilization.

6. Dynamic Scheduling for Sparse and Iterative Workloads

Domain-specific scheduler kernels address irregular workloads:

  • AutoSAGE dynamically selects the best SpMM/SDDMM kernel variants for input-specific sparsity patterns and feature widths in sparse GNN workloads (Stankovic, 17 Nov 2025). It utilizes a roofline-style cost model, augmented by on-device micro-probes on random graph subsets to refine candidate selection, and includes guardrails for no-regression fallback and persistent cache for deterministic replay.
  • Kernel Batching with CUDA Graphs (Ekelund et al., 16 Jan 2025) applies batching and static CUDA graph unrolling for iterative applications, deriving an optimal batch size NSMN_{\rm SM}0 balancing graph creation overhead and intra-graph launch efficiency. Empirical results show up to 1.5× speedups, with optimal batch sizes of 50–150 iterations unaffected by kernel granularity.

7. Scheduler Kernel Trade-offs, Overheads, and Implementation Considerations

The selection and configuration of CUDA scheduler kernels involve nuanced and hardware-dependent trade-offs:

  • Prediction overhead: Online profiling (SRTF) introduces overhead only in the first block but yields large systematic gains.
  • Launch cost: Kernel slicing and batching (Kernelet, graph batching) must balance per-slice or per-batch launch overhead against the benefits of finer control and co-scheduling.
  • Asynchrony and resource sharing: The DAG-based runtime scheduler (GrCUDA) exposes asynchronous, resource-partitioned scheduling without user intervention, achieving up to 44% speedup versus synchronous baseline and nearly matching manual CUDA Graphs performance (Parravicini et al., 2020).
  • Synchronization granularity: Finer-grained tile-based policies (cuSync) can maximize overlap, but synchronization and atomic operation costs can be substantial and must be tuned to workload structure.
  • Hardware integration: Most proposed schedulers operate entirely at the software/kernel launch level; block-to-SM mapping remains in hardware control, and block-level preemption/priority is not hardware-exposed in current generations.

Performance modeling, resource utilization, and observed overhead factors are summarized and compared across approaches in Table 1.

Scheduler Policy Type Predictive Modeling Overhead Reported Speedup
FIFO (CUDA default) FIFO, non-preempt. None None Baseline
SRTF / SRTF/Adaptive Preemptive, dyn. Online per-block, grid First-block timing 1.12–2.25× ANTT, fairness (Pai et al., 2014)
Kernelet Slicing + Markov PUR/MUR, Markov chain Slice launch, code rectification 23–31% throughput (Zhong et al., 2013)
cuSync Tile sync, DAG DAG, semaphore handoff Atomics, spin-wait 14–22% throughput, LLMs (Jangda et al., 2023)
Kernel Batching (Graph) Batch+static graph Empirical throughput Batch creation up to 1.5× (Ekelund et al., 16 Jan 2025)

The field continues to evolve with increasing device concurrency, hardware support for multi-tenancy, and rising demands from machine learning and cloud applications.


References:

  • (Pai et al., 2014) Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels
  • (Zhong et al., 2013) Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
  • (Jangda et al., 2023) A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
  • (Ekelund et al., 16 Jan 2025) Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs
  • (Stankovic, 17 Nov 2025) AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention
  • (Parravicini et al., 2020) DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CUDA Scheduler Kernel.