Persistent Kernel Solutions

Updated 23 April 2026

Persistent kernel solution is a hardware–software approach that employs a single, enduring kernel on GPUs to manage multiple computational tasks with minimal overhead.
It enhances performance by enabling fine-grained scheduling, on-device synchronization, and efficient reuse of on-chip resources like registers and shared memory.
Persistent kernel strategies are crucial in domains such as deep learning inference and iterative solvers, demonstrated by architectures like MPK, PERKS, and GPUOS.

A persistent kernel solution refers to a class of hardware–software methodologies that utilize a single, long-lived ("persistent") kernel to execute many computational tasks, as opposed to repeatedly launching short-lived kernels for each operation or iteration. The persistent kernel paradigm eliminates or radically minimizes host-to-device launch overhead, enables fine-grained scheduling and resource management, and exposes opportunities for software pipelining, synchronization, and memory locality that are not accessible to per-kernel models. Persistent kernel strategies have emerged primarily in the context of modern GPU and heterogeneous accelerator systems, where kernel launch latencies can dominate or substantially degrade the performance of small or tightly coupled tasks. This encyclopedic entry surveys the theoretical underpinnings, architectural instantiations, algorithmic frameworks, optimization techniques, and empirical results associated with persistent kernel solutions across representative domains.

1. Origins and Motivation

Persistent kernel solutions arose as a response to inefficiencies observed in GPU programming models in several high-intensity domains, particularly deep learning inference, scientific stencils, and iterative solvers. In traditional CUDA-style programming, a host repeatedly launches many short-lived kernels, with each kernel handling only a subtask in a larger computational workflow. This execution model incurs launch overheads on the order of several microseconds per kernel and flushes on-chip state (registers, shared memory) between launches, which degrades data locality and impedes the potential overlap of compute and communication (Zhang et al., 2022, Cheng et al., 22 Dec 2025, Yang et al., 20 Apr 2026).

The persistent kernel paradigm executes an entire computational pipeline within a single GPU kernel invocation. By keeping the executing kernel alive, persistent kernel solutions amortize or eliminate launch overheads, enable explicit on-device synchronization, allow the reuse of register and shared-memory buffers across steps, and provide a substrate for advanced operator fusion and dynamic scheduling. Architectures such as MPK (Mirage Persistent Kernel), GPUOS, and PERKS embody these principles at different levels of abstraction and for diverse workloads (Cheng et al., 22 Dec 2025, Yang et al., 20 Apr 2026, Zhang et al., 2022).

2. Core Architectural Models

There are multiple realized architectures for persistent kernel solutions, each tailored to distinct computational motifs and workload requirements.

SM-Level Mega-Kernel/MPK

Mirage Persistent Kernel (MPK) (Cheng et al., 22 Dec 2025) introduces a compiler and runtime that transforms a high-level dataflow graph into a "mega-kernel" with explicit SM ("streaming multiprocessor")-level task decomposition. Here, the kernel itself is persistent, orchestrating the entire execution across fine-grained tiled tasks assigned to SM-local FIFOs, synchronized via event queues and decentralized scheduling. The kernel remains active, handling all computation and synchronization without return-to-host control until completion.

Iterative Loop-in-Kernel/PERKS

PERKS (Zhang et al., 2022) focuses on iterative solvers (e.g., PDE stencils, Krylov subspace methods), transforming the conventional host-driven time-stepping loop into an in-kernel loop, with device-wide synchronization (CUDA cooperative groups grid.sync()) used as an intra-kernel "barrier." This approach exploits the fact that the per-step state reused between iterations can remain on-chip, reducing redundant read/write traffic to device memory.

Persistent Worker Pool/GPUOS

GPUOS (Yang et al., 20 Apr 2026) maintains a fleet of persistent worker threads inside a single kernel that polls host-managed work queues. Operators are dispatched and executed "in place," with new operator