Mirage Persistent Kernel (MPK) for GPU Inference
- Mirage Persistent Kernel (MPK) is an end-to-end compiler and runtime that transforms tensor programs into a single, high-performance mega-kernel executing across GPU SMs.
- Its SM-level graph abstraction and event-driven scheduling enable efficient cross-operator pipelining, overlapping computation and communication to reduce latency.
- The system’s compiler pipeline, task superoptimization, and decentralized asynchronous runtime facilitate fine-grained parallelism, approaching hardware memory-bandwidth limits.
Mirage Persistent Kernel (MPK) is an end-to-end compiler and runtime system designed to transform a tensor program—such as a LLM inference graph—into a single, high-performance "mega-kernel" that executes over all GPU streaming multiprocessors (SMs) without intermediate kernel launches. MPK introduces a streaming-multiprocessor-level (SM-level) graph abstraction to capture computation and communication dependencies at tile granularity, enabling previously impractical GPU optimizations, including cross-operator software pipelining and fine-grained kernel overlap. The MPK system comprises a compiler pipeline that lowers tensor graphs to SM-level task graphs with event-driven synchronization, a task superoptimizer that produces CUDA device functions for each tile, and a fully asynchronous in-kernel parallel runtime with decentralized scheduling across SMs. Empirical evaluation demonstrates that MPK reduces end-to-end inference latency by up to 1.7× compared to established kernel-per-operator systems, approaching the lower bound set by hardware memory bandwidth. MPK is available at https://github.com/mirage-project/mirage (Cheng et al., 22 Dec 2025).
1. SM-Level Graph Representation
MPK models entire tensor computations, encompassing both arithmetic and communication, as a directed acyclic SM-level graph , where each node denotes a task corresponding to a tile of a tensor operator (e.g., a matrix multiplication block, AllReduce tile, or attention slice) that executes on a single SM. Edges represent fine-grained data dependencies: the output of task is consumed by .
Synchronization across tasks is expressed by augmenting to a bipartite graph , with tasks and events . Each task has a dependent event 0 (must fire before execution) and a trigger event 1 (notified upon completion). Each event 2 is parameterized by an activation counter 3, representing the number of prerequisite tasks, and an index interval 4 indicating the contiguous range of tasks that depend on 5 in a linearized ordering.
Cross-operator pipelining is enabled by partitioning each task into preload and compute phases, managed via shared-memory token buffers per SM. This abstraction supports both intra-operator pipelining (e.g., tile-level parallelism within a matrix multiplication) and inter-operator pipelining (e.g., initiating AllReduce tiles as soon as upstream MatMul tiles complete), effectively overlapping data transfers and computation subject to shared-memory budget constraints.
2. MPK Compiler Pipeline
The MPK compiler pipeline proceeds through six key stages:
| Stage | Purpose | Mechanism/Operation |
|---|---|---|
| 1. Operator Decomposition | Partition operators into SM-level tasks | Tiles per operator; balance ≈ number of SMs |
| 2. Dependency Analysis | Identify inter-task data dependencies | Insert event nodes and dependency edges |
| 3. Event Fusion | Reduce synchronization overhead | Merge events with matching predecessor/successor sets |
| 4. 6Graph Normalization | Ensure consistent event-task mapping | Dummy nodes for single dependent/trigger event |
| 5. 7Graph Linearization | Produce memory-efficient, cacheable execution order | BFS topological scan; contiguous consumer task indices |
| 6. Task Implementation Gen. | Superoptimize CUDA kernels for each SM task | Mirage search; output device functions with pipelining |
Operator decomposition divides each high-level tensor operator (e.g., matrix multiply) into 8 SM-level tasks, each computing an output tile. Dependency analysis exhaustively enumerates task pairs for overlapping tensor regions, inserting events to manage dependencies. Event fusion combines events where possible to minimize synchronization overhead. Normalization ensures each task participates in exactly one dependent and one trigger event by inserting dummy nodes as necessary. Graph linearization places tasks in an array where each event’s consumer tasks occupy a contiguous range, enabling efficient scheduling and compact storage. Finally, each task is subject to superoptimization, producing a CUDA device function that orchestrates its prefetch and compute stages with register reuse and efficient memory access pattern transformations.
3. In-Kernel Parallel Runtime and Decentralized Scheduling
Once a megakernel is launched, all SMs remain active until inference completes. SMs are assigned distinct roles: worker (one per SM) and scheduler (four warps per SM). Each worker maintains local Just-In-Time (JIT) and Ahead-Of-Time (AOT) task queues; schedulers handle a global event queue.
- Just-In-Time (JIT): Tasks are enqueued as their dependent events fire, accommodating control-flow variability (e.g., per-sequence attention length in LLMs).
- Ahead-Of-Time (AOT): Tasks are enqueued in advance; workers spin-wait on event counters, reducing enqueue synchronization.
Operators are labeled JIT or AOT based on their data-dependence characteristics, with data-dependent layers (e.g., attention) marked as JIT up to a barrier, then proceeding AOT downstream. Fine-grained pipelining, such as overlapping AllReduce communication as soon as a corresponding MatMul tile is ready, allows for compute-communication overlap and additional speedup.
Synchronization primitives are limited to atomicAdd on event counters, circular buffer-based queues, and thread block-local lightweight barriers—removing the need for global locks or roundtrip synchronization with the host (Cheng et al., 22 Dec 2025).
4. Performance Modeling and Key Results
Traditionally, kernel-per-operator systems exhibit cumulative latency 9 plus kernel launch barriers, where each 0 is an operator runtime. MPK’s tile-level pipelining yields overall latency:
1
with 2 (overlap) typically 3–4, producing up to 5 end-to-end speedup.
Selected results (Cheng et al., 22 Dec 2025):
- On single-GPU LLM inference (0.6B–30B parameters), MPK outperforms vLLM and SGLang by 6–7 on A100, H100, and B200 GPUs.
- Example: Qwen3-8B on A100: per-token latency drops from 14.5 ms (vLLM) to 12.5 ms (MPK), approaching the memory-bandwidth bound of ~10 ms.
- On 8×H100 tensor-parallel runs, MPK achieves 8–9 higher throughput vs. optimized PyTorch+CUDA-Graphs.
- For Mixture-of-Experts (MoE) models, MPK’s fused gather–GEMM and workload balancer accelerate MoE layers by up to 0 vs. standard library implementations.
By fusing all operators and eliminating kernel launches, MPK closes the gap to the hardware memory-bandwidth bound to within 1, limiting further software-driven optimizations absent hardware innovation.
5. Distinctive Features and Impact on GPU Programming Models
MPK achieves end-to-end kernel fusion with minimal developer effort while maintaining the flexibility of traditional tensor programming models. Its SM-level abstraction permits cross-operator software pipelining and fine-grained overlap, exposing both intra- and inter-operator parallelism. The use of event-driven decentralized scheduling avoids performance bottlenecks typical of global locks or host-device synchronization. The persistent kernel model is especially amenable to workloads exhibiting pipeline and communication overlap, such as LLM inference, MoE, and tensor-parallel operations.
A key distinction from prior GPU runtime approaches is MPK’s unification of the entire inference graph into one persistent mega-kernel, nullifying kernel launch overhead and introducing a single point of dynamic control via in-kernel scheduling logic. This design paradigm brings end-to-end GPU inference throughput and latency close to the theoretical limits imposed by hardware (Cheng et al., 22 Dec 2025).
6. Availability and Prospects
MPK is publicly available and open source, enabling empirical replication and extension of its compiler and runtime design. The system advances the state-of-the-art in GPU-based model serving, compiler design for deep learning, and persistent kernel-based runtimes. By reducing temporal gaps between computation and communication with hardware-efficient scheduling, MPK provides a blueprint for forthcoming innovations in large-scale model inference at hardware efficiency limits (Cheng et al., 22 Dec 2025).