Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mirage Persistent Kernel (MPK) for GPU Inference

Updated 2 July 2026
  • Mirage Persistent Kernel (MPK) is an end-to-end compiler and runtime that transforms tensor programs into a single, high-performance mega-kernel executing across GPU SMs.
  • Its SM-level graph abstraction and event-driven scheduling enable efficient cross-operator pipelining, overlapping computation and communication to reduce latency.
  • The system’s compiler pipeline, task superoptimization, and decentralized asynchronous runtime facilitate fine-grained parallelism, approaching hardware memory-bandwidth limits.

Mirage Persistent Kernel (MPK) is an end-to-end compiler and runtime system designed to transform a tensor program—such as a LLM inference graph—into a single, high-performance "mega-kernel" that executes over all GPU streaming multiprocessors (SMs) without intermediate kernel launches. MPK introduces a streaming-multiprocessor-level (SM-level) graph abstraction to capture computation and communication dependencies at tile granularity, enabling previously impractical GPU optimizations, including cross-operator software pipelining and fine-grained kernel overlap. The MPK system comprises a compiler pipeline that lowers tensor graphs to SM-level task graphs with event-driven synchronization, a task superoptimizer that produces CUDA device functions for each tile, and a fully asynchronous in-kernel parallel runtime with decentralized scheduling across SMs. Empirical evaluation demonstrates that MPK reduces end-to-end inference latency by up to 1.7× compared to established kernel-per-operator systems, approaching the lower bound set by hardware memory bandwidth. MPK is available at https://github.com/mirage-project/mirage (Cheng et al., 22 Dec 2025).

1. SM-Level Graph Representation

MPK models entire tensor computations, encompassing both arithmetic and communication, as a directed acyclic SM-level graph G=(V,E)G = (V, E), where each node tVt \in V denotes a task corresponding to a tile of a tensor operator (e.g., a matrix multiplication block, AllReduce tile, or attention slice) that executes on a single SM. Edges (ti,tj)E(t_i, t_j) \in E represent fine-grained data dependencies: the output of task tit_i is consumed by tjt_j.

Synchronization across tasks is expressed by augmenting GG to a bipartite graph G=(T,E)\mathcal{G} = (T, \mathcal{E}), with tasks TT and events E\mathcal{E}. Each task tt has a dependent event tVt \in V0 (must fire before execution) and a trigger event tVt \in V1 (notified upon completion). Each event tVt \in V2 is parameterized by an activation counter tVt \in V3, representing the number of prerequisite tasks, and an index interval tVt \in V4 indicating the contiguous range of tasks that depend on tVt \in V5 in a linearized ordering.

Cross-operator pipelining is enabled by partitioning each task into preload and compute phases, managed via shared-memory token buffers per SM. This abstraction supports both intra-operator pipelining (e.g., tile-level parallelism within a matrix multiplication) and inter-operator pipelining (e.g., initiating AllReduce tiles as soon as upstream MatMul tiles complete), effectively overlapping data transfers and computation subject to shared-memory budget constraints.

2. MPK Compiler Pipeline

The MPK compiler pipeline proceeds through six key stages:

Stage Purpose Mechanism/Operation
1. Operator Decomposition Partition operators into SM-level tasks Tiles per operator; balance ≈ number of SMs
2. Dependency Analysis Identify inter-task data dependencies Insert event nodes and dependency edges
3. Event Fusion Reduce synchronization overhead Merge events with matching predecessor/successor sets
4. tVt \in V6Graph Normalization Ensure consistent event-task mapping Dummy nodes for single dependent/trigger event
5. tVt \in V7Graph Linearization Produce memory-efficient, cacheable execution order BFS topological scan; contiguous consumer task indices
6. Task Implementation Gen. Superoptimize CUDA kernels for each SM task Mirage search; output device functions with pipelining

Operator decomposition divides each high-level tensor operator (e.g., matrix multiply) into tVt \in V8 SM-level tasks, each computing an output tile. Dependency analysis exhaustively enumerates task pairs for overlapping tensor regions, inserting events to manage dependencies. Event fusion combines events where possible to minimize synchronization overhead. Normalization ensures each task participates in exactly one dependent and one trigger event by inserting dummy nodes as necessary. Graph linearization places tasks in an array where each event’s consumer tasks occupy a contiguous range, enabling efficient scheduling and compact storage. Finally, each task is subject to superoptimization, producing a CUDA device function that orchestrates its prefetch and compute stages with register reuse and efficient memory access pattern transformations.

3. In-Kernel Parallel Runtime and Decentralized Scheduling

Once a megakernel is launched, all SMs remain active until inference completes. SMs are assigned distinct roles: worker (one per SM) and scheduler (four warps per SM). Each worker maintains local Just-In-Time (JIT) and Ahead-Of-Time (AOT) task queues; schedulers handle a global event queue.

  • Just-In-Time (JIT): Tasks are enqueued as their dependent events fire, accommodating control-flow variability (e.g., per-sequence attention length in LLMs).
  • Ahead-Of-Time (AOT): Tasks are enqueued in advance; workers spin-wait on event counters, reducing enqueue synchronization.

Operators are labeled JIT or AOT based on their data-dependence characteristics, with data-dependent layers (e.g., attention) marked as JIT up to a barrier, then proceeding AOT downstream. Fine-grained pipelining, such as overlapping AllReduce communication as soon as a corresponding MatMul tile is ready, allows for compute-communication overlap and additional speedup.

Synchronization primitives are limited to atomicAdd on event counters, circular buffer-based queues, and thread block-local lightweight barriers—removing the need for global locks or roundtrip synchronization with the host (Cheng et al., 22 Dec 2025).

4. Performance Modeling and Key Results

Traditionally, kernel-per-operator systems exhibit cumulative latency tVt \in V9 plus kernel launch barriers, where each (ti,tj)E(t_i, t_j) \in E0 is an operator runtime. MPK’s tile-level pipelining yields overall latency:

(ti,tj)E(t_i, t_j) \in E1

with (ti,tj)E(t_i, t_j) \in E2 (overlap) typically (ti,tj)E(t_i, t_j) \in E3–(ti,tj)E(t_i, t_j) \in E4, producing up to (ti,tj)E(t_i, t_j) \in E5 end-to-end speedup.

Selected results (Cheng et al., 22 Dec 2025):

  • On single-GPU LLM inference (0.6B–30B parameters), MPK outperforms vLLM and SGLang by (ti,tj)E(t_i, t_j) \in E6–(ti,tj)E(t_i, t_j) \in E7 on A100, H100, and B200 GPUs.
    • Example: Qwen3-8B on A100: per-token latency drops from 14.5 ms (vLLM) to 12.5 ms (MPK), approaching the memory-bandwidth bound of ~10 ms.
  • On 8×H100 tensor-parallel runs, MPK achieves (ti,tj)E(t_i, t_j) \in E8–(ti,tj)E(t_i, t_j) \in E9 higher throughput vs. optimized PyTorch+CUDA-Graphs.
  • For Mixture-of-Experts (MoE) models, MPK’s fused gather–GEMM and workload balancer accelerate MoE layers by up to tit_i0 vs. standard library implementations.

By fusing all operators and eliminating kernel launches, MPK closes the gap to the hardware memory-bandwidth bound to within tit_i1, limiting further software-driven optimizations absent hardware innovation.

5. Distinctive Features and Impact on GPU Programming Models

MPK achieves end-to-end kernel fusion with minimal developer effort while maintaining the flexibility of traditional tensor programming models. Its SM-level abstraction permits cross-operator software pipelining and fine-grained overlap, exposing both intra- and inter-operator parallelism. The use of event-driven decentralized scheduling avoids performance bottlenecks typical of global locks or host-device synchronization. The persistent kernel model is especially amenable to workloads exhibiting pipeline and communication overlap, such as LLM inference, MoE, and tensor-parallel operations.

A key distinction from prior GPU runtime approaches is MPK’s unification of the entire inference graph into one persistent mega-kernel, nullifying kernel launch overhead and introducing a single point of dynamic control via in-kernel scheduling logic. This design paradigm brings end-to-end GPU inference throughput and latency close to the theoretical limits imposed by hardware (Cheng et al., 22 Dec 2025).

6. Availability and Prospects

MPK is publicly available and open source, enabling empirical replication and extension of its compiler and runtime design. The system advances the state-of-the-art in GPU-based model serving, compiler design for deep learning, and persistent kernel-based runtimes. By reducing temporal gaps between computation and communication with hardware-efficient scheduling, MPK provides a blueprint for forthcoming innovations in large-scale model inference at hardware efficiency limits (Cheng et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mirage Persistent Kernel (MPK).