Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Published 12 May 2026 in cs.CL | (2605.11581v1)

Abstract: When LLMs serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents an adaptive MegaKernel pipeline that reduces kernel launch overhead for LLM inference using an automated DAG-based search to achieve millisecond-scale latency.
It employs MLIR-based IR decomposition, adaptive shared memory management, and quantization-aware strategies to optimize throughput on resource-constrained NVIDIA Ada GPUs.
Experimental results show throughput gains of up to 70% over traditional methods, establishing Ada-MK as a robust solution for latency-critical, production-scale LLM inference.

Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference: An Expert Analysis

Motivation

The deployment of LLMs in latency-critical production environments, such as commercial online advertising, imposes stringent requirements on inference efficiency. In such systems, decode-phase inference involves thousands of GPU kernel launches per token, resulting in kernel launch overheads that significantly impact end-to-end latency, with measurements showing around 14.6% overhead on NVIDIA Ada GPUs. While MegaKernel fusion—combining multiple operators into a persistent kernel—removes launch overhead and inter-operator off-chip data transfers, prior MegaKernel implementations have been severely limited by hardware coupling, lack of extensibility, or dynamic runtime decision-making penalties that are prohibitive in low-latency regimes.

The Ada-MK framework addresses these limitations head-on by providing an architecture-agnostic, compile-time optimized, and production-ready MegaKernel pipeline that is tailored for the resource-constrained NVIDIA Ada (L20) architecture. The goal is to achieve deterministic, millisecond-scale latency with robust improvements in throughput, without sacrificing system extensibility or requiring high-cost engineering rewrites.

Ada-MK Architecture

Ada-MK is architected around two tightly integrated phases:

1. Offline MegaKernel Synthesis:

The system parses the Transformer Decoder and LM Head into a fine-grained DAG. Hardware constraints—such as shared memory limits and register file size—are explicitly encoded, pruned, and synthesized. An MLIR-based lowering pipeline is then used to decompose computation graphs into PTX-level primitives. Offline profiling is performed to select the optimal execution path, capturing the best configuration of warp roles, tiling, and shared memory usage for the target hardware. The selected trace is statically serialized for runtime use.

2. Online Warp-Specialized Execution:

At inference time, Ada-MK operates as a TensorRT-LLM plugin, enabling hybrid execution: prefill computation is delegated to high-throughput TensorRT-LLM fused kernels, while decode is performed via the optimized MegaKernel. Within each streaming multiprocessor, warps are partitioned by role (Loader, Consumer, Storer, Controller, Launcher), and collaborate using an explicit shared memory page management scheme to maximize data locality, overlap I/O with compute, and eliminate launch and operator-switch overhead.

Key Optimizations

Adaptive Shared Memory Management

Ada-MK models shared memory allocation as a function of hardware resource limits, model architectural specifics (e.g., parameter sizes, quantization), and dynamic workload factors. Novel K-dimension fine-grained splitting is introduced to halve peak shared memory requirements by loading only relevant weight sub-tiles per computation iteration—achieving a 50% reduction in usage. Cross-operator shared memory page reuse is further refined, enabling seamless pipeline recomposition on resource-constrained architectures. Specifically, memory associated with completed activation load or output writeback is dynamically reclaimed and repurposed for subsequent pipeline stages.

Fine-Grained DAG-based Offline Search

A central innovation in Ada-MK is the replacement of costly runtime dynamic scheduling with exhaustive, compile-time trace selection. The core steps are:

MLIR-driven decomposition of operator-level graphs into PTX-level node dependencies.
Systematic construction of fine-grained data-movement and computation dependency DAGs, with explicit encoding of RAW and WAR constraints.
Automated alias analysis to identify precise points for asynchronous data transfer (e.g., advancing RMS Norm or KV-cache loads ahead of dependent computation) and to break pseudo-dependencies that block streaming reduction.
Simulated scheduling and role-to-pipeline mapping, including heuristic load balancing and tiling parameter sweeps.
Empirical offline profiling to lock the optimal pipeline configuration (warp allocation, shared-memory scheduling, duty-cycle maximization), with the optimal execution trace compiled directly into the persistent kernel code, thus removing all runtime branching overhead.

Quantization-Aware Optimization

Ada-MK integrates advanced quantization strategies, leveraging QServe-inspired tensor-core aware weight reordering for rapid quantized data access, K-dimension multi-level pipeline computation to further reduce memory footprint, and small-batch padding optimizations that avoid redundant HBM-register transfers. These quantization-aware techniques are vital for maximizing throughput and minimizing latency under Ada’s strict memory and bandwidth constraints.

Heterogeneous Hybrid Inference Engine

The Ada-MK deployment model is pragmatic: MegaKernel is used selectively where it offers maximal benefit (decode phase, IO-bound paths), while retaining TensorRT-LLM’s native prefill pipeline for compute-bound, high-batch workloads. All business logic (e.g., prefix-tree decoding, discriminative filtering) remains compatible, ensuring zero-cost migration for production workloads.

Experimental Results

Ada-MK achieves marked end-to-end throughput improvements across diverse commercial and open-source LLM inference workloads:

In Qwen-series model benchmarks (e.g., Qwen3-1.7B and Qwen2.5-1.5B, GPTQ-W4A16 quantized), throughput gains over vanilla TensorRT-LLM range from 10% to 23.6% for short-sequence, small-batch scenarios, and up to 50.2% over vLLM.
The framework shows consistent improvements (4%-19.5%) in real-dataset tasks (CSL, Human-eval) over TensorRT-LLM, with maximal gains at small- and mid-size batches.
Ada-MK’s advantage is most pronounced in small-batch, short-sequence inference (BS=1/2), where kernel launch/overhead dominates. Here, speedups can reach over 70% relative to SGLang and vLLM.
As batch size and sequence length increase, Ada-MK maintains a lead over TensorRT-LLM, but system-level improvements in vLLM and SGLang (such as request scheduling and KV cache management) gradually close the performance gap, sometimes surpassing Ada-MK at extreme batch sizes (e.g., BS=16 on CSL).
The design supports cross-model generalization: performance advantages are stable across different Qwen versions.

Implications and Future Directions

Ada-MK demonstrates that architecture-agnostic, compile-time optimized MegaKernel pipelines are deployable at industrial scale even under severe resource constraints. The paradigm of exhaustive, DAG-based offline search and statically solidified execution traces enables both portability and hardware peak efficiency, circumventing the trade-offs endemic to hand-tuning and autotuned dynamic scheduling approaches such as Mirage MPK.

In practical terms, the framework’s hybrid plugin model means that enterprises can rapidly integrate Ada-MK into existing TensorRT-LLM infrastructure without business logic modifications, achieving immediate gains in latency and throughput for latency-critical applications such as online advertising, recommendation, and interactive LLM services.

Theoretically, Ada-MK opens the possibility of scaling this approach to even more constrained or diverse future GPU architectures, including migration to forthcoming platforms such as Blackwell, and integrating larger models. The methodology is broadly applicable to other domains requiring deep global fusion and deterministic low-latency inference, and sets a new standard for combining MLIR-based IR optimization, hardware-aware scheduling, and persistent-kernel fusion strategies.

Conclusion

Ada-MK presents a comprehensive solution for deploying persistent-kernel MegaKernel LLM inference on NVIDIA Ada GPUs, balancing portability, efficiency, and production system integration. Through adaptive shared memory management, MLIR-based fine-grained DAG search, and zero-overhead hybrid engine embedding, Ada-MK systematically reduces decode-phase latency and pushes throughput beyond existing baselines, representing the first industrial-scale deployment of MegaKernel in commercial advertising. The approach’s compile-time determinism and hardware extensibility make it a foundational advance in LLM deployment under constrained environments, with broad implications for future AI system design and optimization (2605.11581).

Markdown Report Issue