Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

Published 4 May 2026 in cs.SE | (2605.03208v2)

Abstract: Iterative GPU kernel tuning is bottlenecked by the scale of the applications that host the kernels. Rapid iteration requires isolating the kernel so it can be edited, recompiled, and validated without rebuilding the full application -- but manual isolation requires reconstructing build flags, dispatch configuration, and runtime inputs by hand, so developers usually settle for slow in-place edits. We present Kerncap, an automated kernel extraction tool that intercepts dispatches at the HSA runtime for both HIP and Triton, bridging Triton's JIT-only metadata into HSA-level capture via a lightweight Python compile-hook shim. Kerncap performs an address-space closure of all device memory -- a virtual-address-faithful snapshot that preserves embedded device pointers without DWARF metadata or pointer chasing -- locates kernel sources, and emits self-contained reproducer projects. HIP reproducers use a Clang VFS overlay for source-level recompilation without modifying the original build system; Triton reproducers are tuning-pinned, binding the captured autotuner configuration into the artifact to preserve the JIT kernel's numerical contract. Across six real-world HIP and Triton workloads spanning traditional HPC and ML domains on three AMD GPU architectures (CDNA2, CDNA3, RDNA3), Kerncap extracts and validates kernels from snapshots ranging from 152~MB to 30~GB -- including a VA-faithful capture of vLLM's Mixture-of-Experts weight pool reached through pointer indirection. On our llama-cpp case study, Kerncap's edit-recompile-validate loop achieves a 13.6x speedup over the traditional workflow, reducing kernel isolation from a multi-hour process to a single command. The resulting reproducers also serve as a substrate for autotuning agents and LLM-driven kernel generators that need rapid, isolated evaluation of candidates.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents Kerncap, an automated pipeline that isolates AMD GPU kernels and achieves up to a 13.6× reduction in iteration time.
The methodology employs unified HSA interception, Python shims, and VA-faithful memory closure to accurately capture runtime state in both HIP and Triton workloads.
Experimental results demonstrate robustness, correctness, and low overhead across diverse workloads and AMD GPU architectures.

Detailed Analysis of "Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs" (2605.03208)

Motivation and Problem Statement

The paper addresses a critical bottleneck in GPU kernel optimization workflows: the laborious and error-prone process of isolating GPU kernels from complex applications for iterative tuning and analysis. On modern AMD hardware, a typical optimization loop involves editing and recompiling the entire host application to reflect even minor kernel changes, introducing significant time overhead. Manual isolation requires developers to reconstruct build flags, dispatch configurations, and precisely capture runtime inputs—a process so prohibitively complex that most practitioners default to full-application iteration. This inefficiency is exacerbated in large codebases and for applications with dynamic workloads, such as those in HPC and deep learning, where kernel extraction is needed for rapid hypothesis testing, autotuning, and domain-specific code generation.

System Design and Methodology

Kerncap implements an automated, end-to-end pipeline for isolating GPU kernels and generating self-contained reproducer projects on AMD architectures. Its workflow consists of five main stages: profiling, runtime state capture, automated source discovery, reproducer generation, and validation. Critical technical components include:

Unified HSA-level interception: Leveraging the ROCProfiler-SDK’s LD_PRELOAD mechanism, Kerncap injects itself at the HSA (Heterogeneous System Architecture) API layer to hook kernel dispatches for both HIP (C++-based) and Triton (Python JIT) kernels.
Python compile-hook shim for Triton: To bridge the gap between JIT-compiled Python kernels (which have critical metadata only at runtime) and HSA-level capture, a lightweight Python shim records kernel signatures, arguments, and autotuner state, cross-referenced by binary fingerprints (HSACO SHA-256).
VA-faithful address-space closure: Kerncap captures the entire device memory at their original virtual addresses, ensuring that all pointers—regardless of depth or indirection—remain valid at replay, obviating the need for DWARF metadata or explicit pointer chasing.
Automated source and dependency discovery: For HIP, the system combines DWARF-based translation unit resolution with fallback heuristics (compile_commands.json, symbol lookup, grep). For Triton, Python AST parsing and import tracing reconstruct the source structure.
Reproducer generation: For HIP, a Clang Virtual File System overlay allows the reproducer to recompile the kernel source with precise original flags and includes. For Triton, a Jinja2-generated script pins tile configuration and imports captured tensors, guaranteeing deterministic replay—even for autotuned kernels.
Validation: The tool supports smoke-testing, byte-exact verification (HIP), and tolerance-based floating-point checks (Triton).

Experimental Results

The evaluation encompasses six workloads—ranging from LLM inference (llama.cpp) to framework-generated scientific kernels (LAMMPS, rocBLAS GEMM), to both hand-authored and codegen-based Triton workloads (including FlashAttention2 and vLLM Mixture-of-Experts). Kerncap demonstrates the following:

Breadth and robustness: Kerncap extracts and validates kernels from all evaluated workloads across three AMD GPU architectures (CDNA2, CDNA3, RDNA3), with device memory snapshots up to 30 GB, confirming architecture-agnostic address space capture.
Iteration speedup: Isolated edit-recompile-replay loops provide up to a $13.6\times$ reduction in wall time (as shown in the llama.cpp case study, reducing kernel isolation from several hours to ~2.7 minutes for a single iteration). This is a strong, quantitative claim of workflow acceleration.
Correctness: Byte-exact output matching and deterministic replay are achieved for both HIP and Triton kernels, including those requiring multi-level pointer indirection (e.g., vLLM’s MoE expert weights).
Low runtime overhead: The fixed cost for device memory snapshots (~1.5–1.7 GB/s throughput) and negligible per-dispatch interception cost make the tool practical even for large or latency-sensitive workloads. The interception overhead is consistently below $1.2\times$ for realistic applications, and substantially lower than standard profiling tools such as rocprofv3 in some scenarios.
Generality: The tool gracefully degrades when source code is unavailable (e.g., kernels generated by code generators), still providing correct replay for hardware counter profiling and input perturbation studies.

Comparative and Theoretical Implications

Kerncap differentiates itself from NVIDIA’s Nsight Compute and CUPTI Checkpoint API in key aspects: it is open-source, supports AMD hardware (both HIP and Triton/HSA paths), generates editable and rebuildable artifacts (i.e., the kernel definition, state, and environment required for standalone replay), and performs address-space closure rather than per-allocation state management. Interactive debuggers (CUDA-GDB, rocgdb), benchmarking frameworks, and compiler-emitted reproducibility tools (e.g., torch.compile artifacts) provide pieces of this workflow but lack comprehensive automation for kernel-level isolation and replay.

From a theoretical perspective, the VA-faithful snapshot methodology is significant—it demonstrates that the full kernel-execution state (including arbitrarily deep device pointer graphs) can be recovered and restored purely at the memory layout level, sidestepping the limitations of symbol-based or metadata-based approaches. For Triton, pinning the autotuner configuration directly addresses the hidden numerical contract imposed by floating-point accumulation order changes, ensuring reproducer fidelity.

Limitations and Future Directions

Limitations identified include restriction to single-kernel, single-GPU capture, absence of host-side state capture (host memory, IO handles), potential races when device memory is rapidly deallocated after kernel execution, source-discovery fallbacks for projects lacking comprehensive build metadata, and the hardware-dependency of autotuner-captured configurations. Portability across distinct GPU architectures may require further autotuning or manual tuning-state reconciliation.

Future research and engineering directions emphasized in the paper are: multi-kernel and sequence-capture (tracking of inter-kernel dependencies), multi-GPU coordination, comprehensive host/device state integration, persistent kernel databases for regression testing, and support for NVIDIA hardware via analogous driver hooks. These extensions would solidify Kerncap’s role as standard infrastructure for both manual performance engineering and automated agent-based GPU kernel synthesis workflows.

Conclusion

Kerncap systematically removes the manual bottlenecks in AMD GPU kernel isolation by automating end-to-end extraction, source recovery, device state capture, and replay validation, unifying the workflow for both C++ HIP and Python-based Triton workloads. The tool’s design—especially its use of VA-faithful memory closure and autotuner-pinned reproducer generation—enables high-throughput, correct, and reproducible kernel-level experimentation. The demonstrated speedups, correctness guarantees, and robustness across diverse workloads position Kerncap as a foundational asset for expert GPU practitioners and for emerging research on agentic kernel synthesis and autotuning agents. The canonical, self-contained reproducer artifact produced by Kerncap directly addresses the need for rapid, isolated kernel evaluation in both human and AI-driven optimization cycles.

Markdown Report Issue