Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 181 tok/s Pro
2000 character limit reached

SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization (2508.20258v1)

Published 27 Aug 2025 in cs.DC and cs.AI

Abstract: LLMs have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers rely on for near-optimal utilization -- hardware-awareness. By leveraging the workload's specific memory access patterns, architecture specifications, filtered profiling logs, and reflections on historical performance, we can make software-level optimizations that are tailored to the underlying hardware. SwizzlePerf automatically generates spatial optimizations for GPU kernels on disaggregated architectures by giving LLMs explicit hardware-awareness. For a GEMM kernel, SwizzlePerf takes less than 5 minutes to generate the same hardware-specific optimal swizzling pattern that took expert performance engineers 2 weeks to find. On a suite of 10 diverse ML and Science kernels, SwizzlePerf can generate swizzling patterns for 9 of the kernels that achieve up to a 2.06x speedup and 70% improvement in L2 hit rate. This work is the first of many steps toward systematically creating hardware-aware LLM performance engineering agents.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a hardware-aware LLM framework that automatically synthesizes swizzling patterns to optimize GPU kernel performance.
  • It integrates detailed architectural context and historical profiling feedback to improve L2 cache hit rates by up to 70% on certain kernels.
  • Experimental results demonstrate speedups ranging from 1.29× to 2.06×, reducing optimization cycles from weeks to minutes.

SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization

Introduction and Motivation

SwizzlePerf introduces a hardware-aware, LLM-driven methodology for optimizing GPU kernel performance, specifically targeting spatial remapping (swizzling) to maximize cache locality on disaggregated architectures. The core insight is that existing LLM-based autotuning frameworks lack explicit hardware-awareness, which is essential for achieving near-optimal utilization on modern GPUs with complex memory hierarchies and chiplet-based designs. By integrating detailed architectural context, filtered profiling logs, and historical performance feedback into the LLM prompt, SwizzlePerf enables the automatic synthesis of swizzling patterns that align with the underlying hardware topology and scheduling policies.

Methodology: Hardware-Aware Optimization Loop

SwizzlePerf operates as a closed-loop, bottleneck-driven optimization system. The workflow is as follows:

  1. CodeGen LLM Call: The LLM receives a prompt containing the original kernel code, a summary of memory access patterns, a history of previous optimization attempts, and explicit hardware details (e.g., number of XCDs, cache sizes, block scheduling policy).
  2. Parsed Context: Structured context is extracted from profilers (e.g., rocprofv3), device attributes, and architecture documentation, exposing both the bottleneck metric (L2 hit rate) and spatial constraints.
  3. CodeGen Output: Using DSPy, the LLM outputs a reasoning trace, critiques prior attempts, and proposes a new swizzling formula. The generated code is compiled, validated, and profiled.
  4. Bottleneck History Buffer: Each iteration appends the code diff and bottleneck report to a persistent buffer, enabling the LLM to reflect on failures and diversify remapping strategies. Candidates are ranked by L2 hit rate, and the best validated kernel is retained. Figure 1

Figure 1

Figure 1: SwizzlePerf methodology and example swizzling outcome. (a) The optimization loop leverages parsed context and historical bottleneck feedback to guide LLM-driven code generation. (b) For GEMM, SwizzlePerf generates a swizzling pattern that co-locates tiles on the same XCD, improving L2 locality.

This approach mimics the workflow of expert performance engineers, but automates the process and accelerates convergence from weeks to minutes.

Swizzling Patterns: Implementation and Examples

Swizzling refers to the remapping of workgroup program IDs (PIDs) to enhance spatial/temporal locality and align computation with hardware topology. SwizzlePerf generates kernel-specific swizzling formulas that maximize intra-XCD L2 reuse. For example, in a tiled GEMM kernel, the swizzling pattern ensures that tiles sharing rows in matrix A are mapped to the same XCD, thus maximizing L2 cache hits.

A representative SwizzlePerf-generated swizzling pattern for GEMM in Triton is:

1
2
3
4
5
pid = tl.program_id(0)
num_xcds   = 8
num_blocks = NUM_SMS
b_per_xcd = (num_blocks + num_xcds - 1) // num_xcds
pid = (pid % num_xcds) * b_per_xcd + (pid // num_xcds)

This formula ensures contiguous blocks are assigned to the same XCD, with ceiling division handling edge cases—an improvement over expert-designed patterns. Figure 2

Figure 3: GEMM swizzling pattern generated by SwizzlePerf, demonstrating hardware-specific remapping for improved cache locality.

SwizzlePerf generalizes to a variety of kernels, including LayerNorm, Softmax, FDTD, Stencil 2D, and Transpose, each requiring distinct remapping strategies to align with their memory access patterns and the hardware's cache topology.

Experimental Results

SwizzlePerf is evaluated on 10 diverse GPU kernels (6 ML, 4 scientific), benchmarked on AMD MI300x with medium problem sizes. The primary metric is L2 hit rate, serving as a low-noise proxy for spatial locality, with end-to-end speedup as a secondary metric.

Key results:

  • Speedup: SwizzlePerf achieves an average speedup of 1.29×, with up to 2.06× on the transpose kernel.
  • L2 Hit Rate: Average improvement of 23.9%, with up to 70% on certain kernels. Four kernels approach 100% L2 hit rate.
  • Generality: SwizzlePerf-generated patterns generalize across problem sizes, maintaining or increasing the L2 hit rate advantage as input size grows. Figure 3

    Figure 4: L2 hit rate improvements and speedups from SwizzlePerf-generated swizzling patterns across 10 kernels. SwizzlePerf consistently outperforms hardware-unaware and hardware-overload baselines.

    Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 6: Progression plots for GEMM, Stencil 2D, SpMV, Softmax, and LayerNorm kernels. SwizzlePerf rapidly converges to performant swizzling patterns, while baselines stagnate or degrade.

Notably, SwizzlePerf finds the optimal swizzling pattern for GEMM in under 5 minutes, matching a solution that required two weeks of expert engineering. For memory-bound kernels (e.g., Stencil 2D, Transpose), L2 hit rate improvements translate to substantial speedups, while for compute-bound kernels (e.g., GEMM), the impact is more modest.

Ablation Studies and Model Analysis

SwizzlePerf's effectiveness is robust to kernel type, problem size, and LLM choice:

  • Problem Size Generalization: Swizzling patterns maintain their L2 hit rate advantage as tensor sizes increase, indicating that the generated formulas are not overfit to specific configurations.
  • LLM Variants: Different LLMs (OpenAI 4o, 4.1-mini, o3-mini) exhibit varying performance across kernels. All converge on optimal patterns for GEMM, but only the largest models consistently solve more complex kernels. Figure 7

    Figure 2: L2 hit rate of SwizzlePerf-generated patterns with three different LLMs, highlighting model-dependent performance on hardware-aware optimization tasks.

  • Baselines: Hardware-unaware and hardware-overload approaches fail to achieve significant L2 hit rate improvements or speedups, and hardware-overload patterns often lack generality and correctness.

Implementation Considerations

SwizzlePerf is built atop the open-source IntelliPerf framework, with key modifications to inject hardware-aware context and structure the LLM prompt/output. The system leverages public profilers (rocprofv3), device attributes (HIP), and architecture documentation to construct the context. The optimization loop is implemented using DSPy, which enforces a structured output signature for reasoning and code generation.

Resource requirements are modest: each optimization loop iteration involves a single LLM call, code compilation, and profiling run. The approach is scalable to large kernels and problem sizes, as the swizzling formulas are synthesized symbolically and validated empirically.

Potential limitations include dependence on the quality of profiler data and the LLM's ability to reason about complex mappings. For highly irregular or data-dependent access patterns, swizzling may yield diminishing returns.

Implications and Future Directions

SwizzlePerf demonstrates that explicit hardware-awareness is essential for unlocking the full potential of LLM-driven performance engineering. The methodology closes the hardware–software feedback loop, enabling LLMs to replicate and, in some cases, surpass expert reasoning in spatial optimization tasks.

Practical implications include:

  • Accelerated Kernel Optimization: Drastically reduces the time and expertise required to achieve near-optimal cache locality on modern GPUs.
  • Generalizability: The approach is extensible to new architectures, kernels, and bottleneck metrics (e.g., power efficiency).
  • Integration with Compiler Toolchains: SwizzlePerf's structured context and output can be integrated into autotuning and code generation pipelines for ML and HPC workloads.

Theoretically, the work raises questions about the modalities of hardware-awareness most conducive to LLM reasoning. Future research directions include:

  • Expanding the context to include non-textual modalities (e.g., visualizations of memory access patterns).
  • Exploring swizzling for power efficiency and DVFS-aware optimization.
  • Systematic benchmarking of LLMs across vendors and architectures for hardware-aware code synthesis.

Conclusion

SwizzlePerf establishes a principled, hardware-aware LLM workflow for GPU kernel performance optimization, with empirical evidence of substantial gains in cache locality and runtime efficiency across a range of real-world kernels. By structuring the optimization loop around explicit hardware context and bottleneck metrics, SwizzlePerf bridges the gap between human expert reasoning and autonomous code generation, setting a new standard for LLM-driven performance engineering on modern accelerators.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube