SwizzlePerf: Hardware-Aware GPU Optimization
- SwizzlePerf is a hardware-aware, LLM-driven framework that automates GPU kernel performance optimizations through hardware-specific swizzling.
- It integrates detailed profiling metrics and architectural context to generate code transformations that replicate expert-level strategies in minutes.
- The framework achieves up to a 2.06× speedup by enhancing L2 cache hit rates, thereby significantly narrowing the performance gap with manual codesign.
SwizzlePerf is a hardware-aware, LLM-driven framework for GPU kernel performance optimization that automates the spatial mapping of workgroups to improve cache locality and overall execution efficiency on disaggregated architectures. The distinguishing innovation of SwizzlePerf is the explicit infusion of hardware context into LLMs, enabling software-level optimizations—specifically generation of hardware-specific swizzling patterns—that traditionally require multi-week manual codesign by expert engineers. By leveraging workload memory access patterns, platform architecture specifications, detailed profiler feedback, and historical performance context, SwizzlePerf enables the systematic, rapid synthesis of code transformations that narrow the gap between automated code generation and the nuanced, hardware-conscious strategies adopted by domain experts (Tschand et al., 27 Aug 2025).
1. Motivation and Problem Context
Traditional approaches to GPU kernel performance engineering primarily employ search-based optimization around runtime, typically lacking in-depth hardware-awareness. Such methodologies treat execution time as a noisy, indirect proxy for underlying bottlenecks, disregarding details of cache topology, core scheduling, and other microarchitectural subtleties. In practice, human engineers achieve near-optimal utilization on modern accelerators by tuning workgroup-to-hardware mappings (“swizzling patterns”) with explicit knowledge of memory hierarchies and scheduler behaviors—a manual process that is time-consuming and nontrivial to generalize across hardware generations.
Swizzling refers to the reordering or remapping of how workgroups are assigned to hardware execution/storage resources to maximize spatial and temporal locality. Effective swizzling can colocate related computation tiles within shared caches (for example, mapping mutually dependent regions to a common XCD, or accelerator die), thereby boosting intra-cache reuse and minimizing expensive off-chip memory accesses.
SwizzlePerf aims to formalize and automate this hardware–software codesign loop by providing LLMs with architecture-aware context and explicit metric targets such as cache hit rate.
2. Core Principles of Hardware-Aware Optimization
SwizzlePerf diverges from previous runtime-only methods by embedding hardware details and profiling signals in its optimization process. Key features include:
- Explicit Hardware Context: Device attributes relevant to spatial scheduling are programmatically extracted (e.g., from HIP or low-level driver APIs). Examples include the number of XCDs, the cache hierarchy (size, sharing structure), and compute unit counts.
- Profiling Metrics Integration: Performance-critical signals such as L2 hit rates are collected using profiling tools (e.g., rocprofv3) and used as direct optimization targets, rather than proxying by aggregate runtime.
- Scheduling Policy Modeling: Understanding of the default workgroup scheduling policy (typically round-robin across XCDs) is embedded in the prompt, so LLMs can reason about deviations required for optimal swizzling.
- Workload-Specific Memory Access Patterns: The LLM is provided with summaries of the kernel’s memory-locality constraints and the patterns of data reuse optimal for the architecture.
This hardware-rich context enables LLMs to generate code transformations that reflect the practical decision-making process of performance engineers. For example, SwizzlePerf employs the formula
where is the number of workgroup blocks mapped per XCD, ensuring contiguous tiles are co-located up to hardware capacity.
3. Optimization Methodology
SwizzlePerf orchestrates an iterative, fully automated optimization loop with the following structure:
- Prompted LLM Code Generation: The LLM is prompted with original kernel code, a summary of memory-locality constraints, a history of prior transformations, and detailed hardware context. This facilitates model understanding of both the architectural bottleneck and the goal condition for optimization.
- Metric-Guided Generation: Parsed profiling data—most notably L2 cache hit rate and, by extension, data reuse efficiency—is provided to the LLM. The prompt also includes architecture-specific scheduling and memory details.
- Model Output: For each iteration, the LLM outputs:
- A reasoning trace articulating failure/success of prior strategies and justifying changes.
- New, candidate code with an explicit swizzling transformation. An example is:
1
pid = (pid % num_xcds) * b_per_xcd + (pid // num_xcds)
- Validation and Profiling: The transformed code is compiled and runs correctness tests. Key performance metrics (e.g., updated L2 hit rate) are collected post-execution.
- Feedback and Iteration: Results and code variants are stored, informing the prompt for the next LLM cycle until a satisfactory solution is reached or further changes provide no substantive gain.
The optimization is thus grounded in measurable hardware-centric feedback rather than time-based or heuristic scoring.
4. Empirical Performance and Results
SwizzlePerf’s hardware-centric approach yields substantial efficiency improvements across a diverse suite of 10 ML and scientific GPU kernels, including general matrix multiplication (GEMM), layer normalization, softmax, and stencil computations.
Key results include:
- SwizzlePerf generated optimal or near-optimal swizzling patterns in less than 5 minutes for most kernels, including GEMM—an efficiency gain compared to the 2-week timescale required by expert engineers.
- Across tested benchmarks, 9 out of 10 kernels realized significant performance gains, with up to a 2.06× speedup observed (notably in transpose and other data-movement heavy kernels).
- L2 cache hit rate improvements of up to 70%, with some workloads approaching 100% L2 hit rate after applying the optimized mapping.
These improvements are visualized in the paper’s plots, which overlay L2 hit rate bars (original and post-optimization) with red lines indicating end-to-end kernel speedup. The data demonstrate direct correspondence between hardware-aware locality optimizations and observed runtime enhancements (Tschand et al., 27 Aug 2025).
5. Case Study: GEMM Kernel Optimization
A detailed case paper examines SwizzlePerf’s application to a tiled GEMM kernel. In standard tiled GEMM, contiguous output tiles—despite having high data locality—tend to be scheduled to disparate XCDs due to round-robin workgroup assignment, leading to suboptimal cache reuse.
SwizzlePerf’s LLM deduced a remapping formula:
This mapping co-locates adjacent output tiles onto the same XCD, optimizing for on-die data reuse. The LLM-produced code also integrated ceiling division to handle edge cases—a nuance that was missed by human experts in earlier manual solutions. Notably, computational parity with human performance engineers was achieved in under 5 minutes, illustrating both the speed and effectiveness of the approach.
6. Implications and Future Directions
SwizzlePerf exemplifies a new paradigm in hardware–software co-design, moving automated kernel optimization closer to the practices and capabilities of human domain experts. By employing LLMs as performance engineering agents supplied with architecture, scheduling, profiling, and workload-specific context, code generation can be tuned toward hardware bottlenecks far more effectively than runtime-centric search methods.
Potential broad impacts include:
- Systematic power efficiency improvements: Enhanced cache-locality not only yields faster execution but also curtails off-chip memory access, reducing overall energy consumption.
- Generalization across platforms: The architecture-parameterized prompting and feedback loop provides a template for adapting to future GPU architectures with differing cache and scheduling topologies.
- Multimodal integration: The introduction of supplementary modalities (e.g., mapping pattern visualizations) may further close the gap with expert reasoning, allowing LLMs to synthesize and debug performance-critical code in more humanlike ways.
A plausible implication is that intelligent LLM agents, augmented with structured hardware feedback, will become a central tool in performance and power optimization of increasingly heterogeneous and complex hardware platforms.
7. Summary
SwizzlePerf delivers a hardware-aware, feedback-driven workflow for LLM-based GPU kernel optimization, achieving significant speedups and cache efficiency gains by exposing software optimizations to explicit architecture-aware constraints and direct profiling signals. Its methodology sets a precedent for future performance engineering agents capable of rapidly synthesizing high-quality, hardware-tailored code, and it provides a foundation for further research in automated, holistic hardware–software optimization strategies (Tschand et al., 27 Aug 2025).