Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 47 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 11 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 30 tok/s Pro

2000 character limit reached

Astra: A Multi-Agent System for GPU Kernel Performance Optimization (2509.07506v1)

Published 9 Sep 2025 in cs.DC, cs.AI, cs.CL, cs.LG, and cs.SE

Abstract: GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating LLM training and serving, yet attaining high performance typically requires extensive manual tuning. Compiler-based systems reduce some of this burden, but still demand substantial manual design and engineering effort. Recently, researchers have explored using LLMs for GPU kernel generation, though prior work has largely focused on translating high-level PyTorch modules into CUDA code. In this work, we introduce Astra, the first LLM-based multi-agent system for GPU kernel optimization. Unlike previous approaches, Astra starts from existing CUDA implementations extracted from SGLang, a widely deployed framework for serving LLMs, rather than treating PyTorch modules as the specification. Within Astra, specialized LLM agents collaborate through iterative code generation, testing, profiling, and planning to produce kernels that are both correct and high-performance. On kernels from SGLang, Astra achieves an average speedup of 1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study further demonstrates that LLMs can autonomously apply loop transformations, optimize memory access patterns, exploit CUDA intrinsics, and leverage fast math operations to yield substantial performance gains. Our work highlights multi-agent LLM systems as a promising new paradigm for GPU kernel optimization.

Collections

Summary

The paper presents Astra, a multi-agent system that refines CUDA kernels through specialized agents, achieving a 1.32× speedup over baseline implementations.
The paper leverages advanced optimization techniques like loop transformations, vectorized memory access, and CUDA intrinsics to improve kernel performance.
The paper validates optimized kernels via comprehensive testing and profiling, demonstrating scalability across diverse tensor shapes in real-world LLM serving environments.

Astra: Multi-Agent LLM System for GPU Kernel Optimization

Introduction and Motivation

The paper presents Astra, a multi-agent system leveraging LLMs for the optimization of GPU kernels, specifically targeting existing CUDA implementations extracted from SGLang, a production-grade LLM serving framework. Astra is motivated by the persistent challenge of achieving high-performance GPU kernels, which are critical for efficient LLM training and inference. Manual kernel tuning and compiler-based approaches have limitations in scalability, adaptability, and engineering overhead. Prior LLM-driven efforts have focused on code generation from high-level specifications, but Astra directly addresses the optimization of real-world, deployed CUDA kernels.

Astra decomposes the kernel optimization workflow into specialized agent roles—code generation, testing, profiling, and planning—enabling systematic, iterative refinement of kernels. This multi-agent paradigm is designed to overcome the limitations of single-agent approaches, which struggle to balance correctness, performance, and the complexity of the optimization space.

Figure 1: Overview of Astra's multi-agent workflow for GPU kernel optimization, illustrating agent collaboration across code generation, testing, profiling, and planning.

Methodology

Task Definition

Astra formalizes CUDA kernel optimization as the search for an optimized kernel $S'$ that achieves lower runtime than a baseline $S$ while maintaining functional correctness. Correctness is defined as output equivalence (within a floating-point tolerance) across a diverse test suite, and performance is measured by geometric mean speedup over representative input shapes.

Multi-Agent System Architecture

Astra's architecture consists of four agents:

Testing Agent: Generates test cases and validates candidate kernels.
Profiling Agent: Measures execution time and provides performance feedback.
Planning Agent: Synthesizes correctness and performance signals to propose targeted code modifications.
Coding Agent: Implements suggested changes to generate new kernel variants.

Agents interact in an iterative loop, with each round producing a candidate kernel, validating correctness, profiling performance, and logging results. The system is implemented using the OpenAI Agents SDK and powered by the o4-mini LLM, with experiments conducted on NVIDIA H100 hardware.

Pre- and Post-Processing

Due to the complexity of SGLang kernels, manual pre-processing is performed to extract and simplify kernels for Astra's input. Post-processing integrates optimized kernels back into SGLang and validates them against the full framework, ensuring drop-in compatibility and accurate performance measurement.

Experimental Results

Astra is evaluated on three representative SGLang kernels: merge, fused, and silu. Across five optimization rounds, Astra achieves an average speedup of 1.32× over the baseline, with all optimized kernels passing correctness validation against manually constructed test suites. Notably, the multi-agent system outperforms a single-agent baseline, which achieves only 1.08× speedup on average. The performance gap widens with kernel complexity, highlighting the efficacy of agent specialization.

Optimization Strategies

Detailed case studies reveal that Astra's agents autonomously apply several advanced optimization techniques:

Loop Transformations: Hoisting loop-invariant computations to reduce instruction count and improve parallelism.
Memory Access Optimization: Employing vectorized loads (e.g., __half2) to increase memory bandwidth and reduce transaction overhead.
CUDA Intrinsics: Utilizing warp-level primitives (__shfl_down_sync) for efficient reductions and register-resident computation.
Fast Math Operations: Replacing standard library calls with device intrinsics (__expf, __frcp_rn, __fmul_rn) to lower arithmetic latency.

These strategies collectively contribute to the observed speedups and demonstrate the system's ability to discover non-trivial, hardware-aware optimizations.

Discussion

Impact of Tensor Shapes

Astra's optimizations generalize across diverse tensor shapes, with speedups consistently observed for input dimensions representative of modern LLM workloads. Unlike compiler-based autotuning systems, Astra does not require shape-specific tuning, instead optimizing for general tensor computations.

Limitations and Future Directions

The current evaluation is limited to three kernels and requires manual pre- and post-processing. Automating these steps and extending support to additional frameworks (e.g., vLLM, PyTorch, TorchTitan) are important future directions. Further, integrating training-based methods (e.g., RL fine-tuning) could enhance agent capabilities beyond zero-shot prompting.

Implications and Future Prospects

Astra demonstrates that multi-agent LLM systems can autonomously optimize complex GPU kernels, achieving substantial performance gains without manual intervention or additional training. This paradigm has significant implications for the automation of high-performance code optimization, potentially reducing engineering overhead and accelerating the deployment of efficient ML systems. The approach is extensible to other domains requiring specialized code optimization, such as tensor compilers, DSLs, and hardware-specific libraries.

Theoretical implications include the validation of agent specialization for complex software engineering tasks and the feasibility of LLM-driven, iterative optimization workflows. Practically, Astra's integration with production frameworks like SGLang suggests immediate applicability and impact in large-scale ML serving environments.

Conclusion

Astra introduces a multi-agent LLM system for GPU kernel optimization, directly targeting existing CUDA implementations in production LLM serving frameworks. By decomposing the optimization process into specialized agent roles and leveraging iterative refinement, Astra achieves an average speedup of 1.32× while maintaining correctness. The system autonomously applies advanced optimization strategies, outperforming single-agent baselines and generalizing across diverse tensor shapes. Astra establishes multi-agent LLM systems as a promising paradigm for automated, high-performance code optimization, with broad implications for both research and practical deployment in AI infrastructure.