Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization

Published 26 Apr 2026 in cs.PF and cs.SE | (2604.23892v1)

Abstract: LLMs show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but stop short of generating actionable code changes. Consequently, performance optimization continues to be a time-intensive and manual endeavor, typically undertaken only by experts with detailed architectural understanding. To bridge this gap, we introduce Optimas, a modular, fully automated, end-to-end generative AI framework built on a multi-agent workflow. Optimas uses LLMs to map performance diagnostics from multiple reports to established, literature-backed code transformations, while unifying insight extraction, code generation, execution, and validation within a single pipeline. Across 3,410 real-world experiments on 10 benchmarks and two HPC mini-applications, Optimas generates 100% correct code and improves performance in over 98.82% of those experiments, achieving average gains of 8.02%-79.09% on NVIDIA GPUs.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces an automated, evidence-based framework using multi-agent LLMs to transform and optimize GPU code.
It integrates profiling, diagnostics, and structured prompt engineering to guide precise, performance-driven code edits.
Empirical results across 3,410 experiments demonstrate robust optimizations with up to 79% speedup and 100% correct code generation.

Optimas: Analytics-Informed Generative Code Optimization with LLMs

Motivation and Problem Context

Performance optimization for high-performance computing (HPC) applications, particularly GPU-accelerated codes, demands deep architecture-specific knowledge. While state-of-the-art LLMs can automatically generate compilable code, their effectiveness for optimization is severely limited without access to detailed runtime performance signals. Traditional profiling tools detect critical bottlenecks but provide no mechanism for transforming these analytics into actionable code changes. This work addresses the gap by treating code optimization as a closed-loop, evidence-driven reasoning and generation process, encapsulated in the Optimas framework.

Optimas Framework Architecture

Optimas implements a modular, multi-agent system that orchestrates performance data extraction, evidence mining, prompt generation, LLM code transformation, and post-hoc verification, all in a fully automated pipeline.

Figure 1: High-level overview of the Optimas system architecture and multi-agent workflow.

The workflow consists of four stages:

Profiling Agent: Runs vendor tools (e.g., Nsight Compute) to collect extensive raw telemetry, including per-kernel runtime, memory and compute throughput, PC-level stall statistics, and hardware counter values.
Analysis Agent: Applies aggressive data reduction and saliency criteria to extract only highly impactful performance signals. Notably, it uses ensemble variants of Orthogonal Matching Pursuit (eOMP) for sparse hardware counter selection and derives kernel-level characteristics via automated Roofline and stall-type analyses.
Prompt Construction Agent: Assembles structured LLM input incorporating (1) salient code with line annotations, (2) structured diagnostics (roofline, stalls, counter summaries), and (3) strong guardrails restricting code edits to evidence-backed regions.
Evaluation Agent: Executes model-generated code, runs iterative compilation and functional testing, and uses refined EAR (Evidence-Aligned Reasoning) metrics to assess both correctness and fidelity to profiling evidence.

A dashboard interface and YAML-driven CLI enable reproducible, extensible usage. New profilers and model backends are supported via simple configuration, ensuring broad applicability.

Figure 2: Optimas dashboard displaying end-to-end workflow management and code/diagnostic inspection.

Performance Evidence Extraction and Prompt Engineering

Kernel and Insight Selection

Given overwhelming profiler outputs, Optimas applies thresholded filtering. Only kernels contributing a pre-specified fraction of execution time (default $\alpha=0.8$ ) are selected for further analysis. Within these, only stall types and counters responsible for at least $30\%$ of stalled cycles or overall runtime are forwarded to the LLM, ensuring controllable prompt length and interpretability.

Structured Prompting

The prompt includes explicit sections: (1) code with line numbers, (2) stall analysis mapping source regions to stall types and impact, (3) interpretive summaries of selected hardware counters, (4) Roofline diagnostics. The model is constrained (via instructions and examples) to confine edits to bottlenecked regions and to cite the triggering performance evidence explicitly in every transformation, reducing hallucination and unsafe code rewriting.

Evaluation and Results

Large-scale Validation

The evaluation spans $3,410$ experiments across $10$ real-world GPU benchmarks and two DOE/HPC proxy apps, using three LLMs—GPT-5, Gemini-2.5-Pro, and Llama-3.1-70B, on NVIDIA H100 GPUs. Each experiment comprises profiling, annotation, LLM transformation, correctness validation, and repeated timing. Optimas achieves 100% correct code generations and $>98.82\%$ optimizing edits across all runs, with verified runtime improvements ranging from $8.02\%$ to $79.09\%$ for microbenchmarks and up to $14.56\%$ for complex HPC proxies.

Figure 3: Runtime improvements for each application/kernal across diverse input configurations; blue rectangles highlight default configuration gains, green circles represent maxima, red diamonds minima.

Optimas demonstrates that its LLM-generated transformations are robust across input regimes: of 68 tested configurations over 8 applications, only a single configuration (Triad kernel, large array, few repetitions) fails to yield speedup, attributable to a fundamental shift in the application’s bottleneck regime.

Ablation Studies

Systematic ablation establishes the quantitative value of each diagnostic modality. A single source (roofline, stalls, or counters) achieves substantial speedup in isolation ( $3.81\%-79.09\%$ for GPT-5), but combining sources yields maximal gains for complex or multi-kernel applications. The process is non-monotonic due to increased prompt length and potential source signal conflict but consistently increases robustness of optimization suggestion coverage.

Evidence-Driven Optimization

Evidence-Aligned Reasoning metrics quantify edit quality in terms of (1) diagnostic reference coverage, (2) localization to bottleneck lines, and (3) observed post-fix directional alignment: e.g., optimizations aiming at memory stalls must empirically reduce associated counter values post-edit.

Figure 4: Dominant stall occurrences for the Accuracy kernel before and after optimization, with substantial reductions where __restrict__ qualifiers were applied, validating direct bottleneck mitigation.

Notable claim: The default prompt containing only code and no diagnostics produced zero valid optimizations, directly contradicting conventional code-completion-based approaches.

Comparative Analysis of LLMs

GPT-5 yields the highest evidence-aligned coverage ( $83.1\%$ ) and speedup, implementing the largest set of precise, evidence-cited transformations; Gemini-2.5-Pro closely follows, albeit with marginally lower consistency and slightly higher variability. Llama-3.1-70B, while less aggressive and covering fewer cases, produces no hallucinations and maintains strict adherence to constraints. All models benefit from evidence-structured prompts.

Practical Implications and Theoretical Contributions

Optimas fundamentally alters the role of LLMs in scientific software optimization: it elevates code generation from syntax-focused or shallow learned transformations to a fully evidence-driven, causal, and closed-loop process. This approach decouples optimization from expert knowledge bottlenecks and provides reproducible workflows and artifacts for research and education.

Profiling time remains the dominant cost due to exhaustive diagnostics, but optimization turns are reduced from hours/days of expert labor to minutes (sub- $30\%$ 0 of the profiling time in case studies). The rigorous pipeline established by Optimas offers a reusable methodology for retrieval-augmented LLM-based code manipulation across other domains with similarly structured analytics-to-edit workflows.

Limitations and Prospects

The multi-modal, evidence-centric approach demonstrates that generative AI can automate expert-level optimization workflows but is currently reliant on diagnostic completeness and prompt engineering discipline. Extending closed feedback loops—enabling the system to autonomously re-invoke profiling and iterate optimization—remains an open challenge. The framework’s hardware/software-agnostic design and public optimization corpus suggest it will serve as a strong baseline for future research in automated systems programming, compiler optimization, and LLM-guided performance engineering.

Conclusion

Optimas establishes a scalable, analytics-informed, LLM-agnostic system for full-cycle, automated, and verifiable performance optimization, closing the empirical-to-actionable loop. Across thousands of empirical runs, the approach yields high rates of correct, robust, and transferable optimizations, demonstrating that aligning generative models with multi-modal performance analytics effectively democratizes expert-level performance engineering for GPU computing and broader HPC domains. (2604.23892)

Markdown Report Issue