Finding Missed Code Size Optimizations in Compilers using LLMs
This paper explores the novel application of LLMs for identifying missed code size optimizations in compilers. Traditionally, compiler testing largely focuses on functional correctness rather than optimization efficacy. Therefore, discovering inefficient code generation, particularly related to code size, is often overlooked. The authors introduce an LLM-driven methodology that effectively identifies such missed optimizations through a mutation-based testing approach.
The traditional methods of compiler testing and fuzzing, such as CSmith, necessitate complex, language-specific random program generators, which can be resource-intensive and difficult to maintain. Furthermore, these methods typically generate large and hard-to-interpret test cases. In contrast, the authors propose starting with a trivial seed program and leveraging an LLM to incrementally mutate this seed. By doing so, the approach circumvents the need for elaborate random program generators and simplifies the process considerably.
The paper articulates four differential testing strategies to identify compiler optimization shortcomings:
- Dead Code Differential Testing: This strategy checks whether mutations classified as dead code alter the size of the compiled output. If the compiler generates different outputs for equivalent functionalities, a missed optimization is flagged.
- Optimization Pipeline Differential Testing: This involves comparing code output sizes from different optimization levels (e.g., -O3 vs. -Oz). A larger size for a typically lower-size optimization pipeline suggests a potential inefficiency.
- Single-Compiler Differential Testing: By comparing different versions of the same compiler, this strategy identifies regressions in code size optimizations that may have been introduced with changes in compiler versions.
- Multi-Compiler Differential Testing: This approach evaluates the compilation outputs across different compilers. Significant discrepancies in code sizes hint at missed optimization opportunities in one of the compilers.
The methodology identified 24 bugs in production compilers across C/C++, Swift, and Rust, demonstrating its efficacy. For example, in evaluating GCC, a specific bug where value range analysis failed to optimize dead control structures was found and verified. Additionally, optimizations that inconsistently applied across different compiler versions or that varied from expected results in similar pipeline settings were noted.
A significant contribution of this work is its extensibility and application to multiple programming languages, signifying a broad impact beyond a single language or compiler system. The usage of LLMs facilitates easier adaptation to different languages by modifying test generation scripts, which the authors demonstrated with Rust and Swift.
While this paper focuses on code size optimizations, it opens the door for further research into runtime performance and other optimization aspects. The integration of enhanced heuristics, advancements in LLM capability, and potential improvements in LLM-engine coordination hold promising prospects for this research direction. Future work could explore enhanced prompt engineering, more sophisticated model interventions, and broader application scopes within the scope of compiler optimizations.
This research presents a significant step forward in compiler testing methodologies by applying AI-driven techniques to discover optimization inefficiencies, which historically have received minimal structured examination. The approach's minimal codebase requirement and effective bug discovery potential point to valuable real-world applications and a new direction in the growing intersection of AI and compiler technology.