- The paper introduces Naturalized Execution Tuning (NExT) that integrates execution traces with chain-of-thought reasoning to enhance program repair.
- It employs weakly-supervised self-training with iterative rationale generation, filtering, and corrections validated by unit tests.
- Experiments on MBPP and HumanEval datasets show improvements of up to 26.1% and 14.3% in code fix rates over baseline models.
Enhancing LLMs' Reasoning with Execution Information for Program Repair
Introduction to Naturalized Execution Tuning (NExT)
The paper details a novel approach to enhance the capability of LLMs in handling complex software engineering tasks—specifically, program repair tasks leveraging execution information. The introduced framework, Naturalized Execution Tuning (NExT), focuses on teaching LLMs to reason about code execution by incorporating program execution traces alongside chain-of-thought (CoT) reasoning methods to generate more sophisticated natural language rationales.
Key Concepts and Implementations
Task and Challenges Addressed:
- NExT addresses the challenge of aiding LLMs in reasoning about program execution to solve programming tasks.
- By providing models with detailed execution traces (variable values and states line-by-line), NExT aims to increase a model's ability to detect and rectify errors in code.
Methodology Overview:
- The proposed method involves finetuning LLMs using weakly-supervised self-training.
- Each iteration involves the generation and selection of NL rationales and subsequent code corrections, which are verified against unit tests for accuracy.
- The approach entails multiple iterations of sampling, filtering based on test executions, and finetuning, focusing on iteratively improving the LLM’s capabilities.
Execution Traces Representation:
- The model uses a compact, inline representation of execution traces as code comments, a novel yet efficient way to provide execution context.
- This allows models to leverage complex execution behaviors within their normal text comprehension methods without requiring specialized architectures.
Experimental Validation
Datasets and Models:
- The paper utilizes two primary datasets, MBPP for Python program repair and the HumanEval fix dataset (HE), to train and validate the LLMs enhanced by NExT.
- PaLM 2-L model serves as the base LLM for enhancements through NExT.
Results and Observations:
- Models enhanced with NExT demonstrated a significant improvement in the problem-fix rate, with a notable 26.1% and 14.3% absolute performance boost on MBPP and HE datasets respectively.
- Evaluations show that even when execution traces are unavailable at testing, the trained models perform better than the base models, emphasizing the learning transfer and generalizability of the execution reasoning ability.
Comparative Analysis:
- NExT generally matches or exceeds the performance of several strong LLM baselines.
- Proxy-based evaluations reveal that generated rationales not only help the main model but also assist smaller LLMs in achieving higher success rates in code fixes.
Conclusion and Future Directions
NExT presents a promising approach to significantly elevate the capabilities of LLMs in software development applications, particularly in automated debugging and program repair tasks. The approach fosters a deeper integration of execution semantics in model reasoning pathways through natural language processsing, enriching both the interpretability and functional correctness of model outputs. Future explorations may extend NExT’s methodologies to a wider range of programming languages and more diverse coding tasks, potentially integrating more dynamic elements of program executions and exploring the scalability of such models to larger and more complex datasets.