RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning (2410.02089v2)

Published 2 Oct 2024 in cs.CL and cs.AI

Abstract: LLMs deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new state-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Summary

The paper introduces RLEF, an iterative RL method that leverages execution feedback to enhance code generation accuracy in LLMs.
It models code synthesis as a partially observable MDP, using PPO to optimize a binary reward system based on test case outcomes.
Benchmarking on CodeContests demonstrates state-of-the-art improvements, significantly reducing sampling requirements for both 8B and 70B models.

Overview of RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

The paper introduces an end-to-end reinforcement learning (RL) approach to enhance the performance of LLMs in code synthesis tasks by leveraging execution feedback. The authors focus on teaching LLMs to utilize feedback iteratively, thus improving their ability to refine code solutions in competitive programming contexts. The method proposed, titled Reinforcement Learning with Execution Feedback (RLEF), demonstrates substantial improvements over existing models in code generation tasks.

Methodology

The primary innovation in this research is framing code generation as an iterative task. This approach allows LLMs to be repeatedly prompted to create, evaluate, and refine code solutions based on execution feedback. This feedback is pivotal for grounding the model's outputs, enabling more accurate and efficient problem-solving. The RL component is optimized through Proximal Policy Optimization (PPO), which is a robust choice for alignment tasks in LLMs.

The code generation task is modeled as a partially observable Markov Decision Process (MDP), with the LLM acting as a policy. The reinforcement learning process uses execution feedback to inform subsequent generations, optimizing a binary reward system that evaluates whether the final code solution passes held-out test cases.

Experimentation and Results

The authors conducted extensive benchmarking on the CodeContests dataset, a challenging set of competitive programming problems. They achieved new state-of-the-art results by enhancing both small (8B parameters) and large (70B) models, notably reducing the number of samples required for successful code generation by an order of magnitude.

The 70B model, after RLEF training, significantly outperformed previously reported results on CodeContests, demonstrating effectiveness with a smaller sampling budget compared to prior methods like AlphaCodium and MapCoder. The improvements were evident with the 8B model as well, which surpassed the AlphaCode 9B model with a drastically reduced number of iterations.

Inference-time Behavior

The paper also explores the behavior of models in iterative settings compared to static, single-turn evaluations. RLEF-trained models exhibited improved capabilities in error correction and diversity of code solutions within successive generations. This refined behavior was mainly attributed to the effective use of execution feedback, as demonstrated by ablations using random feedback scenarios.

Implications and Future Directions

The findings indicate that incorporating execution feedback into the training process via RLEF can significantly enhance LLMs' performance in iterative and multi-turn environments. This ability to iteratively improve code is crucial for applications requiring autonomous or semi-autonomous operation, such as software development and quality control in programming tasks.

In future developments, more complex environments could benefit from such training methods, potentially extending beyond code synthesis to broader domains requiring iterative improvement and feedback integration. Moreover, this work highlights the potential of reinforcement learning to align LLMs with task-specific requirements effectively.

Conclusion

This research presents a well-defined and effective method to leverage execution feedback in LLM environments, advancing the state of the art in code synthesis tasks. The implications are significant, suggesting a viable path forward in using RL to enhance model capabilities in settings where iterative refinement and grounding in feedback are essential.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (7)

Tweets

https://twitter.com/syhw/status/1842202438171832668

https://twitter.com/_philschmid/status/1854124186270130615

https://twitter.com/natolambert/status/1842219039667720552

https://twitter.com/bimedotcom/status/1881344087606919554

https://twitter.com/neurosp1ke/status/1872597528828629177

https://twitter.com/fly51fly/status/1842323997129888148