StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback (2402.01391v2)

Published 2 Feb 2024 in cs.SE and cs.CL

Abstract: The advancement of LLMs has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. Our dataset APPS+ and StepCoder are available online.

Citations (17)

View on Semantic Scholar

Summary

The paper presents a novel reinforcement learning framework, StepCoder, which integrates CCCS and FGO to enhance long-sequence code generation.
The paper addresses RL challenges by focusing on executed code snippets and incrementally increasing task difficulty with a dynamic curriculum.
Empirical results on APPS+, MBPP, and HumanEval benchmarks demonstrate that StepCoder outperforms traditional methods in exploration efficiency and code accuracy.

Introduction

The progress in LLMs has significantly influenced the field of code generation, particularly through integration of reinforcement learning (RL) with compiler feedback. Despite advancements, RL faces challenges with LLMs generating lengthy code in response to complex human requirements. Additionally, utilizing unexecuted code snippets for LLM optimization has been ineffective due to its irrelevance to the reward metric. A novel RL framework, StepCoder, has been presented to tackle these challenges with two main components: Curriculum of Code Completion Subtasks (CCCS) and Fine-Grained Optimization (FGO). This paper also discusses the creation of APPS+, a dataset designed for RL training in code generation, ensuring correctness of unit tests which facilitates improved model training.

Reinforcement Learning Challenges

Reinforcement learning (RL) in code generation confronts the complexity of managing long sequences and sparse rewards due to the intricacies of human requirements. The conventional approach, such as PPO or actor-critic methods, optimizes performance utilizing unit test feedback; however, it is constrained by the execution coverage of the compiler feedback, which makes RL exploration arduous. StepCoder addresses these barriers with innovative approaches, notably the CCCS, which simplifies the exploration by incrementally increasing the difficulty of tasks and the FGO that refines model optimization by focusing solely on executed code snippets.

StepCoder Framework

StepCoder's CCCS component assists in breaking down lengthy code generation tasks into smaller, manageable subtasks, creating a curriculum that eases exploration. It uses a dynamic process, starting RL exploration from simple code sequences and progressively increasing complexity. FGO, on the other hand, significantly enhances the precision of model optimization by masking unexecuted code sections during the rewards calculation process. This dual-component strategy allows StepCoder not only to refine the LLM's output space more effectively but also to surpass state-of-the-art methods on corresponding benchmarks.

Empirical Results and Dataset Significance

StepCoder was evaluated on the specially curated APPS+ dataset, demonstrating improvements over existing methods in terms of exploration efficiency and effective code generation. Furthermore, APPS+ provides a rigorous evaluation platform and a valuable baseline for integrating RL in LLMs. When applied to widely-used benchmarks such as MBPP and HumanEval, StepCoder shows superior performance compared to other RL-based methods, confirming its effectiveness. This success could primarily be attributed to improved exploration within reinforcement learning, positioning StepCoder as a potent framework for enhancing the capabilities of LLMs in code generation scenarios.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1754340063675056536

https://twitter.com/fly51fly/status/1754644575505748198

https://twitter.com/gm8xx8/status/1754330915591860578

https://twitter.com/ComputerPapers/status/1754435184835866646

https://twitter.com/ComputerPapers/status/1755009238919782865

https://twitter.com/javaeeeee1/status/1754485479506919708