Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback (2402.01391v2)

Published 2 Feb 2024 in cs.SE and cs.CL

Abstract: The advancement of LLMs has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. Our dataset APPS+ and StepCoder are available online.

Introduction

The progress in LLMs has significantly influenced the field of code generation, particularly through integration of reinforcement learning (RL) with compiler feedback. Despite advancements, RL faces challenges with LLMs generating lengthy code in response to complex human requirements. Additionally, utilizing unexecuted code snippets for LLM optimization has been ineffective due to its irrelevance to the reward metric. A novel RL framework, StepCoder, has been presented to tackle these challenges with two main components: Curriculum of Code Completion Subtasks (CCCS) and Fine-Grained Optimization (FGO). This paper also discusses the creation of APPS+, a dataset designed for RL training in code generation, ensuring correctness of unit tests which facilitates improved model training.

Reinforcement Learning Challenges

Reinforcement learning (RL) in code generation confronts the complexity of managing long sequences and sparse rewards due to the intricacies of human requirements. The conventional approach, such as PPO or actor-critic methods, optimizes performance utilizing unit test feedback; however, it is constrained by the execution coverage of the compiler feedback, which makes RL exploration arduous. StepCoder addresses these barriers with innovative approaches, notably the CCCS, which simplifies the exploration by incrementally increasing the difficulty of tasks and the FGO that refines model optimization by focusing solely on executed code snippets.

StepCoder Framework

StepCoder's CCCS component assists in breaking down lengthy code generation tasks into smaller, manageable subtasks, creating a curriculum that eases exploration. It uses a dynamic process, starting RL exploration from simple code sequences and progressively increasing complexity. FGO, on the other hand, significantly enhances the precision of model optimization by masking unexecuted code sections during the rewards calculation process. This dual-component strategy allows StepCoder not only to refine the LLM's output space more effectively but also to surpass state-of-the-art methods on corresponding benchmarks.

Empirical Results and Dataset Significance

StepCoder was evaluated on the specially curated APPS+ dataset, demonstrating improvements over existing methods in terms of exploration efficiency and effective code generation. Furthermore, APPS+ provides a rigorous evaluation platform and a valuable baseline for integrating RL in LLMs. When applied to widely-used benchmarks such as MBPP and HumanEval, StepCoder shows superior performance compared to other RL-based methods, confirming its effectiveness. This success could primarily be attributed to improved exploration within reinforcement learning, positioning StepCoder as a potent framework for enhancing the capabilities of LLMs in code generation scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Shihan Dou (46 papers)
  2. Yan Liu (419 papers)
  3. Haoxiang Jia (7 papers)
  4. Limao Xiong (9 papers)
  5. Enyu Zhou (12 papers)
  6. Junjie Shan (12 papers)
  7. Caishuang Huang (13 papers)
  8. Wei Shen (181 papers)
  9. Xiaoran Fan (23 papers)
  10. Zhiheng Xi (37 papers)
  11. Yuhao Zhou (78 papers)
  12. Tao Ji (28 papers)
  13. Rui Zheng (78 papers)
  14. Qi Zhang (784 papers)
  15. Xuanjing Huang (287 papers)
  16. Tao Gui (127 papers)
  17. Xiao Wang (507 papers)
Citations (17)