Emergent Mind


The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. Our dataset APPS+ and StepCoder are available online.


  • StepCoder introduces a novel reinforcement learning framework for code generation, overcoming challenges with long code sequences and sparse rewards.

  • The two main components of the framework, Curriculum of Code Completion Subtasks and Fine-Grained Optimization, aid exploratory learning and optimization precision.

  • The paper presents APPS+, a new dataset conducive to reinforcement learning which stresses on unit test correctness for better model training.

  • StepCoder's framework features enable it to outperform state-of-the-art methods by refining large language model outputs more effectively.

  • Empirical results show that StepCoder excels in exploration efficiency and code generation on benchmarks like MBPP and HumanEval.


The progress in LLMs has significantly influenced the field of code generation, particularly through integration of reinforcement learning (RL) with compiler feedback. Despite advancements, RL faces challenges with LLMs generating lengthy code in response to complex human requirements. Additionally, utilizing unexecuted code snippets for LLM optimization has been ineffective due to its irrelevance to the reward metric. A novel RL framework, StepCoder, has been presented to tackle these challenges with two main components: Curriculum of Code Completion Subtasks (CCCS) and Fine-Grained Optimization (FGO). This paper also discusses the creation of APPS+, a dataset designed for RL training in code generation, ensuring correctness of unit tests which facilitates improved model training.

Reinforcement Learning Challenges

Reinforcement learning (RL) in code generation confronts the complexity of managing long sequences and sparse rewards due to the intricacies of human requirements. The conventional approach, such as PPO or actor-critic methods, optimizes performance utilizing unit test feedback; however, it is constrained by the execution coverage of the compiler feedback, which makes RL exploration arduous. StepCoder addresses these barriers with innovative approaches, notably the CCCS, which simplifies the exploration by incrementally increasing the difficulty of tasks and the FGO that refines model optimization by focusing solely on executed code snippets.

StepCoder Framework

StepCoder's CCCS component assists in breaking down lengthy code generation tasks into smaller, manageable subtasks, creating a curriculum that eases exploration. It uses a dynamic process, starting RL exploration from simple code sequences and progressively increasing complexity. FGO, on the other hand, significantly enhances the precision of model optimization by masking unexecuted code sections during the rewards calculation process. This dual-component strategy allows StepCoder not only to refine the LLM's output space more effectively but also to surpass state-of-the-art methods on corresponding benchmarks.

Empirical Results and Dataset Significance

StepCoder was evaluated on the specially curated APPS+ dataset, demonstrating improvements over existing methods in terms of exploration efficiency and effective code generation. Furthermore, APPS+ provides a rigorous evaluation platform and a valuable baseline for integrating RL in LLMs. When applied to widely-used benchmarks such as MBPP and HumanEval, StepCoder shows superior performance compared to other RL-based methods, confirming its effectiveness. This success could primarily be attributed to improved exploration within reinforcement learning, positioning StepCoder as a potent framework for enhancing the capabilities of LLMs in code generation scenarios.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.