Introduction
The progress in LLMs has significantly influenced the field of code generation, particularly through integration of reinforcement learning (RL) with compiler feedback. Despite advancements, RL faces challenges with LLMs generating lengthy code in response to complex human requirements. Additionally, utilizing unexecuted code snippets for LLM optimization has been ineffective due to its irrelevance to the reward metric. A novel RL framework, StepCoder, has been presented to tackle these challenges with two main components: Curriculum of Code Completion Subtasks (CCCS) and Fine-Grained Optimization (FGO). This paper also discusses the creation of APPS+, a dataset designed for RL training in code generation, ensuring correctness of unit tests which facilitates improved model training.
Reinforcement Learning Challenges
Reinforcement learning (RL) in code generation confronts the complexity of managing long sequences and sparse rewards due to the intricacies of human requirements. The conventional approach, such as PPO or actor-critic methods, optimizes performance utilizing unit test feedback; however, it is constrained by the execution coverage of the compiler feedback, which makes RL exploration arduous. StepCoder addresses these barriers with innovative approaches, notably the CCCS, which simplifies the exploration by incrementally increasing the difficulty of tasks and the FGO that refines model optimization by focusing solely on executed code snippets.
StepCoder Framework
StepCoder's CCCS component assists in breaking down lengthy code generation tasks into smaller, manageable subtasks, creating a curriculum that eases exploration. It uses a dynamic process, starting RL exploration from simple code sequences and progressively increasing complexity. FGO, on the other hand, significantly enhances the precision of model optimization by masking unexecuted code sections during the rewards calculation process. This dual-component strategy allows StepCoder not only to refine the LLM's output space more effectively but also to surpass state-of-the-art methods on corresponding benchmarks.
Empirical Results and Dataset Significance
StepCoder was evaluated on the specially curated APPS+ dataset, demonstrating improvements over existing methods in terms of exploration efficiency and effective code generation. Furthermore, APPS+ provides a rigorous evaluation platform and a valuable baseline for integrating RL in LLMs. When applied to widely-used benchmarks such as MBPP and HumanEval, StepCoder shows superior performance compared to other RL-based methods, confirming its effectiveness. This success could primarily be attributed to improved exploration within reinforcement learning, positioning StepCoder as a potent framework for enhancing the capabilities of LLMs in code generation scenarios.