Process Supervision-Guided Policy Optimization for Code Generation (2410.17621v2)

Published 23 Oct 2024 in cs.AI

Abstract: Reinforcement learning (RL) with unit test feedback has enhanced LLMs' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our experimental results also highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

YouTube

Show All Videos

Process Supervision-Guided Policy Optimization for Code Generation (2410.17621v2)

Summary

Related Papers

YouTube