Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis (2310.03173v2)

Published 4 Oct 2023 in cs.CL

Abstract: Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with LLMs, significantly enhancing code generation capabilities. The application of RL focuses on directly optimizing for functional correctness, offering an advantage over conventional supervised methods. Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods. This stems from the rich collection of off-policy programs, including those developed by human programmers and also historical samples, coupled with the straightforward verification of generated programs through automated unit testing, meaning rewards are easy to obtain. Diverging from the dominant use of policy-based algorithms, our work explores the feasibility of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced BeLLMan coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we introduce an initialization protocol for RL agents utilizing pre-trained LMs and a conservative BeLLMan operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance when compared to policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zishun Yu (7 papers)
  2. Yunzhe Tao (20 papers)
  3. Liyu Chen (22 papers)
  4. Tao Sun (143 papers)
  5. Hongxia Yang (130 papers)
Citations (3)