Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs (2411.13611v3)

Published 20 Nov 2024 in cs.SE and cs.AI

Abstract: Direct preference learning offers a promising and computation-efficient beyond supervised fine-tuning (SFT) for improving code generation in coding LLMs (LMs). However, the scarcity of reliable preference data is a bottleneck for the performance of direct preference learning to improve the coding accuracy of code LMs. In this paper, we introduce \underline{\textbf{D}}irect Preference Learning with Only \underline{\textbf{S}}elf-Generated \underline{\textbf{T}}ests and \underline{\textbf{C}}ode (DSTC), a framework that leverages only self-generated code snippets and tests to construct reliable preference pairs such that direct preference learning can improve LM coding accuracy without external annotations. DSTC combines a minimax selection process and test-code concatenation to improve preference pair quality, reducing the influence of incorrect self-generated tests and enhancing model performance without the need for costly reward models. When applied with direct preference learning methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), DSTC yields stable improvements in coding accuracy (pass@1 score) across diverse coding benchmarks, including HumanEval, MBPP, and BigCodeBench, demonstrating both its effectiveness and scalability for models of various sizes. This approach autonomously enhances code generation accuracy across LLMs of varying sizes, reducing reliance on expensive annotated coding datasets.

Summary

  • The paper introduces DSTC, a novel framework that improves code LLM accuracy through direct preference learning using only self-generated code and tests, eliminating the need for external high-quality preference data.
  • DSTC employs a minimax selection process and code-test concatenation to construct reliable preference pairs from generated data, effectively mitigating issues with incorrect self-generated tests.
  • Experimental results demonstrate that DSTC, combined with DPO or KTO, consistently enhances coding accuracy on benchmarks like HumanEval and MBPP for models including Starcoder2-15b and Deepseekcoder-33b.

This paper introduces DSTC (Direct Preference Learning with Only Self-Generated Tests and Code), a novel framework designed to improve the coding accuracy of code LLMs using direct preference learning methods. The key idea is to construct reliable preference pairs from self-generated code snippets and tests, removing the reliance on external annotations or high-quality datasets. DSTC combines a minimax selection process with test-code concatenation to enhance the quality of preference pairs, effectively reducing the impact of incorrect self-generated tests.

The authors identify that Supervised Fine-Tuning (SFT) of code LLMs (LMs) has limitations in generalizing to unseen instructions. While reinforcement learning can address this, a significant challenge lies in the scarcity of high-quality preference datasets tailored to code generation. Unlike natural language tasks, code generation requires preference datasets to emphasize correctness and functional alignment.

DSTC leverages direct preference learning methods like Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), which offer computational efficiency and simplicity compared to Proximal Policy Optimization (PPO). The core question addressed is how to generate reliable preference pairs from self-generated code and tests to improve coding accuracy through direct preference learning.

The DSTC framework comprises two main components:

  • A minimax selection mechanism to improve the quality of code-test pairs.
  • Code-test concatenation to create more reliable preference pairs, using binary execution feedback.

The authors evaluated DSTC by implementing it with both Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) on the Starcoder2-15b-instruct model (15 billion parameters), assessing performance on benchmarks like HumanEval Base, HumanEval Plus, Mostly Basic Python Problems (MBPP) Base, Mostly Basic Python Problems (MBPP) Plus, and BigCodeBench (BCB). The results indicate that DSTC effectively increases coding accuracy across multiple benchmarks. Ablation studies validate the importance of each component within DSTC, and results show that DSTC can also improve the performance of the Deepseekcoder-33b model (33 billion parameters).

The contributions of the paper are:

  • The DSTC mechanism for generating preference pairs from self-generated code and tests, suitable for direct preference learning algorithms like Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO). DSTC includes a minimax selection procedure and code-test concatenation.
  • Experimental results demonstrating that DSTC, when combined with Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), improves coding accuracy (pass@1 score) in code LLMs (LMs) across multiple benchmarks and model sizes.

The paper reviews related work on Reinforcement Learning from Human Feedback (RLHF), highlighting that Reinforcement Learning from Human Feedback (RLHF) has become essential for training state-of-the-art LLMs, and direct preference learning algorithms have emerged as a promising direction to overcome the limitations of Proximal Policy Optimization (PPO)-based methods. The paper also discusses advances in LLMs for code generation, noting that pre-training on code datasets has yielded impressive performance, and fine-tuning strategies have been developed to further improve coding performance.

The paper formulates the code generation problem within the Reinforcement Learning from Human Feedback (RLHF) framework, where state xXx\in\mathcal{X} refers to a natural language instruction and action aAa\in\mathcal{A} refers to the response provided by the code LLM (LM) πθ(x)\pi_\theta(\cdot | x) containing a code snippet yYy\in\mathcal{Y}. The correctness of a code snippet is checked by feeding the concatenation of the code snippet yy and a test zz to a compiler, which outputs binary execution feedback. The objective of code generation is formalized as:

$\max_{\theta} \mathbb{E}_{x \sim \cD, a \sim \pi_\theta(\cdot x)} \bigl[ r(z^\star_x, Ex(a)) - \beta \cdot \text{KL}(\pi_\theta(\cdot x) | \pi_\text{ref}(\cdot x)) \bigr]$,

where:

  • xx is a natural language instruction
  • aa is a response provided by the code LLM (LM)
  • r(zx,Ex(a))r(z^\star_x, Ex(a)) is a binary reward function for code snippet yy and test zz
  • β\beta is a hyperparameter
  • πθ\pi_\theta is the policy
  • πref\pi_\text{ref} is the reference policy

The Direct Preference Optimization (DPO) loss is defined as:

$\mathcal{L}_{\mathrm{DPO}\left(\theta\right)=-\mathbb{E}_{(x, a^+,a^-)\sim \mathcal{D}_\text{DPO}\left[\log\sigma(r_\theta(x, a^+)-r_\theta(x, a^-))\right]}$,

where

rθ(x,a)=logπθ(ax)πref (ax)r_\theta(x, a) = \log \frac{\pi_\theta(a x)}{\pi_{\text {ref }(a x)}}

  • θ\theta is the parameter
  • xx is the prompt
  • a+a^+ is the chosen response
  • aa^- is the rejected response
  • DDPO\mathcal{D}_\text{DPO} is the preference dataset

The Kahneman-Tversky Optimization (KTO) loss is defined as:

$\mathcal{L}_{\mathrm{KTO}\left(\theta\right)=\mathbb{E}_{(x, a, b ) \sim \mathcal{D}_{\text{KTO}}\left[\lambda_y-v(x, a)\right]}$,

where

v(x,a)={λDσ(β(rθ(x,a)z0)), if b=1 λUσ(β(z0rθ(x,a))), otherwise v(x, a) = \begin{cases} \lambda_D \sigma\left(\beta\left(r_\theta(x, a)-z_0\right)\right) \text {, if } b = 1 \ \lambda_U \sigma\left(\beta\left(z_0-r_\theta(x, a)\right)\right) \text {, otherwise } \end{cases}

rθ(x,a)=logπθ(ax)πref (ax)r_\theta(x, a) =\log \frac{\pi_\theta(a x)}{\pi_{\text {ref }(a x)}}

$z_0 =\mathbb{E}_{(x^\prime, a^\prime, b^\prime ) \sim \mathcal{D}_{\text{KTO}}\left[\operatorname{KL}\left(\pi_\theta\left(a^{\prime} x\right) \| \pi_{\text {ref }\left(a^{\prime} x\right)\right) \right]$

  • θ\theta is the parameter
  • xx is the prompt
  • aa is the response
  • bb is a binary variable that indicates whether aa is desired
  • DKTO\mathcal{D}_\text{KTO} is the preference dataset

The DSTC method involves the following steps:

  1. Generating JJ pairs of code snippets {yij}j=1J\{y_i^j\}_{j=1}^J and tests {zij}j=1J\{z_i^j\}_{j=1}^J for each instruction xix_i.
  2. Concatenating each self-generated code snippet yijy_i^j and test zijz_i^j in pairs and recording their binary execution feedback rijk{0,1}r_{ijk}\in\{0,1\} from a compiler.

The minimax selection process then determines which code snippets and tests are chosen or rejected based on the following criteria:

  • Chosen code snippet yijy_i^{j^\prime}: j=arg maxjk=1Jrijkj^\prime = \argmax_j \sum_{k=1}^J r_{ijk}.
  • Chosen test zikz_i^{k^\prime}: k=arg maxkj=1Jrijk s.t. rijk=1k^\prime = \argmax_k \sum_{j=1}^J r_{ijk} \quad\text{ s.t. } r_{ij^\prime k} = 1.
  • Rejected test zikz_i^{k^\dagger}: k=arg maxkj=1Jrijk s.t. j=1Jrijk<Jk^\dagger = \argmax_k \sum_{j=1}^J r_{ijk} \quad\text{ s.t. } \sum_{j=1}^J r_{ijk} < J.
  • Rejected code snippet yiky_i^{k^\dagger}: j=arg maxjk=1Jrijk s.t. rijk=0j^\dagger = \argmax_j \sum_{k=1}^J r_{ijk} \quad\text{ s.t. } r_{ij k^\dagger} = 0.

The chosen and rejected code snippets and tests are then concatenated using a predefined prompt template, forming preference pairs for training the code LLMs (LMs) with Direct Preference Optimization (DPO) or Kahneman-Tversky Optimization (KTO).

The experimental results demonstrate that DSTC improves code quality and increases the quality gap between chosen and rejected code, making preference pairs more reliable.

The evaluation results demonstrate that DSTC consistently enhances the performance of code LLMs (LMs) across all benchmarks when combined with either Direct Preference Optimization (DPO) or Kahneman-Tversky Optimization (KTO). The ablation paper indicates that both the minimax selection and code-test concatenation components are necessary for achieving optimal performance with DSTC.

X Twitter Logo Streamline Icon: https://streamlinehq.com