- The paper introduces DSTC, a novel framework that improves code LLM accuracy through direct preference learning using only self-generated code and tests, eliminating the need for external high-quality preference data.
- DSTC employs a minimax selection process and code-test concatenation to construct reliable preference pairs from generated data, effectively mitigating issues with incorrect self-generated tests.
- Experimental results demonstrate that DSTC, combined with DPO or KTO, consistently enhances coding accuracy on benchmarks like HumanEval and MBPP for models including Starcoder2-15b and Deepseekcoder-33b.
This paper introduces DSTC (Direct Preference Learning with Only Self-Generated Tests and Code), a novel framework designed to improve the coding accuracy of code LLMs using direct preference learning methods. The key idea is to construct reliable preference pairs from self-generated code snippets and tests, removing the reliance on external annotations or high-quality datasets. DSTC combines a minimax selection process with test-code concatenation to enhance the quality of preference pairs, effectively reducing the impact of incorrect self-generated tests.
The authors identify that Supervised Fine-Tuning (SFT) of code LLMs (LMs) has limitations in generalizing to unseen instructions. While reinforcement learning can address this, a significant challenge lies in the scarcity of high-quality preference datasets tailored to code generation. Unlike natural language tasks, code generation requires preference datasets to emphasize correctness and functional alignment.
DSTC leverages direct preference learning methods like Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), which offer computational efficiency and simplicity compared to Proximal Policy Optimization (PPO). The core question addressed is how to generate reliable preference pairs from self-generated code and tests to improve coding accuracy through direct preference learning.
The DSTC framework comprises two main components:
- A minimax selection mechanism to improve the quality of code-test pairs.
- Code-test concatenation to create more reliable preference pairs, using binary execution feedback.
The authors evaluated DSTC by implementing it with both Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) on the Starcoder2-15b-instruct model (15 billion parameters), assessing performance on benchmarks like HumanEval Base, HumanEval Plus, Mostly Basic Python Problems (MBPP) Base, Mostly Basic Python Problems (MBPP) Plus, and BigCodeBench (BCB). The results indicate that DSTC effectively increases coding accuracy across multiple benchmarks. Ablation studies validate the importance of each component within DSTC, and results show that DSTC can also improve the performance of the Deepseekcoder-33b model (33 billion parameters).
The contributions of the paper are:
- The DSTC mechanism for generating preference pairs from self-generated code and tests, suitable for direct preference learning algorithms like Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO). DSTC includes a minimax selection procedure and code-test concatenation.
- Experimental results demonstrating that DSTC, when combined with Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), improves coding accuracy (pass@1 score) in code LLMs (LMs) across multiple benchmarks and model sizes.
The paper reviews related work on Reinforcement Learning from Human Feedback (RLHF), highlighting that Reinforcement Learning from Human Feedback (RLHF) has become essential for training state-of-the-art LLMs, and direct preference learning algorithms have emerged as a promising direction to overcome the limitations of Proximal Policy Optimization (PPO)-based methods. The paper also discusses advances in LLMs for code generation, noting that pre-training on code datasets has yielded impressive performance, and fine-tuning strategies have been developed to further improve coding performance.
The paper formulates the code generation problem within the Reinforcement Learning from Human Feedback (RLHF) framework, where state x∈X refers to a natural language instruction and action a∈A refers to the response provided by the code LLM (LM) πθ(⋅∣x) containing a code snippet y∈Y. The correctness of a code snippet is checked by feeding the concatenation of the code snippet y and a test z to a compiler, which outputs binary execution feedback. The objective of code generation is formalized as:
$\max_{\theta} \mathbb{E}_{x \sim \cD, a \sim \pi_\theta(\cdot x)} \bigl[ r(z^\star_x, Ex(a)) - \beta \cdot \text{KL}(\pi_\theta(\cdot x) | \pi_\text{ref}(\cdot x)) \bigr]$,
where:
- x is a natural language instruction
- a is a response provided by the code LLM (LM)
- r(zx⋆,Ex(a)) is a binary reward function for code snippet y and test z
- β is a hyperparameter
- πθ is the policy
- πref is the reference policy
The Direct Preference Optimization (DPO) loss is defined as:
$\mathcal{L}_{\mathrm{DPO}\left(\theta\right)=-\mathbb{E}_{(x, a^+,a^-)\sim \mathcal{D}_\text{DPO}\left[\log\sigma(r_\theta(x, a^+)-r_\theta(x, a^-))\right]}$,
where
rθ(x,a)=logπref (ax)πθ(ax)
- θ is the parameter
- x is the prompt
- a+ is the chosen response
- a− is the rejected response
- DDPO is the preference dataset
The Kahneman-Tversky Optimization (KTO) loss is defined as:
$\mathcal{L}_{\mathrm{KTO}\left(\theta\right)=\mathbb{E}_{(x, a, b ) \sim \mathcal{D}_{\text{KTO}}\left[\lambda_y-v(x, a)\right]}$,
where
v(x,a)={λDσ(β(rθ(x,a)−z0)), if b=1 λUσ(β(z0−rθ(x,a))), otherwise
rθ(x,a)=logπref (ax)πθ(ax)
$z_0 =\mathbb{E}_{(x^\prime, a^\prime, b^\prime ) \sim \mathcal{D}_{\text{KTO}}\left[\operatorname{KL}\left(\pi_\theta\left(a^{\prime} x\right) \| \pi_{\text {ref }\left(a^{\prime} x\right)\right) \right]$
- θ is the parameter
- x is the prompt
- a is the response
- b is a binary variable that indicates whether a is desired
- DKTO is the preference dataset
The DSTC method involves the following steps:
- Generating J pairs of code snippets {yij}j=1J and tests {zij}j=1J for each instruction xi.
- Concatenating each self-generated code snippet yij and test zij in pairs and recording their binary execution feedback rijk∈{0,1} from a compiler.
The minimax selection process then determines which code snippets and tests are chosen or rejected based on the following criteria:
- Chosen code snippet yij′: j′=argmaxj∑k=1Jrijk.
- Chosen test zik′: k′=kargmaxj=1∑Jrijk s.t. rij′k=1.
- Rejected test zik†: k†=kargmaxj=1∑Jrijk s.t. j=1∑Jrijk<J.
- Rejected code snippet yik†: j†=jargmaxk=1∑Jrijk s.t. rijk†=0.
The chosen and rejected code snippets and tests are then concatenated using a predefined prompt template, forming preference pairs for training the code LLMs (LMs) with Direct Preference Optimization (DPO) or Kahneman-Tversky Optimization (KTO).
The experimental results demonstrate that DSTC improves code quality and increases the quality gap between chosen and rejected code, making preference pairs more reliable.
The evaluation results demonstrate that DSTC consistently enhances the performance of code LLMs (LMs) across all benchmarks when combined with either Direct Preference Optimization (DPO) or Kahneman-Tversky Optimization (KTO). The ablation paper indicates that both the minimax selection and code-test concatenation components are necessary for achieving optimal performance with DSTC.