Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code (2409.19715v2)

Published 29 Sep 2024 in cs.CL

Abstract: This paper presents Coffee-Gym, a comprehensive RL environment for training models that provide feedback on code editing. Coffee-Gym includes two major components: (1) Coffee, a dataset containing humans' code edit traces for coding questions and machine-written feedback for editing erroneous code; (2) CoffeeEval, a reward function that faithfully reflects the helpfulness of feedback by assessing the performance of the revised code in unit tests. With them, Coffee-Gym addresses the unavailability of high-quality datasets for training feedback models with RL, and provides more accurate rewards than the SOTA reward model (i.e., GPT-4). By applying Coffee-Gym, we elicit feedback models that outperform baselines in enhancing open-source code LLMs' code editing, making them comparable with closed-source LLMs. We make the dataset and the model checkpoint publicly available.

PDF HTML Abstract

Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code

The paper introduces Coffee-Gym, a robust reinforcement learning (RL) environment designed for developing feedback models aimed at enhancing code editing. Coffee-Gym consists of two main components: a dataset named Coffee, which includes human code edit traces and machine-generated feedback, and a reward function, CoffeeEval, that assesses feedback based on revised code performance in unit tests.

Key Contributions

Dataset - Coffee:
- Coffee comprises human-annotated code edits, incorporating a wide range of difficulties and errors not only easily solvable by current LLMs but also challenging ones.
- It includes pairs of correct and incorrect feedback, vital for training and testing in reinforcement learning frameworks.
- The dataset also enriches existing collections by providing unit tests that rigorously evaluate code revisions' functionality.
Reward Function - CoffeeEval:
- CoffeeEval addresses the need for reliable reward systems, assessing feedback by simulating code edits and executing unit tests, thereby estimating the usefulness of feedback.
- The setup demonstrates a more accurate reward mechanism than previous methods like G-Eval using GPT-4, offering a credible alternative for feedback evaluation.

Methodology

The approach involves utilizing reinforcement learning to align feedback generation with code editing efficacy. By leveraging CoffeeEval, the model iteratively improves through realignment based on the correct execution of code edits. This process significantly enhances the feedback model's capability, allowing it to produce feedback similar in quality to proprietary models like GPT-3.5 and even GPT-4.

Results and Implications

The experiments conducted show that using Coffee-Gym, feedback models can produce feedback that improves the pass rate of edited codes, making open-source LLMs competitive with closed-source counterparts.
The paper evidences that CoffeeEval provides a trustworthy evaluation mechanism, surpassing the state-of-the-art in accurately reflecting feedback quality.
The enhancement in performance with Coffee-Gym suggests significant potential for wider adoption and further development of open-source feedback models, reducing dependency on expensive proprietary models.

Future Developments

The research paves the way for extending reinforcement learning applications in code-related tasks. Future work could involve expanding the feedback model's capabilities into real-world software engineering applications or adapting the models to handle multilingual programming environments.

Conclusion

Coffee-Gym sets a new benchmark for evaluating and generating natural language feedback in code editing, offering tools and methodologies that substantially lower the barriers to developing effective open-source coding aids. This research is a valuable contribution to AI-based code optimization, promoting transparency and accessibility.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Hyungjoo Chae (18 papers)
Taeyoon Kwon (12 papers)
Seungjun Moon (8 papers)
Yongho Song (5 papers)
Dongjin Kang (10 papers)
Kai Tzu-iunn Ong (10 papers)
Beong-woo Kwak (12 papers)
SeongHyeon Bae (2 papers)
Seung-won Hwang (59 papers)
Jinyoung Yeo (46 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code (2409.19715v2)