Language Models Can Teach Themselves to Program Better (2207.14502v4)

Published 29 Jul 2022 in cs.LG and cs.AI

Abstract: Recent LLMs (LMs) achieve breakthrough performance in code generation when trained on human-authored problems, even solving some competitive-programming problems. Self-play has proven useful in games such as Go, and thus it is natural to ask whether LMs can generate their own instructive programming problems to improve their performance. We show that it is possible for an LM to synthesize programming problems and solutions, which are filtered for correctness by a Python interpreter. The LM's performance is then seen to improve when it is fine-tuned on its own synthetic problems and verified solutions; thus the model 'improves itself' using the Python interpreter. Problems are specified formally as programming puzzles [Schuster et al., 2021], a code-based problem format where solutions can easily be verified for correctness by execution. In experiments on publicly-available LMs, test accuracy more than doubles. This work demonstrates the potential for code LMs, with an interpreter, to generate instructive problems and improve their own performance.

PDF Abstract

Introduction to Self-Play and New Developments

The concept of self-improvement in AI is not novel and has seen remarkable success in strategic games like Go. However, the application of a similar technique in programming-related tasks has opened a new frontier. Researchers have paved the way for LLMs (LMs) to autonomously improve their programming problem-solving abilities. By generating their own set of programming puzzles, coupled with solutions, these models use a Python interpreter to verify solution correctness. This methodology promises a steady self-enhancement of the model without additional human-authored problems, potentially leading to software development breakthroughs.

Experimentation with LLMs

In this research, current LLMs were evaluated on a series of programming puzzles. These puzzles range from simple to complex and are designed to assess code generation capabilities. Unlike previous strategies that rely on ambiguous English problem descriptions and require extensive human verification, this approach negates such inefficiencies. These programming puzzles are easily machine-verifiable and encompass a variety of computational problems. With success in puzzles suggesting a strong correlation with coding experience, LMs' potential to excel in these challenges is immense.

Results and Implications

When these models were fine-tuned using the synthetic problems that they generated themselves, the accuracy on subsequent puzzles improved significantly—more than doubling in some cases. The research indicates that older, established models, when fine-tuned on data generated from more sophisticated models, learn more effectively. This could signify that even limited models can benefit from the "knowledge" of advanced machine learning constructs, including exploring broader applications and addressing data scarcity concerns in AI training.

Contributions and Future Directions

This work makes three main contributions. Firstly, it introduces an effective way to generate diverse programming puzzles, ensuring the solutions are both correct and efficient. Secondly, it provides open access to a dataset of 1 million synthetic puzzles with their solutions. Lastly, improvements to the LLM suggest that puzzles are instructional, influencing the model's performance positively on unseen problems.

Future research could evaluate whether AI can ultimately surpass human code generation by solving open algorithmic or mathematical challenges. The idea of generating puzzles could be applied to natural language understanding or even other AI fields such as theorem proving. As self-improvement demonstrates promise, further exploration into other synthesizing and verifying techniques is warranted, potentially expanding the landscape of code generation and AI development.