Textbooks Are All You Need (2306.11644v2)

Published 20 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce phi-1, a new LLM for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.

PDF Abstract

Introduction

The rapid ascent of LLMs in the domain of program synthesis has introduced transformative capabilities and improved performance benchmarks. Notably, many models have shown that scaling up the size of neural networks and the datasets on which they are trained can lead to predictable performance improvements as described by scaling laws. However, much of the recent literature suggests that diminishing marginal returns require increasingly larger datasets and compute resources. In this context, the work presented here takes a departure from conventional scaling, emphasizing the pivotal role of data quality over quantity.

High-Quality Data & Model Performance

The core hypothesis of this research is that the data quality can lead a model to surpass state-of-the-art performance with significantly less computational overhead. The proposed phi-1 model, a Transformer-based LLM for code, is relatively compact with 1.3 billion parameters, yet delivers compelling results on benchmark datasets such as HumanEval and MBPP. The phi-1 model was pretrained on a meticulously curated dataset termed "CodeTextbook," containing high-quality, instructive code snippets. After pretraining, finetuning on the "CodeExercises" dataset with a concentration on Python exercises led to notable emergent properties and further enhanced its coding proficiency. Crucially, the data-oriented approach exhibited not only high pass@1 accuracy on benchmarks but also signified a promising direction for environmental sustainability by curbing computational costs.

Emergent Capabilities Post-Finetuning

Another section of the paper explores the emergent capabilities of the phi-1 model post-finetuning. Through qualitative analysis, it becomes clear that finetuning on the CodeExercises dataset encourages the model to execute complex tasks effectively, even those not explicitly present in the finetuning dataset. This suggests a refinement in the model's internal representation and knowledge consolidation, which imparts versatility and robustness in problem-solving.

Evaluation Metrics and Benchmarks

The robustness of the phi-1 model is further scrutinized by introducing unconventional evaluation problems designed to be distant from those seen during training. To ensure the authenticity of the model's capabilities, a dedicated team developed these problems to negate any possible memorization or overfitting to benchmark datasets. Remarkably, phi-1 maintained its superior performance, reinforcing confidence in its ability to generalize effectively.

Data Contamination Analysis

A thorough analysis of potential data contamination highlighted the integrity of the phi-1 model's performance. Through a combination of embedding distances and syntax-based similarity checks, the team pruned the CodeExercises dataset to remove entries with resemblance to HumanEval tasks. Even with substantial pruning, phi-1 outperformed larger models, affirming the validity of its achievements and demonstrating the potency of high-quality data in elevating LLM performance.

Conclusion

In sum, phi-1's admirable performance stands as strong evidence for the proposition that dataset quality is paramount, potentially circumventing the need for voluminous data and immense computational resources. This work offers an intriguing perspective on efficient and sustainable LLM training, flagging data quality as an avenue ripe for exploration to reach next-level LLM proficiencies in code generation.