TinyGSM: achieving >80% on GSM8k with small language models (2312.09241v1)

Published 14 Dec 2023 in cs.LG and cs.CL

Abstract: Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small LLMs to acquire mathematical reasoning. We introduce \texttt{TinyGSM}, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on \texttt{TinyGSM}, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset \texttt{TinyGSM}, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.

PDF Abstract

Introduction

In the field of AI, particularly within the domain of LLMs (LMs), there continues to be a debate regarding the necessity of large model sizes for complex problem-solving. An especially intriguing area of application is the ability of these models to solve grade school math problems, which require a blend of mathematical reasoning and language understanding. The canonical benchmark for assessing this capability in models is the GSM8K dataset, which is challenging even for LLMs.

TinyGSM and Verifier Model

The research paper presents the TinyGSM dataset, consisting of 12.3 million high-quality synthetic grade school math problems paired with Python solutions, all generated by GPT-3.5. When used to fine-tune a pair of more modestly-sized 1.3 billion parameter models (a generation model and an independent verifier model), an accuracy of 81.5% was achieved on the GSM8K benchmark. This level of performance surpasses even much larger models and is significant because it demonstrates that smaller models, with the right training data and strategies, can display advanced problem-solving capabilities comparable to their much larger counterparts.

Training and Performance

The researchers reveal that the smaller models, when fine-tuned on the TinyGSM dataset, perform remarkably well with the smallest 125M models attaining a 63.1% accuracy on the GSM8K test set. The paper sheds light on two important elements for this performance: firstly, the high-quality dataset and secondly, the employment of a verifier model that selects the most probable solutions from various candidate generations. Interestingly, the data diversity of the verifier's training seems to be more impactful than merely scaling up the generation model, pointing to more efficient parameter usage in the verifier.

Potential and Contributions

This work challenges prevailing notions that LLMs need to be large to be effective problem solvers, especially in mathematical reasoning. Not only does it open up new avenues for using smaller, more computationally-friendly models in various applications, but it also contributes a synthetic dataset that could prove invaluable for future research. Additionally, this paper offers insights into the importance of verifier models and diverse training data. Future research could explore different solution formats and further investigate the intriguing relationship between the sizes of generation models and verifiers.