Introduction
In the field of AI, particularly within the domain of LLMs (LMs), there continues to be a debate regarding the necessity of large model sizes for complex problem-solving. An especially intriguing area of application is the ability of these models to solve grade school math problems, which require a blend of mathematical reasoning and language understanding. The canonical benchmark for assessing this capability in models is the GSM8K dataset, which is challenging even for LLMs.
TinyGSM and Verifier Model
The research paper presents the TinyGSM dataset, consisting of 12.3 million high-quality synthetic grade school math problems paired with Python solutions, all generated by GPT-3.5. When used to fine-tune a pair of more modestly-sized 1.3 billion parameter models (a generation model and an independent verifier model), an accuracy of 81.5% was achieved on the GSM8K benchmark. This level of performance surpasses even much larger models and is significant because it demonstrates that smaller models, with the right training data and strategies, can display advanced problem-solving capabilities comparable to their much larger counterparts.
Training and Performance
The researchers reveal that the smaller models, when fine-tuned on the TinyGSM dataset, perform remarkably well with the smallest 125M models attaining a 63.1% accuracy on the GSM8K test set. The paper sheds light on two important elements for this performance: firstly, the high-quality dataset and secondly, the employment of a verifier model that selects the most probable solutions from various candidate generations. Interestingly, the data diversity of the verifier's training seems to be more impactful than merely scaling up the generation model, pointing to more efficient parameter usage in the verifier.
Potential and Contributions
This work challenges prevailing notions that LLMs need to be large to be effective problem solvers, especially in mathematical reasoning. Not only does it open up new avenues for using smaller, more computationally-friendly models in various applications, but it also contributes a synthetic dataset that could prove invaluable for future research. Additionally, this paper offers insights into the importance of verifier models and diverse training data. Future research could explore different solution formats and further investigate the intriguing relationship between the sizes of generation models and verifiers.