Emergent Abilities in Reduced-Scale Generative LLMs
Introduction
The capability of LLMs to partake in in-context learning (ICL) without the need for fine-tuning has spurred significant interest. This feature, predominantly observed in billion-parameter models, raises the question: can emergent abilities be unlocked in smaller models through simplified pre-training data? This paper endeavors to explore this avenue by pre-training 36 causal LLMs with parameter counts ranging from 1 million to 165 million on a simplified English dataset. The results indicate that smaller models, when trained on simplified data, exhibit zero-shot capabilities on par with models six times their size trained on comprehensive datasets.
Simplifying Pre-training Data
The fundamental approach involved filtering existing pre-training corpora to adhere to a simplified vocabulary based on child-directed speech, resulting in a dataset predominantly consisting of simple linguistic structures. The SlimPajama dataset served as the basis for this simplified dataset, which underwent filtration to remove tokens outside the defined vocabulary, maintaining a minimal out-of-vocabulary rate.
Pre-training and Evaluation
Models were trained across a range of parameters, utilizing a tokenizer developed on the slimmed-down dataset. The training employed an effective batch size adjusted to the token count, with models spanning from 1M to 165M parameters undergoing this regimen. The evaluation covered a broad spectrum of tasks, differentiating between zero-shot and few-shot capabilities, and utilized a standard as well as a simplified variant of these tasks for a comprehensive analysis.
Findings
- Zero-Shot Learning Capabilities: Simplified models demonstrated enhanced zero-shot learning capabilities across various tasks in simplified language, suggesting that by tailoring the complexity of the language, smaller models can indeed exhibit emergent abilities.
- Model Scaling and Performance: An observed power law relationship between evaluation loss and scaling factors (compute, dataset size, and model size) was consistent with findings from larger models, indicating predictable performance improvements with increasing scale, even in a simplified language setting.
- Comparative Performance: Simplified models, particularly the Simple 165M model, showcased zero-shot performance on simplified datasets that were comparable or superior to their larger counterparts trained on comprehensive datasets.
Implications and Future Directions
The paper sheds light on the potential of simplifying pre-training data as a viable strategy to instigate emergent abilities in smaller models. This not only has ramifications for reducing computational costs but also opens up new avenues for research into the mechanisms behind in-context learning and the bounds of model scaling. Future investigations could explore the effects of further data simplification, integration with model distillation techniques, and the efficacy of simplified models in specific application domains.
Conclusions
This paper posits that the emergent abilities typically reserved for LLMs can be accessed by smaller models through the strategic simplification of pre-training data. The implications of this are twofold: firstly, it highlights the adaptability and potential of smaller models in capturing complex language phenomena; secondly, it proposes a cost-effective alternative to the prevailing trend of scaling up model size for achieving advanced linguistic capabilities. As such, this research contributes valuable insights into the ongoing dialogue on effective and efficient ways to enhance the performance of generative LLMs.