Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Emergent Abilities in Reduced-Scale Generative Language Models (2404.02204v1)

Published 2 Apr 2024 in cs.CL and cs.LG

Abstract: LLMs can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in LLMs with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal LLMs with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size. Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size.

PDF HTML Abstract

Emergent Abilities in Reduced-Scale Generative LLMs

Introduction

The capability of LLMs to partake in in-context learning (ICL) without the need for fine-tuning has spurred significant interest. This feature, predominantly observed in billion-parameter models, raises the question: can emergent abilities be unlocked in smaller models through simplified pre-training data? This paper endeavors to explore this avenue by pre-training 36 causal LLMs with parameter counts ranging from 1 million to 165 million on a simplified English dataset. The results indicate that smaller models, when trained on simplified data, exhibit zero-shot capabilities on par with models six times their size trained on comprehensive datasets.

Simplifying Pre-training Data

The fundamental approach involved filtering existing pre-training corpora to adhere to a simplified vocabulary based on child-directed speech, resulting in a dataset predominantly consisting of simple linguistic structures. The SlimPajama dataset served as the basis for this simplified dataset, which underwent filtration to remove tokens outside the defined vocabulary, maintaining a minimal out-of-vocabulary rate.

Pre-training and Evaluation

Models were trained across a range of parameters, utilizing a tokenizer developed on the slimmed-down dataset. The training employed an effective batch size adjusted to the token count, with models spanning from 1M to 165M parameters undergoing this regimen. The evaluation covered a broad spectrum of tasks, differentiating between zero-shot and few-shot capabilities, and utilized a standard as well as a simplified variant of these tasks for a comprehensive analysis.

Findings

Zero-Shot Learning Capabilities: Simplified models demonstrated enhanced zero-shot learning capabilities across various tasks in simplified language, suggesting that by tailoring the complexity of the language, smaller models can indeed exhibit emergent abilities.
Model Scaling and Performance: An observed power law relationship between evaluation loss and scaling factors (compute, dataset size, and model size) was consistent with findings from larger models, indicating predictable performance improvements with increasing scale, even in a simplified language setting.
Comparative Performance: Simplified models, particularly the Simple 165M model, showcased zero-shot performance on simplified datasets that were comparable or superior to their larger counterparts trained on comprehensive datasets.

Implications and Future Directions

The paper sheds light on the potential of simplifying pre-training data as a viable strategy to instigate emergent abilities in smaller models. This not only has ramifications for reducing computational costs but also opens up new avenues for research into the mechanisms behind in-context learning and the bounds of model scaling. Future investigations could explore the effects of further data simplification, integration with model distillation techniques, and the efficacy of simplified models in specific application domains.

Conclusions

This paper posits that the emergent abilities typically reserved for LLMs can be accessed by smaller models through the strategic simplification of pre-training data. The implications of this are twofold: firstly, it highlights the adaptability and potential of smaller models in capturing complex language phenomena; secondly, it proposes a cost-effective alternative to the prevailing trend of scaling up model size for achieving advanced linguistic capabilities. As such, this research contributes valuable insights into the ongoing dialogue on effective and efficient ways to enhance the performance of generative LLMs.

PDF Markdown Bookmark Chat (Pro)

References (45)

Authors (4)

Sherin Muckatira (5 papers)
Vijeta Deshpande (6 papers)
Vladislav Lialin (14 papers)
Anna Rumshisky (42 papers)

Citations (2)

View on Semantic Scholar

Tweets

https://twitter.com/LChoshen/status/1775919359267127656

https://twitter.com/SherinMuckatira/status/1783234816877498826

https://twitter.com/arxivsanitybot/status/1776240151993209027