Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emergent Abilities in Reduced-Scale Generative Language Models (2404.02204v1)

Published 2 Apr 2024 in cs.CL and cs.LG
Emergent Abilities in Reduced-Scale Generative Language Models

Abstract: LLMs can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in LLMs with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal LLMs with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size. Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size.

Emergent Abilities in Reduced-Scale Generative LLMs

Introduction

The capability of LLMs to partake in in-context learning (ICL) without the need for fine-tuning has spurred significant interest. This feature, predominantly observed in billion-parameter models, raises the question: can emergent abilities be unlocked in smaller models through simplified pre-training data? This paper endeavors to explore this avenue by pre-training 36 causal LLMs with parameter counts ranging from 1 million to 165 million on a simplified English dataset. The results indicate that smaller models, when trained on simplified data, exhibit zero-shot capabilities on par with models six times their size trained on comprehensive datasets.

Simplifying Pre-training Data

The fundamental approach involved filtering existing pre-training corpora to adhere to a simplified vocabulary based on child-directed speech, resulting in a dataset predominantly consisting of simple linguistic structures. The SlimPajama dataset served as the basis for this simplified dataset, which underwent filtration to remove tokens outside the defined vocabulary, maintaining a minimal out-of-vocabulary rate.

Pre-training and Evaluation

Models were trained across a range of parameters, utilizing a tokenizer developed on the slimmed-down dataset. The training employed an effective batch size adjusted to the token count, with models spanning from 1M to 165M parameters undergoing this regimen. The evaluation covered a broad spectrum of tasks, differentiating between zero-shot and few-shot capabilities, and utilized a standard as well as a simplified variant of these tasks for a comprehensive analysis.

Findings

  • Zero-Shot Learning Capabilities: Simplified models demonstrated enhanced zero-shot learning capabilities across various tasks in simplified language, suggesting that by tailoring the complexity of the language, smaller models can indeed exhibit emergent abilities.
  • Model Scaling and Performance: An observed power law relationship between evaluation loss and scaling factors (compute, dataset size, and model size) was consistent with findings from larger models, indicating predictable performance improvements with increasing scale, even in a simplified language setting.
  • Comparative Performance: Simplified models, particularly the Simple 165M model, showcased zero-shot performance on simplified datasets that were comparable or superior to their larger counterparts trained on comprehensive datasets.

Implications and Future Directions

The paper sheds light on the potential of simplifying pre-training data as a viable strategy to instigate emergent abilities in smaller models. This not only has ramifications for reducing computational costs but also opens up new avenues for research into the mechanisms behind in-context learning and the bounds of model scaling. Future investigations could explore the effects of further data simplification, integration with model distillation techniques, and the efficacy of simplified models in specific application domains.

Conclusions

This paper posits that the emergent abilities typically reserved for LLMs can be accessed by smaller models through the strategic simplification of pre-training data. The implications of this are twofold: firstly, it highlights the adaptability and potential of smaller models in capturing complex language phenomena; secondly, it proposes a cost-effective alternative to the prevailing trend of scaling up model size for achieving advanced linguistic capabilities. As such, this research contributes valuable insights into the ongoing dialogue on effective and efficient ways to enhance the performance of generative LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  2. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  5. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
  8. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  9. Honey, I shrunk the language: Language model behavior at reduced scale. arXiv preprint arXiv:2305.17266.
  10. Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third international workshop on paraphrasing (IWP2005).
  11. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796.
  12. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759.
  13. A framework for few-shot language model evaluation.
  14. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations.
  15. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
  16. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  17. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  18. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  19. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436, Online. Association for Computational Linguistics.
  20. Babyberta: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th conference on computational natural language learning, pages 624–646.
  21. Philip A Huebner and Jon A Willits. 2021. Using lexical context to discover the noun category: Younger children have it easier. In Psychology of learning and motivation, volume 75, pages 279–331. Elsevier.
  22. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  23. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209.
  24. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, Toronto, Canada. Association for Computational Linguistics.
  25. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849.
  26. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  27. In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
  28. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  30. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  31. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
  32. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  33. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
  34. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  35. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  37. Inar Timiryasov and Jean-Loup Tastet. 2023. Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. arXiv preprint arXiv:2308.02019.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  39. Attention is all you need. Advances in neural information processing systems, 30.
  40. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377–392.
  41. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  43. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  44. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116.
  45. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv. org/abs/2205.01068, 3:19–0.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sherin Muckatira (5 papers)
  2. Vijeta Deshpande (6 papers)
  3. Vladislav Lialin (14 papers)
  4. Anna Rumshisky (42 papers)
Citations (2)