Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale (2305.17266v2)

Published 26 May 2023 in cs.CL

Abstract: In recent years, LLMs have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked LLMing (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below $2.2 \times 10{15}$ FLOPs. We also find that adding layers does not always benefit downstream performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vijeta Deshpande (6 papers)
  2. Dan Pechi (2 papers)
  3. Shree Thatte (1 paper)
  4. Vladislav Lialin (14 papers)
  5. Anna Rumshisky (42 papers)
Citations (7)