Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Trajectories of Language Models Across Scales (2212.09803v3)

Published 19 Dec 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Scaling up LLMs has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do LLMs of different sizes learn during pre-training? Why do larger LLMs demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mengzhou Xia (34 papers)
  2. Mikel Artetxe (52 papers)
  3. Chunting Zhou (36 papers)
  4. Xi Victoria Lin (39 papers)
  5. Ramakanth Pasunuru (32 papers)
  6. Danqi Chen (84 papers)
  7. Luke Zettlemoyer (225 papers)
  8. Ves Stoyanov (15 papers)
Citations (45)