Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How predictable is language model benchmark performance? (2401.04757v1)

Published 9 Jan 2024 in cs.LG and cs.AI

Abstract: We investigate LLM performance across five orders of magnitude of compute scaling in eleven recent model architectures. We show that average benchmark performance, aggregating over many individual tasks and evaluations as in the commonly-used BIG-Bench dataset, is decently predictable as a function of training compute scale. Specifically, when extrapolating BIG-Bench Hard performance across one order of magnitude in compute, we observe average absolute errors of 6 percentage points (pp). By contrast, extrapolation for individual BIG-Bench tasks across an order of magnitude in compute yields higher average errors of 18pp. Nonetheless, individual task performance remains significantly more predictable than chance. Overall, our work suggests compute scaling provides a promising basis to forecast AI capabilities in diverse benchmarks, though predicting performance in specific tasks poses challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  2. Tom Davidson. What a compute-centric framework says about AI takeoff speeds - a draft report. Open Philanthropy, 2023.
  3. Lukas Finnveden. Extrapolating GPT-N performance. Alignment Forum, 2020.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Lukas Finnveden. PaLM in "Extrapolating GPT-N performance". Alignment Forum, 2022.
  6. Lukas Finnveden. GPT-4 and PaLM-2 in "Extrapolating GPT-N performance". Alignment Forum, 2023.
  7. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  8. Alyssa Vance. Is AI Progress Impossible to Predict? LessWrong, 2022.
  9. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
  10. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
  11. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
  12. Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
  13. Matthew Barnett. Twitter thread on retrodicting GPT-4 MMLU performance. Twitter, 2023.
  14. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  15. Nostalgebraist. Chinchilla’s wild implications. Alignment Forum, 2022.
  16. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
  17. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  18. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  19. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  20. Gemini technical report. Technical report, Google DeepMind, 2023.
  21. Algorithmic progress in computer vision. arXiv preprint arXiv:2212.05153, 2022.
  22. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  23. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  24. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
  25. SWE-bench: Can language models resolve real-world GitHub issues? arXiv preprint arXiv:2310.06770, 2023.
  26. Epoch. Parameter, compute and data trends in machine learning. https://epochai.org/data/pcd, 2022.
  27. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021.
  28. BloombergGPT: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  29. GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  30. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  31. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  32. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  34. LLaMa: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. David Owen (10 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets