How predictable is language model benchmark performance? (2401.04757v1)
Abstract: We investigate LLM performance across five orders of magnitude of compute scaling in eleven recent model architectures. We show that average benchmark performance, aggregating over many individual tasks and evaluations as in the commonly-used BIG-Bench dataset, is decently predictable as a function of training compute scale. Specifically, when extrapolating BIG-Bench Hard performance across one order of magnitude in compute, we observe average absolute errors of 6 percentage points (pp). By contrast, extrapolation for individual BIG-Bench tasks across an order of magnitude in compute yields higher average errors of 18pp. Nonetheless, individual task performance remains significantly more predictable than chance. Overall, our work suggests compute scaling provides a promising basis to forecast AI capabilities in diverse benchmarks, though predicting performance in specific tasks poses challenges.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Tom Davidson. What a compute-centric framework says about AI takeoff speeds - a draft report. Open Philanthropy, 2023.
- Lukas Finnveden. Extrapolating GPT-N performance. Alignment Forum, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Lukas Finnveden. PaLM in "Extrapolating GPT-N performance". Alignment Forum, 2022.
- Lukas Finnveden. GPT-4 and PaLM-2 in "Extrapolating GPT-N performance". Alignment Forum, 2023.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Alyssa Vance. Is AI Progress Impossible to Predict? LessWrong, 2022.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
- Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
- Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
- Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
- Matthew Barnett. Twitter thread on retrodicting GPT-4 MMLU performance. Twitter, 2023.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Nostalgebraist. Chinchilla’s wild implications. Alignment Forum, 2022.
- Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Gemini technical report. Technical report, Google DeepMind, 2023.
- Algorithmic progress in computer vision. arXiv preprint arXiv:2212.05153, 2022.
- PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
- SWE-bench: Can language models resolve real-world GitHub issues? arXiv preprint arXiv:2310.06770, 2023.
- Epoch. Parameter, compute and data trends in machine learning. https://epochai.org/data/pcd, 2022.
- Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021.
- BloombergGPT: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
- GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- LLaMa: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- David Owen (10 papers)