Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inverse Scaling: When Bigger Isn't Better (2306.09479v2)

Published 15 Jun 2023 in cs.CL, cs.AI, and cs.CY

Abstract: Work on scaling laws has found that LLMs (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training LLMs.

Insights into Inverse Scaling in LLMs

The paper "Inverse Scaling: When Bigger Isn't Better" introduces a compelling observation in the field of LLM (LM) performance, specifically the inverse scaling phenomenon. Traditionally, larger LMs, characterized by increased parameters, more extensive training data, and higher compute power, exhibit improved performance across various tasks. However, this research challenges the conventional wisdom, presenting data that for certain tasks, performance declines as model scale increases. The research leverages empirical data curated from the Inverse Scaling Prize contest and provides insightful analysis into potential causes of inverse scaling, marking an important contribution to understanding LM behaviors beyond mere performance metrics.

Summary of Findings

The researchers focus on 11 datasets showcasing the inverse scaling phenomenon. They identify four primary causes for inverse scaling:

  1. Strong Prior: Larger models might prefer repeating memorized sequences rather than adhering to in-context instructions. Tasks exhibiting this include Resisting Correction, where LMs fail to repeat ungrammatical sequences correctly, showing a strong inclination towards commonly learned sequences.
  2. Unwanted Imitation: This refers to LMs imitating undesirable patterns within the training data. The task Modus Tollens, where models incorrectly apply the logical inference rule of modus tollens, exemplifies this.
  3. Distractor Tasks: In these tasks, LMs may focus on easier distractor tasks rather than more challenging intended tasks. Pattern Match Suppression is such a task, where LMs fail to break a simple pattern even when instructed.
  4. Spurious Few-Shot: In this scenario, few-shot examples can mislead LMs into focusing on spurious patterns rather than the intended task logic, as seen in the Hindsight Neglect task.

The authors release these datasets to encourage further investigation, providing a significant resource for the community to examine the nuanced scaling behaviors of LMs.

Implications and Theoretical Considerations

The implications of this research are profound, both practically and theoretically. Practically, inverse scaling presents a challenge to reliance on larger LMs for improved performance, especially in critical applications requiring accurate and context-sensitive responses. This necessitates more thoughtful model training strategies that go beyond increasing scale.

Theoretically, the findings compel a reconsideration of scaling laws and their predictive reliability for task performance. The emergence of U-shaped and inverted-U scaling trends—where scaling behavior initially reverses—challenges the linear scaling paradigms and suggests a more complex interaction between model capacity and task performance.

Moreover, the phenomenon of inverse scaling underscores the importance of designing LMs with nuanced understanding and analysis capabilities, rather than mere pattern recognition or data memorization. The reliance on training objectives that align closely with intended tasks and mitigate undesirable behaviors becomes crucial.

Future Developments in AI

Looking ahead, the research points to several avenues for advancing AI technology and theory. Mitigation strategies such as enhancing pretrained models with targeted fine-tuning, incorporating reinforcement learning from human feedback (RLHF), or fundamentally revisiting pretraining objectives could ameliorate inverse scaling effects. They could enable the development of LMs that are both scalable and reliable across a wider array of tasks, including those that defy traditional scaling laws.

Additionally, understanding inverse scaling can contribute to AI safety and alignment by helping recognize scenarios where models might deviate unexpectedly from desired operational behaviors. This understanding could support the design of LMs that effectively balance scale with nuanced task comprehension, reducing susceptibility to failures borne from purely statistical or memorized patterns.

In conclusion, this research opens new discourse around the capabilities, limitations, and potential risks associated with large LMs, urging the community to rethink established scaling paradigms and encouraging a more holistic approach to model development and deployment. The datasets and insights provided serve as a valuable foundation for future explorations in this critical area of AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Anthropic. Introducing Claude. https://www.anthropic.com/index/introducing-claude, 2023. Accessed: 2023-03-15.
  2. A General Language Assistant as a Laboratory for Alignment, December 2021. URL http://arxiv.org/abs/2112.00861. arXiv:2112.00861 [cs].
  3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv preprint 2204.05862, 2022. URL https://arxiv.org/abs/2204.05862.
  4. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  5. On the Opportunities and Risks of Foundation Models, July 2022. URL http://arxiv.org/abs/2108.07258. arXiv:2108.07258 [cs].
  6. Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  7. Quantifying Memorization Across Neural Language Models. arXiv preprint 2202.07646, 2022. URL https://arxiv.org/abs/2202.07646.
  8. Evaluating Large Language Models Trained on Code. arXiv preprint 2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  9. PaLM: Scaling Language Modeling with Pathways. arXiv preprint 2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
  10. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
  11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  12. Behavioral Biases in the NFL Gambling Market: Overreaction to News and the Recency Bias. SSRN Electronic Journal, 2021. doi: 10.2139/ssrn.3861231. URL https://doi.org/10.2139/ssrn.3861231.
  13. Predictability and Surprise in Large Generative Models. In 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, jun 2022. doi: 10.1145/3531146.3533229. URL https://doi.org/10.1145%2F3531146.3533229.
  14. The Capacity for Moral Self-Correction in Large Language Models, 2023. URL https://arxiv.org/abs/2302.07459.
  15. Scaling Laws for Reward Model Overoptimization, October 2022. URL http://arxiv.org/abs/2210.10760. arXiv:2210.10760 [cs, stat].
  16. Measuring Massive Multitask Language Understanding, January 2021. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs].
  17. Training Compute-Optimal Large Language Models. arXiv preprint 2203.15556, 2022. URL https://arxiv.org/abs/2203.15556.
  18. Risks from Learned Optimization in Advanced Machine Learning Systems. arXiv preprint 1906.01820, 2021. URL https://arxiv.org/abs/1906.01820.
  19. On the Psychology of Prediction. Psychological Review, 80(4):237–251, July 1973. doi: 10.1037/h0034747. URL https://doi.org/10.1037/h0034747.
  20. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In ICML, 2022. URL https://proceedings.mlr.press/v162/kandpal22a/kandpal22a.pdf.
  21. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs, stat], January 2020. URL http://arxiv.org/abs/2001.08361. arXiv: 2001.08361.
  22. Alignment of Language Agents, March 2021. URL http://arxiv.org/abs/2103.14659. arXiv:2103.14659 [cs].
  23. Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models, December 2022. URL http://arxiv.org/abs/2212.10769. arXiv:2212.10769 [cs].
  24. Pretraining Language Models with Human Preferences, February 2023. URL http://arxiv.org/abs/2302.08582. arXiv:2302.08582 [cs].
  25. Holistic Evaluation of Language Models, November 2022. URL http://arxiv.org/abs/2211.09110. arXiv:2211.09110 [cs].
  26. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  27. Inverse Scaling Prize Ideas, Oct 2022. URL https://ethanperez.net/inverse-scaling-prize-ideas/.
  28. The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python, May 2023. URL http://arxiv.org/abs/2305.15507. arXiv:2305.15507 [cs].
  29. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2381–2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260.
  30. COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models, February 2023. URL http://arxiv.org/abs/2210.01963. arXiv:2210.01963 [cs].
  31. The Alignment Problem From a Deep Learning Perspective, 2023. URL https://arxiv.org/abs/2209.00626.
  32. Show Your Work: Scratchpads for Intermediate Computation with Language Models. CoRR, abs/2112.00114, 2021. URL https://arxiv.org/abs/2112.00114.
  33. In-context Learning and Induction Heads, 2022.
  34. OpenAI. Model index for researchers. https://platform.openai.com/docs/model-index-for-researchers, 2022. Accessed: 2023-04-06.
  35. OpenAI. GPT-4 Technical Report, 2023. URL https://arxiv.org/abs/2303.08774.
  36. Training language models to follow instructions with human feedback, March 2022. URL http://arxiv.org/abs/2203.02155. arXiv:2203.02155 [cs].
  37. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144.
  38. BBQ: A Hand-Built Bias Benchmark for Question Answering. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.165. URL https://aclanthology.org/2022.findings-acl.165.
  39. True Few-Shot Learning with Language Models. NeurIPS, 2021. URL https://arxiv.org/abs/2105.11447.
  40. Discovering Language Model Behaviors with Model-Written Evaluations, 2022. URL https://arxiv.org/abs/2212.09251.
  41. Sundar Pichai. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/.
  42. Counterfactual Story Reasoning and Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5043–5053, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1509. URL https://aclanthology.org/D19-1509.
  43. Language Models are Unsupervised Multitask Learners. Technical report, OpenAI, 2019.
  44. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv preprint 2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
  45. Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022. Accessed: 2023-03-15.
  46. Goal Misgeneralization: Why Correct Specifications Aren’t Enough for Correct Goals, 2022. URL https://arxiv.org/abs/2210.01790. version 2.
  47. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  5861–5873. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/2e855f9489df0712b4bd8ea9e2848c5a-Paper.pdf.
  48. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. arXiv preprint 2206.04615, 2022. URL https://arxiv.org/abs/2206.04615.
  49. Learning to Summarize with Human Feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  3008–3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.
  50. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL http://aclweb.org/anthology/W18-5446.
  51. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://papers.nips.cc/paper_files/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
  52. Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.  1719–1729, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.130. URL https://aclanthology.org/2022.findings-naacl.130.
  53. Peter Cathcart Wason. Reasoning about a Rule. Quarterly Journal of Experimental Psychology, 20(3):273–281, 1968. doi: 10.1080/14640746808400161. URL https://journals.sagepub.com/doi/10.1080/14640746808400161.
  54. Inverse Scaling Can Become U-shaped. Computing Research Repository, arXiv:2211.02011, 2022a. doi: 10.48550/ARXIV.2211.02011. URL https://arxiv.org/abs/2211.02011.
  55. Emergent Abilities of Large Language Models, October 2022b. URL http://arxiv.org/abs/2206.07682. arXiv:2206.07682 [cs].
  56. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022c. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  57. OPT: Open Pre-trained Transformer Language Models. arXiv preprint 2205.01068, 2022. URL https://arxiv.org/abs/2205.01068.
  58. Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models. In Findings of the Association for Computational Linguistics (ACL Findings), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (27)
  1. Ian R. McKenzie (1 paper)
  2. Alexander Lyzhov (6 papers)
  3. Michael Pieler (10 papers)
  4. Alicia Parrish (31 papers)
  5. Aaron Mueller (35 papers)
  6. Ameya Prabhu (37 papers)
  7. Euan McLean (6 papers)
  8. Aaron Kirtland (7 papers)
  9. Alexis Ross (13 papers)
  10. Alisa Liu (25 papers)
  11. Andrew Gritsevskiy (8 papers)
  12. Daniel Wurgaft (3 papers)
  13. Derik Kauffman (2 papers)
  14. Gabriel Recchia (6 papers)
  15. Jiacheng Liu (67 papers)
  16. Joe Cavanagh (2 papers)
  17. Max Weiss (1 paper)
  18. Sicong Huang (12 papers)
  19. The Floating Droid (1 paper)
  20. Tom Tseng (6 papers)
Citations (104)
Youtube Logo Streamline Icon: https://streamlinehq.com