Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Teaching Arithmetic to Small Transformers (2307.03381v1)

Published 7 Jul 2023 in cs.LG

Abstract: LLMs like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.

Teaching Arithmetic to Small Transformers

The paper "Teaching Arithmetic to Small Transformers" presents a thorough exploration of how small transformer models can learn arithmetic operations effectively using varying data formatting and sampling strategies. Given the context of LLMs, such as GPT-4, exhibiting emergent arithmetic abilities, the paper investigates whether similar capabilities can be achieved in models with fewer parameters. This discourse intends to synthesize the findings and implications of the research, presenting them clearly for computational linguists and AI researchers.

The core inquiry of the paper is centered on whether small transformer architectures, trained from random initialization, are capable of efficiently learning arithmetic tasks such as addition, subtraction, multiplication, and certain elementary functions like sine and square root. This research is premised on the hypothesis that careful formatting of training data can yield notable enhancements in model accuracy and sample efficiency, two crucial parameters when dealing with transformer-based models.

Key Findings

  1. Data Formatting Impact: The paper underscores that traditional training datasets for arithmetic may not be optimal. It was found that reversing the order of the output digits in operations like addition -- the "reverse" format -- markedly improves accuracy and leads to significant sample efficiency when compared to the plaintext approach. The effectiveness of data formatting is justified both by phase transitions characteristic of low-rank matrix completion (LRMC) scenarios and by attention mechanisms aligned with human-like procedural reasoning.
  2. Chain-of-Thought (CoT) Data: Building on prior analysis, experiments indicated that CoT-style training data, which includes detailed step-by-step operations, could further improve the accuracy of arithmetic operations significantly. Notably, this holds even in the absence of any language pretraining, emphasizing that CoT data can break down compositional operations into simpler, digestible tasks that the models can learn efficiently.
  3. Role of Pretraining: Fine-tuning pretrained models like GPT-2 and GPT-3 on arithmetic tasks highlighted that these models, despite their large parameter count, initially perform poorly on straightforward arithmetical tests. However, they gain reasonable competence with relatively few additional training samples, with comprehensive CoT data leading to better performance metrics.
  4. Generalization Challenges: The authors address the notable limitations in length generalization—where models falter significantly when extending to digit lengths beyond those seen during training. Despite fine-tuning and CoT data, the adaptability to handle longer sequences without retraining proved challenging, reiterating the models’ inclination toward memorization rather than algorithmic understanding.
  5. Training on Text and Arithmetic Mixtures: Simulating the training conditions of LLMs, experiments mixing arithmetic tasks with text data (e.g., from Shakespeare works) illuminated the intertwined roles of arithmetic and text in training. The outcome attested to the models learning across contexts, with a differential impact observed in text-heavy data scenarios, where consistent formatting was crucial for blending arithmetic learning.

Implications and Future Directions

The implications of this research are profound both from the perspective of scaling LLM capabilities and from optimizing lower-parameter transformer models for arithmetic tasks. An emphasis on data quality, format, and sample efficiency demonstrated throughout provides actionable insights into pretraining and fine-tuning strategies that could democratize performance improvements in smaller models.

Future research could pivot toward refining length generalization, possibly by embedding algorithmic logic directly into initial training or through novel architectural adaptations that improve recursive problem-solving capabilities. Continued cross-pollination of ideas with human cognitive processes, akin to CoT reasoning, also presents a fertile ground for exploration.

The research enriches the discourse on data-centric artificial intelligence—where precision in training data design directly informs model competence, presenting distinct pathways to optimize models both in resource-constraint and resource-rich settings. Overall, the paper contributes valuable methodologies and insights to the AI community, potentially steering incremental advances in arithmetic problem-solving capabilities in various scales of transformer models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901, 2022.
  2. Bowman, S. R. Can recursive neural tensor networks learn logical reasoning? arXiv preprint arXiv:1312.6192, 2013.
  3. Recursive neural networks for learning logical semantics. CoRR, abs/1406.1827, 5, 2014.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  6. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611, 2017.
  7. Charton, F. Linear algebra with transformers. arXiv preprint arXiv:2112.01898, 2021.
  8. Charton, F. What is my math transformer doing?–three results on interpretability and generalization. arXiv preprint arXiv:2211.00170, 2022.
  9. Towards synthesizing complex programs from input-output examples. arXiv preprint arXiv:1706.01284, 2017.
  10. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  14. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  15. Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003, 2022.
  16. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  17. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
  18. Data-centric ai requires rethinking data notion. arXiv preprint arXiv:2110.02491, 2021.
  19. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. arXiv preprint arXiv:2305.00586, 2023.
  20. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  21. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
  22. Karpathy, A. char-rnn. https://github.com/karpathy/char-rnn, 2015.
  23. Karpathy, A. Andrej karpathy’s lightweight implementation of medium-sized gpts. GitHub, 2022. URL https://github.com/karpathy/nanoGPT.
  24. Have you seen that number? investigating extrapolation in question answering models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7031–7037, 2021.
  25. The algebraic combinatorial approach for low-rank matrix completion. J. Mach. Learn. Res., 16(1):1391–1436, 2015.
  26. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  27. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pp. 2873–2882. PMLR, 2018.
  28. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  29. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
  30. Exposing attention glitches with flip-flop language modeling. arXiv preprint arXiv:2306.00946, 2023.
  31. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  32. MosaicML. Introducing mpt-7b: A new standard for open source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  33. A data-centric approach for training deep neural networks with less data. arXiv preprint arXiv:2110.03613, 2021.
  34. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021.
  35. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  36. Making transformers solve compositional tasks. arXiv preprint arXiv:2108.04378, 2021.
  37. Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
  38. Open clone of openai’s unreleased webtext dataset scraper. GitHub, 2019. URL https://github.com/jcpeterson/openwebtext.
  39. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  40. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022.
  41. Improving language understanding by generative pre-training. 2018.
  42. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019.
  43. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
  44. Recht, B. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12), 2011.
  45. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.
  46. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
  47. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  48. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
  49. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  50. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
  51. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
  52. Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems, 33:20227–20237, 2020.
  53. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  54. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  55. Llama: Open and efficient foundation language models, 2023.
  56. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Do nlp models know numbers? probing numeracy in embeddings. arXiv preprint arXiv:1909.07940, 2019.
  59. Exploring generalization ability of pretrained language models on arithmetic and logical reasoning. In Natural Language Processing and Chinese Computing: 10th CCF International Conference, NLPCC 2021, Qingdao, China, October 13–17, 2021, Proceedings, Part I 10, pp.  758–769. Springer, 2021.
  60. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  61. Emergent analogical reasoning in large language models. arXiv preprint arXiv:2212.09196, 2022.
  62. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071–12083, 2022a.
  63. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022b.
  64. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022c.
  65. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.
  66. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
  67. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
  68. Learning to discover efficient mathematical identities. Advances in Neural Information Processing Systems, 27, 2014.
  69. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022.
  70. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
  71. Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nayoung Lee (6 papers)
  2. Kartik Sreenivasan (8 papers)
  3. Jason D. Lee (151 papers)
  4. Kangwook Lee (70 papers)
  5. Dimitris Papailiopoulos (59 papers)
Citations (64)
Youtube Logo Streamline Icon: https://streamlinehq.com