Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

312 1

Teaching Arithmetic to Small Transformers (2307.03381v1)

Published 7 Jul 2023 in cs.LG

Abstract: LLMs like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.

PDF HTML Abstract

Teaching Arithmetic to Small Transformers

The paper "Teaching Arithmetic to Small Transformers" presents a thorough exploration of how small transformer models can learn arithmetic operations effectively using varying data formatting and sampling strategies. Given the context of LLMs, such as GPT-4, exhibiting emergent arithmetic abilities, the paper investigates whether similar capabilities can be achieved in models with fewer parameters. This discourse intends to synthesize the findings and implications of the research, presenting them clearly for computational linguists and AI researchers.

The core inquiry of the paper is centered on whether small transformer architectures, trained from random initialization, are capable of efficiently learning arithmetic tasks such as addition, subtraction, multiplication, and certain elementary functions like sine and square root. This research is premised on the hypothesis that careful formatting of training data can yield notable enhancements in model accuracy and sample efficiency, two crucial parameters when dealing with transformer-based models.

Key Findings

Data Formatting Impact: The paper underscores that traditional training datasets for arithmetic may not be optimal. It was found that reversing the order of the output digits in operations like addition -- the "reverse" format -- markedly improves accuracy and leads to significant sample efficiency when compared to the plaintext approach. The effectiveness of data formatting is justified both by phase transitions characteristic of low-rank matrix completion (LRMC) scenarios and by attention mechanisms aligned with human-like procedural reasoning.
Chain-of-Thought (CoT) Data: Building on prior analysis, experiments indicated that CoT-style training data, which includes detailed step-by-step operations, could further improve the accuracy of arithmetic operations significantly. Notably, this holds even in the absence of any language pretraining, emphasizing that CoT data can break down compositional operations into simpler, digestible tasks that the models can learn efficiently.
Role of Pretraining: Fine-tuning pretrained models like GPT-2 and GPT-3 on arithmetic tasks highlighted that these models, despite their large parameter count, initially perform poorly on straightforward arithmetical tests. However, they gain reasonable competence with relatively few additional training samples, with comprehensive CoT data leading to better performance metrics.
Generalization Challenges: The authors address the notable limitations in length generalization—where models falter significantly when extending to digit lengths beyond those seen during training. Despite fine-tuning and CoT data, the adaptability to handle longer sequences without retraining proved challenging, reiterating the models’ inclination toward memorization rather than algorithmic understanding.
Training on Text and Arithmetic Mixtures: Simulating the training conditions of LLMs, experiments mixing arithmetic tasks with text data (e.g., from Shakespeare works) illuminated the intertwined roles of arithmetic and text in training. The outcome attested to the models learning across contexts, with a differential impact observed in text-heavy data scenarios, where consistent formatting was crucial for blending arithmetic learning.

Implications and Future Directions

The implications of this research are profound both from the perspective of scaling LLM capabilities and from optimizing lower-parameter transformer models for arithmetic tasks. An emphasis on data quality, format, and sample efficiency demonstrated throughout provides actionable insights into pretraining and fine-tuning strategies that could democratize performance improvements in smaller models.

Future research could pivot toward refining length generalization, possibly by embedding algorithmic logic directly into initial training or through novel architectural adaptations that improve recursive problem-solving capabilities. Continued cross-pollination of ideas with human cognitive processes, akin to CoT reasoning, also presents a fertile ground for exploration.

The research enriches the discourse on data-centric artificial intelligence—where precision in training data design directly informs model competence, presenting distinct pathways to optimize models both in resource-constraint and resource-rich settings. Overall, the paper contributes valuable methodologies and insights to the AI community, potentially steering incremental advances in arithmetic problem-solving capabilities in various scales of transformer models.

PDF Markdown Bookmark Chat (Pro)

References (71)

Authors (5)

Nayoung Lee (6 papers)
Kartik Sreenivasan (8 papers)
Jason D. Lee (151 papers)
Kangwook Lee (70 papers)
Dimitris Papailiopoulos (59 papers)

Citations (64)

View on Semantic Scholar

Tweets

https://twitter.com/tomgoldsteincs/status/1795508276903252311

https://twitter.com/Eli_Schwartz/status/1775042910309736837

https://twitter.com/Sneaky2x/status/1859519423456936181

https://twitter.com/MInusGix/status/1768323210099982464

https://twitter.com/OwariDa/status/1888482314469953705

https://twitter.com/MInusGix/status/1845920850589171875

YouTube

Show All Videos