Carrying over algorithm in transformers (2401.07993v2)

Published 15 Jan 2024 in cs.LG and cs.AI

Abstract: Addition is perhaps one of the simplest arithmetic tasks one can think of and is usually performed using the carrying over algorithm. This algorithm consists of two tasks: adding digits in the same position and carrying over a one whenever necessary. We study how transformer models implement this algorithm and how the two aforementioned tasks are allocated to different parts of the network. We first focus on two-layer encoder-only models and show that the carrying over algorithm is implemented in a modular fashion. The first layer is mostly responsible for adding digits in the same position. The second layer first decides, in the attention, which positions need a carried one or not, and then performs the carrying of the one in the final MLP. We provide a simple way of precisely identifying which neurons are responsible for that task. This implementation of the carrying over algorithm occurs across a range of hyperparameters for two as well as three-layer models. For small decoder-only models, we observe the same implementation and provide suggestive evidence for its existence in three 7B LLMs.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that transformer models partition the carrying over process across distinct components, akin to digital full adders.
It employs PCA to show that initial layers segregate sums while deeper layers consistently integrate carried values across both encoder and decoder architectures.
The study implies that optimizing factors like weight decay and preserving attention patterns can enhance arithmetic performance in larger language models.

Introduction

The paper under discussion explores the intricate implementation of integer addition within transformer models. Notably, the analysis is conducted on transformer architectures trained specifically on three-digit addition, employing both encoder-only and decoder-only frameworks. The central focus is the modular process encapsulating the carrying over algorithm within these transformer models.

Algorithm Implementation in Encoder-Only Models

The paper establishes that in two-layer encoder-only models, the carrying over algorithm is partitioned across distinct architectural components. Each of the four sequential stages of algorithm—digit addition at corresponding positions, evaluation of the need for carrying over based on a sum's magnitude, location determination for the carried one, and the final addition of the carried one—is parallelly executed by designated parts of the network architecture.

Further investigation reveals that larger models (with three layers) and decoder-only transformer models illustrate similar behavior. There is compelling evidence that a specific subset of neurons within the final multilayer perceptron (MLP) of these transformer models is primarily involved in adding the carried one.

Analysis and Interpretation

By employing principal component analysis (PCA) on each layer's output, the paper reveals how the first layer pivots around segregating sums less than 10 from those 10 or greater, while subsequent layers focus on organizing examples depending on whether additional carried ones are required. This granular dissection provides comprehensive insights into the operational mechanisms of these models.

The accuracy and performance consistency across different model initializations and training hyperparameters is shown to be reliant on factors like weight decay. Notably, the model's functions, mirrored with digital circuits' full and half-adders, offer a fascinating analogy to the transformer's operational functionality.

LLMs and Integer Addition

Ultimate applications to LLMs present a curious case paper. Though such models exhibit proficiency in numerous tasks, arithmetic computations often remain a challenge. Interestingly, certain facets of smaller, interpretable models carry over to LLMs. Evidence for this includes similar attention mechanisms and residue stream patterns. Such insights call for further explorations to enhance arithmetic capabilities within LLMs.

Conclusions

The paper concludes with an assertion that small transformer models implement the carrying over algorithm in a fashion akin to digital full adders. Despite the complexity of these deep learning models, the modularity of operations is evident, suggesting potential methods for optimization and perhaps enticing future improvements in LLMs' arithmetic reasoning. The underlying implementations of these algorithms across varying architectures emerge as modular, systematic, and reflective of digital logic operations.

Further Research

These findings lay the groundwork for subsequent studies focusing on preserving attention patterns and stabilizing them for out-of-distribution generalizations. How transformers can effectively apply inductive reasoning algorithms to solve mathematical problems remains a substantial area of research, particularly in the context of few-shot learning and fine-tuning approaches.