Transformers Can Do Arithmetic with the Right Embeddings (2405.17399v1)
Abstract: The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
- Path independent equilibrium models can better exploit test-time computation. Advances in Neural Information Processing Systems, 35:7796–7809, 2022a.
- Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022b.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking. Advances in Neural Information Processing Systems, 35, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Kerple: Kernelized relative positional embedding for length extrapolation. In Advances in Neural Information Processing Systems, 2022.
- Dissecting transformer length extrapolation via the lens of receptive field analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13522–13537, 2023.
- Simulation of graph algorithms with looped transformers. arXiv preprint arXiv:2402.01107, 2024.
- Universal transformers. In International Conference on Learning Representations, 2018.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
- Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning, pages 11117–11143. PMLR, 2023.
- Looped transformers as programmable computers. In International Conference on Machine Learning, pages 11398–11442. PMLR, 2023.
- xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989, 2023.
- A generalist neural algorithmic learner. In Learning on graphs conference, pages 2–1. PMLR, 2022.
- Length generalization in arithmetic transformers. arXiv preprint arXiv:2306.15400, 2023.
- The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, 2023.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
- Teaching arithmetic to small transformers. arXiv preprint arXiv:2307.03381, 2023.
- Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023.
- John Loeber. #16: Notes on Arithmetic in GPT-4, February 2024. URL https://loeber.substack.com/p/16-notes-on-arithmetic-in-gpt-4.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
- Benchmarking chatgpt on algorithmic reasoning. arXiv preprint arXiv:2404.03441, 2024.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Yarn: Efficient context window extension of large language models. International Conference on Learning Representations, 2024.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
- Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Discrete neural algorithmic reasoning. arXiv preprint arXiv:2402.11628, 2024.
- Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
- Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
- Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34, 2021.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Adaptive attention span in transformers. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1032. URL https://aclanthology.org/P19-1032.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- The clrs algorithmic reasoning benchmark. In International Conference on Machine Learning, pages 22084–22102. PMLR, 2022.
- DeepNet: Scaling Transformers to 1,000 Layers. arXiv:2203.00555 [cs], March 2022. URL http://arxiv.org/abs/2203.00555.
- Looped transformers are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023a.
- Gpt can solve mathematical problems without a calculator. arXiv preprint arXiv:2309.03241, 2023b.
- Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.
- What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
- Transformers can achieve length generalization but not robustly. arXiv preprint arXiv:2402.09371, 2024.
- Sean McLeish (5 papers)
- Arpit Bansal (17 papers)
- Alex Stein (5 papers)
- Neel Jain (13 papers)
- John Kirchenbauer (21 papers)
- Brian R. Bartoldson (23 papers)
- Bhavya Kailkhura (108 papers)
- Abhinav Bhatele (33 papers)
- Jonas Geiping (73 papers)
- Avi Schwarzschild (35 papers)
- Tom Goldstein (226 papers)