Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformers Can Do Arithmetic with the Right Embeddings (2405.17399v1)

Published 27 May 2024 in cs.LG and cs.AI

Abstract: The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.

Enhanced Arithmetic Capabilities in Transformers through Abacus Embeddings and Recurrence

The paper investigates the inherent challenges faced by transformer models, particularly in the context of arithmetic tasks, and proposes a solution to address these deficits. The primary contribution involves introducing Abacus Embeddings, which significantly improve positional representations of digits, and integrating recurrent layers to enhance the transformer’s reasoning capabilities.

Core Contributions and Methodologies

The authors identify that transformers struggle with arithmetic due to their difficulty in maintaining exact positional information of digits within sequences. To remedy this, they propose Abacus Embeddings, a novel positional embedding technique that encodes digit positions relative to the start of their respective numbers. This approach diverges from traditional positional embeddings by providing identical embeddings for digits of the same significance, hence preserving the positional hierarchy required for arithmetic operations.

Key Insights and Numerical Results

  1. Abacus Embeddings:
    • These embeddings significantly boost transformer performance on arithmetic. For example, models trained with Abacus Embeddings generalize to addition problems up to 120 digits in length with state-of-the-art generalization, representing a 6x factor relative to the training distribution—a notable enhancement over the previous 2.5x state-of-the-art.
    • Models utilizing Abacus Embeddings reached up to 99% accuracy on 100-digit addition problems.
  2. Architectural Enhancements:
    • Input Injection: Introducing skip connections that propagate input features into each transformer layer was found to reduce generalization errors by 50% when layered with Abacus Embeddings.
    • Recurrent Layers: By looping transformer layers, notable improvements were observed in multi-step reasoning tasks. The looped transformer, integrated with Abacus Embeddings, showed near-perfect generalization on extensive arithmetic problems.
    • These methods collectively reduced errors from 92.9% to 99.1% in out-of-distribution accuracy, translating to an 87% reduction in error compared to standard architectures.

Extended Implications for Algorithmic Reasoning

The success of Abacus Embeddings extends beyond addition to other algorithmic reasoning tasks like multiplication and sorting.

  1. Multiplication:
    • Transformers augmented with Abacus Embeddings achieved near-perfect accuracy when tested on multiplication problems involving operands of up to 15 digits.
    • The performance remains robust even as complexity increases, highlighting the embeddings’ capabilities in handling more intricate arithmetic tasks.
  2. Sorting:
    • The paper explores sorting problems, presenting arrays of variable length numbers. Abacus Embeddings enhance the model's ability to generalize across diverse scenarios, performing significantly better in generalization tasks than other embeddings.
    • Different architectural setups (standard transformer, transformer with input injection, and looped transformer) were tested, showing varied results. Looped transformers excelled at accurately identifying the minimum element in the array during extrapolation tasks.

Future Prospects and Implications

This paper advances the understanding of transformer capabilities in performing arithmetic and algorithmic reasoning tasks. The findings open several avenues for future research:

  1. Integration with General-Purpose Models:
    • Investigating the combination of Abacus Embeddings with embeddings more suited for natural language, such as Rotary Embeddings (RoPE) and Functional Interpolation for Relative Position Embeddings (FIRE), indicates substantial potential. This amalgamation can create a robust embedding strategy that maintains high performance across arithmetic and broader NLP tasks.
  2. Broader Range of Algorithmic Tasks:
    • Extending the current approach to a more diverse set of algorithmic reasoning challenges can help in developing more versatile models and enhance the ability of transformers to generalize in increasingly complex scenarios.
  3. Improved Positional Embedding Strategies:
    • Future research might explore further refinements in positional embeddings, especially those that facilitate better length generalization without significant computational overhead.

In conclusion, the paper presents a noteworthy advance in improving transformer models' performance on arithmetic tasks through the introduction of Abacus Embeddings and recurrent architectures. These techniques not only achieve significant performance gains but also demonstrate promising transferability to other complex algorithmic procedures, paving the way for more practical and theoretically robust applications in AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Path independent equilibrium models can better exploit test-time computation. Advances in Neural Information Processing Systems, 35:7796–7809, 2022a.
  2. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022b.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking. Advances in Neural Information Processing Systems, 35, 2022.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Kerple: Kernelized relative positional embedding for length extrapolation. In Advances in Neural Information Processing Systems, 2022.
  7. Dissecting transformer length extrapolation via the lens of receptive field analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13522–13537, 2023.
  8. Simulation of graph algorithms with looped transformers. arXiv preprint arXiv:2402.01107, 2024.
  9. Universal transformers. In International Conference on Learning Representations, 2018.
  10. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
  11. Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning, pages 11117–11143. PMLR, 2023.
  12. Looped transformers as programmable computers. In International Conference on Machine Learning, pages 11398–11442. PMLR, 2023.
  13. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989, 2023.
  14. A generalist neural algorithmic learner. In Learning on graphs conference, pages 2–1. PMLR, 2022.
  15. Length generalization in arithmetic transformers. arXiv preprint arXiv:2306.15400, 2023.
  16. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, 2023.
  17. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
  18. Teaching arithmetic to small transformers. arXiv preprint arXiv:2307.03381, 2023.
  19. Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023.
  20. John Loeber. #16: Notes on Arithmetic in GPT-4, February 2024. URL https://loeber.substack.com/p/16-notes-on-arithmetic-in-gpt-4.
  21. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  22. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
  23. Benchmarking chatgpt on algorithmic reasoning. arXiv preprint arXiv:2404.03441, 2024.
  24. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  25. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  26. Yarn: Efficient context window extension of large language models. International Conference on Learning Representations, 2024.
  27. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  28. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  30. Discrete neural algorithmic reasoning. arXiv preprint arXiv:2402.11628, 2024.
  31. Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
  32. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
  33. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34, 2021.
  34. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  35. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  36. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023.
  37. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  38. Adaptive attention span in transformers. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1032. URL https://aclanthology.org/P19-1032.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. The clrs algorithmic reasoning benchmark. In International Conference on Machine Learning, pages 22084–22102. PMLR, 2022.
  42. DeepNet: Scaling Transformers to 1,000 Layers. arXiv:2203.00555 [cs], March 2022. URL http://arxiv.org/abs/2203.00555.
  43. Looped transformers are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023a.
  44. Gpt can solve mathematical problems without a calculator. arXiv preprint arXiv:2309.03241, 2023b.
  45. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.
  46. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
  47. Transformers can achieve length generalization but not robustly. arXiv preprint arXiv:2402.09371, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Sean McLeish (5 papers)
  2. Arpit Bansal (17 papers)
  3. Alex Stein (5 papers)
  4. Neel Jain (13 papers)
  5. John Kirchenbauer (21 papers)
  6. Brian R. Bartoldson (23 papers)
  7. Bhavya Kailkhura (108 papers)
  8. Abhinav Bhatele (33 papers)
  9. Jonas Geiping (73 papers)
  10. Avi Schwarzschild (35 papers)
  11. Tom Goldstein (226 papers)
Citations (20)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com