Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Length Generalization in Arithmetic Transformers (2306.15400v1)

Published 27 Jun 2023 in cs.LG

Abstract: We examine how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on $5$-digit numbers can perform $15$-digit sums. However, this method fails for multiplication, and we propose train set priming: adding a few ($10$ to $50$) long sequences to the training set. We show that priming allows models trained on $5$-digit $\times$ $3$-digit multiplications to generalize to $35\times 3$ examples. We also show that models can be primed for different generalization lengths, and that the priming sample size scales as the logarithm of the training set size. Finally, we discuss potential applications of priming beyond arithmetic.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  2. Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901, 2022.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  4. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking, 2022.
  5. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  6. Induced natural language rationales and interleaved markup tokens enable extrapolation in large language models. arXiv preprint arXiv:2208.11445, 2022.
  7. François Charton. Linear algebra with transformers. arXiv preprint arXiv:2112.01898, 2021.
  8. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34, 2021.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  11. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  12. Simplifying polylogarithms with machine learning, 2022.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Location attention for extrapolation to longer sequences. arXiv preprint arXiv:1911.03872, 2019.
  16. Solving arithmetic word problems with transformers and preprocessing of problem text. arXiv preprint arXiv:2106.00893, 2021.
  17. Teaching temporal logics to neural networks. arXiv preprint arXiv:2003.04218, 2020.
  18. An improved relative self-attention mechanism for transformer with application to music generation. 2018.
  19. Improve transformer models with better relative position embeddings. arXiv preprint arXiv:2009.13658, 2020.
  20. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34, 2021.
  21. Inferring algorithmic patterns with stack-augmented recurrent nets. Advances in neural information processing systems, 28, 2015.
  22. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
  23. Shape: Shifted absolute position embedding for transformers. arXiv preprint arXiv:2109.05644, 2021.
  24. Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412, 2019.
  25. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  26. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.
  27. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  28. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  29. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  30. Solving math word problems with double-decoder transformer. arXiv preprint arXiv:1908.10924, 2019.
  31. Correcting length bias in neural machine translation. arXiv preprint arXiv:1808.10006, 2018.
  32. The eos decision and length extrapolation. arXiv preprint arXiv:2010.07174, 2020.
  33. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021.
  34. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  35. Theodoros Palamas. Investigating the ability of neural networks to learn simple modular arithmetic. 2017.
  36. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020.
  37. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  38. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  40. Lcr-net: Localization-classification-regression for human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3433–3441, 2017.
  41. Analysis of positional encodings for neural machine translation. In Proceedings of the 16th International Conference on Spoken Language Translation, 2019.
  42. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE international conference on computer vision workshops, pages 10–15, 2015.
  43. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  44. Chatgpt: Optimizing language models for dialogue, 2022.
  45. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  46. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  47. Neural arithmetic logic units. arXiv preprint arXiv:1808.00508, 2018.
  48. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  49. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  50. Salsa: Attacking lattice cryptography with transformers. arXiv preprint arXiv:2207:04785, 2022.
  51. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
  52. Unveiling transformers with lego: a synthetic reasoning task. arXiv preprint arXiv:2206.04301, 2022.
  53. Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Samy Jelassi (26 papers)
  2. Stéphane d'Ascoli (24 papers)
  3. Carles Domingo-Enrich (25 papers)
  4. Yuhuai Wu (49 papers)
  5. Yuanzhi Li (119 papers)
  6. François Charton (90 papers)
Citations (33)