Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformers as Transducers (2404.02040v3)

Published 2 Apr 2024 in cs.FL and cs.LG

Abstract: We study the sequence-to-sequence mapping capacity of transformers by relating them to finite transducers, and find that they can express surprisingly large classes of transductions. We do so using variants of RASP, a programming language designed to help people "think like transformers," as an intermediate representation. We extend the existing Boolean variant B-RASP to sequence-to-sequence functions and show that it computes exactly the first-order rational functions (such as string rotation). Then, we introduce two new extensions. B-RASP[pos] enables calculations on positions (such as copying the first half of a string) and contains all first-order regular functions. S-RASP adds prefix sum, which enables additional arithmetic operations (such as squaring a string) and contains all first-order polyregular functions. Finally, we show that masked average-hard attention transformers can simulate S-RASP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Masked hard-attention transformers and Boolean RASP recognize exactly the star-free languages. arXiv:2310.13897.
  2. Eric Bakovic. 2000. Harmony, Dominance, and Control. Ph.D. thesis, Rutgers, The State University of New Jersey.
  3. Logical languages accepted by transformer encoders with hard attention. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR).
  4. Mikołaj Bojańczyk. 2018. Polyregular functions. arXiv:1810.08760.
  5. Mikolaj Bojanczyk. 2022. Transducers of polynomial growth. In Proceedings of the 37th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pages 1–27.
  6. Mikolaj Bojanczyk and Rafal Stefanski. 2020. Single-use automata and transducers for infinite alphabets. In 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, July 8-11, 2020, Saarbrücken, Germany (Virtual Conference), volume 168 of LIPIcs, pages 113:1–113:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik.
  7. Olivier Carton and Luc Dartois. 2015. Aperiodic two-way transducers and FO-transductions. In 24th EACSL Annual Conference on Computer Science Logic (CSL 2015), volume 41 of Leibniz International Proceedings in Informatics (LIPIcs), pages 160–174, Dagstuhl, Germany. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
  8. Neural networks and the Chomsky hierarchy. In Proc. ICLR.
  9. Calvin C. Elgot and Jorge E. Mezei. 1965. On relations defined by generalized finite automata. IBM Journal of Research and Development, 9(1):47–68.
  10. Towards revealing the mystery behind Chain of Thought: A theoretical perspective. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 70757–70798.
  11. First-order definability of rational transductions: An algebraic approach. In Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, pages 387–396.
  12. Logical and algebraic characterizations of rational transductions. Logical Methods in Computer Science, 15(4).
  13. Formal language recognition by hard attention Transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810.
  14. Tracr: Compiled transformers as a laboratory for interpretability. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 37876–37899.
  15. Generating Wikipedia by summarizing long sequences. In Proc. ICLR.
  16. R. McNaughton and S. Papert. 1971. Counter-free Automata. M.I.T. Press research monographs. M.I.T. Press.
  17. William Merrill and Ashish Sabharwal. 2024. The expressive power of transformers with chain of thought. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR).
  18. Mehryar Mohri. 1997. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269–311.
  19. Lê Thành Dũng Nguyễn. 2021. Two-way transducers with planar behaviours are aperiodic. Presentation slides.
  20. Two-way automata and transducers with planar behaviours are aperiodic. arXiv:2307.11057.
  21. Attention is Turing-complete. Journal of Machine Learning Research, 22:75:1–75:35.
  22. Cécilia Pradic and Lê Thành Dũng Nguyẽn. 2020. Implicit automata in typed λ𝜆\lambdaitalic_λ-calculi I: aperiodicity in a non-commutative logic. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Full version.
  23. Brian Roark and Richard Sproat. 2007. Computational approaches to morphology and syntax. Oxford University Press.
  24. Representational strengths and limitations of transformers. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
  25. Marcel Paul Schützenberger. 1965. On finite monoids having only trivial subgroups. Inf. Control., 8(2):190–194.
  26. Transformers as recognizers of formal languages: A survey on expressivity. arXiv:2311.00208.
  27. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv:2210.09261.
  28. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, NeurIPS.
  29. Thinking like transformers. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11080–11090. PMLR.
  30. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3770–3785, Online. Association for Computational Linguistics.
  31. What algorithms can Transformers learn? A study in length generalization. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR).
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com