Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages (2310.13897v4)
Abstract: The expressive power of transformers over inputs of unbounded size can be studied through their ability to recognize classes of formal languages. In this paper, we establish exact characterizations of transformers with hard attention (in which all attention is focused on exactly one position) and attention masking (in which each position only attends to positions on one side). With strict masking (each position cannot attend to itself) and without position embeddings, these transformers are expressively equivalent to linear temporal logic (LTL), which defines exactly the star-free languages. A key technique is the use of Boolean RASP as a convenient intermediate language between transformers and LTL. We then take numerous results known for LTL and apply them to transformers, showing how position embeddings, strict masking, and depth all increase expressive power.
- Layer Normalization. In NIPS 2016 Deep Learning Symposium. https://arxiv.org/abs/1607.06450
- Logical Languages Accepted by Transformer Encoders with Hard Attention. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2310.03817 To appear.
- On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7096–7116. https://doi.org/10.18653/v1/2020.emnlp-main.576
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33 (NeurIPS). 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- Tighter Bounds on the Expressivity of Transformer Encoders. In Proceedings of the 40th International Conference on Machine Learning (ICML). 5544–5562. https://proceedings.mlr.press/v202/chiang23a.html
- Rina S. Cohen and J.A. Brzozowski. 1971. Dot-Depth of Star-Free Events. J. Comput. System Sci. 5, 1 (1971), 1–16. https://doi.org/10.1016/S0022-0000(71)80003-X
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186. https://aclanthology.org/N19-1423
- How Can Self-Attention Networks Recognize Dyck-n Languages?. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4301–4306. https://doi.org/10.18653/v1/2020.findings-emnlp.384
- Learning Transformer Programs. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2306.01128
- On the Temporal Analysis of Fairness. In Proceedings of the 7th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 163–173. https://doi.org/10.1145/567446.567462
- Memory-efficient Transformers via Top-k Attention. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. 39–52. https://doi.org/10.18653/v1/2021.sustainlp-1.5
- Michael Hahn. 2020. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics 8 (2020), 156–171. https://doi.org/10.1162/tacl_a_00306
- Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity. Transactions of the Association for Computational Linguistics 10 (2022), 800–810. https://doi.org/10.1162/tacl_a_00490
- Johan Anthony Willem Kamp. 1968. Tense Logic and the Theory of Linear Order. Ph. D. Dissertation. University of California, Los Angeles. https://www.proquest.com/docview/302320357
- Jambay Kinley. 2020. Two-Stream Transformer Architecture With Discrete Attention for Better Interpretrability and Separation of Model Concerns. https://dash.harvard.edu/handle/1/37364732
- Oded Maler. 2010. On the Krohn-Rhodes Cascaded Decomposition Theorem. In Time for Verification: Essays in Memory of Amir Pnueli. Springer, 260–278. https://doi.org/10.1007/978-3-642-13754-9_12
- Oded Maler and Amir Pnueli. 1990. Tight Bounds on the Complexity of Cascaded Decomposition of Automata. Proceedings of the 31st Annual Symposium on Foundations of Computer Science (FOCS) 2 (1990), 672–682. https://doi.org/10.1109/FSCS.1990.89589
- Robert McNaughton and Seymour Papert. 1971. Counter-Free Automata. Number 65 in M.I.T. Press Research Monographs. The M.I.T. Press. https://archive.org/embed/CounterFre_00_McNa
- William Merrill and Ashish Sabharwal. 2024. The Expressive Power of Transformers with Chain of Thought. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR). arxiv.org/abs/2310.07923 To appear.
- Regular languages in 𝑁𝐶1superscript𝑁𝐶1\mathit{NC^{1}}italic_NC start_POSTSUPERSCRIPT italic_1 end_POSTSUPERSCRIPT. J. Comput. System Sci. 44, 3 (1992), 478–499. https://doi.org/10.1016/0022-0000(92)90014-A
- First-order expressibility of languages with neutral letters or: The Crane Beach conjecture. J. Comput. System Sci. 70, 2 (2005), 101–127. https://doi.org/10.1016/j.jcss.2004.07.004
- Binghui Peng. 2023. Personal communication.
- Attention is Turing-Complete. J. Mach. Learn. Res. 22 (2021), 75:1–75:35. http://jmlr.org/papers/v22/20-302.html
- M.P. Schützenberger. 1965. On Finite Monoids Having Only Trivial Subgroups. Information and Control 8, 2 (1965), 190–194. https://doi.org/10.1016/S0019-9958(65)90108-7
- Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NIPS). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- Thinking Like Transformers. In Proceedings of the 38th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 139). 11080–11090. https://proceedings.mlr.press/v139/weiss21a.html
- Self-Attention Networks Can Process Bounded Hierarchical Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP). 3770–3785. https://doi.org/10.18653/v1/2021.acl-long.292