Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages (2309.00857v2)

Published 2 Sep 2023 in cs.CL

Abstract: Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer's ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Long-distance scrambling and Tree Adjoining Grammars. In Fifth Conference of the European Chapter of the Association for Computational Linguistics, Berlin, Germany. Association for Computational Linguistics.
  2. Can the transformer learn nested recursion with symbol masking? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 753–760, Online. Association for Computational Linguistics.
  3. Rajesh Bhatt and Aravind Joshi. 2004. Semilinearity is a syntactic invariant: A reply to Michaelis and Kracht 1997. Linguistic Inquiry, 35(4):683–692.
  4. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, Online. Association for Computational Linguistics.
  5. On the computational power of transformers and its implications in sequence modeling. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 455–475, Online. Association for Computational Linguistics.
  6. Cross-serial dependencies in Dutch. Linguistic Inquiry, 13(4):613–635.
  7. David Chiang and Peter Cholak. 2022. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664, Dublin, Ireland. Association for Computational Linguistics.
  8. Tighter bounds on the expressivity of transformer encoders. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 5544–5562. PMLR.
  9. Noam Chomsky. 1956. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124.
  10. Noam Chomsky. 1959. On certain formal properties of grammars. Information and Control, 2(2):137–167.
  11. Alexander Clark and Ryo Yoshinaka. 2012. Beyond semilinearity: Distributional learning of parallel multiple context-free grammars. In International Conference on Grammatical Inference, pages 84–96. PMLR.
  12. Christopher Culy. 1985. The Complexity of the Vocabulary of Bambara, pages 349–357. Springer Netherlands, Dordrecht.
  13. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations.
  14. How can self-attention networks recognize Dyck-n languages? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4301–4306, Online. Association for Computational Linguistics.
  15. Felix A Gers and Jürgen Schmidhuber. 2001. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
  16. Thomas Graf. 2021. Minimalism and computational linguistics. Lingbuzz/005855.
  17. Michael Hahn. 2020. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, 8:156–171.
  18. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810.
  19. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  20. John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
  21. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  22. Lucian Ilie. 1997. On computational complexity of contextual languages. Theoretical Computer Science, 183(1):33–44. Formal Language Theory.
  23. Language Modeling with Deep Transformers. In Proc. Interspeech 2019, pages 3905–3909.
  24. Gerhard Jäger and James Rogers. 2012. Formal language theory: refining the chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1598):1956–1970.
  25. Aravind K. Joshi. 1985. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions?, Studies in Natural Language Processing, page 206–250. Cambridge University Press.
  26. Laura Kallmeyer. 2010. Parsing Beyond Context-Free Grammars, 1st edition. Springer Publishing Company, Incorporated.
  27. Makoto Kanazawa and Sylvain Salvati. 2012. MIX is not a tree-adjoining language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 666–674, Jeju Island, Korea. Association for Computational Linguistics.
  28. Gregory Michael Kobele. 2006. Generating copies: An investigation into structural identity in language and grammar. Ph.D. thesis, University of California, Los Angeles.
  29. Konstantinos Kogkalidis and Gijs Wijnholds. 2022. Discontinuous constituency and BERT: A case study of Dutch. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3776–3785, Dublin, Ireland. Association for Computational Linguistics.
  30. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  31. William Merrill and Ashish Sabharwal. 2022. A Logic for Expressing Log-Precision Transformers. arXiv e-prints, page arXiv:2210.02671.
  32. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856.
  33. Jens Michaelis and Marcus Kracht. 1997. Semilinearity as a syntactic invariant. In Logical Aspects of Computational Linguistics, pages 329–345, Berlin, Heidelberg. Springer Berlin Heidelberg.
  34. The EOS decision and length extrapolation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 276–291, Online. Association for Computational Linguistics.
  35. Isabel Papadimitriou and Dan Jurafsky. 2023. Pretrain on just structure: Understanding linguistic inductive biases using transfer learning. arXiv e-prints, page arXiv:2304.13060.
  36. PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  37. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35.
  38. On the turing completeness of modern neural network architectures. In International Conference on Learning Representations.
  39. Daniel Radzinski. 1991. Chinese number-names, tree adjoining languages, and mild context-sensitivity. Computational Linguistics, 17(3):277–300.
  40. Sylvain Salvati. 2015. MIX is a 2-MCFL and the word problem in ℤ2superscriptℤ2\mathbb{Z}^{2}blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is captured by the IO and the OI hierarchies. Journal of Computer and System Sciences, 81(7):1252–1277.
  41. On multiple context-free grammars. Theoretical Computer Science, 88(2):191–229.
  42. Stuart M. Shieber. 1985. Evidence Against the Context-Freeness of Natural Language, pages 320–334. Springer Netherlands, Dordrecht.
  43. Edward Stabler. 1997. Derivational minimalism. In Logical Aspects of Computational Linguistics, pages 68–95, Berlin, Heidelberg. Springer Berlin Heidelberg.
  44. Edward P. Stabler. 2011. Computational perspectives on Minimalism. In Cedric Boeckx, editor, The Oxford Handbook of Linguistic Minimalism, pages 617–643. Oxford University Press, Oxford.
  45. Mark Steedman. 1987. Combinatory grammars and parasitic gaps. Natural Language and Linguistic Theory, 5(3):403–439.
  46. On evaluating the generalization of LSTM models in formal languages. In Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 277–286.
  47. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  48. Characterizing structural descriptions produced by various grammatical formalisms. In 25th Annual Meeting of the Association for Computational Linguistics, pages 104–111, Stanford, California, USA. Association for Computational Linguistics.
  49. Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5307–5315, Hong Kong, China. Association for Computational Linguistics.
  50. Shunjie Wang. 2021. Evaluating transformer’s ability to learn mildly context-sensitive languages. Master’s thesis, University of Washington.
  51. On the practical computational power of finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne, Australia. Association for Computational Linguistics.
  52. (Un) interpretability of transformers: a case study with dyck grammars.
  53. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3770–3785, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shunjie Wang (1 paper)
  2. Shane Steinert-Threlkeld (20 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.