Mission: Impossible Language Models (2401.06416v2)
Abstract: Chomsky and others have very directly claimed that LLMs are equally capable of learning languages that are possible and impossible for humans to learn. However, there is very little published experimental evidence to support such a claim. Here, we develop a set of synthetic impossible languages of differing complexity, each designed by systematically altering English data with unnatural word orders and grammar rules. These languages lie on an impossibility continuum: at one end are languages that are inherently impossible, such as random and irreversible shuffles of English words, and on the other, languages that may not be intuitively impossible but are often considered so in linguistics, particularly those with rules based on counting word positions. We report on a wide range of evaluations to assess the capacity of GPT-2 small models to learn these uncontroversially impossible languages, and crucially, we perform these assessments at various stages throughout training to compare the learning process for each language. Our core finding is that GPT-2 struggles to learn impossible languages when compared to English as a control, challenging the core claim. More importantly, we hope our approach opens up a productive line of inquiry in which different LLM architectures are tested on a variety of impossible languages in an effort to learn more about how LLMs can be used as tools for these cognitive and typological investigations.
- Word order does matter and shuffled language models know it. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6907–6919, Dublin, Ireland. Association for Computational Linguistics.
- Syntactic perturbations reveal representational correlates of hierarchical phrase structure in pretrained language models. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 263–276, Online. Association for Computational Linguistics.
- Noam Chomsky. 1956. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124.
- Noam Chomsky. 1957. Syntactic Structures. De Gruyter Mouton, Berlin, Boston.
- Noam Chomsky. 1959. On certain formal properties of grammars. Information and Control, 2(2):137–167.
- Noam Chomsky. 1965. Aspects of the Theory of Syntax. The MIT Press.
- Noam Chomsky. 2002. On Nature and Language. Cambridge University Press.
- Noam Chomsky. 2023. Conversations with Tyler: Noam Chomsky. Conversations with Tyler Podcast.
- Noam Chomsky: The false promise of ChatGPT. The New York Times.
- Bernard Comrie. 1989. Language universals and linguistic typology: Syntax and morphology. University of Chicago press.
- Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- How can self-attention networks recognize Dyck-n languages? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4301–4306, Online. Association for Computational Linguistics.
- Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179–211.
- Nicholas Evans and Stephen C Levinson. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and brain sciences, 32(5):429–448.
- Daniel L. Everett. 2012. What does pirahã grammar have to teach us about human language and the mind? WIREs Cognitive Science, 3(6):555–563.
- Richard Futrell. 2019. Information-theoretic locality properties of natural language. In Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), pages 2–15, Paris, France. Association for Computational Linguistics.
- Richard Futrell and Michael Hahn. 2022. Information theory as a bridge between language function and language form. Frontiers in Communication, 7.
- Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics.
- Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 9574–9586. Curran Associates, Inc.
- Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 163–173, Online. Association for Computational Linguistics.
- Finding alignments between interpretable causal variables and distributed neural representations.
- Joseph Greenberg. 1963. Some universals of grammar with particular reference to the order of meaningful elements. Universals of Language, pages 73–113.
- Michael Hahn. 2020. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, 8:156–171.
- Universals of word order reflect optimization of grammars for efficient communication. Proceedings of the National Academy of Sciences, 117(5):2347–2353.
- Modeling profanity and hate speech in social media with semantic subspaces. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 6–16, Online. Association for Computational Linguistics.
- John Hale. 2001. A probabilistic Earley parser as a psycholinguistic model. In Second Meeting of the North American Chapter of the Association for Computational Linguistics.
- Context-free transductions with neural stacks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 306–315, Brussels, Belgium. Association for Computational Linguistics.
- The faculty of language: What is it, who has it, and how did it evolve? Science, 298(5598):1569–1579.
- Jack Hessel and Alexandra Schofield. 2021. How effective is BERT without word ordering? implications for language understanding and data privacy. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 204–211, Online. Association for Computational Linguistics.
- RNNs can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010, Online. Association for Computational Linguistics.
- A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744, Online. Association for Computational Linguistics.
- Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2721–2731, Brussels, Belgium. Association for Computational Linguistics.
- Aravind K. Joshi. 1985. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions?, Studies in Natural Language Processing, page 206–250. Cambridge University Press.
- Mistral - a journey towards reproducible language model training.
- Fred Karlsson. 2007. Constraints on multiple center-embedding of clauses. Journal of Linguistics, 43(2):365–392.
- The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466.
- Roger Levy. 2008. Expectation-based syntactic comprehension. Cognition, 106(3):1126–1177.
- John Mansfield and Charles Kemp. 2023. The emergence of grammatical structure from inter-predictability.
- Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics.
- William Merrill. 2019. Sequential neural networks as automata. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 1–13, Florence. Association for Computational Linguistics.
- A formal hierarchy of RNN architectures. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 443–459, Online. Association for Computational Linguistics.
- Jeff Mitchell and Jeffrey Bowers. 2020. Priorless recurrent networks learn curiously. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5147–5158, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Large languages, impossible languages and human brains. Cortex, 167:82–85.
- Pushdown layers: Encoding recursive structure in transformer language models.
- Broca’s area and the language instinct. Nature Neuroscience, 6(7):774–781.
- When classifying grammatical role, BERT doesn’t care about word order… except when it matters. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 636–643, Dublin, Ireland. Association for Computational Linguistics.
- Isabel Papadimitriou and Dan Jurafsky. 2023. Injecting structural hints: Using language models to study inductive biases in language learning.
- Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1145–1160, Online. Association for Computational Linguistics.
- Using priming to uncover the organization of syntactic representations in neural language models. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 66–76, Hong Kong, China. Association for Computational Linguistics.
- Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35.
- Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
- Improving language understanding by generative pre-training. Ms, OpenAI.
- Language models are unsupervised multitask learners. Ms, OpenAI.
- Stuart M. Shieber. 1985. Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8(3):333–343.
- Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Call for papers – the babylm challenge: Sample-efficient pretraining on a developmentally plausible corpus.
- On the practical computational power of finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne, Australia. Association for Computational Linguistics.
- What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics.
- Using Computational Models to Test Syntactic Learnability. Linguistic Inquiry, pages 1–44.
- Causal proxy models for concept-based model explanations. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 37313–37334. PMLR.
- Interpretability at scale: Identifying causal mechanisms in alpaca.
- Causal distillation for language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4295, Seattle, United States. Association for Computational Linguistics.