Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
37 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Mission: Impossible Language Models (2401.06416v2)

Published 12 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Chomsky and others have very directly claimed that LLMs are equally capable of learning languages that are possible and impossible for humans to learn. However, there is very little published experimental evidence to support such a claim. Here, we develop a set of synthetic impossible languages of differing complexity, each designed by systematically altering English data with unnatural word orders and grammar rules. These languages lie on an impossibility continuum: at one end are languages that are inherently impossible, such as random and irreversible shuffles of English words, and on the other, languages that may not be intuitively impossible but are often considered so in linguistics, particularly those with rules based on counting word positions. We report on a wide range of evaluations to assess the capacity of GPT-2 small models to learn these uncontroversially impossible languages, and crucially, we perform these assessments at various stages throughout training to compare the learning process for each language. Our core finding is that GPT-2 struggles to learn impossible languages when compared to English as a control, challenging the core claim. More importantly, we hope our approach opens up a productive line of inquiry in which different LLM architectures are tested on a variety of impossible languages in an effort to learn more about how LLMs can be used as tools for these cognitive and typological investigations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Word order does matter and shuffled language models know it. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6907–6919, Dublin, Ireland. Association for Computational Linguistics.
  2. Syntactic perturbations reveal representational correlates of hierarchical phrase structure in pretrained language models. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 263–276, Online. Association for Computational Linguistics.
  3. Noam Chomsky. 1956. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124.
  4. Noam Chomsky. 1957. Syntactic Structures. De Gruyter Mouton, Berlin, Boston.
  5. Noam Chomsky. 1959. On certain formal properties of grammars. Information and Control, 2(2):137–167.
  6. Noam Chomsky. 1965. Aspects of the Theory of Syntax. The MIT Press.
  7. Noam Chomsky. 2002. On Nature and Language. Cambridge University Press.
  8. Noam Chomsky. 2023. Conversations with Tyler: Noam Chomsky. Conversations with Tyler Podcast.
  9. Noam Chomsky: The false promise of ChatGPT. The New York Times.
  10. Bernard Comrie. 1989. Language universals and linguistic typology: Syntax and morphology. University of Chicago press.
  11. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  13. How can self-attention networks recognize Dyck-n languages? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4301–4306, Online. Association for Computational Linguistics.
  14. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179–211.
  15. Nicholas Evans and Stephen C Levinson. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and brain sciences, 32(5):429–448.
  16. Daniel L. Everett. 2012. What does pirahã grammar have to teach us about human language and the mind? WIREs Cognitive Science, 3(6):555–563.
  17. Richard Futrell. 2019. Information-theoretic locality properties of natural language. In Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), pages 2–15, Paris, France. Association for Computational Linguistics.
  18. Richard Futrell and Michael Hahn. 2022. Information theory as a bridge between language function and language form. Frontiers in Communication, 7.
  19. Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics.
  20. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 9574–9586. Curran Associates, Inc.
  21. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 163–173, Online. Association for Computational Linguistics.
  22. Finding alignments between interpretable causal variables and distributed neural representations.
  23. Joseph Greenberg. 1963. Some universals of grammar with particular reference to the order of meaningful elements. Universals of Language, pages 73–113.
  24. Michael Hahn. 2020. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, 8:156–171.
  25. Universals of word order reflect optimization of grammars for efficient communication. Proceedings of the National Academy of Sciences, 117(5):2347–2353.
  26. Modeling profanity and hate speech in social media with semantic subspaces. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 6–16, Online. Association for Computational Linguistics.
  27. John Hale. 2001. A probabilistic Earley parser as a psycholinguistic model. In Second Meeting of the North American Chapter of the Association for Computational Linguistics.
  28. Context-free transductions with neural stacks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 306–315, Brussels, Belgium. Association for Computational Linguistics.
  29. The faculty of language: What is it, who has it, and how did it evolve? Science, 298(5598):1569–1579.
  30. Jack Hessel and Alexandra Schofield. 2021. How effective is BERT without word ordering? implications for language understanding and data privacy. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 204–211, Online. Association for Computational Linguistics.
  31. RNNs can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010, Online. Association for Computational Linguistics.
  32. A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744, Online. Association for Computational Linguistics.
  33. Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2721–2731, Brussels, Belgium. Association for Computational Linguistics.
  34. Aravind K. Joshi. 1985. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions?, Studies in Natural Language Processing, page 206–250. Cambridge University Press.
  35. Mistral - a journey towards reproducible language model training.
  36. Fred Karlsson. 2007. Constraints on multiple center-embedding of clauses. Journal of Linguistics, 43(2):365–392.
  37. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466.
  38. Roger Levy. 2008. Expectation-based syntactic comprehension. Cognition, 106(3):1126–1177.
  39. John Mansfield and Charles Kemp. 2023. The emergence of grammatical structure from inter-predictability.
  40. Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics.
  41. William Merrill. 2019. Sequential neural networks as automata. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 1–13, Florence. Association for Computational Linguistics.
  42. A formal hierarchy of RNN architectures. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 443–459, Online. Association for Computational Linguistics.
  43. Jeff Mitchell and Jeffrey Bowers. 2020. Priorless recurrent networks learn curiously. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5147–5158, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  44. Large languages, impossible languages and human brains. Cortex, 167:82–85.
  45. Pushdown layers: Encoding recursive structure in transformer language models.
  46. Broca’s area and the language instinct. Nature Neuroscience, 6(7):774–781.
  47. When classifying grammatical role, BERT doesn’t care about word order… except when it matters. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 636–643, Dublin, Ireland. Association for Computational Linguistics.
  48. Isabel Papadimitriou and Dan Jurafsky. 2023. Injecting structural hints: Using language models to study inductive biases in language learning.
  49. Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1145–1160, Online. Association for Computational Linguistics.
  50. Using priming to uncover the organization of syntactic representations in neural language models. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 66–76, Hong Kong, China. Association for Computational Linguistics.
  51. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35.
  52. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
  53. Improving language understanding by generative pre-training. Ms, OpenAI.
  54. Language models are unsupervised multitask learners. Ms, OpenAI.
  55. Stuart M. Shieber. 1985. Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8(3):333–343.
  56. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  57. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
  58. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  59. Call for papers – the babylm challenge: Sample-efficient pretraining on a developmentally plausible corpus.
  60. On the practical computational power of finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne, Australia. Association for Computational Linguistics.
  61. What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics.
  62. Using Computational Models to Test Syntactic Learnability. Linguistic Inquiry, pages 1–44.
  63. Causal proxy models for concept-based model explanations. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 37313–37334. PMLR.
  64. Interpretability at scale: Identifying causal mechanisms in alpaca.
  65. Causal distillation for language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4295, Seattle, United States. Association for Computational Linguistics.
Citations (9)

Summary

  • The paper shows that GPT-2 struggles to learn synthetically generated impossible languages compared to natural English, challenging previous claims.
  • The paper employs perplexity evaluation, surprisal analysis, and causal abstraction to reveal that natural statistical patterns align better with GPT-2’s representations.
  • The paper finds that token-based verb marker placement (TokenHop) improves grammatical tracking over word-based markers, emphasizing the inductive bias of information locality.

This paper, "Mission: Impossible LLMs," investigates the claim that LLMs are equally capable of learning both possible and impossible human languages. The authors challenge this assertion by training GPT-2 small models on a set of synthetically generated "impossible" languages and comparing their performance to that of models trained on English. The core finding is that GPT-2 struggles to learn these impossible languages compared to English, thus questioning the initial claim.

The paper defines a spectrum of impossible languages based on their complexity, ranging from entirely random word orderings to more subtly altered languages with unnatural grammar rules, specifically those dependent on counting word positions. The authors generate these languages by systematically perturbing the BabyLM dataset, an English-language dataset designed to simulate the linguistic input available to a child.

The impossible languages are categorized into three classes:

  1. *Shuffle Languages: These involve different ways of shuffling the tokens of English sentences, including:
    • NoShuffle (English, as a control)
    • NondeterministicShuffle (random shuffling)
    • DeterministicShuffle (shuffling based on sentence length and a random seed)
    • LocalShuffle (shuffling within a fixed-size window)
    • EvenOddShuffle (even-indexed tokens followed by odd-indexed tokens)
  2. *Reverse Languages: These involve reversing all or part of sentences:
    • NoReverse (English with an inserted reversal marker, as a control)
    • PartialReverse (reversal marker followed by reversed tokens)
    • FullReverse (entire sentence reversed after a reversal marker)
  3. *Hop Languages: These languages manipulate verb inflection by placing number/tense markers at different positions relative to the verb:
    • NoHop (English-like with verb markers immediately after the verb, as a control)
    • TokenHop (verb marker placed 4 tokens after the verb)
    • WordHop (verb marker placed 4 words after the verb, skipping punctuation)

The authors conduct three main experiments:

  1. Perplexity Evaluation: GPT-2 models are trained on each language, and their perplexities on a test set are measured. Results show that the models trained on possible languages (controls) achieve lower perplexities more quickly, indicating better learning efficiency. The NondeterministicShuffle languages are the hardest to learn, while the Hop languages are nearly as easy to learn as the control.
  2. Surprisal Analysis: This experiment focuses on the *Hop languages and measures the surprisal of verb marker tokens (singular/plural). The NoHop model exhibits the highest surprisal difference between the expected and unexpected marker positions, suggesting it has learned the natural grammatical pattern better than the models of impossible languages. The TokenHop model performs better than the WordHop model, indicating that GPT-2 is better at learning the verb marking rule when counting units are tokens instead of words.
  3. Causal Abstraction Analysis: This experiment uses interchange interventions to identify representations within the *Hop models that causally affect subject-verb agreement. The analysis reveals that all three *Hop models develop similar modular solutions by tracking agreement through representations at relevant positions, but the NoHop model achieves higher accuracy earlier in training.

The paper concludes that GPT-2 models struggle to learn impossible languages compared to natural ones, contradicting claims that LLMs cannot distinguish between possible and impossible languages. The authors suggest that information locality, the tendency for statistical correlations to be short-range, might be an inductive bias in GPT models that matches natural language and explains these results. They propose further exploration of the boundaries between possible and impossible languages, treating LLMs as a comparative system for understanding human language.

The appendix provides details on the dataset filtering, model hyperparameters, additional results for models trained without positional encodings, constituency probing experiments, and detailed results for DeterministicShuffle experiments.

Youtube Logo Streamline Icon: https://streamlinehq.com