Emergent Mind


Language models learn rare syntactic phenomena, but it has been argued that they rely on rote memorization, as opposed to grammatical generalization. Training on a corpus of human-scale in size (100M words), we iteratively trained transformer language models on systematically manipulated corpora and then evaluated their learning of a particular rare grammatical phenomenon: the English Article+Adjective+Numeral+Noun (AANN) construction (``a beautiful five days''). We first compared how well this construction was learned on the default corpus relative to a counterfactual corpus in which the AANN sentences were removed. AANNs were still learned better than systematically perturbed variants of the construction. Using additional counterfactual corpora, we suggest that this learning occurs through generalization from related constructions (e.g., ``a few days''). An additional experiment showed that this learning is enhanced when there is more variability in the input. Taken together, our results provide an existence proof that models learn rare grammatical phenomena by generalization from less rare phenomena. Code available at https://github.com/kanishkamisra/aannalysis
Comparison of language models on tests, showing effects of different manipulations on accuracy against chance.


  • The study investigates transformer-based language models' ability to learn the rare AANN (Article+Adjective+Numeral+Noun) construction using a 100 million word corpus.

  • Models could generalize the AANN construction from related, more common constructions, despite reduced performance when AANN instances were removed from training data.

  • Variability in training data and exposure to a broad range of AANN instances enhanced the models’ learning and generalization capabilities.

  • Findings suggest language models’ learning of grammatical constructions is based on statistical learning rather than memorization, with implications for both machine learning and linguistic theory.

Insights from Systematic Manipulation of Training Data

Recent developments in the field of computational linguistics have highlighted the capabilities of language models to learn and generalize from linguistic input. This blog post discusses a study that investigates the ability of transformer-based language models to learn a specific rare grammatical phenomenon, the English Article+Adjective+Numeral+Noun (AANN) construction, through systematic manipulation of the training data.

The Study at a Glance

The core of the study involves training language models on a corpus that approximates a human-scale linguistic input (100 million words), with and without exposure to instances of the AANN construction. The training was followed by evaluating the models' performance on AANN as well as on purposefully perturbed variants of the construction, to assess the generality of the learning. The findings lend credence to the hypothesis that models can abstract grammatical principles from related, more common constructions, thereby demonstrating an ability to generalize beyond direct experience.

Key Findings

  1. Generalization from Less Rare Phenomena: The study found that models were able to learn the AANN construction even when explicit instances were removed from the training data, albeit with reduced performance. This suggests that learning leveraged generalization from related constructions encountered in training.

  2. Influence of Related Constructions: Further manipulations of the training data, which removed related constructions (e.g., “a few days”), resulted in a diminished ability to learn AANN, reinforcing the idea that models abstract grammatical rules from them.

  3. Variability Enhances Learning: When models were exposed to a variety of AANN instances in training, showcasing a broad range of adjectives, numerals, and nouns, they were more successful at generalizing the construction compared to models trained on more limited samples. This underscores the role of variability in learning linguistic constructions.

  4. Statistical Learning vs. Memorization: Results indicate that the models' learning of the AANN construction is rooted in statistical learning mechanisms rather than rote memorization. This contrasts with the criticism often leveled at language models, suggesting they are merely "stochastic parrots."

Implications and Future Directions

The study presents compelling evidence that language models, even when trained on data of a scale comparable to that encountered by human learners, can generalize and learn rare grammatical constructions. This has significant implications:

  • Machine Learning: The findings highlight the potential of current statistical learning mechanisms in language models to capture complex linguistic phenomena, suggesting avenues for further refining these models’ grammatical generalization capabilities.

  • Linguistic Theory: The ability to learn from less common phenomena bolsters theories that posit the human linguistic capability stems from generalization over input, rather than innate grammatical knowledge.

  • Teaching Machines Language: From a practical standpoint, understanding the conditions under which language models can generalize rare constructions could inform strategies for training more efficient, linguistically nuanced models.

In future work, extending this approach to a broader set of rare constructions could provide deeper insights into the learning capacities of language models and the linguistic principles that underpin language acquisition. Moreover, studies that bridge the gap between grammatical form learning and semantic understanding could offer a more holistic view of language comprehension in machine learning models.

In conclusion, this research contributes to the ongoing exploration of how language models learn and generalize, demonstrating that with the right exposure and data manipulation, even rare syntactic phenomena are within the grasp of current language technologies.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. R Harald Baayen. 2009. 43. corpus linguistics in morphology: morphological productivity. Corpus linguistics. An international handbook, pages 900–919.
  2. Marco Baroni. 2022. On the proper role of linguistically oriented deep net analysis in linguistic theorising. In Algebraic structures in natural language, pages 1–16. CRC Press.
  3. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623
  4. Joan Bybee. 1995. Regular morphology and the lexicon. Language and cognitive processes, 10(5):425–455.
  5. N. Chomsky. 1957. Syntactic Structures. The Hague: Mouton.
  6. N. Chomsky. 1965. Aspects of the Theory of Syntax. MIT Press, Cambridge, MA.
  7. N. Chomsky. 1986. Knowledge of language: Its nature, origin, and use. Praeger Publishers.
  8. Noam Chomsky: The false promise of ChatGPT. The New York Times.
  9. Mary Dalrymple and Tracy Holloway King. 2019. An amazing four doctoral dissertations. Argumentum, 15(2019). Publisher: Debreceni Egyetemi Kiado.
  10. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
  11. Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. Adele E Goldberg. 1995. Constructions: A construction grammar approach to argument structure. University of Chicago Press.
  13. Adele E Goldberg. 2005. Constructions at Work: The Nature of Generalization in Language. Oxford University Press.
  14. Adele E Goldberg. 2019. Explain me this: Creativity, competition, and the partial productivity of constructions. Princeton University Press.
  15. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics.
  16. spaCy: Industrial-strength natural language processing in python
  17. BabyBERTa: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 624–646, Online. Association for Computational Linguistics.
  18. Mission: Impossible Language Models
  19. Richard S Kayne. 2007. On the syntax of quantity in english. Linguistic theory and south Asian languages: Essays in honour of Ka Jayaseelan, 102:73.
  20. Caitlin Keenan. 2013. “A pleasant three days in Philadelphia”: Arguments for a pseudopartitive analysis. University of Pennsylvania Working Papers in Linguistics, 19(1):11.
  21. Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models
  22. Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge. Cognitive science, 41(5):1202–1241.
  23. Cara Su-Yi Leong and Tal Linzen. 2023. Language models can learn exceptions to syntactic rules. In Proceedings of the Society for Computation in Linguistics 2023, pages 133–144, Amherst, MA. Association for Computational Linguistics.
  24. Neural reality of argument structure constructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7410–7423, Dublin, Ireland. Association for Computational Linguistics.
  25. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
  26. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  27. Gender bias in neural natural language processing. Logic, language, and security: essays dedicated to Andre Scedrov on the occasion of his 65th birthday, pages 189–202.
  28. Kyle Mahowald. 2023. A discerning several thousand judgments: GPT-3 rates the article + adjective + numeral + noun construction. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 265–273, Dubrovnik, Croatia. Association for Computational Linguistics.
  29. Dissociating language and thought in large language models. Trends in Cognitive Sciences.
  30. It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5267–5275, Hong Kong, China. Association for Computational Linguistics.
  31. How much do language models copy from their training data? evaluating linguistic novelty in text generation using RAVEN. Transactions of the Association for Computational Linguistics, 11:652–670.
  32. minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models
  33. Timothy J O’Donnell. 2015. Productivity and reuse in language: A theory of linguistic computation and storage. MIT Press.
  34. Category-based Induction. Psychological Review, 97(2):185.
  35. Adam Pauls and Dan Klein. 2012. Large-scale syntactic language modeling with treelets. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 959–968, Jeju Island, Korea. Association for Computational Linguistics.
  36. Lisa Pearl. 2022. Poverty of the stimulus without tears. Language Learning and Development, 18(4):415–454.
  37. Steven Piantadosi. 2023. Modern language models refute chomsky’s approach to language. Lingbuzz Preprint, lingbuzz
  38. Christopher Potts. 2023. Characterizing English Preposing in PP constructions. Ms., Stanford University.
  39. Supertagging the long tail with tree-structured decoding of complex categories. Transactions of the Association for Computational Linguistics, 9:243–260.
  40. Geoffrey K Pullum. 2017. Theory, data, and the epistemology of syntax. In Grammatische Variation. Empirische Zugänge und theoretische Modellierung, pages 283–298. de Gruyter.
  41. Geoffrey K Pullum and Barbara C Scholz. 2002. Empirical assessment of stimulus poverty arguments. The Linguistic Review, 19(1-2):9–50.
  42. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
  43. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 194–209, Online. Association for Computational Linguistics.
  44. Roger Schwarzschild. 2011. Stubborn distributivity, multiparticipant nouns and the count/mass distinction. In Proceedings of NELS, volume 39, pages 661–678. Graduate Linguistics Students Association, University of Massachusetts. Issue: 2.
  45. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  46. Stephanie Solt. 2007. Two types of modified cardinals. In International Conference on Adjectives. Lille.
  47. Laura Suttle and Adele E Goldberg. 2011. The partial productivity of constructions as induction.
  48. CxGBERT: BERT meets construction grammar. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4020–4032, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  49. Llama 2: Open Foundation and Fine-Tuned Chat Models
  50. CxLM: A construction and context-aware language model. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6361–6369, Marseille, France. European Language Resources Association.
  51. Tim Veenboer and Jelke Bloem. 2023. Using collostructional analysis to evaluate BERT’s representation of linguistic constructions. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12937–12951, Toronto, Canada. Association for Computational Linguistics.
  52. Alex Warstadt and Samuel R Bowman. 2022. What artificial neural networks can tell us about human language acquisition. In Algebraic Structures in Natural Language, pages 17–60. CRC Press.
  53. Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34, Singapore. Association for Computational Linguistics.
  54. Frequency effects on syntactic rule learning in transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 932–948, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  55. The better your syntax, the better your semantics? probing pretrained language models for the English comparative correlative. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10859–10882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  56. Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena
  57. What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics.
  58. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  59. Fei Xu and Joshua B Tenenbaum. 2007. Word learning as bayesian inference. Psychological review, 114(2):245.
  60. OPT: Open Pre-trained Transformer Language Models

Show All 60

Test Your Knowledge

You answered out of questions correctly.

Well done!