Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs (2403.19827v2)
Abstract: LLMs learn rare syntactic phenomena, but the extent to which this is attributable to generalization vs. memorization is a major open question. To that end, we iteratively trained transformer LLMs on systematically manipulated corpora which were human-scale in size, and then evaluated their learning of a rare grammatical phenomenon: the English Article+Adjective+Numeral+Noun (AANN) construction (a beautiful five days''). We compared how well this construction was learned on the default corpus relative to a counterfactual corpus in which AANN sentences were removed. We found that AANNs were still learned better than systematically perturbed variants of the construction. Using additional counterfactual corpora, we suggest that this learning occurs through generalization from related constructions (e.g.,
a few days''). An additional experiment showed that this learning is enhanced when there is more variability in the input. Taken together, our results provide an existence proof that LMs can learn rare grammatical phenomena by generalization from less rare phenomena. Data and code: https://github.com/kanishkamisra/aannalysis.
- R Harald Baayen. 2009. 43. corpus linguistics in morphology: morphological productivity. Corpus linguistics. An international handbook, pages 900–919.
- Marco Baroni. 2022. On the proper role of linguistically oriented deep net analysis in linguistic theorising. In Algebraic structures in natural language, pages 1–16. CRC Press.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
- Joan Bybee. 1995. Regular morphology and the lexicon. Language and cognitive processes, 10(5):425–455.
- N. Chomsky. 1957. Syntactic Structures. The Hague: Mouton.
- N. Chomsky. 1965. Aspects of the Theory of Syntax. MIT Press, Cambridge, MA.
- N. Chomsky. 1986. Knowledge of language: Its nature, origin, and use. Praeger Publishers.
- Noam Chomsky: The false promise of ChatGPT. The New York Times.
- Mary Dalrymple and Tracy Holloway King. 2019. An amazing four doctoral dissertations. Argumentum, 15(2019). Publisher: Debreceni Egyetemi Kiado.
- Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759.
- Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics.
- Adele E Goldberg. 1995. Constructions: A construction grammar approach to argument structure. University of Chicago Press.
- Adele E Goldberg. 2005. Constructions at Work: The Nature of Generalization in Language. Oxford University Press.
- Adele E Goldberg. 2019. Explain me this: Creativity, competition, and the partial productivity of constructions. Princeton University Press.
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics.
- spaCy: Industrial-strength natural language processing in python.
- BabyBERTa: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 624–646, Online. Association for Computational Linguistics.
- Mission: Impossible language models. arXiv preprint arXiv:2401.06416.
- Richard S Kayne. 2007. On the syntax of quantity in english. Linguistic theory and south Asian languages: Essays in honour of Ka Jayaseelan, 102:73.
- Caitlin Keenan. 2013. “A pleasant three days in Philadelphia”: Arguments for a pseudopartitive analysis. University of Pennsylvania Working Papers in Linguistics, 19(1):11.
- Uncontrolled lexical exposure leads to overestimation of compositional generalization in pretrained models. arXiv preprint arXiv:2212.10769.
- Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge. Cognitive science, 41(5):1202–1241.
- Cara Su-Yi Leong and Tal Linzen. 2023. Language models can learn exceptions to syntactic rules. In Proceedings of the Society for Computation in Linguistics 2023, pages 133–144, Amherst, MA. Association for Computational Linguistics.
- Neural reality of argument structure constructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7410–7423, Dublin, Ireland. Association for Computational Linguistics.
- Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
- Gender bias in neural natural language processing. Logic, language, and security: essays dedicated to Andre Scedrov on the occasion of his 65th birthday, pages 189–202.
- Kyle Mahowald. 2023. A discerning several thousand judgments: GPT-3 rates the article + adjective + numeral + noun construction. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 265–273, Dubrovnik, Croatia. Association for Computational Linguistics.
- Dissociating language and thought in large language models. Trends in Cognitive Sciences.
- It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5267–5275, Hong Kong, China. Association for Computational Linguistics.
- How much do language models copy from their training data? evaluating linguistic novelty in text generation using RAVEN. Transactions of the Association for Computational Linguistics, 11:652–670.
- Kanishka Misra. 2022. minicons: Enabling flexible behavioral and representational analyses of transformer language models. arXiv preprint arXiv:2203.13112.
- Timothy J O’Donnell. 2015. Productivity and reuse in language: A theory of linguistic computation and storage. MIT Press.
- Category-based Induction. Psychological Review, 97(2):185.
- Adam Pauls and Dan Klein. 2012. Large-scale syntactic language modeling with treelets. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 959–968, Jeju Island, Korea. Association for Computational Linguistics.
- Lisa Pearl. 2022. Poverty of the stimulus without tears. Language Learning and Development, 18(4):415–454.
- Steven Piantadosi. 2023. Modern language models refute chomsky’s approach to language. Lingbuzz Preprint, lingbuzz, 7180.
- Christopher Potts. 2023. Characterizing English Preposing in PP constructions. Ms., Stanford University.
- Supertagging the long tail with tree-structured decoding of complex categories. Transactions of the Association for Computational Linguistics, 9:243–260.
- Geoffrey K Pullum. 2017. Theory, data, and the epistemology of syntax. In Grammatische Variation. Empirische Zugänge und theoretische Modellierung, pages 283–298. de Gruyter.
- Geoffrey K Pullum and Barbara C Scholz. 2002. Empirical assessment of stimulus poverty arguments. The Linguistic Review, 19(1-2):9–50.
- Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
- Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 194–209, Online. Association for Computational Linguistics.
- Roger Schwarzschild. 2011. Stubborn distributivity, multiparticipant nouns and the count/mass distinction. In Proceedings of NELS, volume 39, pages 661–678. Graduate Linguistics Students Association, University of Massachusetts. Issue: 2.
- Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Stephanie Solt. 2007. Two types of modified cardinals. In International Conference on Adjectives. Lille.
- Laura Suttle and Adele E Goldberg. 2011. The partial productivity of constructions as induction.
- CxGBERT: BERT meets construction grammar. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4020–4032, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- CxLM: A construction and context-aware language model. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6361–6369, Marseille, France. European Language Resources Association.
- Tim Veenboer and Jelke Bloem. 2023. Using collostructional analysis to evaluate BERT’s representation of linguistic constructions. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12937–12951, Toronto, Canada. Association for Computational Linguistics.
- Alex Warstadt and Samuel R Bowman. 2022. What artificial neural networks can tell us about human language acquisition. In Algebraic Structures in Natural Language, pages 17–60. CRC Press.
- Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34, Singapore. Association for Computational Linguistics.
- Frequency effects on syntactic rule learning in transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 932–948, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- The better your syntax, the better your semantics? probing pretrained language models for the English comparative correlative. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10859–10882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Hybrid human-llm corpus construction and llm evaluation for rare linguistic phenomena. arXiv preprint arXiv:2403.06965.
- What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Fei Xu and Joshua B Tenenbaum. 2007. Word learning as bayesian inference. Psychological review, 114(2):245.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.