Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Black Big Boxes: Do Language Models Hide a Theory of Adjective Order? (2407.02136v1)

Published 2 Jul 2024 in cs.CL

Abstract: In English and other languages, multiple adjectives in a complex noun phrase show intricate ordering patterns that have been a target of much linguistic theory. These patterns offer an opportunity to assess the ability of LMs to learn subtle rules of language involving factors that cross the traditional divisions of syntax, semantics, and pragmatics. We review existing hypotheses designed to explain Adjective Order Preferences (AOPs) in humans and develop a setup to study AOPs in LMs: we present a reusable corpus of adjective pairs and define AOP measures for LMs. With these tools, we study a series of LMs across intermediate checkpoints during training. We find that all models' predictions are much closer to human AOPs than predictions generated by factors identified in theoretical linguistics. At the same time, we demonstrate that the observed AOPs in LMs are strongly correlated with the frequency of the adjective pairs in the training data and report limited generalization to unseen combinations. This highlights the difficulty in establishing the link between LM performance and linguistic theory. We therefore conclude with a road map for future studies our results set the stage for, and a discussion of key questions about the nature of knowledge in LMs and their ability to generalize beyond the training sets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Word order does matter and shuffled language models know it. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6907–6919, Dublin, Ireland. Association for Computational Linguistics.
  2. Linguistic productivity: the case of determiners in English. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 330–343, Nusa Dua, Bali. Association for Computational Linguistics.
  3. Inbal Arnon and Neal Snider. 2010. More than words: Frequency effects for multi-word phrases. Journal of Memory and Language, 62(1):67–82.
  4. Marco Baroni. 2022. On the proper role of linguistically-oriented deep net analysis in linguistic theorizing. In Algebraic systems and the representation of linguistic knowledge, pages 5–22. CRC Press.
  5. Otto Behaghel. 1930. Von deutscher Wortstellung. Zeitschrift für Deutschkunde, 44:81–89.
  6. Emergent and predictable memorization in large language models. In Advances in Neural Information Processing Systems, volume 36, pages 28072–28090. Curran Associates, Inc.
  7. Pythia: A suite for analyzing large language models across training and scaling.
  8. Kathryn Bock. 1982. Toward a cognitive psychology of syntax: Information processing contributions to sentence formulation. Psychological Review, 89:1–47.
  9. Joan Bybee. 2010. Language, usage and cognition. Cambridge University Press.
  10. Lisa Bylinina and Alexey Tikhonov. 2022. Transformers in the loop: Polarity in neural models of language. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6601–6610, Dublin, Ireland. Association for Computational Linguistics.
  11. Brian Byrne. 1979. Rules of prenominal adjective order and the interpretation of “incompatible” adjective pairs. Journal of Verbal Learning and Verbal Behavior, 18(1):73–78.
  12. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms. CoRR, abs/2309.07311.
  13. Guglielmo Cinque. 1996. On the evidence for partial N-movement in the Romance DP, Cambridge Studies in Linguistics. Cambridge University Press.
  14. Generalising to German plural noun classes, from the perspective of a recurrent neural network. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 94–108, Online. Association for Computational Linguistics.
  15. Holger Diessel. 2019. The grammar network: How linguistic structure is shaped by language use. Cambridge University Press.
  16. Robert M. W. Dixon. 1982. ‘Where Have All the Adjectives Gone?’ and other Essays in Semantics and Syntax. Mouton de Gruyter, Berlin.
  17. Predicting cross-linguistic adjective order with information gain. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 957–967, Online. Association for Computational Linguistics.
  18. Evaluating a century of progress on the cognitive science of adjective ordering. Transactions of the Association for Computational Linguistics, 11:1185–1200.
  19. What’s in my big data? CoRR, abs/2310.20707.
  20. Victor S. Ferreira and Gary S. Dell. 2000. Effect of ambiguity and lexical availability on syntactic and lexical production. Cognitive Psychology, 40(4):296–340.
  21. Subjectivity-based adjective ordering maximizes communicative success. In CogSci, pages 344–350.
  22. What determines the order of adjectives in English? comparing efficiency-based theories using dependency treebanks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2003–2012, Online. Association for Computational Linguistics.
  23. Richard Futrell and Roger P. Levy. 2019. Do RNNs learn human-like abstract word order preferences? In Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 50–59.
  24. Dependency locality as an explanatory principle for word order. Language, 96:371 – 412.
  25. The pile: An 800gb dataset of diverse text for language modeling.
  26. SyntaxGym: An online platform for targeted evaluation of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 70–76, Online. Association for Computational Linguistics.
  27. Edward Gibson. 1998. Linguistic complexity: locality of syntactic dependencies. Cognition, 68(1):1–76.
  28. A.E. Goldberg. 2006. Constructions at Work: The Nature of Generalization in Language. Oxford linguistics. Oxford University Press.
  29. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–1205, New Orleans, Louisiana. Association for Computational Linguistics.
  30. An information-theoretic explanation of adjective ordering preferences. Cognitive Science.
  31. Felix Hill. 2012. Beauty before age? applying subjectivity to automatic English adjective ordering. In Proceedings of the NAACL HLT 2012 Student Research Workshop, pages 11–16, Montréal, Canada. Association for Computational Linguistics.
  32. A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744.
  33. BabyBERTa: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 624–646, Online. Association for Computational Linguistics.
  34. A taxonomy and review of generalization research in nlp. Nature Machine Intelligence, 5:1161–1174.
  35. Language models use monotonicity to assess NPI licensing. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4958–4969, Online. Association for Computational Linguistics.
  36. Jaap Jumelet and Dieuwke Hupkes. 2018. Do language models understand anything? on the ability of LSTMs to understand negative polarity items. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 222–231, Brussels, Belgium. Association for Computational Linguistics.
  37. Language models as an alternative evaluator of word order hypotheses: A case study in Japanese. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 488–504, Online. Association for Computational Linguistics.
  38. Mechanisms for handling nested dependencies in neural-network language models and humans. Cognition, 213:104699. Special Issue in Honour of Jacques Mehler, Cognition’s founding editor.
  39. Determinants of adjective-noun plausibility. In Ninth Conference of the European Chapter of the Association for Computational Linguistics, pages 30–36, Bergen, Norway. Association for Computational Linguistics.
  40. Causal estimation of memorisation profiles.
  41. Investigating Cross-Linguistic Adjective Ordering Tendencies with a Latent-Variable Model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4016–4028, Online. Association for Computational Linguistics.
  42. Why we need a gradient approach to word order. Linguistics, 61(4):825–883.
  43. Tal Linzen and Marco Baroni. 2021. Syntactic structure from deep learning. Annual Review of Linguistics, 7(Volume 7, 2021):195–212.
  44. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
  45. Maryellen C MacDonald. 2013. How language production shapes language form and comprehension. Frontiers in psychology, 4:226.
  46. Studying word order through iterative shuffling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10351–10366, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  47. J.E. Martin. 1969. Some competence-process relationships in noun phrases with prenominal and postnominal adjectives. Journal of Verbal Learning and Verbal Behavior, 8(4):471–480.
  48. Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics.
  49. Evaluating n𝑛nitalic_n-gram novelty of language models using rusty-dawg.
  50. Kanishka Misra and Kyle Mahowald. 2024. Language models learn rare phenomena from less rare phenomena: The case of the missing aanns. CoRR, abs/2403.19827.
  51. James W. Ney. 1981. Optionality and choice in the selection of verb complements in english. WORD, 32(2):133–152.
  52. Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2644–2663, St. Julian’s, Malta. Association for Computational Linguistics.
  53. Timothy J. O’Donnell. 2015. Productivity and Reuse in Language: A Theory of Linguistic Computation and Storage. The MIT Press.
  54. Filtered corpus training (fict) shows that language models can generalize from indirect evidence.
  55. Recite, reconstruct, recollect: Memorization in lms as a multifaceted phenomenon.
  56. Gregory Scontras. 2023. Adjective ordering across languages. Annual Review of Linguistics, 9:357–376.
  57. On the grammatical source of adjective ordering preferences. Semantics and Pragmatics, 12:1–21.
  58. Subjectivity Predicts Adjective Ordering Preferences. Open Mind, 1(1):53–66.
  59. Gary-John Scott. 2002. Stacked Adjectival Modification and the Structure of Nominal Phrases.
  60. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  61. Richard Sproat and Chilin Shih. 1991. The Cross-Linguistic Distribution of Adjective Ordering Restrictions. Springer Netherlands, Dordrecht.
  62. The learnability of the wh-island constraint in Dutch by a long short-term memory network. In Proceedings of the Society for Computation in Linguistics 2023, pages 321–331, Amherst, MA. Association for Computational Linguistics.
  63. H. Sweet. 1898. A New English Grammar, Logical and Historical. Number pt. 2 in A New English Grammar, Logical and Historical. Clarendon Press.
  64. Alexandra Teodorescu. 2006. Adjective ordering restrictions revisited. In 25th West Coast Conference on Formal Linguistics, pages 399–407. Cascadilla Proceedings Project.
  65. Andreas Trotzke and Eva Wittenberg. 2019. Long-standing issues in adjective order and corpus evidence for a multifactorial approach. Linguistics, 57(2):273–282.
  66. Robert Truswell. 2009. Attributive adjectives and nominal templates. Linguistic Inquiry, 40(3):525–533.
  67. The birth of bias: A case study on the evolution of gender bias in an English language model. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 75–75, Seattle, Washington. Association for Computational Linguistics.
  68. Spicy adjectives and nominal donkeys: Capturing semantic deviance using compositionality in distributional spaces. Cognitive science, 41 1:102–136.
  69. Studying the recursive behaviour of adjectival modification with compositional distributional semantics. In Conference on Empirical Methods in Natural Language Processing.
  70. Z. Vendler. 1968. Adjectives and Nominalizations. Papers on formal linguistics. Mouton.
  71. Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34, Singapore. Association for Computational Linguistics.
  72. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
  73. What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics.
  74. Michael Wilson and Robert Frank. 2023. Inductive bias is in the eye of the beholder. In Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, pages 152–162, Singapore. Association for Computational Linguistics.
  75. Stefanie Wulff. 2003. A multifactorial corpus analysis of adjective order in english. International Journal of Corpus Linguistics, 8:245–282.
  76. Stefanie Wulff and Stefan Gries. 2015. Prenominal adjective order preferences in chinese and german l2 english: A multifactorial corpus study. Linguistic Approaches to Bilingualism, 5.
Citations (1)

Summary

  • The paper introduces a novel dataset (CAP) and specific metrics (AOP-∆) to measure how language models capture human-like adjective order preferences.
  • The study demonstrates that LMs achieve high accuracy in predicting adjective orders by leveraging both training data frequency and contextual information.
  • Findings reveal that LMs primarily memorize frequency patterns, resulting in limited generalization to unseen adjective pairs.

LLMs and Adjective Order Preferences: An Analysis

The paper detailed in "Black Big Boxes: Do LLMs Hide a Theory of Adjective Order?" by Jumelet et al. presents a meticulous examination of how LMs learn and process adjective order preferences (AOPs) in complex noun phrases. This research focuses on evaluating the alignment of LMs' predictions with human AOPs and how these predictions are influenced by the models' training data and linguistic context. The authors introduce innovative methodologies and data resources to scrutinize the underlying mechanisms of adjective order in LMs, culminating in important insights about linguistic generalization and memorization.

Key Contributions and Methodologies

  1. Introduction of the Corpus of Adjective Pairs (CAP):
    • The authors developed a novel dataset, CAP, which comprises double adjectives extracted from a diverse set of English sources. This dataset serves as a benchmark for evaluating AOPs in various LMs.
  2. AOP Metrics for LMs:
    • The paper introduces specific metrics to quantify AOPs in LMs. These metrics include AOP-∆, which measures the difference in log probabilities for natural vs. swapped adjective orders, both in isolation and within context.
  3. Experimental Analysis on LMs:
    • The researchers evaluated several pretrained LLMs from the Pythia suite, focusing on their AOP prediction capabilities. Through this, they identified distinct phases of AOP acquisition during training, highlighting that adjective order preferences are learned early and stabilized quickly.

Results and Findings

  1. Comparable Performance to Human AOPs:
    • All evaluated models showed predictions closely aligned with human AOPs, surpassing traditional linguistic factors in predictive capability. For instance, the Pythia-12b model achieved an AOP prediction accuracy of up to 94.1%.
  2. Impact of Training Data and Frequency:
    • The paper found a strong correlation between LMs' AOPs and the frequency of adjective pairs within the training corpus. Simple bigram statistics from the training data could independently predict naturally occurring adjective orders with an accuracy of 90.3%.
  3. Role of Context:
    • Contextual information significantly improves LMs' AOP predictions. The presence of context increased AOP accuracy, suggesting that the models leverage more complex linguistic signals beyond mere co-occurrence statistics.
  4. Limited Generalization:
    • While LMs exhibited some capacity to generalize AOPs to unseen adjective combinations, this generalization was relatively limited. The paper shows that LMs primarily rely on memorized frequencies rather than general abstract principles for unseen combinations.

Implications for Future Research

Practical Implications

  • Improvement in NLP Applications:
    • The findings can enhance the performance of NLP applications that rely on nuanced linguistic patterns, such as machine translation and text generation, by fine-tuning LMs to better capture adjective order preferences.

Theoretical Implications

  • Insights into Cognitive Linguistics:
    • The alignment between LM predictions and human AOPs, as well as the models' reliance on frequency and context, provides valuable insights into the cognitive processes involved in language learning and usage.
  • Potential for Linguistic Theory Development:
    • The observed performance gaps underline the potential for advancing linguistic theories that account for the graded and context-sensitive nature of adjective order preferences.

Future Directions

  • Cross-Linguistic Analysis:
    • Extending the analysis to cover multiple languages could unravel universal vs. language-specific aspects of adjective order preferences, beneficial for developing multilingual LMs.
  • Corpus Interventions:
    • Implementing controlled interventions in the training data, such as filtering out specific constructions, could provide deeper insights into the abstraction level at which LMs learn linguistic rules.
  • Contextual Dynamics:
    • Further exploration of how different types of context influence AOPs could refine the understanding of context dependency in LMs, offering broader implications for context-aware language modeling.

Conclusion

This paper provides a comprehensive exploration of how LLMs process and predict adjective order preferences, demonstrating significant alignment with human linguistic behavior. Through detailed experiments and innovative methodological contributions, the paper paves the way for further research into the subtle linguistic capabilities of LMs and their potential applications in both theoretical and practical realms of AI and linguistics.

The findings emphasize the complex interplay between memorization and abstraction in LLMs, suggesting that while current models are adept at leveraging frequency-based patterns, there remains a scope for improving their generalization capabilities in a human-like fashion. As the field continues to evolve, such research is crucial for advancing the understanding and capabilities of NLP systems.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 70 likes.

Upgrade to Pro to view all of the tweets about this paper: