We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text (2404.07304v2)
Abstract: We present a suite of experiments that allow us to understand the underlying challenges of LLM adaptation to nonstandard text. We do so by designing interventions that approximate several types of linguistic variation and their interactions with existing biases of LLMs. Applying our interventions during LLM adaptation with varying size and nature of training data, we gain important insights into when knowledge transfer can be successful, as well as the aspects of linguistic variation that are particularly difficult for LLMs to deal with. For instance, on text with character-level variation, performance improves with even a few training examples but approaches a plateau, suggesting that more data is not the solution. In contrast, on text with variation involving new words or meanings, far more data is needed, but it leads to a massive breakthrough in performance. Our findings reveal that existing models lack the necessary infrastructure to handle diverse forms of nonstandard text and linguistic variation, guiding the development of more resilient LLMing techniques for the future. We make the code for our interventions, which can be applied to any English text data, publicly available.
- Stress test evaluation of Transformer-based models in natural language understanding tasks. In Proc. LREC, pp. 1882–1894, 2020. URL https://aclanthology.org/2020.lrec-1.232.
- Morphynet: a large multilingual database of derivational and inflectional morphology. In Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 39–48, 2021. URL https://aclanthology.org/2021.sigmorphon-1.5/.
- Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022. URL https://aclanthology.org/2022.tacl-1.5.pdf.
- The Cambridge Handbook of Historical Orthography. Cambridge University Press, 2023.
- When is BERT multilingual? isolating crucial ingredients for cross-lingual transfer. In Proc. NAACL HLT, pp. 3610–3623, 2022. URL https://aclanthology.org/2022.naacl-main.264/.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT, pp. 4171–4186, 2019. URL https://aclanthology.org/N19-1423.pdf.
- Jacob Eisenstein. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics, 19(2):161–188, 2015.
- Characterbert: Reconciling elmo and bert for word-level open-vocabulary representations from characters. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 6903–6915, 2020. URL https://aclanthology.org/2020.coling-main.609/.
- DIALECTBENCH: A NLP benchmark for dialects, varieties, and closely-related languages, 2024. URL https://arxiv.org/abs/2403.11009. arXiv:2403.11009.
- The structure of lexical variation: Meaning, naming, and context. Number 5. Walter de Gruyter, 1994.
- Lexical Variation and Change: A Distributional Semantic Approach. Oxford University Press, 2024.
- Lyn R. Haber. Leaped and leapt: A theoretical account of linguistic variation. Foundations of Language, pp. 211–238, 1976.
- TADA: Task agnostic dialect adapters for English. In Findings of the ACL, pp. 813–824, 2023. URL https://aclanthology.org/2023.findings-acl.51/.
- Tatsuya Hiraoka. MaxMatch-Dropout: Subword regularization for WordPiece. In Proc. ICLR, pp. 4864–4872, 2022. URL https://aclanthology.org/2022.coling-1.430/.
- LoRA: Low-rank adaptation of large language models. In Proc. ICLR, 2021. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Christian Ilbury. “Sassy Queens”: Stylistic orthographic variation in Twitter and the enregisterment of AAVE. Journal of Sociolinguistics, 24(2):245–264, 2020.
- How to do a vocab swap? a study of embedding replacement for pre-trained transformers. 2022.
- Natural language processing for dialects of a language: A survey, 2024. URL https://arxiv.org/abs/2401.05632. arXiv:2401.05632.
- Nilufar Nuridinovna Kakharova. On the nature of linguistic variation and its types. Asian Journal of Research in Social Sciences and Humanities, 11(11):507–510, 2021.
- Quantifying the dialect gap and its correlates across languages. In Findings of the ACL: EMNLP, pp. 7226–7245, 2023. URL https://aclanthology.org/2023.findings-emnlp.481.pdf.
- Noisy text data: Achilles’ heel of BERT. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 16–21, 2020. URL https://aclanthology.org/2020.wnut-1.3.pdf.
- NLTK: the natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pp. 63–70, 2002. URL https://aclanthology.org/W02-0109/.
- Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Martin Neef. Morphological variation: A declarative approach. In Describing and Modeling Variation in Grammar, pp. 117–133. Mouton de Gruyter Berlin, 2009.
- Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: A review. In Proc. ACL, 2023. URL https://aclanthology.org/2023.acl-long.323.pdf.
- Princeton University. About WordNet, 2010. URL https://wordnet.princeton.edu.
- Language models are unsupervised multitask learners, 2019. URL https://github.com/openai/gpt-2.
- BART for post-correction of OCR newspaper text. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp. 284–290, 2021. URL https://aclanthology.org/2021.wnut-1.31/.
- Fine-tuning BERT with character-level noise for zero-shot transfer to dialects and closely-related languages. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pp. 152–162, 2023. URL https://aclanthology.org/2023.vardial-1.16.
- Oolong: Investigating what makes crosslingual transfer hard with controlled studies, 2022. URL https://arxiv.org/abs/2202.12312. arXiv:2202.12312.
- On the robustness of language encoders against grammatical errors. In Proc. ACL, 2020. URL https://aclanthology.org/2020.acl-main.310/.
- Micro-syntactic Variation in North American English. Oxford University Press, 2014.
- Multi-VALUE: A framework for cross-dialectal English NLP. In Proc. ACL, 2023. URL https://aclanthology.org/2023.acl-long.44/.
- Aarohi Srivastava (5 papers)
- David Chiang (59 papers)