Principles from Clinical Research for NLP Model Generalization (2311.03663v3)
Abstract: The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to "out-of-distribution" effects. Here, we explore the foundations of generalizability and study the factors that affect it, articulating lessons from clinical studies. In clinical research, generalizability is an act of reasoning that depends on (a) internal validity of experiments to ensure controlled measurement of cause and effect, and (b) external validity or transportability of the results to the wider population. We demonstrate how learning spurious correlations, such as the distance between entities in relation extraction tasks, can affect a model's internal validity and in turn adversely impact generalization. We, therefore, present the need to ensure internal validity when building machine learning models in NLP. Our recommendations also apply to generative LLMs, as they are known to be sensitive to even minor semantic preserving alterations. We also propose adapting the idea of matching in randomized controlled trials and observational studies to NLP evaluation to measure causation.
- Types of out-of-distribution texts and how to detect them. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10687–10701, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Quantified reproducibility assessment of NLP results. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16–28, Dublin, Ireland. Association for Computational Linguistics.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Test suite design for biomedical ontology concept recognition systems. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Association (ELRA).
- Irina Degtiar and Sherri Rose. 2023. A review of generalizability and transportability. Annual Review of Statistics and Its Application, 10(1):501–524.
- M Delgado-Rodríguez and J Llorca. 2004. Bias. Journal of Epidemiology & Community Health, 58(8):635–641.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Effects of human adversarial and affable samples on BERT generalization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7637–7649, Singapore. Association for Computational Linguistics.
- Memorization vs. generalization : Quantifying data leakage in NLP performance evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1325–1335, Online. Association for Computational Linguistics.
- Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT. BMC Bioinformatics, 23(1):4.
- Robert Epstein. 1984. The principle of parsimony and some applications in psychology. The Journal of Mind and Behavior, pages 119–130.
- Jorge Faber and Lilian Fonseca. 2014. How sample size influences research outcomes. Dental press journal of orthodontics, 19:27–9.
- Joseph Felsenstein. 1983. Parsimony in systematics: biological and statistical issues. Annual review of ecology and systematics, 14(1):313–333.
- Chengguang Gan and Tatsunori Mori. 2023. Sensitivity and robustness of large language models to prompt template in Japanese text classification tasks. In Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, pages 1–11, Hong Kong, China. Association for Computational Linguistics.
- Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
- Competency problems: On finding and removing artifacts in language data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1801–1813, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Robert C Gardner. 2000. Correlation, causation, motivation, and second language acquisition. Canadian Psychology/Psychologie Canadienne, 41(1):10.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2:665–673.
- Google. 2022. Generalization - machine learning crash course.
- Tudor Groza and Karin Verspoor. 2014. Automated generation of test suites for error analysis of concept recognition systems. In Proceedings of the Australasian Language Technology Association Workshop 2014, pages 23–31, Melbourne, Australia.
- Sources of irreproducibility in machine learning: A review.
- Odd Erik Gundersen and Sigbjørn Kjensmo. 2018. State of the art: Reproducibility in artificial intelligence. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- GRADE guidelines: 2. Framing the question and deciding on important outcomes. Journal of Clinical Epidemiology, 64(4):395–400.
- GRADE guidelines: 8. Rating the quality of evidence—indirectness. Journal of Clinical Epidemiology, 64(12):1303–1310.
- Eduardo Hariton and Joseph J Locascio. 2018. Randomised controlled trials - the gold standard for effectiveness research: Study design: randomised controlled trials. BJOG, 125(13):1716.
- Focused contrastive loss for classification with pre-trained language models. IEEE Transactions on Knowledge and Data Engineering (TKDE).
- Dan Hendrycks and Kevin Gimpel. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations.
- State-of-the-art generalisation research in NLP: A taxonomy and review.
- J M Kendall. 2003. Designing a research project: randomised controlled trials and their principles. Emergency Medicine Journal, 20(2):164–168.
- Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics.
- Overview of the BioCreative VI chemical-protein interaction track. In Proceedings of the Sixth BioCreative challenge evaluation workshop, volume 1, pages 141–146.
- Walter A Kukull and Mary Ganguli. 2012. Generalizability: the trees, the forest, and the low-hanging fruit. Neurology, 78(23):1886–1891.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- When causal inference meets deep learning. Nature Machine Intelligence, 2(8):426–427.
- Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
- BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 217–227, Online. Association for Computational Linguistics.
- Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
- Mary L McHugh. 2012. Interrater reliability: The kappa statistic. Biochem. Med. (Zagreb), 22(3):276–282.
- Nature Editorial. 2022. Replication studies hold the key to generalization. Nature Communications, 13(1):7004.
- Robust natural language processing: Recent advances, challenges, and future directions. IEEE Access.
- Heather Paterson and Brandon C. Welsh. 2024. Is it time for the use of pair-matching in all randomized controlled trials of crime and violence prevention? a review of the research. Aggression and Violent Behavior, 74:101889.
- Cecilia Maria Patino and Juliana Carvalho Ferreira. 2018. Internal and external validity: can you apply research study results to your patients? J Bras Pneumol, 44(3):183.
- Denise F Polit and Cheryl Tatano Beck. 2010. Generalization in quantitative and qualitative research: myths and strategies. Int. J. Nurs. Stud., 47(11):1451–1458.
- Data cards: Purposeful and transparent dataset documentation for responsible ai. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 1776–1826, New York, NY, USA. Association for Computing Machinery.
- “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California. Association for Computational Linguistics.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- Peter M Rothwell. 2005. External validity of randomised controlled trials:“to whom do the results of this trial apply?”. The Lancet, 365(9453):82–93.
- Donald B. Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688–701.
- NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
- Non-randomized studies as a source of complementary, sequential or replacement evidence for randomized controlled trials in systematic reviews on the effects of interventions. Research synthesis methods, 4(1):49–62.
- Roy Schwartz and Gabriel Stanovsky. 2022. On the limitations of dataset balancing: The lost battle against spurious correlations. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2182–2194, Seattle, United States. Association for Computational Linguistics.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
- Look to the right: Mitigating relative position bias in extractive question answering. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 418–425, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Elizabeth A Stuart. 2010. Matching methods for causal inference: A review and a look forward. Stat. Sci., 25(1):1–21.
- Elizabeth Tipton. 2014. How generalizable is your experiment? An index for comparing experimental samples and populations. Journal of Educational and Behavioral Statistics, 39(6):478–501.
- Not another negation benchmark: The NaN-NLI test suite for sub-clausal negation. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 883–894, Online only. Association for Computational Linguistics.
- An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621–633.
- Towards debiasing NLU models from unknown biases. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7597–7610, Online. Association for Computational Linguistics.
- The textual characteristics of traditional and open access scientific journals are similar. BMC bioinformatics, 10(1):1–16.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Christopher Wild. 2006. The concept of distribution. SERJ EDITORIAL BOARD, page 10.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- P M Wortman. 1983. Evaluation research: A methodological perspective. Annual Review of Psychology, 34(1):223–260.
- Tal Yarkoni. 2022. The generalizability crisis. Behavioral and Brain Sciences, 45:e1.