Backdoor Adjustment of Confounding by Provenance for Robust Text Classification of Multi-institutional Clinical Notes (2310.02451v1)
Abstract: NLP methods have been broadly applied to clinical tasks. Machine learning and deep learning approaches have been used to improve the performance of clinical NLP. However, these approaches require sufficiently large datasets for training, and trained models have been shown to transfer poorly across sites. These issues have led to the promotion of data collection and integration across different institutions for accurate and portable models. However, this can introduce a form of bias called confounding by provenance. When source-specific data distributions differ at deployment, this may harm model performance. To address this issue, we evaluate the utility of backdoor adjustment for text classification in a multi-site dataset of clinical notes annotated for mentions of substance abuse. Using an evaluation framework devised to measure robustness to distributional shifts, we assess the utility of backdoor adjustment. Our results indicate that backdoor adjustment can effectively mitigate for confounding shift.
- Percha B. Modern clinical text mining: a guide and review. Annual review of biomedical data science. 2021 Jul 20;4:165-87.
- Guo Y, Li C, Roan C, Pakhomov S, Cohen T. Crossing the “Cookie Theft” corpus chasm: applying what BERT learns from outside data to the ADReSS challenge dementia detection task. Frontiers in Computer Science. 2021 Apr 16;3:642517.
- Landeiro V, Culotta A. Robust text classification under confounding shift. Journal of Artificial Intelligence Research. 2018 Nov 5;63:391-419.
- Littlestone N. From on-line to batch learning. InProceedings of the second annual workshop on Computational learning theory 2014 Jun 28 (pp. 269-284).
- Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. 2019 Aug 27.
- Pearl J. Causal inference in the health sciences: a conceptual introduction. Health services and outcomes research methodology. 2001 Dec;2:189-220.
- Pearl J. Causality. Cambridge university press; 2009 Sep 14.
- Kazancioğlu R. Risk factors for chronic kidney disease: an update. Kidney international supplements. 2013 Dec 1;3(4):368-71.
- Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995 Sep;20:273-97.
- Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785-794).