Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare (2405.13030v1)
Abstract: LLMs have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19 percent compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.
- Defending against neural fake news. Advances in neural information processing systems. 2019;32.
- Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:191100536. 2019.
- Large language models in medicine. Nat Med. 2023;29:1930-40.
- PathologyBERT – pre-trained vs. a new transformer language model for pathology domain. arXiv. 2022. Available from: https://arxiv.org/abs/2205.06885.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:230509617. 2023.
- Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology. 2023;307(4):e230725.
- Masud MM, et al. Facing the reality of data stream classification: coping with scarcity of labeled data. Knowledge and Information Systems. 2012;33:213-44.
- Brown T, et al. Language models are few-shot learners. In: Advances in neural information processing systems 33; 2020. .
- Chowdhery A, et al. PaLM: Scaling language modeling with Pathways. arXiv preprint arXiv:220402311. 2022. Available from: https://arxiv.org/abs/2204.02311.
- Aydin F, et al. Medical multimodal classifiers under scarce data condition. arXiv preprint arXiv:190208888. 2019.
- A survey of automated methods for biomedical text simplification. Journal of the American Medical Informatics Association. 2022;29(11):1976-88.
- Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. arXiv preprint arXiv:221104325. 2022.
- Pan SJ, Yang Q. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering. 2010;22(10):1345-59.
- Bozinovski S, Fulgosi A. The influence of pattern similarity and transfer learning upon training of a base perceptron b2. In: Proceedings of Symposium Informatica. vol. 3; 1976. .
- Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:190605474. 2019.
- Abad A, et al. Cross lingual transfer learning for zero-resource domain adaptation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2020. .
- Niu S, et al. A decade survey of transfer learning (2010–2020). IEEE Transactions on Artificial Intelligence. 2020;1(2):151-66.
- Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences. 2017;114(13):3521-6.
- To Transfer or Not to Transfer. In: Proc. Conf. Neural Information Processing Systems (NIPS ’05) Workshop Inductive Transfer: 10 Years Later; 2005. .
- A multiple resampling method for learning from imbalanced data sets. Computational Intelligence. 2004;20(1):18-36.
- Chawla NV, et al. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321-57.
- Varotto G, et al. Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy. Frontiers in Neuroinformatics. 2021;15.
- Susan S, Kumar A. The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Engineering Reports. 2021;3(4).
- Data augmentation using pre-trained transformer models. arXiv preprint arXiv:200302245. 2020.
- Crowdsourcing systems on the world-wide web. Communications of the ACM. 2011;54(4):86-96.
- Wong SC, et al. Understanding data augmentation for classification: when to warp? In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA); 2016. .
- Unsupervised domain adaptation for neural machine translation with domain-aware feature embeddings. arXiv preprint arXiv:190810430. 2019.
- Alemohammad S, et al. Self-consuming generative models go mad. arXiv preprint arXiv:230701850. 2023.
- Crowdsourcing as a tool for research: Methodological, fair, and political considerations. Bulletin of Science, Technology & Society. 2020;40(3-4):40-53.
- Vaughan JW. Making better use of the crowd: How crowdsourcing can advance machine learning research. The Journal of Machine Learning Research. 2017;18(1):7026-71.
- Crowdsourcing Information Systems–A Systems Theory Perspective. In: ACIS 2011 Proceedings; 2011. Available from: https://aisel.aisnet.org/acis2011/33.
- Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Computing. 2013;17(2):76-81.
- Babbage C. Passages from the Life of a Philosopher; 1864.
- Data-centric artificial intelligence: A survey. arXiv preprint arXiv:230310158. 2023.
- Precog: Improving crowdsourced data quality before acquisition. arXiv preprint arXiv:170402384. 2017.
- Estellés-Arolas E. The need of co-utility for successful crowdsourcing. Co-utility: Theory and Applications. 2018:189-200.
- The collective intelligence genome. MIT Sloan management review. 2010.
- Crowdforge: Crowdsourcing complex work. In: Proceedings of the 24th annual ACM symposium on User interface software and technology; 2011. p. 43-52.
- Collaboratively crowdsourcing workflows with turkomatic. In: Proceedings of the acm 2012 conference on computer supported cooperative work; 2012. p. 1003-12.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-40.