2000 character limit reached
Enhancing Biomedical Text Summarization and Question-Answering: On the Utility of Domain-Specific Pre-Training (2307.04412v1)
Published 10 Jul 2023 in cs.CL
Abstract: Biomedical summarization requires large datasets to train for text generation. We show that while transfer learning offers a viable option for addressing this challenge, an in-domain pre-training does not always offer advantages in a BioASQ summarization task. We identify a suitable model architecture and use it to show a benefit of a general-domain pre-training followed by a task-specific fine-tuning in the context of a BioASQ summarization task, leading to a novel three-step fine-tuning approach that works with only a thousand in-domain examples. Our results indicate that a LLM without domain-specific pre-training can have a significant edge in some domain-specific biomedical text generation tasks.
- On the opportunities and risks of foundation models (2021).
- Attention is all you need, Advances in Neural Information Processing Systems 2017-Decem (2017). URL: http://arxiv.org/abs/1706.03762.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL: http://arxiv.org/abs/1810.04805. doi:N19-1423.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, Association for Computational Linguistics, 2020, pp. 7871–7880. URL: http://arxiv.org/abs/1910.13461https://www.aclweb.org/anthology/2020.acl-main.703. doi:10.18653/v1/2020.acl-main.703.
- Efficient estimation of word representations in vector space, 2013, pp. 1–12. URL: http://arxiv.org/abs/1301.3781.
- Deep contextualized word representations, volume 1, 2018. URL: http://arxiv.org/abs/1802.05365. doi:10.18653/v1/n18-1202.
- Publicly available clinical bert embeddings (2019). URL: http://arxiv.org/abs/1904.03323.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics (2020) 36 (2019). URL: http://arxiv.org/abs/1901.08746. doi:10.1093/bioinformatics/btz682.
- Domain-specific language model pretraining for biomedical natural language processing (2020). URL: http://arxiv.org/abs/2007.15779.
- S. Alrowili, K. Vijay-Shanker, Biom-transformers: Building large biomedical language models with bert, albert and electra, 2021. doi:10.18653/v1/2021.bionlp-1.24.
- Large-scale application of named entity recognition to biomedicine and epidemiology, PLOS Digital Health 1 (2022). doi:10.1371/journal.pdig.0000152.
- Y. Liu, M. Lapata, Text summarization with pretrained encoders, 2019. URL: http://arxiv.org/abs/1908.08345. doi:10.18653/v1/d19-1387.
- Exploring the limits of transfer learning with a unified text-to-text transformer (2019). URL: http://arxiv.org/abs/1910.10683.
- Biobart: Pretraining and evaluation of a biomedical generative language model, 2022. doi:10.18653/v1/2022.bionlp-1.9.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining, Briefings in bioinformatics 23 (2022). doi:10.1093/bib/bbac409.
- Pretrained transformers improve out-of-distribution robustness, 2020. doi:10.18653/v1/2020.acl-main.244.
- Gpt: Improving language understanding by generative pre-training, OpenAI (2018).
- Language models are unsupervised multitask learners, OpenAI Blog 1 (2019). URL: https://github.com/codelucas/newspaper.
- S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering 22 (2010) 1345–1359. URL: http://ieeexplore.ieee.org/document/5288526/. doi:10.1109/TKDE.2009.191.
- Squad: 100,000+ questions for machine comprehension of text, 2016.
- Teaching machines to read and comprehend, volume 2015-January, 2015.
- C. Y. Lin, Rouge: A package for automatic evaluation of summaries, 2004.
- Interpretable multi-step reasoning with knowledge extraction on complex healthcare question answering (2020).
- D. Molla-Aliod, C. Jones, Classification betters regression in query-based multi-document summarisation techniques for question answering: Macquarie university at bioasq7b, 2020. URL: http://arxiv.org/abs/1909.00542.
- Alter: Auxiliary text rewriting tool for natural language generation, 2019. doi:10.18653/v1/d19-3003.
- Scheduled sampling for sequence prediction with recurrent neural networks, volume 2015-January, 2015.
- Brio: Bringing order to abstractive summarization, volume 1, 2022. doi:10.18653/v1/2022.acl-long.207.
- E. Briakou, M. Carpuat, Can synthetic translations improve bitext quality?, volume 1, 2022. doi:10.18653/v1/2022.acl-long.326.