Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation (2305.17819v3)
Abstract: The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from LLMs trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery. The framework involves of three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound-fungus relation determination. Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted. While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.
- Marcelo Der Torossian Torres and Cesar de la Fuente-Nunez “Toward computer-made artificial antibiotics” Antimicrobials In Current Opinion in Microbiology 51, 2019, pp. 30–38 DOI: https://doi.org/10.1016/j.mib.2019.03.004
- “A Deep Learning Approach to Antibiotic Discovery” In Cell 180.4, 2020, pp. 688–702.e13 DOI: 10.1016/j.cell.2020.01.021
- “Evaluating Large Language Models Trained on Code”, 2021 arXiv:2107.03374 [cs.LG]
- “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing” In ACM Transactions on Computing for Healthcare 3.1 Association for Computing Machinery (ACM), 2021, pp. 1–23 DOI: 10.1145/3458754
- “Improving Language Understanding by Generative Pre-Training”
- “Language Models are Unsupervised Multitask Learners”, 2019
- “Language Models are Few-Shot Learners”, 2020 arXiv:2005.14165 [cs.CL]
- “Training language models to follow instructions with human feedback” arXiv:2203.02155 [cs] arXiv, 2022 URL: http://arxiv.org/abs/2203.02155
- OpenAI “ChatGPT: Optimizing language models for dialogue.”, 2022
- “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” In JMIR Med Educ 9, 2023, pp. e45312 DOI: 10.2196/45312
- “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21 Virtual Event, Canada: Association for Computing Machinery, 2021, pp. 610–623 DOI: 10.1145/3442188.3445922
- “Survey of Hallucination in Natural Language Generation” In ACM Comput. Surv. 55.12 New York, NY, USA: Association for Computing Machinery, 2023 DOI: 10.1145/3571730
- “Dissociating language and thought in large language models: a cognitive perspective”, 2023 arXiv:2301.06627 [cs.CL]
- “Taxonomy of Risks posed by Language Models” In 2022 ACM Conference on Fairness, Accountability, and Transparency Seoul Republic of Korea: ACM, 2022, pp. 214–229 DOI: 10.1145/3531146.3533088
- OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
- “Transformers and the Representation of Biomedical Background Knowledge” In Computational Linguistics 49.1, 2023, pp. 73–115 DOI: 10.1162/coli_a_00462
- Mael Jullien, Marco Valentino and Andre Freitas “Do transformers encode a foundational ontology? Probing abstract classes in natural language” In arXiv preprint arXiv:2201.10262, 2022
- “Interventional Probing in High Dimensions: An NLI Case Study” In arXiv preprint arXiv:2304.10346, 2023
- “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288 [cs.CL]
- Joseph A. DiMasi, Henry G. Grabowski and Ronald W. Hansen “Innovation in the pharmaceutical industry: New estimates of R&D costs” In Journal of Health Economics 47, 2016, pp. 20–33 DOI: https://doi.org/10.1016/j.jhealeco.2016.01.012
- Chi Heem Wong, Kien Wei Siah and Andrew W Lo “Estimation of clinical trial success rates and related parameters” In Biostatistics 20.2, 2018, pp. 273–286 DOI: 10.1093/biostatistics/kxx069
- “An Update on Eight “New” Antibiotics against Multidrug-Resistant Gram-Negative Bacteria” In Journal of Clinical Medicine 10.5, 2021, pp. 1068 DOI: 10.3390/jcm10051068
- “Antimicrobial Resistance in ESKAPE Pathogens” In Clinical Microbiology Reviews 33.3, 2020, pp. e00181–19 DOI: 10.1128/CMR.00181-19
- “The Pile: An 800GB Dataset of Diverse Text for Language Modeling” In arXiv preprint arXiv:2101.00027, 2020
- BigScience “BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model”, https://huggingface.co/bigscience/bloom/, 2022
- “BioGPT: generative pre-trained transformer for biomedical text generation and mining” bbac409 In Briefings in Bioinformatics 23.6, 2022 DOI: 10.1093/bib/bbac409
- “GALACTICA: A Large Language Model for Science”, 2022
- “What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?” arXiv, 2022 arXiv:2204.05832 [cs, stat]
- Stella Biderman, Kieran Bicheno and Leo Gao “Datasheet for the pile” In arXiv preprint arXiv:2201.07311, 2022
- “BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining” arXiv:2210.10341 [cs] In Briefings in Bioinformatics 23.6, 2022, pp. bbac409 DOI: 10.1093/bib/bbac409
- “Gaussian Error Linear Units (GELUs)”, 2020 arXiv:1606.08415 [cs.LG]
- “PaLM: Scaling Language Modeling with Pathways”, 2022 arXiv:2204.02311 [cs.CL]
- Rico Sennrich, Barry Haddow and Alexandra Birch “Neural Machine Translation of Rare Words with Subword Units” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Berlin, Germany: Association for Computational Linguistics, 2016, pp. 1715–1725 DOI: 10.18653/v1/P16-1162
- Asli Celikyilmaz, Elizabeth Clark and Jianfeng Gao “Evaluation of Text Generation: A Survey” arXiv:2006.14799 [cs] arXiv, 2021 DOI: 10.48550/arXiv.2006.14799
- “AI vs. Human - Differentiation Analysis of Scientific Content Generation”
- “CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation” arXiv:2204.00862 [cs] arXiv, 2022 URL: http://arxiv.org/abs/2204.00862
- “The Curious Case of Neural Text Degeneration” arXiv:1904.09751 [cs] arXiv, 2020 URL: http://arxiv.org/abs/1904.09751
- “On Faithfulness and Factuality in Abstractive Summarization” arXiv:2005.00661 [cs] arXiv, 2020 DOI: 10.48550/arXiv.2005.00661
- “How Context Affects Language Models’ Factual Predictions”, 2020 arXiv:2005.04611 [cs.CL]
- “Aspergillus species in indoor environments and their possible occupational and public health hazards” In Current Medical Mycology 2.1, 2016, pp. 36–42 DOI: 10.18869/acadpub.cmm.2.1.36
- “Large Language Models Are Human-Level Prompt Engineers”, 2023 arXiv:2211.01910 [cs.LG]
- “Fungal names: a comprehensive nomenclatural repository and knowledge base for fungal taxonomy” In Nucleic Acids Research 51.D1, 2022, pp. D708–D716 DOI: 10.1093/nar/gkac926
- Thomas A. Richards, Guy Leonard and Jeremy G. Wideman “What Defines the “Kingdom” Fungi?” In Microbiology Spectrum 5.3, 2017, pp. 5.3.23 DOI: 10.1128/microbiolspec.FUNK-0044-2017
- “How to publish a new fungal species, or name, version 3.0” In IMA fungus 12 Springer, 2021, pp. 1–15
- “International Code of Nomenclature for algae, fungi, and plants (Shenzhen Code) adopted by the Nineteenth International Botanical Congress Shenzhen, China, July 2017.” Koeltz botanical books, 2018
- “Unambiguous identification of fungi: where do we stand and how accurate and precise is fungal DNA barcoding?” In IMA fungus 11.1 Springer, 2020, pp. 14
- “Language Models as Knowledge Bases?” arXiv:1909.01066 [cs] arXiv, 2019 DOI: 10.48550/arXiv.1909.01066
- “Can Prompt Probe Pretrained Language Models? Understanding the Invisible Risks from a Causal View” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 5796–5808 DOI: 10.18653/v1/2022.acl-long.398
- “How Can We Know What Language Models Know?” arXiv:1911.12543 [cs] arXiv, 2020 URL: http://arxiv.org/abs/1911.12543
- Zied Bouraoui, Jose Camacho-Collados and Steven Schockaert “Inducing Relational Knowledge from BERT” arXiv, 2019 DOI: 10.48550/ARXIV.1911.12753
- Adi Haviv, Jonathan Berant and Amir Globerson “BERTese: Learning to Speak to BERT” arXiv, 2021 DOI: 10.48550/ARXIV.2103.05327
- “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts” arXiv, 2020 DOI: 10.48550/ARXIV.2010.15980
- Zexuan Zhong, Dan Friedman and Danqi Chen “Factual Probing Is [MASK]: Learning vs. Learning to Recall” arXiv, 2021 DOI: 10.48550/ARXIV.2104.05240
- Leandra Fichtel, Jan-Christoph Kalo and Wolf-Tilo Balke “Prompt Tuning or Fine-Tuning - Investigating Relational Knowledge in Pre-Trained Language Models”
- “SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data” In Proceedings of the 17th International Workshop on Semantic Evaluation, 2023
- “Can Language Models be Biomedical Knowledge Bases?” arXiv:2109.07154 [cs] arXiv, 2021 URL: http://arxiv.org/abs/2109.07154
- “KAMEL : Knowledge Analysis with Multitoken Entities in Language Models”
- “Prompting GPT-3 To Be Reliable” arXiv:2210.09150 [cs] arXiv, 2023 DOI: 10.48550/arXiv.2210.09150
- “When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories”
- “Language Models As or For Knowledge Bases” arXiv:2110.04888 [cs] arXiv, 2021 URL: http://arxiv.org/abs/2110.04888
- Alex Howard, William Hope and Alessandro Gerada “ChatGPT and antimicrobial advice: the end of the consulting infection doctor?” In The Lancet Infectious Diseases 23.4, 2023, pp. 405–406 DOI: 10.1016/S1473-3099(23)00113-5
- “ChatGPT in Healthcare: A Taxonomy and Systematic Review” In medRxiv Cold Spring Harbor Laboratory Press, 2023 DOI: 10.1101/2023.03.30.23287899
- “Systematic Evaluation of Research Progress on Natural Language Processing in Medicine Over the Past 20 Years: Bibliometric Study on PubMed” In J Med Internet Res 22.1, 2020, pp. e16816 DOI: 10.2196/16816
- “Capabilities of GPT-4 on Medical Challenge Problems”, 2023 arXiv:2303.13375 [cs.CL]
- “Calibrate Before Use: Improving Few-Shot Performance of Language Models” arXiv, 2021 arXiv:2102.09690 [cs]
- Nora Kassner, Benno Krojer and Hinrich Schütze “Are Pretrained Language Models Symbolic Reasoners over Knowledge?” In Proceedings of the 24th Conference on Computational Natural Language Learning Online: Association for Computational Linguistics, 2020, pp. 552–564 DOI: 10.18653/v1/2020.conll-1.45
- “Large Language Models Struggle to Learn Long-Tail Knowledge” arXiv, 2023 arXiv:2211.08411 [cs]
- “Impact of Co-occurrence on Factual Knowledge of Large Language Models” arXiv, 2023 arXiv:2310.08256 [cs]
- “Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning” In Findings of the Association for Computational Linguistics: EMNLP 2022 Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 840–854 DOI: 10.18653/v1/2022.findings-emnlp.59
- “Emergent and Predictable Memorization in Large Language Models” arXiv, 2023 arXiv:2304.11158 [cs]
- “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets” arXiv, 2022 arXiv:2201.02177 [cs]
- “Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models” arXiv, 2022 arXiv:2205.10770 [cs]
- “Retrieve What You Need: A Mutual Learning Framework for Open-domain Question Answering”
- Maxime Delmas, Magdalena Wysocka and André Freitas “Relation Extraction in Underexplored Biomedical Domains: A Diversity-Optimised Sampling and Synthetic Data Generation Approach” arXiv, 2023 arXiv:2311.06364 [cs]
- Magdalena Wysocka (8 papers)
- Oskar Wysocki (11 papers)
- Maxime Delmas (4 papers)
- Vincent Mutel (1 paper)
- Andre Freitas (52 papers)