Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation (2305.17819v3)

Published 28 May 2023 in cs.CL and cs.AI

Abstract: The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from LLMs trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery. The framework involves of three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound-fungus relation determination. Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted. While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Marcelo Der Torossian Torres and Cesar de la Fuente-Nunez “Toward computer-made artificial antibiotics” Antimicrobials In Current Opinion in Microbiology 51, 2019, pp. 30–38 DOI: https://doi.org/10.1016/j.mib.2019.03.004
  2. “A Deep Learning Approach to Antibiotic Discovery” In Cell 180.4, 2020, pp. 688–702.e13 DOI: 10.1016/j.cell.2020.01.021
  3. “Evaluating Large Language Models Trained on Code”, 2021 arXiv:2107.03374 [cs.LG]
  4. “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing” In ACM Transactions on Computing for Healthcare 3.1 Association for Computing Machinery (ACM), 2021, pp. 1–23 DOI: 10.1145/3458754
  5. “Improving Language Understanding by Generative Pre-Training”
  6. “Language Models are Unsupervised Multitask Learners”, 2019
  7. “Language Models are Few-Shot Learners”, 2020 arXiv:2005.14165 [cs.CL]
  8. “Training language models to follow instructions with human feedback” arXiv:2203.02155 [cs] arXiv, 2022 URL: http://arxiv.org/abs/2203.02155
  9. OpenAI “ChatGPT: Optimizing language models for dialogue.”, 2022
  10. “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” In JMIR Med Educ 9, 2023, pp. e45312 DOI: 10.2196/45312
  11. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21 Virtual Event, Canada: Association for Computing Machinery, 2021, pp. 610–623 DOI: 10.1145/3442188.3445922
  12. “Survey of Hallucination in Natural Language Generation” In ACM Comput. Surv. 55.12 New York, NY, USA: Association for Computing Machinery, 2023 DOI: 10.1145/3571730
  13. “Dissociating language and thought in large language models: a cognitive perspective”, 2023 arXiv:2301.06627 [cs.CL]
  14. “Taxonomy of Risks posed by Language Models” In 2022 ACM Conference on Fairness, Accountability, and Transparency Seoul Republic of Korea: ACM, 2022, pp. 214–229 DOI: 10.1145/3531146.3533088
  15. OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
  16. “Transformers and the Representation of Biomedical Background Knowledge” In Computational Linguistics 49.1, 2023, pp. 73–115 DOI: 10.1162/coli_a_00462
  17. Mael Jullien, Marco Valentino and Andre Freitas “Do transformers encode a foundational ontology? Probing abstract classes in natural language” In arXiv preprint arXiv:2201.10262, 2022
  18. “Interventional Probing in High Dimensions: An NLI Case Study” In arXiv preprint arXiv:2304.10346, 2023
  19. “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288 [cs.CL]
  20. Joseph A. DiMasi, Henry G. Grabowski and Ronald W. Hansen “Innovation in the pharmaceutical industry: New estimates of R&D costs” In Journal of Health Economics 47, 2016, pp. 20–33 DOI: https://doi.org/10.1016/j.jhealeco.2016.01.012
  21. Chi Heem Wong, Kien Wei Siah and Andrew W Lo “Estimation of clinical trial success rates and related parameters” In Biostatistics 20.2, 2018, pp. 273–286 DOI: 10.1093/biostatistics/kxx069
  22. “An Update on Eight “New” Antibiotics against Multidrug-Resistant Gram-Negative Bacteria” In Journal of Clinical Medicine 10.5, 2021, pp. 1068 DOI: 10.3390/jcm10051068
  23. “Antimicrobial Resistance in ESKAPE Pathogens” In Clinical Microbiology Reviews 33.3, 2020, pp. e00181–19 DOI: 10.1128/CMR.00181-19
  24. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling” In arXiv preprint arXiv:2101.00027, 2020
  25. BigScience “BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model”, https://huggingface.co/bigscience/bloom/, 2022
  26. “BioGPT: generative pre-trained transformer for biomedical text generation and mining” bbac409 In Briefings in Bioinformatics 23.6, 2022 DOI: 10.1093/bib/bbac409
  27. “GALACTICA: A Large Language Model for Science”, 2022
  28. “What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?” arXiv, 2022 arXiv:2204.05832 [cs, stat]
  29. Stella Biderman, Kieran Bicheno and Leo Gao “Datasheet for the pile” In arXiv preprint arXiv:2201.07311, 2022
  30. “BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining” arXiv:2210.10341 [cs] In Briefings in Bioinformatics 23.6, 2022, pp. bbac409 DOI: 10.1093/bib/bbac409
  31. “Gaussian Error Linear Units (GELUs)”, 2020 arXiv:1606.08415 [cs.LG]
  32. “PaLM: Scaling Language Modeling with Pathways”, 2022 arXiv:2204.02311 [cs.CL]
  33. Rico Sennrich, Barry Haddow and Alexandra Birch “Neural Machine Translation of Rare Words with Subword Units” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Berlin, Germany: Association for Computational Linguistics, 2016, pp. 1715–1725 DOI: 10.18653/v1/P16-1162
  34. Asli Celikyilmaz, Elizabeth Clark and Jianfeng Gao “Evaluation of Text Generation: A Survey” arXiv:2006.14799 [cs] arXiv, 2021 DOI: 10.48550/arXiv.2006.14799
  35. “AI vs. Human - Differentiation Analysis of Scientific Content Generation”
  36. “CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation” arXiv:2204.00862 [cs] arXiv, 2022 URL: http://arxiv.org/abs/2204.00862
  37. “The Curious Case of Neural Text Degeneration” arXiv:1904.09751 [cs] arXiv, 2020 URL: http://arxiv.org/abs/1904.09751
  38. “On Faithfulness and Factuality in Abstractive Summarization” arXiv:2005.00661 [cs] arXiv, 2020 DOI: 10.48550/arXiv.2005.00661
  39. “How Context Affects Language Models’ Factual Predictions”, 2020 arXiv:2005.04611 [cs.CL]
  40. “Aspergillus species in indoor environments and their possible occupational and public health hazards” In Current Medical Mycology 2.1, 2016, pp. 36–42 DOI: 10.18869/acadpub.cmm.2.1.36
  41. “Large Language Models Are Human-Level Prompt Engineers”, 2023 arXiv:2211.01910 [cs.LG]
  42. “Fungal names: a comprehensive nomenclatural repository and knowledge base for fungal taxonomy” In Nucleic Acids Research 51.D1, 2022, pp. D708–D716 DOI: 10.1093/nar/gkac926
  43. Thomas A. Richards, Guy Leonard and Jeremy G. Wideman “What Defines the “Kingdom” Fungi?” In Microbiology Spectrum 5.3, 2017, pp. 5.3.23 DOI: 10.1128/microbiolspec.FUNK-0044-2017
  44. “How to publish a new fungal species, or name, version 3.0” In IMA fungus 12 Springer, 2021, pp. 1–15
  45. “International Code of Nomenclature for algae, fungi, and plants (Shenzhen Code) adopted by the Nineteenth International Botanical Congress Shenzhen, China, July 2017.” Koeltz botanical books, 2018
  46. “Unambiguous identification of fungi: where do we stand and how accurate and precise is fungal DNA barcoding?” In IMA fungus 11.1 Springer, 2020, pp. 14
  47. “Language Models as Knowledge Bases?” arXiv:1909.01066 [cs] arXiv, 2019 DOI: 10.48550/arXiv.1909.01066
  48. “Can Prompt Probe Pretrained Language Models? Understanding the Invisible Risks from a Causal View” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 5796–5808 DOI: 10.18653/v1/2022.acl-long.398
  49. “How Can We Know What Language Models Know?” arXiv:1911.12543 [cs] arXiv, 2020 URL: http://arxiv.org/abs/1911.12543
  50. Zied Bouraoui, Jose Camacho-Collados and Steven Schockaert “Inducing Relational Knowledge from BERT” arXiv, 2019 DOI: 10.48550/ARXIV.1911.12753
  51. Adi Haviv, Jonathan Berant and Amir Globerson “BERTese: Learning to Speak to BERT” arXiv, 2021 DOI: 10.48550/ARXIV.2103.05327
  52. “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts” arXiv, 2020 DOI: 10.48550/ARXIV.2010.15980
  53. Zexuan Zhong, Dan Friedman and Danqi Chen “Factual Probing Is [MASK]: Learning vs. Learning to Recall” arXiv, 2021 DOI: 10.48550/ARXIV.2104.05240
  54. Leandra Fichtel, Jan-Christoph Kalo and Wolf-Tilo Balke “Prompt Tuning or Fine-Tuning - Investigating Relational Knowledge in Pre-Trained Language Models”
  55. “SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data” In Proceedings of the 17th International Workshop on Semantic Evaluation, 2023
  56. “Can Language Models be Biomedical Knowledge Bases?” arXiv:2109.07154 [cs] arXiv, 2021 URL: http://arxiv.org/abs/2109.07154
  57. “KAMEL : Knowledge Analysis with Multitoken Entities in Language Models”
  58. “Prompting GPT-3 To Be Reliable” arXiv:2210.09150 [cs] arXiv, 2023 DOI: 10.48550/arXiv.2210.09150
  59. “When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories”
  60. “Language Models As or For Knowledge Bases” arXiv:2110.04888 [cs] arXiv, 2021 URL: http://arxiv.org/abs/2110.04888
  61. Alex Howard, William Hope and Alessandro Gerada “ChatGPT and antimicrobial advice: the end of the consulting infection doctor?” In The Lancet Infectious Diseases 23.4, 2023, pp. 405–406 DOI: 10.1016/S1473-3099(23)00113-5
  62. “ChatGPT in Healthcare: A Taxonomy and Systematic Review” In medRxiv Cold Spring Harbor Laboratory Press, 2023 DOI: 10.1101/2023.03.30.23287899
  63. “Systematic Evaluation of Research Progress on Natural Language Processing in Medicine Over the Past 20 Years: Bibliometric Study on PubMed” In J Med Internet Res 22.1, 2020, pp. e16816 DOI: 10.2196/16816
  64. “Capabilities of GPT-4 on Medical Challenge Problems”, 2023 arXiv:2303.13375 [cs.CL]
  65. “Calibrate Before Use: Improving Few-Shot Performance of Language Models” arXiv, 2021 arXiv:2102.09690 [cs]
  66. Nora Kassner, Benno Krojer and Hinrich Schütze “Are Pretrained Language Models Symbolic Reasoners over Knowledge?” In Proceedings of the 24th Conference on Computational Natural Language Learning Online: Association for Computational Linguistics, 2020, pp. 552–564 DOI: 10.18653/v1/2020.conll-1.45
  67. “Large Language Models Struggle to Learn Long-Tail Knowledge” arXiv, 2023 arXiv:2211.08411 [cs]
  68. “Impact of Co-occurrence on Factual Knowledge of Large Language Models” arXiv, 2023 arXiv:2310.08256 [cs]
  69. “Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning” In Findings of the Association for Computational Linguistics: EMNLP 2022 Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 840–854 DOI: 10.18653/v1/2022.findings-emnlp.59
  70. “Emergent and Predictable Memorization in Large Language Models” arXiv, 2023 arXiv:2304.11158 [cs]
  71. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets” arXiv, 2022 arXiv:2201.02177 [cs]
  72. “Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models” arXiv, 2022 arXiv:2205.10770 [cs]
  73. “Retrieve What You Need: A Mutual Learning Framework for Open-domain Question Answering”
  74. Maxime Delmas, Magdalena Wysocka and André Freitas “Relation Extraction in Underexplored Biomedical Domains: A Diversity-Optimised Sampling and Synthetic Data Generation Approach” arXiv, 2023 arXiv:2311.06364 [cs]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Magdalena Wysocka (8 papers)
  2. Oskar Wysocki (11 papers)
  3. Maxime Delmas (4 papers)
  4. Vincent Mutel (1 paper)
  5. Andre Freitas (52 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com