Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can GPT Redefine Medical Understanding? Evaluating GPT on Biomedical Machine Reading Comprehension (2405.18682v2)

Published 29 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown remarkable performance on many tasks in different domains. However, their performance in closed-book biomedical machine reading comprehension (MRC) has not been evaluated in depth. In this work, we evaluate GPT on four closed-book biomedical MRC benchmarks. We experiment with different conventional prompting techniques as well as introduce our own novel prompting method. To solve some of the retrieval problems inherent to LLMs, we propose a prompting strategy named Implicit Retrieval Augmented Generation (RAG) that alleviates the need for using vector databases to retrieve important chunks in traditional RAG setups. Moreover, we report qualitative assessments on the natural language generation outputs from our approach. The results show that our new prompting technique is able to get the best performance in two out of four datasets and ranks second in rest of them. Experiments show that modern-day LLMs like GPT even in a zero-shot setting can outperform supervised models, leading to new state-of-the-art (SoTA) results on two of the benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. A survey on machine reading comprehension systems. Natural Language Engineering, 28(6):683–732.
  2. Modeling biological processes for reading comprehension. In Conference on Empirical Methods in Natural Language Processing.
  3. Modeling biological processes for reading comprehension. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1499–1510.
  4. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  5. Artificial intelligence in sports medicine: could gpt-4 make human doctors obsolete? Annals of Biomedical Engineering, pages 1–5.
  6. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  7. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  8. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
  9. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  10. Contextual embedding and model weighting by fusing domain knowledge on biomedical question answering. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pages 1–4.
  11. Bioadapt-mrc: adversarial learning-based domain adaptation improves biomedical machine reading comprehension task. Bioinformatics, 38(18):4369–4379.
  12. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  13. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
  14. OpenAI. 2023. Gpt-4 technical report.
  15. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  16. Bioread: A new dataset for biomedical reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  17. Biomrc: A dataset for biomedical machine reading comprehension. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 140–149.
  18. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193–203, Seattle, Washington, USA. Association for Computational Linguistics.
  19. Natalie Schluter. 2017. The limits of automatic summarisation according to ROUGE. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 41–45, Valencia, Spain. Association for Computational Linguistics.
  20. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  21. What makes reading comprehension questions difficult? arXiv preprint arXiv:2203.06342.
  22. Mrc4bioer: joint extraction of biomedical entities and relations in the machine reading comprehension framework. Journal of Biomedical Informatics, 125:103956.
  23. Simon Šuster and Walter Daelemans. 2018. Clicr: a dataset of clinical case reports for machine reading comprehension. arXiv preprint arXiv:1803.09720.
  24. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
  25. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
  26. Recipeqa: A challenge dataset for multimodal comprehension of cooking recipes. arXiv preprint arXiv:1809.00812.
  27. Enhancing phenotype recognition in clinical notes using large language models: Phenobcbert and phenogpt. arXiv preprint arXiv:2308.06294.
  28. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2013–2018.
  29. Large Language Models as Analogical Reasoners. ArXiv:2310.01714 [cs].
  30. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714.
  31. Medical exam question answering with large-scale reading comprehension. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  32. Question answering with long multiple-span answers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3840–3849.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets