Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain (2403.14578v1)

Published 21 Mar 2024 in cs.LG and cs.AI

Abstract: LLMs increasingly support applications in a wide range of domains, some with potential high societal impact such as biomedicine, yet their reliability in realistic use cases is under-researched. In this work we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA) framework and evaluate whether four state-of-the-art foundation LLMs can serve as reliable assistants in the biomedical domain. We identify prompt robustness, high recall, and a lack of hallucinations as necessary criteria for this use case. We design shortform tasks and tasks requiring LLM freeform responses mimicking real-world user interactions. We evaluate LLM performance using semantic similarity with a ground truth response, through an evaluator LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
  2. All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  7282–7296, 2021.
  3. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
  4. Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022a.
  5. Evaluating attribution in dialogue systems: The begin benchmark, 2022b.
  6. What’s going on with the open llm leaderboard? URL https://huggingface.co/blog/evaluating-mmlu-leaderboard.
  7. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694, 2023a.
  8. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=XPZIaotutsD.
  9. Medeval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8725–8744, 2023b.
  10. Prompting is not a substitute for probability measurements in large language models, 2023.
  11. A comprehensive evaluation of large language models on benchmark biomedical text processing tasks. arXiv preprint arXiv:2310.04270, 2023.
  12. Followbench: A multi-level fine-grained constraints following benchmark for large language models. arXiv preprint arXiv:2310.20410, 2023.
  13. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2567–2577, 2019.
  14. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012, 2023.
  15. Semantic similarity and machine learning with ontologies. Briefings in bioinformatics, 22(4):bbaa199, 2021.
  16. Evaluating human-language model interaction. ArXiv, abs/2212.09746, 2022. URL https://api.semanticscholar.org/CorpusID:254854296.
  17. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6449–6464, 2023a.
  18. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  19. Holistic evaluation of language models, 2023.
  20. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004a. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  21. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004b.
  22. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  23. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp.  216–223, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf.
  24. Towards accurate differential diagnosis with large language models, 2023.
  25. Benchmarking for biomedical natural language processing tasks with a domain specific albert. BMC bioinformatics, 23(1):1–15, 2022.
  26. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering, 2022.
  27. Med-halt: Medical domain hallucination test for large language models. arXiv preprint, 2023.
  28. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pp.  311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
  29. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/1908.10084.
  30. Malik Sallam. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 2023. ISSN 2227-9032. doi: 10.3390/healthcare11060887. URL https://www.mdpi.com/2227-9032/11/6/887.
  31. Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324, 2024.
  32. Large language models encode clinical knowledge, 2022.
  33. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.
  34. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
  35. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.
  36. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):1–28, 2015.
  37. Pre-trained language models in biomedical domain: A systematic survey. ACM Comput. Surv., 56(3), oct 2023a. ISSN 0360-0300. doi: 10.1145/3611651. URL https://doi.org/10.1145/3611651.
  38. Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=GF9cSKI3A_q.
  39. On the robustness of chatGPT: An adversarial and out-of-distribution perspective. In ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models, 2023b. URL https://openreview.net/forum?id=uw6HSkgoM29.
  40. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. arXiv preprint arXiv:2305.07609, 2023a.
  41. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  42. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023b.
  43. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp. 12697–12706. PMLR, 2021.
  44. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  45. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com