RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain (2403.14578v1)

Published 21 Mar 2024 in cs.LG and cs.AI

Abstract: LLMs increasingly support applications in a wide range of domains, some with potential high societal impact such as biomedicine, yet their reliability in realistic use cases is under-researched. In this work we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA) framework and evaluate whether four state-of-the-art foundation LLMs can serve as reliable assistants in the biomedical domain. We identify prompt robustness, high recall, and a lack of hallucinations as necessary criteria for this use case. We design shortform tasks and tasks requiring LLM freeform responses mimicking real-world user interactions. We evaluate LLM performance using semantic similarity with a ground truth response, through an evaluator LLM.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (45)

Authors (5)

William James Bolton (1 paper)
Rafael Poyiadzi (14 papers)
Edward R. Morrell (1 paper)
Gabriela van Bergen Gonzalez Bueno (1 paper)
Lea Goetz (4 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/gm8xx8/status/1771014416445628745

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain (2403.14578v1)

Related Papers

Tweets