Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models (2401.02851v2)

Published 5 Jan 2024 in cs.AI

Abstract: Generative LLMs hold significant promise in healthcare, demonstrating capabilities such as passing medical licensing exams and providing clinical knowledge. However, their current use as information retrieval tools is limited by challenges like data staleness, resource demands, and occasional generation of incorrect information. This study assessed the potential of LLMs to function as autonomous agents in a simulated tertiary care medical center, using real-world clinical cases across multiple specialties. Both proprietary and open-source LLMs were evaluated, with Retrieval Augmented Generation (RAG) enhancing contextual relevance. Proprietary models, particularly GPT-4, generally outperformed open-source models, showing improved guideline adherence and more accurate responses with RAG. The manual evaluation by expert clinicians was crucial in validating models' outputs, underscoring the importance of human oversight in LLM operation. Further, the study emphasizes Natural Language Programming (NLP) as the appropriate paradigm for modifying model behavior, allowing for precise adjustments through tailored prompts and real-world interactions. This approach highlights the potential of LLMs to significantly enhance and supplement clinical decision-making, while also emphasizing the value of continuous expert involvement and the flexibility of NLP to ensure their reliability and effectiveness in healthcare settings.

PDF Abstract

Generative LLMs as Autonomous Practitioners in Evidence-Based Medicine

The paper "Generative LLMs are Autonomous Practitioners of Evidence-Based Medicine" investigates the utilization of Generative LLMs as autonomous agents in evidence-based clinical practice. The paper leverages the problem-solving and reasoning abilities of LLMs to manage real-world clinical cases autonomously, incorporating prompt engineering, diagnostic tooling, retrieval-augmented generation (RAG), and established clinical guidelines.

Methods

The paper curated real-world clinical cases across multiple medical specialties and converted these cases into structured .json files. These files encompassed clinically relevant information such as patient symptoms, signs, past medical history, and recordings from lab tests or imaging studies, matched with questions aimed at finding the best next steps in patient management.

The LLMs used in this paper included both proprietary models (ChatGPT-3.5, GPT-4, Gemini Pro) and open-source models (LLaMA v2-70B, Mixtral-8x7B). These models were evaluated on four metrics: correctness of the final answer, judicious use of tools, conformity to guidelines, and resistance to hallucinations. Performance was assessed across categories by specialty and case difficulty.

Results

Correctness of Final Answer

GPT-4 demonstrated superior performance compared to other models, excelling in Cardiology (80% correctness), Genetics (100%), and Critical Care (100%). With more complex cases, all models showed a decline in performance, with proprietary models generally outperforming open-source ones.

Judicious Use of Tools

GPT-4 also excelled in the judicious use of diagnostic tools, maintaining logical and directed use across most specialties. The model's precision in selecting relevant investigations was highlighted, outperforming others particularly in Cardiology and Genetics. Identity reshaping via prompt engineering showed notable effects, as demonstrated in the modified behavior when models took on the role of a “Clinical Geneticist”.

Conformity to Guidelines

With RAG enabled, GPT-4 showed a marked improvement in guideline adherence, achieving an average performance increase of ~10% over other models. RAG significantly enhanced the model's ability to tailor recommendations based on specifically retrieved guidelines, although conforming to complex guidelines remained a challenge.

Resistance to Hallucinations

All models exhibited minimal hallucinations, with GPT-3.5 performing best overall. Errors were predominantly related to incorrect naming of laboratory tests. The open-source models displayed more hallucinatory tendencies in Emergency Medicine cases, specifically LLaMA-70B which scored poorly in this regard.

Discussion

This research underscores that LLMs have vast potential beyond their role as medical databases. They can reason and autonomously navigate clinical scenarios, akin to a clinician practicing evidence-based medicine. Their ability to perform next-word prediction extends to making informed clinical decisions by iteratively building up patient context via tool utilization.

The implications of this paper are significant for clinical decision support systems, especially in resource-constrained settings. LLMs can serve as triage specialists or be the first point of patient contact, synthesizing patient history and clinical findings to inform subsequent care. Additionally, the models can alleviate clinician workload by summarizing patient records, thereby mitigating information overload.

The use of RAG highlights a critical element in medical applications of LLMs—ongoing updates and accurate contextual fetching are essential due to the evolving nature of medical knowledge. Models equipped with RAG can integrate up-to-date information dynamically, enhancing their clinical relevance.

Future work will focus on integrating larger, more sophisticated multi-modal models capable of handling text, images, and videos. These advancements will further improve the accuracy and reliability of LLMs in real-world clinical settings. Additionally, work on reducing the propensity for hallucinations through refined prompt engineering and systematic updates will be essential.

In conclusion, the paper demonstrates that LLMs, including advanced models like GPT-4, can act as promising autonomous practitioners in evidence-based medicine. By seamlessly integrating with healthcare infrastructures through tools and enhanced capabilities such as RAG, these models provide a transformative approach to clinical practice, ultimately benefiting both clinicians and patients.

PDF Markdown Bookmark Chat (Pro)

Authors (29)

Akhil Vaid (4 papers)
Joshua Lampert (4 papers)
Juhee Lee (10 papers)
Ashwin Sawant (2 papers)
Donald Apakama (1 paper)
Ankit Sakhuja (2 papers)
Ali Soroush (9 papers)
Denise Lee (2 papers)
Isotta Landi (9 papers)
Nicole Bussola (2 papers)
Ismail Nabeel (1 paper)
Patricia Kovatch (2 papers)
Brendan Carr (1 paper)
Benjamin Glicksberg (4 papers)
Edgar Argulian (2 papers)
Stamatios Lerakis (2 papers)
Monica Kraft (1 paper)
Alexander Charney (2 papers)
Girish Nadkarni (7 papers)
Sarah Bick (1 paper)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/AkhilVaidMD/status/1744780219209175153

https://twitter.com/MisterMufasaa/status/1746604410250186879