Generative LLMs as Autonomous Practitioners in Evidence-Based Medicine
The paper "Generative LLMs are Autonomous Practitioners of Evidence-Based Medicine" investigates the utilization of Generative LLMs as autonomous agents in evidence-based clinical practice. The paper leverages the problem-solving and reasoning abilities of LLMs to manage real-world clinical cases autonomously, incorporating prompt engineering, diagnostic tooling, retrieval-augmented generation (RAG), and established clinical guidelines.
Methods
The paper curated real-world clinical cases across multiple medical specialties and converted these cases into structured .json files. These files encompassed clinically relevant information such as patient symptoms, signs, past medical history, and recordings from lab tests or imaging studies, matched with questions aimed at finding the best next steps in patient management.
The LLMs used in this paper included both proprietary models (ChatGPT-3.5, GPT-4, Gemini Pro) and open-source models (LLaMA v2-70B, Mixtral-8x7B). These models were evaluated on four metrics: correctness of the final answer, judicious use of tools, conformity to guidelines, and resistance to hallucinations. Performance was assessed across categories by specialty and case difficulty.
Results
Correctness of Final Answer
GPT-4 demonstrated superior performance compared to other models, excelling in Cardiology (80% correctness), Genetics (100%), and Critical Care (100%). With more complex cases, all models showed a decline in performance, with proprietary models generally outperforming open-source ones.
Judicious Use of Tools
GPT-4 also excelled in the judicious use of diagnostic tools, maintaining logical and directed use across most specialties. The model's precision in selecting relevant investigations was highlighted, outperforming others particularly in Cardiology and Genetics. Identity reshaping via prompt engineering showed notable effects, as demonstrated in the modified behavior when models took on the role of a “Clinical Geneticist”.
Conformity to Guidelines
With RAG enabled, GPT-4 showed a marked improvement in guideline adherence, achieving an average performance increase of ~10% over other models. RAG significantly enhanced the model's ability to tailor recommendations based on specifically retrieved guidelines, although conforming to complex guidelines remained a challenge.
Resistance to Hallucinations
All models exhibited minimal hallucinations, with GPT-3.5 performing best overall. Errors were predominantly related to incorrect naming of laboratory tests. The open-source models displayed more hallucinatory tendencies in Emergency Medicine cases, specifically LLaMA-70B which scored poorly in this regard.
Discussion
This research underscores that LLMs have vast potential beyond their role as medical databases. They can reason and autonomously navigate clinical scenarios, akin to a clinician practicing evidence-based medicine. Their ability to perform next-word prediction extends to making informed clinical decisions by iteratively building up patient context via tool utilization.
The implications of this paper are significant for clinical decision support systems, especially in resource-constrained settings. LLMs can serve as triage specialists or be the first point of patient contact, synthesizing patient history and clinical findings to inform subsequent care. Additionally, the models can alleviate clinician workload by summarizing patient records, thereby mitigating information overload.
The use of RAG highlights a critical element in medical applications of LLMs—ongoing updates and accurate contextual fetching are essential due to the evolving nature of medical knowledge. Models equipped with RAG can integrate up-to-date information dynamically, enhancing their clinical relevance.
Future work will focus on integrating larger, more sophisticated multi-modal models capable of handling text, images, and videos. These advancements will further improve the accuracy and reliability of LLMs in real-world clinical settings. Additionally, work on reducing the propensity for hallucinations through refined prompt engineering and systematic updates will be essential.
In conclusion, the paper demonstrates that LLMs, including advanced models like GPT-4, can act as promising autonomous practitioners in evidence-based medicine. By seamlessly integrating with healthcare infrastructures through tools and enhanced capabilities such as RAG, these models provide a transformative approach to clinical practice, ultimately benefiting both clinicians and patients.