Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Conversational Diagnostic AI (2401.05654v1)

Published 11 Jan 2024 in cs.AI, cs.CL, and cs.LG
Towards Conversational Diagnostic AI

Abstract: At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. AI systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a LLM based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.

Introduction to AMIE

In the field of medicine, the communication between a physician and a patient is a cornerstone of healthcare delivery. The success of medical treatment is often rooted in the quality of this interaction, which establishes a foundation for diagnosis and patient care. With advancements in the field of AI, the development of intelligent systems capable of mimicking such crucial conversations has made notable progress. AMIE, short for Articulate Medical Intelligence Explorer, is one such AI system that utilizes LLMs to simulate such diagnostic dialogues.

Training and Methodology Behind AMIE

The ingenuity behind AMIE is not just its ability to converse but also its refined learning environment that simulates varied medical scenarios. By engaging in what's known as "self-play" within this simulated environment, AMIE enriches its learning across different diseases, specialties, and contexts. AMIE’s training incorporated real-world datasets comprising electronic health records, medical question-answering, and transcribed medical conversations. During its training, AMIE employs a 'chain-of-reasoning' strategy, where it systematically refines its responses to ensure accurate and empathetic communication with the patient.

Evaluating AMIE’s Capabilities

To evaluate AMIE against the gold standard of primary care physicians, researchers engaged in a rigorous, randomized, double-blind paper. Here, both AMIE and physicians interacted with validated patient actors similar to an Objective Structured Clinical Examination (OSCE), a common method in medical education for assessing clinical competence. AMIE’s performance across a myriad of diagnostic cases was judged by specialist physicians and patient actors. Garnering superior ratings on most axes, AMIE demonstrated remarkable diagnostic accuracy, outstripping primary care physicians in multiple areas.

Implications and Future Directions

While the results are indeed promising, it's crucial to understand that AMIE, despite its sophistication, is not yet ready to replace human clinicians. The AI underwent evaluation in a controlled paper environment, using text-chat, which significantly differs from everyday clinical interactions. AMIE's deployment in actual healthcare settings will require careful further research, particularly to explore its safety, reliability, and fairness, especially when dealing with diverse populations and multilingual settings.

The potential of AMIE, and AI like it, could alter the landscape of healthcare, especially where access to quality medical advice is limited. It could support doctors by providing diagnostic suggestions and allowing healthcare providers to focus their skills where they are most needed. However, the path forward must be tread with cautious optimism, ensuring that any implementation is underpinned by rigorous testing and an ethical framework to maximize patient care without sacrificing human touch and professional insight.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. George Libman Engel and William L Morgan “Interviewing the patient” Saunders, Philadelphia, London, 1973
  2. “Contributions of the history, physical examination, and laboratory investigation in making medical diagnoses.” In Western Journal of Medicine 156.2 BMJ Publishing Group, 1992, pp. 163
  3. “Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients.” In Br Med J 2.5969 British Medical Journal Publishing Group, 1975, pp. 486–489
  4. Jerome P Kassirer “Teaching clinical medicine by iterative hypothesis testing: let’s preach what we practice” In New England Journal of Medicine 309.15 Mass Medical Soc, 1983, pp. 921–923
  5. “A study on relative contributions of the history, physical examination and investigations in making medical diagnosis.” In The Journal of the Association of Physicians of India 48.8, 2000, pp. 771–775
  6. Gerald Sandler “The importance of the history in the medical clinic and the cost of unnecessary tests” In American heart journal 100.6 Elsevier, 1980, pp. 928–931
  7. Jonathan Silverman, Suzanne Kurtz and Juliet Draper “Skills for communicating with patients” crc press, 2016
  8. Timothy Rennie, Jennifer Marriott and Tina P Brock “Global supply of health professionals” In N Engl J Med 370.23, 2014, pp. 2246–7
  9. OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
  10. Google “PaLM 2 Technical Report”, https://ai.google/static/documents/palm2techreport.pdf, 2023
  11. Google Deepmind “Gemini: A Family of Highly Capable Multimodal Models”, https://assets.bwbx.io/documents/users/iqjWHBFdfxIU/r7G7RrtT6rnM/v0, 2023
  12. “Large Language Models Encode Clinical Knowledge” In arXiv preprint arXiv:2212.13138, 2022
  13. “Towards expert-level medical question answering with large language models” In arXiv preprint arXiv:2305.09617, 2023
  14. “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine” In arXiv preprint arXiv:2311.16452, 2023
  15. “LaMDA: Language models for dialog applications” In arXiv preprint arXiv:2201.08239, 2022
  16. OpenAI “Introducing ChatGPT”, 2022 OpenAI URL: https://openai.com/blog/chatgpt
  17. “Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding” In arXiv preprint arXiv:2305.12031, 2023
  18. “MEDITRON-70B: Scaling Medical Pretraining for Large Language Models” In arXiv preprint arXiv:2311.16079, 2023
  19. David Levine “History taking is a complex skill” In BMJ 358 British Medical Journal Publishing Group, 2017
  20. Ann King and Ruth B Hoppe ““Best practice” for patient-centered communication: a narrative review” In Journal of graduate medical education 5.3 The Accreditation Council for Graduate Medical Education Suite 2000, 515 …, 2013, pp. 385–393
  21. “What disease does this patient have? a large-scale open domain question answering dataset from medical exams” In Applied Sciences 11.14 MDPI, 2021, pp. 6421
  22. “MIMIC-III, a freely accessible critical care database” In Scientific data 3.1 Nature Publishing Group, 2016, pp. 1–9
  23. “Speech recognition for medical conversations” In arXiv preprint arXiv:1711.07274, 2017
  24. “A computational approach to understanding empathy expressed in text-based mental health support” In arXiv preprint arXiv:2009.08441, 2020
  25. “Improving language model negotiation with self-play and in-context learning from ai feedback” In arXiv preprint arXiv:2305.10142, 2023
  26. “Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations” In Proceedings of the 5th Clinical Natural Language Processing Workshop, 2023, pp. 503–513
  27. “Overview of the ImageCLEF 2023: Multimedia Retrieval in Medical, Social Media and Internet Applications” In International Conference of the Cross-Language Evaluation Forum for European Languages, 2023, pp. 370–396 Springer
  28. “DialMed: A Dataset for Dialogue-based Medication Recommendation” In arXiv preprint arXiv:2203.07094, 2022
  29. “Incorporating Medical Knowledge to Transformer-based Language Models for Medical Dialogue Generation” In Proceedings of the 21st Workshop on Biomedical Language Processing, 2022, pp. 110–115
  30. Jane Dacre, Mike Besser and Patricia White “MRCP (UK) PART 2 Clinical Examination (PACES): a review of the first four examination sessions (June 2001–July 2002)” In Clinical Medicine 3.5 Royal College of Physicians, 2003, pp. 452
  31. “The Objective Structured Clinical Examination. The new gold standard for evaluating postgraduate clinical performance.” In Annals of surgery 222.6 Lippincott, Williams,Wilkins, 1995, pp. 735
  32. “The objective structured clinical examination: a step in the direction of competency-based evaluation” In Archives of pediatrics & adolescent medicine 154.7 American Medical Association, 2000, pp. 736–741
  33. Ronald M Epstein and Edward M Hundert “Defining and assessing professional competence” In Jama 287.2 American Medical Association, 2002, pp. 226–235
  34. Joel L Horowitz “The bootstrap” In Handbook of econometrics 5 Elsevier, 2001, pp. 3159–3228
  35. “Controlling the false discovery rate: a practical and powerful approach to multiple testing” In Journal of the Royal statistical society: series B (Methodological) 57.1 Wiley Online Library, 1995, pp. 289–300
  36. Robert F Woolson “Wilcoxon signed-rank test” In Wiley encyclopedia of clinical trials Wiley Online Library, 2007, pp. 1–3
  37. “Teaching history taking to medical students: a systematic review” In BMC medical education 15.1 BioMed Central, 2015, pp. 1–12
  38. “Effect of communications training on medical student performance” In Jama 290.9 American Medical Association, 2003, pp. 1157–1165
  39. Gregory Makoul “Communication skills education in medical school and beyond” In Jama 289.1 American Medical Association, 2003, pp. 93–93
  40. “Teaching and assessing communication skills in the postgraduate medical setting: a systematic scoping review” In BMC medical education 21 Springer, 2021, pp. 1–19
  41. “Improving communication skills: a course for academic medical center surgery residents and faculty” In Journal of Surgical education 72.6 Elsevier, 2015, pp. e202–e211
  42. “UK consensus statement on the content of communication curricula in undergraduate medical education” In Medical education 42.11 Wiley Online Library, 2008, pp. 1100–1107
  43. Hanneke De Haes and Jozien Bensing “Endpoints in medical communication research, proposing a framework of functions and outcomes” In Patient education and counseling 74.3 Elsevier, 2009, pp. 287–294
  44. Ronald M Epstein and Richard L Street Jr “Patient-centered communication in cancer care: promoting healing and reducing suffering”, 2007
  45. “Assessing communication competence: a review of current tools” In Family Medicine 37.3, 2005, pp. 184–92
  46. Jonathan R Nichol, Joshua Henrina Sundjaja and Grant Nelson “Medical history” StatPearls Publishing, Treasure Island (FL), 2018 URL: http://europepmc.org/books/NBK534249
  47. Claire Denness “What are consultation models for?” In InnovAiT 6.9 Sage Publications Sage UK: London, England, 2013, pp. 592–599
  48. “Implementation of virtual OSCE in health professions education: A systematic review” In Medical Education Wiley Online Library, 2023
  49. “Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling” In arXiv preprint arXiv:1810.00278, 2018
  50. “Airdialogue: An environment for goal-oriented dialogue research” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3844–3854
  51. “Decision-Oriented Dialogue for Human-AI Collaboration”, 2023 arXiv:2305.20076 [cs.CL]
  52. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  53. “Training language models to follow instructions with human feedback” In arXiv preprint arXiv:2203.02155, 2022
  54. “Ethical-advice taker: Do language models understand natural language interventions?” In arXiv preprint arXiv:2106.01465, 2021
  55. “Self-critiquing models for assisting human evaluators” In arXiv preprint arXiv:2206.05802, 2022
  56. “Training language models with language feedback at scale” In arXiv preprint arXiv:2303.16755, 2023
  57. “Improving alignment of dialogue agents via targeted human judgements” In arXiv preprint arXiv:2209.14375, 2022
  58. “Constitutional AI: Harmlessness from AI feedback” In arXiv preprint arXiv:2212.08073, 2022
  59. “A general language assistant as a laboratory for alignment” In arXiv preprint arXiv:2112.00861, 2021
  60. “Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings” In arXiv preprint arXiv:2303.05737, 2023
  61. “Overview of the medical question answering task at TREC 2017 LiveQA.” In TREC, 2017, pp. 1–12
  62. “The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review” In NPJ Digital Medicine 5.1 Nature Publishing Group UK London, 2022, pp. 118
  63. “Diagnostic accuracy of artificial intelligence in virtual primary care” In Mayo Clinic Proceedings: Digital Health 1.4 Elsevier, 2023, pp. 480–489
  64. “Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment” In medRxiv Cold Spring Harbor Laboratory Press, 2023, pp. 2023–09
  65. “MedDialog: Large-scale medical dialogue datasets” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9241–9250
  66. “MedDG: an entity-centric medical consultation dataset for entity-aware medical dialogue generation” In CCF International Conference on Natural Language Processing and Chinese Computing, 2022, pp. 447–459 Springer
  67. “Cdialog: A multi-turn COVID-19 conversation dataset for entity-aware dialog generation” In arXiv preprint arXiv:2212.06049, 2022
  68. “ReMeDi: Resources for Multi-domain, Multi-service, Medical Dialogues” In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3013–3024
  69. “Key challenges for delivering clinical impact with artificial intelligence” In BMC medicine 17 Springer, 2019, pp. 1–9
  70. “Towards Accurate Differential Diagnosis with Large Language Models” In arXiv preprint arXiv:2312.00164, 2023
  71. Zahir Kanjee, Byron Crowe and Adam Rodman “Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge” In JAMA, 2023
  72. “Evaluation of symptom checkers for self diagnosis and triage: audit study” In BMJ 351 British Medical Journal Publishing Group, 2015
  73. “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum” In JAMA Internal Medicine, 2023
  74. OpenAI “ChatGPT”, 2023 OpenAI URL: https://chat.openai.com/chat
  75. Sara Carrillo de Albornoz, Kah-Ling Sia and Anthony Harris “The effectiveness of teleconsultations in primary care: systematic review” In Family Practice 39.1 Oxford University Press UK, 2022, pp. 168–182
  76. “Virtual primary care: fragmentation or integration?” In The Lancet Digital Health 1.7 Elsevier, 2019, pp. e330–e331
  77. “Asynchronous Remote Communication as a Tool for Care Management in Primary Care: A Rapid Review of the Literature” In International Journal of Integrated Care 22.3 Ubiquity Press, 2022
  78. “Comparing the content and quality of video, telephone, and face-to-face consultations: a non-randomised, quasi-experimental, exploratory study in UK primary care” In British Journal of General Practice 69.686 British Journal of General Practice, 2019, pp. e595–e604
  79. “Patient satisfaction with time spent with their physician” In Journal of Family Practice 47.2 [New York, Appleton-Century-Crofts], 1998, pp. 133–138
  80. “The effect of screen-to-screen versus face-to-face consultation on doctor-patient communication: an experimental study with simulated patients” In Journal of medical Internet research 19.12 JMIR Publications Toronto, Canada, 2017, pp. e421
  81. “Trade-offs in high-volume primary care practice” In Journal of Family Practice 46.5 [New York, Appleton-Century-Crofts], 1998, pp. 397–402
  82. “Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians” In Nature Medicine 29.7 Nature Publishing Group US New York, 2023, pp. 1814–1820
  83. Julian Bird and Steven A Cohen-Cole “The three-function model of the medical interview” In Methods in teaching consultation-liaison psychiatry 20 Karger Publishers, 1990, pp. 65–88
  84. Agnes G Rezler, James A Woolliscroft and Summers G Kalishman “What is missing from patient histories?” In Medical Teacher 13.3 Taylor & Francis, 1991, pp. 245–252
  85. Ellen E Rosenberg “Lessons for Clinicians From Physician-Patient” In Arch Fam Med 6, 1997, pp. 279–283
  86. Robert Charles Smith “Patient-centered interviewing: an evidence-based method” Lippincott Williams & Wilkins, 2002
  87. Donald M Berwick, Thomas W Nolan and John Whittington “The triple aim: care, health, and cost” In Health affairs 27.3 Project HOPE-The People-to-People Health Foundation, Inc., 2008, pp. 759–769
  88. “From triple to quadruple aim: care of the patient requires care of the provider” In The Annals of Family Medicine 12.6 Annals Family Med, 2014, pp. 573–576
  89. “Physician communication skills and malpractice claims. A complex relationship.” In Western Journal of Medicine 150.3 BMJ Publishing Group, 1989, pp. 356
  90. “Doctors’ non-verbal behaviour in consultations: look at the patient before you look at the computer” In British Journal of General Practice 60.571 British Journal of General Practice, 2010, pp. 76–78
  91. “Inter-Cultural Communication Skills Training in Medical Schools: A Systematic Review” In Medical Research Archives 11.4, 2023
  92. “History taking as a diagnostic tool in children with chronic cough” In Frontiers in pediatrics 10 Frontiers, 2022, pp. 850912
  93. Winny Setyonugroho, Kieran M Kennedy and Thomas JB Kropmans “Reliability and validity of OSCE checklists used to assess the communication skills of undergraduate medical students: a systematic review” In Patient education and counseling 98.12 Elsevier, 2015, pp. 1482–1491
  94. “Taxonomy of risks posed by language models” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 214–229
  95. “Bias and Fairness in Large Language Models: A Survey”, 2023 arXiv:2309.00770 [cs.CL]
  96. “Patient race/ethnicity and quality of patient–physician communication during medical visits” In American journal of public health 94.12 American Public Health Association, 2004, pp. 2084–2090
  97. Debra L Roter, Judith A Hall and Yutaka Aoki “Physician gender effects in medical communication: a meta-analytic review” In Jama 288.6 American Medical Association, 2002, pp. 756–764
  98. “Red teaming language models with language models” In arXiv preprint arXiv:2202.03286, 2022
  99. “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned” In arXiv preprint arXiv:2209.07858, 2022
  100. Jiahao Yu, Xingwei Lin and Xinyu Xing “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts” In arXiv preprint arXiv:2309.10253, 2023
  101. “MART: Improving LLM Safety with Multi-round Automatic Red-Teaming” In arXiv preprint arXiv:2311.07689, 2023
  102. “Model cards for model reporting” In Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220–229
  103. “Interactive model cards: A human-centered approach to model documentation” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 427–439
  104. Mahima Pushkarna, Andrew Zaldivar and Oddur Kjartansson “Data cards: Purposeful and transparent dataset documentation for responsible ai” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 1776–1826
  105. “How Linguistically Fair Are Multilingual Pre-Trained Language Models?” In Proceedings of the AAAI conference on artificial intelligence 35.14, 2021, pp. 12710–12718
  106. “You reap what you sow: On the challenges of bias evaluation under multilingual settings” In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, 2022, pp. 26–41
  107. “MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks”, 2023 arXiv:2311.07463 [cs.CL]
  108. “Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics, 2023 DOI: 10.18653/v1/2023.acl-long.61
  109. “Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts”, 2023 arXiv:2306.11372 [cs.CL]
  110. “Having Beer after Prayer? Measuring Cultural Bias in Large Language Models”, 2023 arXiv:2305.14456 [cs.CL]
  111. Krithika Ramesh, Sunayana Sitaram and Monojit Choudhury “Fairness in Language Models Beyond English: Gaps and Challenges”, 2023 arXiv:2302.12578 [cs.CL]
  112. “Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?”, 2023 arXiv:2309.07462 [cs.CL]
  113. “Conformal Language Modeling”, 2023 arXiv:2306.10193 [cs.CL]
  114. “Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness”, 2023 arXiv:2308.16175 [cs.CL]
  115. “Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models”, 2023 arXiv:2307.10236 [cs.SE]
  116. “Uncertainty-aware Language Modeling for Selective Question Answering”, 2023 arXiv:2311.15451 [cs.CL]
  117. “Mind the gap: Assessing temporal generalization in neural language models” In Advances in Neural Information Processing Systems 34, 2021, pp. 29348–29363
  118. “Chain-of-thought prompting elicits reasoning in large language models” In Advances in Neural Information Processing Systems 35, 2022, pp. 24824–24837
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (25)
  1. Tao Tu (45 papers)
  2. Anil Palepu (12 papers)
  3. Mike Schaekermann (20 papers)
  4. Khaled Saab (15 papers)
  5. Jan Freyberg (14 papers)
  6. Ryutaro Tanno (36 papers)
  7. Amy Wang (6 papers)
  8. Brenna Li (2 papers)
  9. Mohamed Amin (4 papers)
  10. Nenad Tomasev (30 papers)
  11. Shekoofeh Azizi (23 papers)
  12. Karan Singhal (26 papers)
  13. Yong Cheng (58 papers)
  14. Le Hou (36 papers)
  15. Albert Webson (19 papers)
  16. Kavita Kulkarni (7 papers)
  17. S Sara Mahdavi (45 papers)
  18. Christopher Semturs (12 papers)
  19. Juraj Gottweis (10 papers)
  20. Joelle Barral (16 papers)
Citations (62)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews