Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Polaris: A Safety-focused LLM Constellation Architecture for Healthcare (2403.13313v1)

Published 20 Mar 2024 in cs.AI and cs.CL

Abstract: We develop Polaris, the first safety-focused LLM constellation for real-time patient-AI healthcare conversations. Unlike prior LLM works in healthcare focusing on tasks like question answering, our work specifically focuses on long multi-turn voice conversations. Our one-trillion parameter constellation system is composed of several multibillion parameter LLMs as co-operative agents: a stateful primary agent that focuses on driving an engaging conversation and several specialist support agents focused on healthcare tasks performed by nurses to increase safety and reduce hallucinations. We develop a sophisticated training protocol for iterative co-training of the agents that optimize for diverse objectives. We train our models on proprietary data, clinical care plans, healthcare regulatory documents, medical manuals, and other medical reasoning documents. We align our models to speak like medical professionals, using organic healthcare conversations and simulated ones between patient actors and experienced nurses. This allows our system to express unique capabilities such as rapport building, trust building, empathy and bedside manner. Finally, we present the first comprehensive clinician evaluation of an LLM system for healthcare. We recruited over 1100 U.S. licensed nurses and over 130 U.S. licensed physicians to perform end-to-end conversational evaluations of our system by posing as patients and rating the system on several measures. We demonstrate Polaris performs on par with human nurses on aggregate across dimensions such as medical safety, clinical readiness, conversational quality, and bedside manner. Additionally, we conduct a challenging task-based evaluation of the individual specialist support agents, where we demonstrate our LLM agents significantly outperform a much larger general-purpose LLM (GPT-4) as well as from its own medium-size class (LLaMA-2 70B).

Polaris: Advancing Healthcare Conversations with a Safety-focused LLM Constellation

Overview of Polaris

Polaris represents a significant step forward in the application of LLMs in the healthcare domain. This system introduces a constellation architecture of LLMs tailored specifically for real-time patient-AI healthcare conversations. Unlike its predecessors, Polaris emphasizes long multi-turn voice conversations, aiming to combine engaging, patient-friendly dialogue with medically accurate and safety-conscious interactions.

The system is built around a one-trillion-parameter constellation framework, encompassing a primary conversational agent and multiple specialist support agents. The primary agent is engineered for general conversation flow, while the specialists focus on healthcare-specific tasks, such as medication adherence and lab result interpretation, to enhance safety and reduce hallucinatory responses.

Training of these agents involves a sophisticated protocol that utilizes diverse objectives, leveraging resources like regulatory documents, medical manuals, and healthcare interaction data. By co-training these agents within a simulated environment of patient actors and licensed nurses, Polaris achieves a high degree of conversational alignment and medical reasoning capability.

Specialist Support Agents

Critical to Polaris's success are its specialist support agents, each designed for specific healthcare functions:

  • Privacy & Compliance Specialist: Ensures identity verification before discussing any Personal Health Information (PHI), addressing privacy and compliance concerns.
  • Checklist Specialist: Manages and navigates through complex care protocols to ensure all necessary topics are covered during a conversation.
  • Medication Specialist: Offers detailed support on medication adherence, contra-indications, and dosage verification, crucial for patient safety.
  • Labs & Vitals Specialist: Interprets lab results and vital signs within the context of the patient's health record, providing insight into changes and trends.
  • Nutrition Specialist: Gives tailored dietary advice based on the patient's health status and nutritional needs, particularly relevant for conditions like CHF and CKD.
  • Policy Specialist: Answers queries related to hospital and payor policies, utilizing a Retrieval-Augmented Generation (RAG) approach for up-to-date information.

This division of labor allows Polaris to allocate computational resources efficiently, reduce the primary agent's load, and ensure specialist tasks are performed with greater accuracy and safety.

Evaluation

Polaris underwent comprehensive evaluation, not only measuring its performance against human nurses but also comparing its capabilities to other general-purpose LLMs like GPT-4. Over 1100 licensed nurses and more than 130 physicians participated in end-to-end conversational assessments, with Polaris demonstrating parity with human nurses across several metrics including medical safety, clinical readiness, and patient education quality.

On specific healthcare tasks, Polaris substantially outperformed GPT-4 and LLaMA-2 70B, showcasing its effectiveness in medication adherence, lab result interpretation, and dietary recommendation accuracy. This underscores the advantage of Polaris's focused architectural design and training approach for healthcare conversations.

Future Directions

The development team behind Polaris is looking into enhancing the system with multi-call relationships for personalized care, improvements in support agent activation and communication, and the integration of multimodal modeling to enrich conversation dynamics. Also, the plan includes exploring asynchronous operation modes for support agents to further decrease response latency while maintaining conversational fluency and safety.

Conclusion

Polaris introduces a novel constellation architecture for LLMs in healthcare, focusing on safety, accuracy, and the patient experience in real-time conversations. By integrating specialized agents with distinct responsibilities, Polaris sets a new standard for AI in healthcare, aiming for impactful clinical improvements and operational efficiency. Through continuous development and rigorous testing, Polaris is poised to address the critical challenges of healthcare delivery with innovative AI solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/. Accessed: 2024-03-14.
  2. Taking flight with copilot: Early insights and opportunities of ai-powered pair-programming tools. Queue, 20(6):35–57, jan 2023. ISSN 1542-7730. doi: 10.1145/3582083. URL https://doi.org/10.1145/3582083.
  3. Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3:121–154, 2023. ISSN 2667-3452. doi: https://doi.org/10.1016/j.iotcps.2023.04.003. URL https://www.sciencedirect.com/science/article/pii/S266734522300024X.
  4. Ai models from microsoft and google already surpass human performance on the superglue language benchmark. https://venturebeat.com/business/ai-models-from-microsoft-and-google-already-surpass-human-performance-on-the-superglue-language-benchmark/. Accessed: 2024-03-14.
  5. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  6. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023. doi: 10.1056/NEJMsr2214184. URL https://doi.org/10.1056/NEJMsr2214184. PMID: 36988602.
  7. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
  8. Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies. JAMA Pediatrics, 178(3):313–315, 03 2024. ISSN 2168-6203. doi: 10.1001/jamapediatrics.2023.5750. URL https://doi.org/10.1001/jamapediatrics.2023.5750.
  9. Rash and Arthralgias in a Teenager With Autism. JAMA Pediatrics, 171(1):89–90, 01 2017. ISSN 2168-6203. doi: 10.1001/jamapediatrics.2016.1565. URL https://doi.org/10.1001/jamapediatrics.2016.1565.
  10. Towards expert-level medical question answering with large language models, 2023b.
  11. Adapted large language models can outperform medical experts in clinical text summarization, 2024.
  12. Towards conversational diagnostic ai, 2024a.
  13. Effectiveness of empathy in general practice: a systematic review. The British journal of general practice : the journal of the Royal College of General Practitioners, 63 606:e76–84, 2013. URL https://api.semanticscholar.org/CorpusID:16807968.
  14. The shortage of us healthcare workers in 2023. https://www.oracle.com/human-capital-management/healthcare-workforce-shortage/. Accessed: 2024-03-14.
  15. Aha senate statement on examining health care workforce shortages: Where do we go from here?”. https://www.aha.org/testimony/2023-02-15-aha-senate-statement-examining-health-care-workforce-shortages-where-do-we-go-here. Accessed: 2024-03-14.
  16. Fact sheet: Nursing shortage. https://www.aacnnursing.org/Portals/0/PDFs/Fact-Sheets/Nursing-Shortage-Factsheet.pdf. Accessed: 2024-03-14.
  17. https://www.ruralhealthinfo.org/toolkits/aging/1/demographics#:~:text=Today2C%20there%20are%20more%20than,increase%20by%20almost%2018%20million. Accessed: 2024-03-14.
  18. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  19. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  20. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  21. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  22. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  23. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  24. Quantifying exposure bias for neural language generation. 2019.
  25. Vital signs monitoring and nurse–patient interaction: A qualitative observational study of hospital practice. International journal of nursing studies, 56:9–16, 2016.
  26. Determination of normal hba1c levels in non-diabetic patients with hemoglobin e. Annals of Clinical & Laboratory Science, 49(6):804–809, 2019.
  27. 2017 acc/aha/aapa/abc/acpm/ags/apha/ash/aspc/nma/pcna guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: a report of the american college of cardiology/american heart association task force on clinical practice guidelines. Journal of the American College of Cardiology, 71(19):e127–e248, 2018.
  28. William G Murphy. The sex difference in haemoglobin levels in adults—mechanisms, causes, and consequences. Blood reviews, 28(2):41–47, 2014.
  29. Aging and physiological changes of the kidneys including changes in glomerular filtration rate. Nephron Physiology, 119(Suppl. 1):p1–p5, 2011.
  30. Dapagliflozin, a novel sglt2 inhibitor, induces dose-dependent glucosuria in healthy subjects. Clinical Pharmacology & Therapeutics, 85(5):520–526, 2009.
  31. Menu labeling requirements. https://www.fda.gov/food/food-labeling-nutrition/menu-labeling-requirements. Accessed: 2023-12-13.
  32. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  33. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  34. C-pack: Packaged resources to advance general chinese embedding, 2023.
  35. Cleveland Clinic. Complete Blood Count (CBC) Test — my.clevelandclinic.org. https://my.clevelandclinic.org/health/diagnostics/4053-complete-blood-count.
  36. Do no harm: a roadmap for responsible machine learning for health care. Nature Medicine, 25:1337 – 1340, 2019. URL https://api.semanticscholar.org/CorpusID:201060023.
  37. Measuring diagnoses: Icd code accuracy. Health services research, 40(5p2):1620–1639, 2005.
  38. Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine, pages 1–9, 2024.
  39. Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy. Intensive Care Medicine, 46:383 – 400, 2020. URL https://api.semanticscholar.org/CorpusID:210835013.
  40. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13:8 – 17, 2014. URL https://api.semanticscholar.org/CorpusID:15315839.
  41. The safety of inpatient health care. The New England journal of medicine, 388 2:142–153, 2023. URL https://api.semanticscholar.org/CorpusID:255748646.
  42. Grimm CA. Adverse events in hospitals: A quarter of medicare patients experienced harm in october 2018. Office of the Inspector General, Report no. OEI-06-18-00400, 2022.
  43. Diagnostic errors in the emergency department: A systematic review. 2022. URL https://api.semanticscholar.org/CorpusID:254767677.
  44. Race, socioeconomic status, and health: complexities, ongoing challenges, and research opportunities. Annals of the new York Academy of Sciences, 1186(1):69–101, 2010.
  45. Implicit bias in healthcare professionals: a systematic review. BMC Medical Ethics, 18, 2017. URL https://api.semanticscholar.org/CorpusID:13574969.
  46. The impact of unconscious bias in healthcare: How to recognize and mitigate it. The Journal of infectious diseases, 220 Supplement_2:S62–S73, 2019. URL https://api.semanticscholar.org/CorpusID:201117683.
  47. Patient safety and quality improvement: Ethical principles for a regulatory approach to bias in healthcare machine learning. Journal of the American Medical Informatics Association : JAMIA, 2020. URL https://api.semanticscholar.org/CorpusID:220077699.
  48. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digital Medicine, 6, 2023. URL https://api.semanticscholar.org/CorpusID:260315526.
  49. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143, 2022.
  50. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  51. Scaling instruction-finetuned language models, 2022.
  52. Large language models to identify social determinants of health in electronic health records. NPJ Digital Medicine, 7, 2023. URL https://api.semanticscholar.org/CorpusID:260887020.
  53. Rarebench: Can llms serve as rare diseases specialists? ArXiv, abs/2402.06341, 2024. URL https://api.semanticscholar.org/CorpusID:267617076.
  54. Bina Venkataraman. Can ai solve medical mysteries? it’s worth finding out. The Washington Post, 2023.
  55. Meghan Holohan. A boy saw 17 doctors over 3 years for chronic pain. chatgpt found the diagnosis. Today.com, 2023.
  56. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654, 2024b.
  57. Generative large language models are autonomous practitioners of evidence-based medicine. arXiv preprint arXiv:2401.02851, 2024.
  58. Clinical text summarization: Adapting large language models can outperform human experts. Research Square, 2023.
  59. Global supply of health professionals. N Engl J Med, 370(23):2246–7, 2014.
  60. Preventing occupational stress in healthcare workers. Cochrane Database of Systematic Reviews, (4), 2006.
  61. Psychosocial work environment and stress-related disorders, a systematic review. Occupational medicine, 60(4):277–286, 2010.
  62. Putting patients first by reducing administrative tasks in health care: a position paper of the american college of physicians. Annals of internal medicine, 166(9):659–661, 2017.
  63. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Annals of internal medicine, 165(11):753–760, 2016.
  64. A future that works: Ai, automation, employment, and productivity. McKinsey Global Institute Research, Tech. Rep, 60:1–135, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (26)
  1. Subhabrata Mukherjee (59 papers)
  2. Paul Gamble (7 papers)
  3. Markel Sanz Ausin (4 papers)
  4. Neel Kant (9 papers)
  5. Kriti Aggarwal (9 papers)
  6. Neha Manjunath (1 paper)
  7. Debajyoti Datta (12 papers)
  8. Zhengliang Liu (91 papers)
  9. Jiayuan Ding (14 papers)
  10. Sophia Busacca (1 paper)
  11. Cezanne Bianco (1 paper)
  12. Swapnil Sharma (3 papers)
  13. Rae Lasko (1 paper)
  14. Michelle Voisard (1 paper)
  15. Sanchay Harneja (1 paper)
  16. Darya Filippova (2 papers)
  17. Gerry Meixiong (1 paper)
  18. Kevin Cha (1 paper)
  19. Amir Youssefi (11 papers)
  20. Meyhaa Buvanesh (1 paper)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com