Polaris: A Safety-focused LLM Constellation Architecture for Healthcare (2403.13313v1)
Abstract: We develop Polaris, the first safety-focused LLM constellation for real-time patient-AI healthcare conversations. Unlike prior LLM works in healthcare focusing on tasks like question answering, our work specifically focuses on long multi-turn voice conversations. Our one-trillion parameter constellation system is composed of several multibillion parameter LLMs as co-operative agents: a stateful primary agent that focuses on driving an engaging conversation and several specialist support agents focused on healthcare tasks performed by nurses to increase safety and reduce hallucinations. We develop a sophisticated training protocol for iterative co-training of the agents that optimize for diverse objectives. We train our models on proprietary data, clinical care plans, healthcare regulatory documents, medical manuals, and other medical reasoning documents. We align our models to speak like medical professionals, using organic healthcare conversations and simulated ones between patient actors and experienced nurses. This allows our system to express unique capabilities such as rapport building, trust building, empathy and bedside manner. Finally, we present the first comprehensive clinician evaluation of an LLM system for healthcare. We recruited over 1100 U.S. licensed nurses and over 130 U.S. licensed physicians to perform end-to-end conversational evaluations of our system by posing as patients and rating the system on several measures. We demonstrate Polaris performs on par with human nurses on aggregate across dimensions such as medical safety, clinical readiness, conversational quality, and bedside manner. Additionally, we conduct a challenging task-based evaluation of the individual specialist support agents, where we demonstrate our LLM agents significantly outperform a much larger general-purpose LLM (GPT-4) as well as from its own medium-size class (LLaMA-2 70B).
- Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/. Accessed: 2024-03-14.
- Taking flight with copilot: Early insights and opportunities of ai-powered pair-programming tools. Queue, 20(6):35–57, jan 2023. ISSN 1542-7730. doi: 10.1145/3582083. URL https://doi.org/10.1145/3582083.
- Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3:121–154, 2023. ISSN 2667-3452. doi: https://doi.org/10.1016/j.iotcps.2023.04.003. URL https://www.sciencedirect.com/science/article/pii/S266734522300024X.
- Ai models from microsoft and google already surpass human performance on the superglue language benchmark. https://venturebeat.com/business/ai-models-from-microsoft-and-google-already-surpass-human-performance-on-the-superglue-language-benchmark/. Accessed: 2024-03-14.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023. doi: 10.1056/NEJMsr2214184. URL https://doi.org/10.1056/NEJMsr2214184. PMID: 36988602.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
- Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies. JAMA Pediatrics, 178(3):313–315, 03 2024. ISSN 2168-6203. doi: 10.1001/jamapediatrics.2023.5750. URL https://doi.org/10.1001/jamapediatrics.2023.5750.
- Rash and Arthralgias in a Teenager With Autism. JAMA Pediatrics, 171(1):89–90, 01 2017. ISSN 2168-6203. doi: 10.1001/jamapediatrics.2016.1565. URL https://doi.org/10.1001/jamapediatrics.2016.1565.
- Towards expert-level medical question answering with large language models, 2023b.
- Adapted large language models can outperform medical experts in clinical text summarization, 2024.
- Towards conversational diagnostic ai, 2024a.
- Effectiveness of empathy in general practice: a systematic review. The British journal of general practice : the journal of the Royal College of General Practitioners, 63 606:e76–84, 2013. URL https://api.semanticscholar.org/CorpusID:16807968.
- The shortage of us healthcare workers in 2023. https://www.oracle.com/human-capital-management/healthcare-workforce-shortage/. Accessed: 2024-03-14.
- Aha senate statement on examining health care workforce shortages: Where do we go from here?”. https://www.aha.org/testimony/2023-02-15-aha-senate-statement-examining-health-care-workforce-shortages-where-do-we-go-here. Accessed: 2024-03-14.
- Fact sheet: Nursing shortage. https://www.aacnnursing.org/Portals/0/PDFs/Fact-Sheets/Nursing-Shortage-Factsheet.pdf. Accessed: 2024-03-14.
- https://www.ruralhealthinfo.org/toolkits/aging/1/demographics#:~:text=Today2C%20there%20are%20more%20than,increase%20by%20almost%2018%20million. Accessed: 2024-03-14.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Quantifying exposure bias for neural language generation. 2019.
- Vital signs monitoring and nurse–patient interaction: A qualitative observational study of hospital practice. International journal of nursing studies, 56:9–16, 2016.
- Determination of normal hba1c levels in non-diabetic patients with hemoglobin e. Annals of Clinical & Laboratory Science, 49(6):804–809, 2019.
- 2017 acc/aha/aapa/abc/acpm/ags/apha/ash/aspc/nma/pcna guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: a report of the american college of cardiology/american heart association task force on clinical practice guidelines. Journal of the American College of Cardiology, 71(19):e127–e248, 2018.
- William G Murphy. The sex difference in haemoglobin levels in adults—mechanisms, causes, and consequences. Blood reviews, 28(2):41–47, 2014.
- Aging and physiological changes of the kidneys including changes in glomerular filtration rate. Nephron Physiology, 119(Suppl. 1):p1–p5, 2011.
- Dapagliflozin, a novel sglt2 inhibitor, induces dose-dependent glucosuria in healthy subjects. Clinical Pharmacology & Therapeutics, 85(5):520–526, 2009.
- Menu labeling requirements. https://www.fda.gov/food/food-labeling-nutrition/menu-labeling-requirements. Accessed: 2023-12-13.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- C-pack: Packaged resources to advance general chinese embedding, 2023.
- Cleveland Clinic. Complete Blood Count (CBC) Test — my.clevelandclinic.org. https://my.clevelandclinic.org/health/diagnostics/4053-complete-blood-count.
- Do no harm: a roadmap for responsible machine learning for health care. Nature Medicine, 25:1337 – 1340, 2019. URL https://api.semanticscholar.org/CorpusID:201060023.
- Measuring diagnoses: Icd code accuracy. Health services research, 40(5p2):1620–1639, 2005.
- Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine, pages 1–9, 2024.
- Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy. Intensive Care Medicine, 46:383 – 400, 2020. URL https://api.semanticscholar.org/CorpusID:210835013.
- Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13:8 – 17, 2014. URL https://api.semanticscholar.org/CorpusID:15315839.
- The safety of inpatient health care. The New England journal of medicine, 388 2:142–153, 2023. URL https://api.semanticscholar.org/CorpusID:255748646.
- Grimm CA. Adverse events in hospitals: A quarter of medicare patients experienced harm in october 2018. Office of the Inspector General, Report no. OEI-06-18-00400, 2022.
- Diagnostic errors in the emergency department: A systematic review. 2022. URL https://api.semanticscholar.org/CorpusID:254767677.
- Race, socioeconomic status, and health: complexities, ongoing challenges, and research opportunities. Annals of the new York Academy of Sciences, 1186(1):69–101, 2010.
- Implicit bias in healthcare professionals: a systematic review. BMC Medical Ethics, 18, 2017. URL https://api.semanticscholar.org/CorpusID:13574969.
- The impact of unconscious bias in healthcare: How to recognize and mitigate it. The Journal of infectious diseases, 220 Supplement_2:S62–S73, 2019. URL https://api.semanticscholar.org/CorpusID:201117683.
- Patient safety and quality improvement: Ethical principles for a regulatory approach to bias in healthcare machine learning. Journal of the American Medical Informatics Association : JAMIA, 2020. URL https://api.semanticscholar.org/CorpusID:220077699.
- The shaky foundations of large language models and foundation models for electronic health records. NPJ Digital Medicine, 6, 2023. URL https://api.semanticscholar.org/CorpusID:260315526.
- Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143, 2022.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- Scaling instruction-finetuned language models, 2022.
- Large language models to identify social determinants of health in electronic health records. NPJ Digital Medicine, 7, 2023. URL https://api.semanticscholar.org/CorpusID:260887020.
- Rarebench: Can llms serve as rare diseases specialists? ArXiv, abs/2402.06341, 2024. URL https://api.semanticscholar.org/CorpusID:267617076.
- Bina Venkataraman. Can ai solve medical mysteries? it’s worth finding out. The Washington Post, 2023.
- Meghan Holohan. A boy saw 17 doctors over 3 years for chronic pain. chatgpt found the diagnosis. Today.com, 2023.
- Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654, 2024b.
- Generative large language models are autonomous practitioners of evidence-based medicine. arXiv preprint arXiv:2401.02851, 2024.
- Clinical text summarization: Adapting large language models can outperform human experts. Research Square, 2023.
- Global supply of health professionals. N Engl J Med, 370(23):2246–7, 2014.
- Preventing occupational stress in healthcare workers. Cochrane Database of Systematic Reviews, (4), 2006.
- Psychosocial work environment and stress-related disorders, a systematic review. Occupational medicine, 60(4):277–286, 2010.
- Putting patients first by reducing administrative tasks in health care: a position paper of the american college of physicians. Annals of internal medicine, 166(9):659–661, 2017.
- Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Annals of internal medicine, 165(11):753–760, 2016.
- A future that works: Ai, automation, employment, and productivity. McKinsey Global Institute Research, Tech. Rep, 60:1–135, 2017.
- Subhabrata Mukherjee (59 papers)
- Paul Gamble (7 papers)
- Markel Sanz Ausin (4 papers)
- Neel Kant (9 papers)
- Kriti Aggarwal (9 papers)
- Neha Manjunath (1 paper)
- Debajyoti Datta (12 papers)
- Zhengliang Liu (91 papers)
- Jiayuan Ding (14 papers)
- Sophia Busacca (1 paper)
- Cezanne Bianco (1 paper)
- Swapnil Sharma (3 papers)
- Rae Lasko (1 paper)
- Michelle Voisard (1 paper)
- Sanchay Harneja (1 paper)
- Darya Filippova (2 papers)
- Gerry Meixiong (1 paper)
- Kevin Cha (1 paper)
- Amir Youssefi (11 papers)
- Meyhaa Buvanesh (1 paper)