Healthcare Copilot: Eliciting the Power of General LLMs for Medical Consultation (2402.13408v1)
Abstract: The copilot framework, which aims to enhance and tailor LLMs for specific complex tasks without requiring fine-tuning, is gaining increasing attention from the community. In this paper, we introduce the construction of a Healthcare Copilot designed for medical consultation. The proposed Healthcare Copilot comprises three main components: 1) the Dialogue component, responsible for effective and safe patient interactions; 2) the Memory component, storing both current conversation data and historical patient information; and 3) the Processing component, summarizing the entire dialogue and generating reports. To evaluate the proposed Healthcare Copilot, we implement an auto-evaluation scheme using ChatGPT for two roles: as a virtual patient engaging in dialogue with the copilot, and as an evaluator to assess the quality of the dialogue. Extensive results demonstrate that the proposed Healthcare Copilot significantly enhances the capabilities of general LLMs for medical consultations in terms of inquiry capability, conversational fluency, response accuracy, and safety. Furthermore, we conduct ablation studies to highlight the contribution of each individual module in the Healthcare Copilot. Code will be made publicly available on GitHub.
- Gpt-4 technical report. arXiv preprint.
- Palm 2 technical report. arXiv preprint.
- Language models are few-shot learners. In NeurIPS.
- Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt. arXiv preprint.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org (accessed 14 April 2023).
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence.
- Glm: General language model pretraining with autoregressive blank infilling. In ACL.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint.
- A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint.
- Sparseadapter: An easy approach for improving the parameter-efficiency of adapters. In Findings of EMNLP.
- The virtual patient as a learning tool: a mixed quantitative qualitative study. BMC Medical Education, 18(1):1–10.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint.
- Prompted llms as chatbot modules for long open-domain conversation. In ACL.
- Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6).
- Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In NLP4ConvAI.
- Biosignal copilot: Leveraging the power of llms in drafting reports for biomedical signals. medRxiv, pages 2023–06.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint.
- Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
- Becky McCall. 2020. Could telemedicine solve the cancer backlog? The Lancet Digital Health, 2(9):e456–e457.
- Swaroop Mishra and Elnaz Nouri. 2022. Help me think: A simple prompting strategy for non-experts to create customized content with models. In ACL.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In CHIL, pages 248–260. PMLR.
- A study of generative large language model for medical research and healthcare. arXiv preprint.
- Revisiting demonstration selection strategies in in-context learning. arXiv preprint.
- Towards making the most of ChatGPT for machine translation. In Findings of EMNLP.
- Crest–copilot for real-world experimental scientist. ChemRxiv.
- On efficient training of large-scale deep learning models: A literature review. arXiv preprint.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Towards expert-level medical question answering with large language models. arXiv preprint.
- Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence to sequence tasks. arXiv preprint.
- Mirac Suzgun and Adam Tauman Kalai. 2024. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint.
- Gemini: a family of highly capable multimodal models. arXiv preprint.
- Llama: Open and efficient foundation language models. arXiv preprint.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.
- Towards conversational diagnostic ai. arXiv preprint.
- Recursively summarizing enables long-term dialogue memory in large language models. arXiv preprint.
- Chatgpt as your vehicle co-pilot: An initial attempt. IEEE Transactions on Intelligent Vehicles.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint.
- Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. arXiv preprint.
- Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint.
- Meddialog: Large-scale medical dialogue datasets. In EMNLP, pages 9241–9250.
- Huatuogpt, towards taming language model to be a doctor. In EMNLP.
- GPTBIAS: A comprehensive framework for evaluating bias in large language models. arXiv preprint.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.
- Zhiyao Ren (2 papers)
- Yibing Zhan (73 papers)
- Baosheng Yu (51 papers)
- Liang Ding (159 papers)
- Dacheng Tao (829 papers)