Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences (2311.06025v3)

Published 10 Nov 2023 in cs.CL

Abstract: Recently, the increasing demand for superior medical services has highlighted the discrepancies in the medical infrastructure. With big data, especially texts, forming the foundation of medical services, there is an exigent need for effective NLP solutions tailored to the healthcare domain. Conventional approaches leveraging pre-trained models present promising results in this domain and current LLMs offer advanced foundation for medical text processing. However, most medical LLMs are trained only with supervised fine-tuning (SFT), even though it efficiently empowers LLMs to understand and respond to medical instructions but is ineffective in learning domain knowledge and aligning with human preference. In this work, we propose ChiMed-GPT, a new benchmark LLM designed explicitly for Chinese medical domain, and undergoes a comprehensive training regime with pre-training, SFT, and RLHF. Evaluations on tasks including information extraction, question answering, and dialogue generation demonstrate ChiMed-GPT's superior performance over general domain LLMs. Furthermore, we analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients, so as to contribute to further responsible development of LLMs in the medical domain. The code and model are released at https://github.com/synlp/ChiMed-GPT.

ChiMed-GPT: A Transformer for Chinese Medical Text Processing

The paper discusses ChiMed-GPT, a LLM designed specifically to cater to the Chinese medical domain by integrating a comprehensive training regime to enhance its performance and alignment with human preferences. This paper highlights the necessity of bridging existing limitations in LLMs pertaining to medical text processing, particularly addressing context length and domain knowledge acquisition through a robust training protocol incorporating pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF).

At the core of ChiMed-GPT is Ziya-13B-v2, a Transformer-based model that embodies the capability to process extended contexts up to 4,096 tokens – a significant improvement over the standard 2,048 tokens found in many LLMs. The paper makes a critical observation that restricted context length poses barriers for effective NLP in the medical domain where detailed texts are integral.

The training regime begins with continuous pre-training using the Chinese Medical Dataset (CMD), ensuring thorough exposure to domain-specific knowledge. SFT is then carried out using a variety of curated datasets comprising medical dialogues and question-answer pairs. Notably, the SFT incorporates safety prompts to address potential harmful content generation.

ChiMed-GPT further exemplifies alignment with human preferences through RLHF, employing rejection sampling fine-tuning. This approach enhances the model’s ability to generate contextually appropriate and insightful responses in medical interactions.

Evaluation and Results

The model’s efficacy is systematically evaluated across critical tasks:

  • Information Extraction: Tested on named entity recognition, ChiMed-GPT achieves superior F1 scores, outperforming general and medical domain baselines.
  • Question Answering: On open-ended and multi-choice QA datasets, including ChiMed, C-Eval, CMMLU, and MedQA, ChiMed-GPT demonstrates heightened accuracy and response quality, particularly in real-world medical scenarios.
  • Dialogue Generation: Its performance in multi-turn dialogue generation, assessed through metrics like BLEU and ROUGE, is notably effective, suggesting practical applicability in patient-doctor interactions.

Bias Analysis

The paper also examines the model's propensity for bias using scales such as CAMI and MICA. ChiMed-GPT's relatively low bias scores underscore its adherence to responsible content generation, an essential aspect for technology deployed in sensitive domains like healthcare.

Implications and Future Directions

Practically, ChiMed-GPT embodies a significant advancement for automated medical support systems, potentially enhancing healthcare accessibility and efficiency. Theoretically, advocating such nuanced domain-specific LLMs, the paper underscores potential expansions in context processing and nuanced alignment models, offering insightful direction for future LLM advancements.

In conclusion, this paper provides a sound exploration into accommodating domain-specific needs within the field of LLMs while mitigating engineering barriers such as context constraints and alignment with end-user expectations. ChiMed-GPT serves as a benchmark in navigating the complexities of NLP applications in healthcare, aligning large-scale data processing with the intricacies of human interaction in Chinese medical contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Language Models are Few-shot Learners. Advances in neural information processing systems, 33:1877–1901.
  2. On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations. arXiv preprint arXiv:2203.13928.
  3. Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1504–1532, Toronto, Canada.
  4. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  5. ChatLaw: Open-source Legal Large Language Model with Integrated External Knowledge Bases. arXiv preprint arXiv:2306.16092.
  6. Flashattention: Fast and Memory-efficient Exact Attention with Io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  8. ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4729–4740.
  9. From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11737–11762, Toronto, Canada.
  10. Mental Illness: Clinicians’ Attitudes (mica) Scale—Psychometric Properties of a Version for Healthcare Students and Professionals. Psychiatry research, 206(1):81–87.
  11. Ziya2: Data-centric Learning is All LLMs Need. arXiv preprint arXiv:2311.03301.
  12. OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs. arXiv preprint arXiv:2309.03876.
  13. MedAlpaca–An Open-Source Collection of Medical Conversational AI Models and Training Data. arXiv preprint arXiv:2304.08247.
  14. Overview of the CCKS 2019 Knowledge Graph Evaluation Track: Entity, Relation, Event and QA. arXiv preprint arXiv:2003.03875.
  15. Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300.
  16. C-Eval: A Multi-level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv preprint arXiv:2305.08322.
  17. What Disease does This Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11(14):6421.
  18. BART: Denoising Sequence-to-sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461.
  19. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online.
  20. CMMLU: Measuring Massive Multitask Language Understanding in Chinese. arXiv preprint arXiv:2306.09212.
  21. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus, 15(6).
  22. Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101.
  23. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Briefings in Bioinformatics, 23(6):bbac409.
  24. James Manyika. 2023. An Overview of BARD: an Early Experiment with Generative AI. Technical report, Technical report, Google AI.
  25. Mixed Precision Training. arXiv preprint arXiv:1710.03740.
  26. OpenAI. 2023. GPT-4 Technical Report. ArXiv, abs/2303.08774.
  27. Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  28. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8):9.
  29. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  30. Zero: Memory Optimizations toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  31. Neural Machine Translation of Rare Words with Subword Units. arXiv preprint arXiv:1508.07909.
  32. Megatron-lm: Training Multi-billion Parameter Language Models using Model Parallelism. arXiv preprint arXiv:1909.08053.
  33. Towards Expert-level Medical Question Answering with Large Language Models. arXiv preprint arXiv:2305.09617.
  34. Summarizing Medical Conversations via Identifying Important Utterances. In Proceedings of the 28th International Conference on Computational Linguistics, pages 717–729.
  35. ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders. arXiv preprint arXiv:2105.01279.
  36. Safety Assessment of Chinese Large Language Models. arXiv preprint arXiv:2304.10436.
  37. Stanford Alpaca: An Instruction-following LLaMA Model. GitHub repository.
  38. S Martin Taylor and Michael J Dear. 1981. Scaling Community Attitudes toward the Mentally Ill. Schizophrenia bulletin, 7(2):225–240.
  39. ChiMed: A Chinese Medical Corpus for Question Answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 250–260, Florence, Italy.
  40. ChiMST: A Chinese Medical Corpus for Word Segmentation and Medical Term Recognition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5654–5664, Marseille, France.
  41. Llama: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
  42. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288.
  43. Attention is All You Need. Advances in neural information processing systems, 30.
  44. BioMedLM: a Domain-specific Large Language Model for Biomedical Text. MosaicML. Accessed: Dec, 23(3):2.
  45. Huatuo: Tuning Llama Model with Chinese Medical Knowledge. arXiv preprint arXiv:2304.06975.
  46. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564.
  47. Baize: An Open-source Chat Model with Parameter-efficient Tuning on Self-chat Data. arXiv preprint arXiv:2304.01196.
  48. CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. arXiv preprint arXiv:2307.09705.
  49. Ming Xu. 2023. MedicalGPT: Training Medical GPT Model.
  50. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32, pages 5753–5763.
  51. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR, abs/2209.02970.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuanhe Tian (15 papers)
  2. Ruyi Gan (14 papers)
  3. Yan Song (91 papers)
  4. Jiaxing Zhang (39 papers)
  5. Yongdong Zhang (119 papers)
Citations (21)