Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge (2304.06975v1)

Published 14 Apr 2023 in cs.CL
HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge

Abstract: LLMs, such as the LLaMA model, have demonstrated their effectiveness in various general-domain NLP tasks. Nevertheless, LLMs have not yet performed optimally in biomedical domain tasks due to the need for medical expertise in the responses. In response to this challenge, we propose HuaTuo, a LLaMA-based model that has been supervised-fine-tuned with generated QA (Question-Answer) instances. The experimental results demonstrate that HuaTuo generates responses that possess more reliable medical knowledge. Our proposed HuaTuo model is accessible at https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese.

This paper introduces HuaTuo, a LLM specifically fine-tuned for the biomedical domain using Chinese medical knowledge. The core motivation is that general-domain LLMs like LLaMA, while powerful, lack the necessary expertise for specialized fields like medicine and often perform poorly in languages other than English. Accurate, domain-specific knowledge is critical in medicine, and errors can be dangerous.

HuaTuo is built upon the LLaMA-7B base model. To inject medical knowledge, the authors utilize the Chinese Medical Knowledge Graph (CMeKG), which contains both structured information (like disease-drug relationships) and unstructured text (like medical guidelines).

The key implementation step is supervised fine-tuning using a curated dataset of knowledge-based instruction data. Instead of focusing on diverse instructions for various general tasks, the authors prioritize factual correctness for medical questions. They generated over 8,000 question-answer instances by sampling knowledge from CMeKG and using the OpenAI API to formulate questions and potentially refine answers based on that knowledge. Unlike instruction-following models that use explicit instructions, HuaTuo's training instances are simply question-answer pairs, designed for a dialogue context where the input is a question and the output is the answer.

For evaluation, the authors propose a new metric called SUS, tailored for medical Question Answering. SUS assesses responses across three dimensions:

  • Safety: Does the response contain potentially misleading or dangerous information (e.g., incorrect medication advice)? Scored 1 (not acceptable) to 3 (good).
  • Usability: Does the response reflect accurate and relevant medical expertise? Scored 1 to 3.
  • Smoothness: Is the response grammatically correct and fluent as a LLM output? Scored 1 to 3.

HuaTuo was compared against the original LLaMA, Alpaca (an instruction-tuned LLaMA), and ChatGLM (a Chinese-optimized LLM). Evaluated by annotators with medical backgrounds, HuaTuo achieved the highest scores in Usability and Smoothness among the fine-tuned models (Alpaca and ChatGLM), demonstrating its improved medical expertise and language generation quality in Chinese. While the original LLaMA had the highest Safety score, its responses were often uninformative or merely rephrased the question, resulting in very low Usability. HuaTuo maintained a high Safety score, close to the base LLaMA, while significantly boosting Usability compared to the baselines.

Practical Implementation & Applications:

Implementing HuaTuo or similar specialized medical LLMs involves several steps:

  1. Model Selection: Choose a suitable base LLM (LLaMA-7B in this case, or potentially larger/different models depending on computational resources and desired performance).
  2. Knowledge Acquisition: Obtain a relevant, reliable medical knowledge source. This could be a knowledge graph, medical guidelines, textbooks, or curated datasets. The quality and breadth of this knowledge are paramount for the fine-tuned model's performance.
  3. Data Generation: Create high-quality instruction-following data (question-answer pairs for dialogue) based on the acquired knowledge. This is a crucial step and can be done through manual annotation, semi-automatic methods using APIs like OpenAI or other LLMs, or programmatic generation from structured data like knowledge graphs. Ensuring factual accuracy is critical, potentially requiring expert review.
  4. Fine-tuning: Perform supervised fine-tuning of the base LLM using the generated dataset. Techniques like LoRA (Low-Rank Adaptation) can be used to make fine-tuning more computationally efficient, allowing adaptation with fewer resources than full fine-tuning.
  5. Evaluation: Rigorously evaluate the fine-tuned model using domain-specific metrics like SUS, potentially involving medical professionals to assess safety and accuracy.
  6. Deployment: Deploy the fine-tuned model. This requires infrastructure capable of running the LLM for inference, which can still be substantial even for a 7B parameter model, though methods like quantization can reduce memory requirements.

HuaTuo can be practically applied in:

  • Medical Question Answering Systems: Powering chatbots or virtual assistants that answer user queries about diseases, symptoms, treatments, etc., in Chinese.
  • Clinical Decision Support (Auxiliary): Providing information retrieval or summarizing medical knowledge for healthcare professionals (though careful validation and human oversight are essential).
  • Patient Education: Explaining medical conditions or treatment plans in an accessible way.

Implementation Considerations:

  • Data Quality: The performance and safety of the fine-tuned model heavily rely on the quality and correctness of the instruction data. Errors in the training data will likely propagate to the model's responses.
  • Computational Resources: While LLaMA-7B is relatively small compared to models like GPT-4, fine-tuning and serving it still require significant GPU resources. Techniques like LoRA and quantization can help mitigate this.
  • Safety Criticality: Medical advice is highly sensitive. The model must not be presented as a substitute for professional medical consultation. User interfaces should include prominent disclaimers. Continuous monitoring and updating of the model based on new medical knowledge and user feedback are necessary.
  • Scalability: Deploying such a model for a large number of users requires a robust inference infrastructure.
  • Generalization: The model is fine-tuned on specific knowledge. Its performance on topics outside the training distribution or on complex, multi-step reasoning tasks may be limited.

The paper emphasizes that HuaTuo is a research initiative and not intended for medical advice, highlighting the ethical considerations and risks associated with using LLMs in healthcare without professional validation. The open-source nature of HuaTuo provides a practical starting point for developers and researchers looking to build or improve Chinese medical LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Haochun Wang (17 papers)
  2. Chi Liu (65 papers)
  3. Nuwa Xi (11 papers)
  4. Zewen Qiang (7 papers)
  5. Sendong Zhao (31 papers)
  6. Bing Qin (186 papers)
  7. Ting Liu (329 papers)
Citations (159)