Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CMB: A Comprehensive Medical Benchmark in Chinese (2308.08833v2)

Published 17 Aug 2023 in cs.CL and cs.AI

Abstract: LLMs provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.

Overview of "CMB: A Comprehensive Medical Benchmark in Chinese"

The research paper "CMB: A Comprehensive Medical Benchmark in Chinese" presents the development and evaluation of a localized medical benchmark for LLMs in China. Unlike generalized benchmarks that primarily stem from English-based medical standards, this research emphasizes the establishment of a culturally and linguistically relevant dataset, considering the distinctive features present in the Chinese medical ecosystem, such as Traditional Chinese Medicine (TCM). This localized approach aims to provide a reliable framework against which the performance of LLMs, such as ChatGPT and GPT-4, as well as Chinese-specific models, can be assessed.

Background and Motivation

The paper outlines the challenges of leveraging English-based medical evaluation datasets within non-English speaking regions due to linguistic and cultural divergences. Translating such datasets often fails to encompass regional medical nuances, thus impeding accurate assessments of LLM capabilities in specialized domains like medicine. By focusing on the creation of a Chinese-centric benchmark, the CMB initiative seeks to mitigate these issues, ensuring that evaluation accounts for both Western and traditional local medical practices.

Dataset Composition and Structure

The CMB dataset is meticulously designed to cover a broad spectrum of medical knowledge, comprising two main components: CMB-Exam and CMB-Clin. CMB-Exam includes a vast array of 280,839 multiple-choice questions derived from various medical qualification exams across 28 subcategories and 176 subjects, encompassing physicians, nurses, technicians, and pharmacists. This section targets knowledge evaluation by employing a clear reference for correct answers. CMB-Clin, on the other hand, consists of 74 complex clinical diagnostic scenarios requiring multi-turn dialogue interactions to simulate real-world case handling by practitioners.

Evaluation Protocol

Through rigorous evaluation using CMB, the paper benchmarks several leading LLMs. General LLMs like GPT-4 and various open-source variants demonstrated an ability to pass thresholds required for medical licensing with accuracy rates surpassing 60%, although performance variability was noted across specific medical domains. Furthermore, the paper highlights significant differences between traditional Chinese medicine and Western medical proficiency among these models. The paper also notes potential risks in using certain prompting strategies, such as Chain-of-Thought (CoT), which may introduce inaccuracies in tasks requiring high knowledge intensity.

Results and Implications

The results from benchmarking reveal insights into the performance and limitations of current LLMs within the Chinese medical domain. GPT-4 and other advanced models showed promising results but also highlighted the need for enhancements, particularly in handling culturally specific medical knowledge. These findings imply that while LLMs have significant potential in medicine, careful adaptation and training on region-specific data are imperative.

Future Directions

The research suggests that further advancements should focus on optimizing LLMs for localized contexts, enhancing their diagnostic capabilities, and expanding datasets to include multi-modal elements such as imaging data. Establishing a robust framework for the ethical deployment of these models in real medical settings remains a critical area for future inquiry.

In conclusion, the "CMB: A Comprehensive Medical Benchmark in Chinese" paper provides a significant step forward in the localized assessment of LLMs for medicine, advocating for culturally relevant benchmarks as essential tools for advancing AI applications in global healthcare contexts.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xidong Wang (30 papers)
  2. Guiming Hardy Chen (8 papers)
  3. Dingjie Song (17 papers)
  4. Zhiyi Zhang (31 papers)
  5. Zhihong Chen (63 papers)
  6. Qingying Xiao (5 papers)
  7. Feng Jiang (97 papers)
  8. Jianquan Li (18 papers)
  9. Xiang Wan (93 papers)
  10. Benyou Wang (109 papers)
  11. Haizhou Li (285 papers)
Citations (55)
Github Logo Streamline Icon: https://streamlinehq.com