Papers
Topics
Authors
Recent
Search
2000 character limit reached

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

Published 23 Apr 2025 in cs.CL and cs.AI | (2504.16601v1)

Abstract: This study evaluates how well LLMs and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while LLMs showed promise, particularly in Vietnamese and Chinese, when translating simpler summaries. Arabic translations improved with complexity due to the language's morphology. Overall, while LLMs offer contextual flexibility, they remain inconsistent, and current evaluation metrics fail to capture clinical relevance. The study highlights the need for domain-specific training, improved evaluation methods, and human oversight in medical translation.

Summary

  • The paper compares LLMs (GPT-4o, LLAMA-3.1, GEMMA-2) and traditional MT (Google, Bing, DeepL) for translating simple and complex medical summaries into Arabic, Chinese, and Vietnamese, evaluated by metrics like BLEU and METEOR.
  • Traditional MT tools generally perform better on surface metrics for complex summaries, while LLMs show potential in capturing semantic meaning for simple summaries, highlighting varied performance across languages and text complexities.
  • The study suggests that future domain-specific fine-tuning, new safety metrics, and human oversight are needed to improve LLM accuracy and reliability for critical medical translations.

Translation of Medical Consultation Summaries Using LLMs and Traditional MT Tools

The paper presented is a pilot study that investigates the comparison between LLMs and traditional Machine Translation (MT) tools in translating medical consultation summaries. Specifically, the translations are from English into Arabic, Chinese (simplified written form), and Vietnamese, which are among the most common languages spoken in Australia other than English. The study focuses on evaluating the effectiveness of both translation methods for healthcare-related content, considering the particular challenges associated with medical terminology and differing text complexities.

Methodology

The authors develop two types of simulated consultation summaries: a straightforward summary meant to be understood by patients in lay language and a more complex, clinician-focused letter that incorporates medical jargon commonly used among healthcare professionals. Three LLMs—GPT-4o, LLAMA-3.1, and GEMMA-2—and three MT systems—Google Translate, Microsoft Bing Translator, and DeepL—are tasked with translating these summaries. The translations are then assessed using BLEU, CHR-F, and METEOR metrics based on reference translations by professional interpreters.

Results

The study reveals several notable findings:

  1. Language-Specific Performance: There is significant variability in translation effectiveness across languages. The metrics indicate that Vietnamese and Chinese translations of the simple summary perform better compared to Arabic, which excels in translating complex summaries that benefit from its morphological context.
  2. Translation Tool Comparison: Traditional MT tools generally outperform LLMs on surface-level metrics, particularly for complex summaries where fidelity to reference translations is more critical. However, LLMs show improved METEOR scores for Vietnamese and Chinese, suggesting capacity in capturing semantic equivalence beyond literal translation.
  3. Summary Complexity: Simple summaries result in higher translation quality across all languages, indicating that both LLMs and conventional MT tools struggle with complex, technical content.

Implications and Future Developments

The results highlight that despite potential, LLMs remain inconsistent across different contexts, displaying variability in semantic capture and morphological intricacies. Traditional MT tools offer stronger alignment with reference translations but lack the contextual adaptability demonstrated by LLMs.

The authors suggest that future improvements could involve fine-tuning LLMs with domain-specific medical corpora to enhance their accuracy and reliability in specialized translations. Additionally, novel, safety-aware evaluation metrics might need to be developed for better assessment in healthcare applications. The paper also emphasizes the importance of integrating human-in-the-loop mechanisms to ensure responsible AI usage in medical settings. This approach would help mitigate risks associated with inaccuracies in critical healthcare translations, emphasizing the role of human oversight in clinical contexts.

Conclusion

The study comprehensively evaluates the current capabilities and limitations of LLMs versus traditional MT tools in translating medical consultation summaries. It underscores the importance of language-specific characteristics and highlights the shortcomings of existing evaluation metrics in accurately gauging clinical translation quality. The practical implications point towards continued research and development to harness the benefits of AI translation tools responsibly, complemented by human expertise to safeguard patient safety and ensure equitable healthcare access.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.