Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study (2504.16601v1)

Published 23 Apr 2025 in cs.CL and cs.AI

Abstract: This study evaluates how well LLMs and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while LLMs showed promise, particularly in Vietnamese and Chinese, when translating simpler summaries. Arabic translations improved with complexity due to the language's morphology. Overall, while LLMs offer contextual flexibility, they remain inconsistent, and current evaluation metrics fail to capture clinical relevance. The study highlights the need for domain-specific training, improved evaluation methods, and human oversight in medical translation.

PDF Abstract

Translation of Medical Consultation Summaries Using LLMs and Traditional MT Tools

The paper presented is a pilot paper that investigates the comparison between LLMs and traditional Machine Translation (MT) tools in translating medical consultation summaries. Specifically, the translations are from English into Arabic, Chinese (simplified written form), and Vietnamese, which are among the most common languages spoken in Australia other than English. The paper focuses on evaluating the effectiveness of both translation methods for healthcare-related content, considering the particular challenges associated with medical terminology and differing text complexities.

Methodology

The authors develop two types of simulated consultation summaries: a straightforward summary meant to be understood by patients in lay language and a more complex, clinician-focused letter that incorporates medical jargon commonly used among healthcare professionals. Three LLMs—GPT-4o, LLAMA-3.1, and GEMMA-2—and three MT systems—Google Translate, Microsoft Bing Translator, and DeepL—are tasked with translating these summaries. The translations are then assessed using BLEU, CHR-F, and METEOR metrics based on reference translations by professional interpreters.

Results

The paper reveals several notable findings:

Language-Specific Performance: There is significant variability in translation effectiveness across languages. The metrics indicate that Vietnamese and Chinese translations of the simple summary perform better compared to Arabic, which excels in translating complex summaries that benefit from its morphological context.
Translation Tool Comparison: Traditional MT tools generally outperform LLMs on surface-level metrics, particularly for complex summaries where fidelity to reference translations is more critical. However, LLMs show improved METEOR scores for Vietnamese and Chinese, suggesting capacity in capturing semantic equivalence beyond literal translation.
Summary Complexity: Simple summaries result in higher translation quality across all languages, indicating that both LLMs and conventional MT tools struggle with complex, technical content.

Implications and Future Developments

The results highlight that despite potential, LLMs remain inconsistent across different contexts, displaying variability in semantic capture and morphological intricacies. Traditional MT tools offer stronger alignment with reference translations but lack the contextual adaptability demonstrated by LLMs.

The authors suggest that future improvements could involve fine-tuning LLMs with domain-specific medical corpora to enhance their accuracy and reliability in specialized translations. Additionally, novel, safety-aware evaluation metrics might need to be developed for better assessment in healthcare applications. The paper also emphasizes the importance of integrating human-in-the-loop mechanisms to ensure responsible AI usage in medical settings. This approach would help mitigate risks associated with inaccuracies in critical healthcare translations, emphasizing the role of human oversight in clinical contexts.

Conclusion

The paper comprehensively evaluates the current capabilities and limitations of LLMs versus traditional MT tools in translating medical consultation summaries. It underscores the importance of language-specific characteristics and highlights the shortcomings of existing evaluation metrics in accurately gauging clinical translation quality. The practical implications point towards continued research and development to harness the benefits of AI translation tools responsibly, complemented by human expertise to safeguard patient safety and ensure equitable healthcare access.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Andy Li (7 papers)
Wei Zhou (308 papers)
Rashina Hoda (37 papers)
Chris Bain (3 papers)
Peter Poon (1 paper)

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study (2504.16601v1)

Translation of Medical Consultation Summaries Using LLMs and Traditional MT Tools

Methodology

Results

Implications and Future Developments

Conclusion

Tweets

YouTube

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study (2504.16601v1)

Translation of Medical Consultation Summaries Using LLMs and Traditional MT Tools

Methodology

Results

Implications and Future Developments

Conclusion

Related Papers

Tweets

YouTube