Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries (2310.13132v2)

Published 19 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems.This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XlingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.

PDF Abstract

Cross-Lingual Evaluation of LLMs for Healthcare: A Focused Analysis

The paper "Better to Ask in English: Cross-Lingual Evaluation of LLMs for Healthcare Queries" investigates the cross-lingual performance of LLMs in healthcare, an area both intricate and pivotal due to the consequences of misinformation. The researchers develop a framework named XLingEval to systematically evaluate the abilities of LLMs as multi-lingual dialogue systems in the healthcare context, introducing XLingHealth as a cross-lingual benchmark. This analysis spans four prominent languages: English, Spanish, Chinese, and Hindi, through which they aim to expose language disparities and recommend areas for enhancement.

Key Contributions

Framework and Metrics: The paper proposes the XLingEval framework which assesses responses based on correctness, consistency, and verifiability—criteria critical for safe deployment of LLMs in healthcare. This structured approach highlights areas where current LLMs, particularly in non-English languages, fall short when evaluated for health-related inquiries.
Cross-Lingual Benchmark: XLingHealth emerges as a novel multilingual benchmark composed of three datasets: HealthQA, LiveQA, and MedicationQA. This benchmark stands out as a resource for evaluating the cross-lingual capabilities of LLMs tailored to healthcare, owing to the use of both algorithmic and human-evaluation strategies.
Evaluation Results: Experimental results demonstrate a significant disparity in LLM response quality across languages. English generally yielded more comprehensive and consistent responses compared to the other languages. Notably, Hindi and Chinese exhibited the greatest performance deficits, both in terms of correctness and consistency metrics, indicative of underlying language biases in LLM models.
Language Disparity Insights: The exploration brings forth critical insights on language disparity—the disproportionately high quality of English responses versus those in non-English languages. This disparity underscores the necessity for further attention to training and evaluating LLMs with a substantial and balanced multilingual dataset.

Implications of Research Findings

The implications of this research are substantial for theoretical and practical dimensions of AI deployment in healthcare. Theoretically, this paper illuminates gaps in the current understanding of LLM behavior across languages, potentially stimulating further research into equitable LLM development. Practically, it suggests routes for enhancing LLM deployment protocols to ensure that non-English speaking users receive reliable healthcare information, emphasizing the need for more inclusive training datasets and comprehensive multilingual evaluations.

Speculations on Future Developments in AI

Given the pressing need highlighted by this paper for equitable information access in healthcare, future developments in AI might increasingly focus on reducing language disparities. We might see the evolution of more sophisticated cross-lingual training methodologies or the integration of real-time translation features, potentially bridging current gaps. Additionally, the frameworks established in this paper, such as XLingEval, could spur analogous frameworks in other high-stake industries like finance or legal that face similar challenges.

Through its systemic evaluation framework, comprehensive multilingual benchmark, and empirical findings, this paper equips researchers with foundational insights and tools for future advancements in the multilingual capabilities of LLMs, ultimately paving the way for more inclusive and equitable AI technologies.