Multilingual Evaluation of LLMs in Healthcare Contexts
This paper undertakes a comprehensive evaluation of 24 LLMs within a multilingual, real-world healthcare scenario. It specifically focuses on responses to questions posed by Indian patients interacting with a medical chatbot, HealthBot, in Indian English and four Indic languages. This research leverages a Retrieval Augmented Generation (RAG) framework to generate responses, subsequently evaluated using automated techniques and human assessors across four pertinent metrics.
Evaluation Framework and Findings
The assessment involves both proprietary and open-weight models, juxtaposed within an identical RAG framework for consistent comparison. The core metrics evaluated include Factual Correctness, Semantic Similarity, Coherence, and Conciseness. The authors focus closely on the capability of models to handle multilingual queries, code-mixed inputs, and contextually relevant linguistic nuances that standard translated benchmarks might overlook.
Intriguingly, the paper reveals significant performance disparity across models, including between larger multilingual models and smaller, instruction-tuned Indic models. Notably, the paper highlights that factual correctness in responses to non-English queries generally lags behind those of English queries, underscoring a gap that requires attention for better multilingual model performance. This finding emphasizes the necessity for more robust benchmarks that cater to real-world multilingual settings, especially in healthcare.
Methodological Robustness
The dataset comprises 749 questions curated from a health bot used by Indian patients pre- and post-cataract surgery, providing a rich, contextually grounded testbed for evaluation. The dataset includes real-world utterances with misspellings, code-mixing, and regional specifics, presenting a more realistic assessment environment than conventional datasets. The methodology ensures that the dataset remains uncontaminated by training data of evaluated models, maintaining the validity of results.
The paper also employs a dual approach of LLM-based evaluators supplemented by human evaluations, validating LLM assessments and reinforcing the reliability of evaluation metrics like Semantic Similarity.
Practical Implications and Future Directions
The insights drawn from this research have both practical and theoretical implications. Practically, the evaluation reveals which models are most adept at generating accurate and coherent responses in multilingual medical contexts, offering guidance for developers deploying LLMs in sensitive fields like healthcare. Theoretically, the observed limitations and variances in performance highlight areas for development in model training—particularly the need for models that can accurately comprehend and respond to culturally nuanced queries.
The paper suggests future directions for research to enhance LLM performance, recommending the development of specialized benchmarks that can robustly test multilingual capabilities in varying linguistic contexts. Such advancements could significantly impact the deployment and reliability of LLMs in multicultural settings worldwide.
Conclusion
In conclusion, this research offers a comprehensive evaluation of LLMs across multilingual and culturally sensitive healthcare contexts, highlighting significant findings on model performance disparities and outlining key areas for improvement in multilingual evaluation benchmarks. This work serves as a foundational step towards better understanding and improving LLM performance in real-world, multicultural applications.