Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings (2410.13671v1)

Published 17 Oct 2024 in cs.CL
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings

Abstract: Assessing the capabilities and limitations of LLMs has garnered significant interest, yet the evaluation of multiple models in real-world scenarios remains rare. Multilingual evaluation often relies on translated benchmarks, which typically do not capture linguistic and cultural nuances present in the source language. This study provides an extensive assessment of 24 LLMs on real world data collected from Indian patients interacting with a medical chatbot in Indian English and 4 other Indic languages. We employ a uniform Retrieval Augmented Generation framework to generate responses, which are evaluated using both automated techniques and human evaluators on four specific metrics relevant to our application. We find that models vary significantly in their performance and that instruction tuned Indic models do not always perform well on Indic language queries. Further, we empirically show that factual correctness is generally lower for responses to Indic queries compared to English queries. Finally, our qualitative work shows that code-mixed and culturally relevant queries in our dataset pose challenges to evaluated models.

Multilingual Evaluation of LLMs in Healthcare Contexts

This paper undertakes a comprehensive evaluation of 24 LLMs within a multilingual, real-world healthcare scenario. It specifically focuses on responses to questions posed by Indian patients interacting with a medical chatbot, HealthBot, in Indian English and four Indic languages. This research leverages a Retrieval Augmented Generation (RAG) framework to generate responses, subsequently evaluated using automated techniques and human assessors across four pertinent metrics.

Evaluation Framework and Findings

The assessment involves both proprietary and open-weight models, juxtaposed within an identical RAG framework for consistent comparison. The core metrics evaluated include Factual Correctness, Semantic Similarity, Coherence, and Conciseness. The authors focus closely on the capability of models to handle multilingual queries, code-mixed inputs, and contextually relevant linguistic nuances that standard translated benchmarks might overlook.

Intriguingly, the paper reveals significant performance disparity across models, including between larger multilingual models and smaller, instruction-tuned Indic models. Notably, the paper highlights that factual correctness in responses to non-English queries generally lags behind those of English queries, underscoring a gap that requires attention for better multilingual model performance. This finding emphasizes the necessity for more robust benchmarks that cater to real-world multilingual settings, especially in healthcare.

Methodological Robustness

The dataset comprises 749 questions curated from a health bot used by Indian patients pre- and post-cataract surgery, providing a rich, contextually grounded testbed for evaluation. The dataset includes real-world utterances with misspellings, code-mixing, and regional specifics, presenting a more realistic assessment environment than conventional datasets. The methodology ensures that the dataset remains uncontaminated by training data of evaluated models, maintaining the validity of results.

The paper also employs a dual approach of LLM-based evaluators supplemented by human evaluations, validating LLM assessments and reinforcing the reliability of evaluation metrics like Semantic Similarity.

Practical Implications and Future Directions

The insights drawn from this research have both practical and theoretical implications. Practically, the evaluation reveals which models are most adept at generating accurate and coherent responses in multilingual medical contexts, offering guidance for developers deploying LLMs in sensitive fields like healthcare. Theoretically, the observed limitations and variances in performance highlight areas for development in model training—particularly the need for models that can accurately comprehend and respond to culturally nuanced queries.

The paper suggests future directions for research to enhance LLM performance, recommending the development of specialized benchmarks that can robustly test multilingual capabilities in varying linguistic contexts. Such advancements could significantly impact the deployment and reliability of LLMs in multicultural settings worldwide.

Conclusion

In conclusion, this research offers a comprehensive evaluation of LLMs across multilingual and culturally sensitive healthcare contexts, highlighting significant findings on model performance disparities and outlining key areas for improvement in multilingual evaluation benchmarks. This work serves as a foundational step towards better understanding and improving LLM performance in real-world, multicultural applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Varun Gumma (14 papers)
  2. Anandhita Raghunath (1 paper)
  3. Mohit Jain (27 papers)
  4. Sunayana Sitaram (54 papers)