Evaluation of LLMs for Bangla Consumer Health Query Summarization
This paper presents an empirical paper on the zero-shot capabilities of LLMs in summarizing consumer health queries (CHQs) written in Bangla, a low-resource language. With the increasing prevalence of online platforms for medical inquiries, CHQs often contain extraneous and irrelevant details that can impede efficient responses by healthcare professionals. Consequently, summarization of CHQs can facilitate rapid extraction of critical information, enhancing response accuracy and efficiency in medical contexts.
The authors benchmarked nine leading LLMs, including GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, using the BanglaCHQ-Summ dataset, consisting of 2,350 annotated query-summary pairs. Serving as a comparative standard was the Bangla T5 model, which has been fine-tuned specifically for Bangla CHQ summarization. The paper aimed to evaluate the inherent capabilities of these LLMs to produce high-quality summaries without specific fine-tuning for Bangla, and it used ROUGE metrics for assessment.
The results indicated substantial variability among LLMs in terms of their summarization performance. Mixtral-8x22b-Instruct and GPT-4 showed impressively high scores in both ROUGE-1 and ROUGE-L metrics. Notably, Mixtral-8x22b-Instruct achieved an R1 score of 51.36 and an RL score of 49.17, surpassing the Bangla T5 model in these metrics. Bangla T5 retained a lead in ROUGE-2 scoring, with a bi-gram overlap measure of 29.11, suggesting superior performance on immediate phrase-level coherence for task-specific contexts. Conversely, Athene-70B produced the least effective summaries, with its scores hovering around the lower ends of all ROUGE metrics, illustrating substantial underperformance relative to its peers.
The paper corroborates the potential of zero-shot LLMs to compete with fine-tuned models in low-resource contexts, indicating that task-specific training may become less necessary with continued advancements in model architectures and learning paradigms. However, challenges remain, notably pronunciation variability within the Bangla language impacting ROUGE scores, and models like Qwen2-72B exhibited difficulties in maintaining fluency and coherence. Discrepancies in summary length also presented a point of analysis, as seen with GPT-4 generating longer summaries than the gold standards.
The implications of these findings are significant for practical applications in healthcare systems where automating medical query summarization can alleviate workload pressures on professionals and optimize query response times. Moreover, the theoretical implications emphasize the need for further research focused on fine-grained improvements in LLMs' handling of linguistic peculiarities specific to low-resource languages like Bangla.
Advancements in LLM models should concentrate on enhancing content fidelity without extensive domain-specific tuning, thus promoting inclusive applicability across various domains and geographic regions. Future research could channel efforts into perfecting phonetic adaptability within models to better manage pronunciation variability and further narrow the performance gap between zero-shot and fine-tuned models, improving the reliability of LLM outputs in specialized tasks such as medical CHQ summarization.