Evaluate LLM summarization in lower-resource languages beyond English and Chinese

Determine how effectively large language models perform summarization in lower-resource languages beyond English and Chinese, assessing performance across multiple domains using fine-grained evaluation criteria such as faithfulness, completeness, and conciseness to establish multilingual robustness.

Background

MSumBench provides a multi-aspect benchmark for summarization evaluation across English and Chinese, introducing domain-specific key-fact categories and fine-grained metrics to assess faithfulness, completeness, and conciseness. Despite this bilingual coverage, many languages remain unaddressed.

Throughout the paper, the authors observe systematic performance differences across languages and domains and note that multilingual evaluation remains challenging. Extending rigorous, domain-aware evaluation to lower-resource languages is therefore identified as an open question to understand how well LLMs handle summarization beyond high-resource settings.

References

Evaluating how effectively LLMs handle other lower-resource languages remains an open question.

Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages (2506.00549 - Min et al., 31 May 2025) in Limitations (Section*, third paragraph)