Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-FAct: Assessing Factuality of Multilingual LLMs using FActScore (2402.18045v3)

Published 28 Feb 2024 in cs.CL
Multi-FAct: Assessing Factuality of Multilingual LLMs using FActScore

Abstract: Evaluating the factuality of long-form LLM-generated text is an important challenge. Recently there has been a surge of interest in factuality evaluation for English, but little is known about the factuality evaluation of multilingual LLMs, specially when it comes to long-form generation. %This paper systematically evaluates multilingual LLMs' factual accuracy across languages and geographic regions. We introduce a simple pipeline for multilingual factuality evaluation, by applying FActScore (Min et al., 2023) for diverse languages. In addition to evaluating multilingual factual generation, we evaluate the factual accuracy of long-form text generation in topics that reflect regional diversity. We also examine the feasibility of running the FActScore pipeline using non-English Wikipedia and provide comprehensive guidelines on multilingual factual evaluation for regionally diverse topics.

Assessing Factual Accuracy and Geographical Bias in Multilingual LLMs through the Multi-FAct Framework

Introduction to Multi-FAct

LLMs, while impressive in their capabilities, have raised concerns regarding their accuracy, particularly in the field of factual content generation. This paper navigates the often overlooked domain of multilingual factuality assessment in LLMs, shedding light on models' performance across different languages and geographical contexts. It frames its analysis by introducing Multi-FAct, a derivative of the FActScore metric adapted for multilingual use, to systematically evaluate the quality and bias of factual output from models like GPT-3.5 and GPT-4 when tasked with generating biographies across nine languages. The fundamental takeaway is the notable variance in factual accuracy and the presence of geographical biases, with a pronounced skew towards Western-centric content.

Methodological Framework

The researchers embarked on their investigation by selecting the biography generation task, focusing on national leaders as a universally identifiable subject that spans cultures and languages. The paper meticulously chose leaders from the year 2015 across twenty populous nations from each continent, ensuring a broad representation. The Multi-FAct pipeline itself is composed of three stages: generating multilingual content, translating these outputs into English for uniform evaluation, and then applying the FActScore to assess factuality. The models used spanned both GPT-3.5 and GPT-4, with a comprehensive approach to translate prompts and verify the generated content's accuracy against English Wikipedia.

Empirical Insights

The analysis revealed two core insights:

  • Language-Dependent Factual Accuracy: English emerged as consistently superior in both the accuracy and quantity of generated facts, outstripping other languages. This points towards a significant disparity in model performance that veers towards high-resource languages.
  • Geographical Bias: A clear pattern of bias was observed, favoring factual information pertaining to Western regions regardless of the input language. This geographical bias underscores an inherent Western-centric skew in the knowledge bases of multilingual LLMs.

Through a meticulous assessment, the paper presents detailed statistical comparisons across languages and continents, showcasing the variability in factual performance and highlighting the confluence of language resource availability and geographical bias.

Analysis of Geographical Biases

Further granularity was provided through a sub-regional analysis, bringing to light how certain languages showed predilection towards accuracy in specific global regions. For instance, Chinese and Korean displayed a higher factual precision for Eastern Asia, while English maintained relatively uniform accuracy across various regions, albeit with a slight preferential tilt towards North America.

Correlational Examination of FActScore

A correlational matrix was constructed to visualize the intersecting performances across languages, showcasing a high correlation in FActScore among Western languages (English, Spanish, French, and German). This element of the paper hints at a possible shared underpinning in factual representation within these languages, contrasted against lesser-resourced languages which did not exhibit a similar correlation pattern.

Discussing Limitations and Future Directions

The paper does not shy away from acknowledging its limitations; these include a reliance on a constrained sample size and the potential effects of automated factuality scoring tools like FActScore which may not fully capture the nuance of informational value among facts. Future avenues for research suggested include enhancing the fidelity of factuality assessment by differentiating between the quality of facts and expanding the scope to include more diverse and non-political figures.

Reflections and Implications

This paper opens up critical discussions around the need for improved accuracy in multilingual LLMs and calls attention to the geographical biases inherent in these models. It underscores the importance of developing more nuanced methodologies for evaluating LLM outputs, especially in non-English and low-resource languages, to ensure a fair and equitably distributed representation of global knowledge.

The implications of this research are vast, encompassing both the enhancement of LLM capabilities and a reconsideration of how bias and fairness are addressed in the development of AI technologies. As the field progresses, it will be crucial to incorporate these findings into the refinement of LLMs, ensuring they not only achieve high levels of linguistic proficiency but also embody a balanced and accurate representation of the diverse world we inhabit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. mface: Multilingual summarization with factual consistency evaluation. arXiv preprint arXiv:2212.10622.
  2. MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
  3. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying.
  4. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, Nusa Dua, Bali. Association for Computational Linguistics.
  5. Multilingual large language models leak human stereotypes across language boundaries.
  6. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
  7. Chain-of-verification reduces hallucination in large language models.
  8. Olmo: Accelerating the science of language models.
  9. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  10. Ashim Gupta and Vivek Srikumar. 2021. X-fact: A new benchmark dataset for multilingual fact checking. arXiv preprint arXiv:2106.09248.
  11. Challenges and strategies in cross-cultural NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6997–7013, Dublin, Ireland. Association for Computational Linguistics.
  12. Dirk Hovy and Diyi Yang. 2021. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588–602, Online. Association for Computational Linguistics.
  13. Jing Huang and Diyi Yang. 2023. Culturally aware natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7591–7609, Singapore. Association for Computational Linguistics.
  14. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
  17. Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries.
  18. Comparing hallucination detection metrics for multilingual generation. arXiv preprint arXiv:2402.10496.
  19. Proofver: Natural logic theorem proving for fact verification. Transactions of the Association for Computational Linguistics, 10:1013–1030.
  20. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.
  21. Large language models are geographically biased. arXiv preprint arXiv:2402.02680.
  22. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
  23. Nonparametric masked language modeling. arXiv preprint arXiv:2212.01349.
  24. Global-liar: Factuality of llms over time and geographic regions. arXiv preprint arXiv:2401.17839.
  25. Fine-grained hallucination detection and editing for language models. arXiv preprint arXiv:2401.06855.
  26. Dolma: an open corpus of three trillion tokens for language model pretraining research.
  27. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  28. Wikipedia contributors. 2024. Wikipedia:multilingual statistics — Wikipedia, the free encyclopedia. [Online; accessed 24-February-2024].
  29. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  30. Reasoning over semantic-level graph for fact checking. arXiv preprint arXiv:1909.03745.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sheikh Shafayat (7 papers)
  2. Eunsu Kim (14 papers)
  3. Juhyun Oh (9 papers)
  4. Alice Oh (81 papers)