Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 72 tok/s
GPT OSS 120B 441 tok/s Pro
Kimi K2 200 tok/s Pro
2000 character limit reached

Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation (2410.21970v1)

Published 29 Oct 2024 in cs.CL

Abstract: RALMs (Retrieval-Augmented LLMs) broaden their knowledge scope by incorporating external textual resources. However, the multilingual nature of global knowledge necessitates RALMs to handle diverse languages, a topic that has received limited research focus. In this work, we propose \textit{Futurepedia}, a carefully crafted benchmark containing parallel texts across eight representative languages. We evaluate six multilingual RALMs using our benchmark to explore the challenges of multilingual RALMs. Experimental results reveal linguistic inequalities: 1) high-resource languages stand out in Monolingual Knowledge Extraction; 2) Indo-European languages lead RALMs to provide answers directly from documents, alleviating the challenge of expressing answers across languages; 3) English benefits from RALMs' selection bias and speaks louder in multilingual knowledge selection. Based on these findings, we offer advice for improving multilingual Retrieval Augmented Generation. For monolingual knowledge extraction, careful attention must be paid to cascading errors from translating low-resource languages into high-resource ones. In cross-lingual knowledge transfer, encouraging RALMs to provide answers within documents in different languages can improve transfer performance. For multilingual knowledge selection, incorporating more non-English documents and repositioning English documents can help mitigate RALMs' selection bias. Through comprehensive experiments, we underscore the complexities inherent in multilingual RALMs and offer valuable insights for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Aya 23: Open Weight Releases to Further Multilingual Progress. arXiv:2405.15032.
  2. XOR QA: Cross-lingual Open-Retrieval Question Answering. In NAACL 2021.
  3. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In ICLR 2024.
  4. Benchmarking Large Language Models in Retrieval-Augmented Generation. In AAAI 2024.
  5. MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning.
  6. Retrieval-augmented generation in multilingual settings. arXiv:2407.01463.
  7. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.
  8. Tug-of-War between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models. In LREC-COLING 2024.
  9. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS 2020.
  10. RA-DIT: Retrieval-Augmented Dual Instruction Tuning. In ICLR 2024.
  11. Lost in the Middle: How Language Models Use Long Contexts. TACL.
  12. RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge. arXiv:2311.08147.
  13. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. arXiv:2007.15207.
  14. CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. arXiv:2401.17043.
  15. Query Rewriting for Retrieval-Augmented Large Language Models. In EMNLP 2023.
  16. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv:2309.09400.
  17. Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers. arXiv:2404.04925.
  18. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv:2311.09476.
  19. Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models. arXiv:2407.05502.
  20. Language Models are Multilingual Chain-of-Thought Reasoners.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube