Retrieval-augmented generation in multilingual settings (2407.01463v1)

Published 1 Jul 2024 in cs.CL and cs.AI

Abstract: Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into LLMs and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at https://github.com/naver/bergen.

PDF HTML Abstract

Retrieval-augmented Generation in Multilingual Settings

The paper "Retrieval-augmented generation in multilingual settings" addresses a significant gap in the application of retrieval-augmented generation (RAG) to multilingual environments. RAG techniques have been primarily focused on English, leaving a void in research for other languages. By investigating RAG in a multilingual context (mRAG), the authors explore how various components within an mRAG pipeline need to be adapted to perform effectively with user queries and datastores in thirteen different languages.

Summary of Contributions

Publicly Available Baseline mRAG Pipeline: The authors present a comprehensive mRAG pipeline, which has been released publicly. This pipeline serves as a robust baseline for future research in multilingual RAG.
Empirical Study across Diverse Languages: The paper spans 13 languages and focuses on open-domain question answering, providing a broad empirical evaluation of mRAG components.
Identification of Essential Adjustments: Despite the availability of high-quality multilingual retrievers and generators, specific modifications such as task-specific prompt engineering and adjusted evaluation metrics are required to facilitate effective generation in user languages.

Core Findings

Retrieval Effectiveness: Off-the-shelf multilingual retrievers and rerankers, specifically the BGE-m3 model, shown reasonable performance in both monolingual and cross-lingual retrieval scenarios (Table \ref{tab:1}). This demonstrates the retriever's efficacy in handling queries and documents across different languages.
Generation Techniques: Achieving high performance in multilingual settings necessitates a robust multilingually pretrained and tuned LLM. The Command-R-35B model, in particular, displayed superior performance across languages when coupled with advanced prompting strategies. Strategies that included explicit instructions to generate responses in the user language and translating system prompts into user languages were shown to maximize performance (Tables \ref{tab:prompts_demo}, \ref{tab:prompts}, and \ref{tab:models}).
Evaluation Adjustments: Metrics such as character 3-gram recall were introduced to better account for the variations in spelling named entities across different languages and transliterations (Table \ref{tab:charrecall}). This adjustment ensures more accurate evaluations of multilingual generations.
Effect of Retrieval Language: Retrieval from multilingual data stores, including both English and user languages, often resulted in higher performance compared to monolingual retrieval settings, as illustrated in the empirical results (Table \ref{tab:1}).

Implications and Future Directions

The implications of this research are multifaceted, impacting both theoretical considerations and practical applications in AI and multilingual NLP:

Enhanced Information Access: The development and refinement of mRAG pipelines can significantly enhance information access across diverse linguistic and cultural contexts, bridging the gap for non-English speakers.
Prompt Engineering and Evaluation: The findings underscore the importance of sophisticated prompt engineering and comprehensive evaluation metrics. Future work might explore dynamic prompt adjustments based on query languages to further enhance multilingual generation fidelity without the need for pre-translation.
Robust Multilingual LLMs: The paper's insights into the performance of different LLMs suggest that more research is needed to address the "curse of multilinguality". Future models should aim to balance multilingual capabilities without compromising performance in any individual language.
Extension to Other Domains: While this paper focuses on open question answering with Wikipedia as the datastore, expanding research to other domains and applications is essential. Domain-specific datasets and contexts could help evaluate and improve the mRAG pipeline’s adaptability and robustness in real-world scenarios.

Conclusion

The presented research makes important strides in bringing the benefits of RAG to a multilingual audience. The publicly available mRAG pipeline and comprehensive analysis across multiple languages provide a strong foundation for future research. The paper uncovers critical modifications needed to optimize components within an mRAG framework and highlights essential directions for enhancing multilingual LLMs and their evaluation. These contributions pave the way for more inclusive and effective NLP systems, offering a promising outlook for multilingual AI advancements.