MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
The paper introduces Multilingual Reverse Instructions (MURI), a novel methodology proposed to generate high-quality instruction tuning datasets for low-resource languages. The significance of this paper stems from the increasing importance of instruction tuning techniques to enhance LLMs for diverse tasks and user specifications. Traditional methods in creating these datasets face significant challenges due to a reliance on data annotation and existing multilingual models, which are often not available or effective for low-resource languages.
Methodology
MURI leverages the concept of reverse instructions coupled with machine translation to create instruction-output pairs from human-written texts. This process is divided into several critical steps:
- Data Selection: Texts are randomly sampled from high-quality multilingual corpora, primarily CulturaX and Wikipedia.
- Document Translation: Selected documents in target languages are translated into English to generate intermediate representations.
- Reverse Instructions: For each translated document, an English instruction is generated using an LLM. This involves prompting the LLM with an example document to produce a relevant instruction.
- Instruction Translation: The generated English instruction is translated back into the source language, ensuring fidelity to linguistic and cultural contexts.
- Content Screening: The generated pairs are filtered for quality, removing inappropriate content and translation artifacts.
This methodology allows MURI to produce culturally relevant and idiomatic instruction-output pairs without extensive manual annotation or reliance on high-resource languages.
MURI-IT Dataset
The resultant dataset, MURI-IT, includes over 2 million instruction-output pairs across 200 languages, making it one of the most comprehensive resources available for multilingual instruction tuning. The dataset includes contributions from diverse sources, including Wikipedia and WikiHow, ensuring a wide variety in style, domain, and linguistic complexity. Code and data are made publicly available, promoting open research and accessibility.
Evaluation and Results
The paper reports extensive evaluations of the dataset's quality and effectiveness. Native speakers across 13 languages assessed the dataset, focusing on alignment, grammaticality, and coherence. The evaluations revealed that while high-resource languages consistently performed well, some low-resource languages exhibited issues like orthographic inconsistency and the occurrence of non-standard dialects.
The effectiveness of models fine-tuned on MURI-IT, specifically MURI-101, was assessed against multilingual NLU and NLG tasks, demonstrating significant improvements:
- NLU Performance: MURI-101 outperformed existing models in multilingual MMLU by over 14%.
- NLG Performance: MURI-101 showed superior performance over baseline models like mT0, with a 59% win rate in open-ended generation tasks using Command R+ as a judge.
Additionally, evaluations on low-resource languages using the Taxi1500 dataset showed that MURI-IT effectively complements other datasets, such as Aya, resulting in improved performance.
Implications and Future Directions
The implications of this research extend both practically and theoretically:
- Practical Implications: MURI-IT provides a valuable resource for developing more robust and culturally aware LLMs for low-resource languages. By ensuring high cultural relevance and eliminating translation artifacts, it supports the creation of models that better serve diverse linguistic communities.
- Theoretical Implications: The novel approach of reverse instructions combined with machine translation offers a scalable and cost-effective method for dataset generation. This methodology can be adapted and extended to other domains and languages, opening new avenues for research in multilingual and low-resource NLP.
Future Developments
Future research could explore refining data quality through advanced clustering and cleaning techniques, addressing orthographic and dialectal variations more effectively. As multilingual models continue to evolve and support a broader range of languages, datasets like MURI-IT will be crucial in pushing the frontiers of what these models can achieve.
By providing a comprehensive and high-quality dataset and a scalable methodology, MURI represents a significant advancement in the field of multilingual NLP, particularly for underserved languages. This work sets a foundation for future research and development in creating more inclusive and culturally competent language technologies.