MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions (2409.12958v1)

Published 19 Sep 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Instruction tuning enhances LLMs by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach's effectiveness for both NLU and open-ended generation. We publicly release datasets and models at https://github.com/akoksal/muri.

PDF Abstract

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

The paper introduces Multilingual Reverse Instructions (MURI), a novel methodology proposed to generate high-quality instruction tuning datasets for low-resource languages. The significance of this paper stems from the increasing importance of instruction tuning techniques to enhance LLMs for diverse tasks and user specifications. Traditional methods in creating these datasets face significant challenges due to a reliance on data annotation and existing multilingual models, which are often not available or effective for low-resource languages.

Methodology

MURI leverages the concept of reverse instructions coupled with machine translation to create instruction-output pairs from human-written texts. This process is divided into several critical steps:

Data Selection: Texts are randomly sampled from high-quality multilingual corpora, primarily CulturaX and Wikipedia.
Document Translation: Selected documents in target languages are translated into English to generate intermediate representations.
Reverse Instructions: For each translated document, an English instruction is generated using an LLM. This involves prompting the LLM with an example document to produce a relevant instruction.
Instruction Translation: The generated English instruction is translated back into the source language, ensuring fidelity to linguistic and cultural contexts.
Content Screening: The generated pairs are filtered for quality, removing inappropriate content and translation artifacts.

This methodology allows MURI to produce culturally relevant and idiomatic instruction-output pairs without extensive manual annotation or reliance on high-resource languages.

MURI-IT Dataset

The resultant dataset, MURI-IT, includes over 2 million instruction-output pairs across 200 languages, making it one of the most comprehensive resources available for multilingual instruction tuning. The dataset includes contributions from diverse sources, including Wikipedia and WikiHow, ensuring a wide variety in style, domain, and linguistic complexity. Code and data are made publicly available, promoting open research and accessibility.

Evaluation and Results

The paper reports extensive evaluations of the dataset's quality and effectiveness. Native speakers across 13 languages assessed the dataset, focusing on alignment, grammaticality, and coherence. The evaluations revealed that while high-resource languages consistently performed well, some low-resource languages exhibited issues like orthographic inconsistency and the occurrence of non-standard dialects.

The effectiveness of models fine-tuned on MURI-IT, specifically MURI-101, was assessed against multilingual NLU and NLG tasks, demonstrating significant improvements:

NLU Performance: MURI-101 outperformed existing models in multilingual MMLU by over 14%.
NLG Performance: MURI-101 showed superior performance over baseline models like mT0, with a 59% win rate in open-ended generation tasks using Command R+ as a judge.

Additionally, evaluations on low-resource languages using the Taxi1500 dataset showed that MURI-IT effectively complements other datasets, such as Aya, resulting in improved performance.

Implications and Future Directions

The implications of this research extend both practically and theoretically:

Practical Implications: MURI-IT provides a valuable resource for developing more robust and culturally aware LLMs for low-resource languages. By ensuring high cultural relevance and eliminating translation artifacts, it supports the creation of models that better serve diverse linguistic communities.
Theoretical Implications: The novel approach of reverse instructions combined with machine translation offers a scalable and cost-effective method for dataset generation. This methodology can be adapted and extended to other domains and languages, opening new avenues for research in multilingual and low-resource NLP.

Future Developments

Future research could explore refining data quality through advanced clustering and cleaning techniques, addressing orthographic and dialectal variations more effectively. As multilingual models continue to evolve and support a broader range of languages, datasets like MURI-IT will be crucial in pushing the frontiers of what these models can achieve.

By providing a comprehensive and high-quality dataset and a scalable methodology, MURI represents a significant advancement in the field of multilingual NLP, particularly for underserved languages. This work sets a foundation for future research and development in creating more inclusive and culturally competent language technologies.