MAmmoTH2: Scaling Instructions from the Web

Published 6 May 2024 in cs.CL | (2405.03548v4)

Abstract: Instruction tuning improves the reasoning abilities of LLMs, with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.

Abstract PDF Upgrade to Chat

Citations (53)

View on Semantic Scholar

Summary

The paper proposes a novel method for extracting high-quality instruction-response pairs from web data, bypassing costly human annotation.
The approach involves a three-step process—document retrieval, LLM-based extraction, and pair refinement—to curate a robust dataset named WebInstruct.
Benchmarking shows marked performance gains, with MATH scores rising from 11% to 34%, underscoring the method's effectiveness in enhancing LLM reasoning.

Exploring MAmmoTH2: An Effective Paradigm for LLM Reasoning Enhancement Using Naturally Harvested Web Data

Introduction to Instruction Tuning

Instruction tuning is a critical method employed to hone the reasoning capabilities of LLMs. By focusing on the quality and source of data for this tuning, significant advancements in model performance can be achieved. The traditional approaches often rely on human-annotated datasets or synthetic generations, which can be both expensive and limited in diversity and scale. In a novel departure, the discussed paper introduces a method to mine high-quality, diverse instruction data directly from the web, significantly enhancing reasoning performance in LLMs without these constraints.

The MAmmoTH2 Approach

The crux of the MAmmoTH2 approach lies in efficiently extracting instruction-response pairs from existing internet resources. This process can be summarized in three major steps:

Recalling Relevant Web Documents: The initial step involves creating a seed dataset to train models that can sift through vast web corpora like Common Crawl to find potentially useful documents. These documents primarily originate from educational and data-rich websites.
Extracting Instruction-Response Pairs: This stage deploys open-source LLMs to pull out question-answer pairs from the filtered documents. Though the raw data contains significant noise, strategic extraction helps isolate valuable instructional content.
Refining the Pairs: The final refinement step leverages LLMs again to polish the extracted pairs — editing for clarity, formal correctness, and, importantly, adding missing explanations to enhance the quality of these instructional pairs.

The final dataset, named WebInstruct, comprises about 10 million such pairs amassed without the direct cost of human data generation or the biases of synthetic data.

Benchmarking the Model Performance

The paper proceeds to benchmark the efficacy of MAmmoTH2 by fine-tuning base LLMs with the gathered instruction dataset and testing them against several reasoning benchmarks. The models trained with WebInstruct showed remarkable improvements:

General Enhancement: For example, the 7B model variant saw a leap from 11% to 34% in MATH accuracy and from 36% to 67% in GSM8K accuracy.
Further Tuning with Additional Public Datasets: MAmmoTH2-Plus, enhanced further with additional datasets, set new performance standards on various other reasoning and general-purpose tasks.

These results not only demonstrate the raw power of WebInstruct but also the effectiveness of harnessing naturally occurring instructional data over more confined and costly data generation methods.

Future Directions and Speculations

The advancements showcased in MAmmoTH2 open several exciting avenues for future exploration. This methodology can potentially be extended to other domains of AI where data scarcity or data diversity is a limiting factor. Moreover, the continual refinement of data extraction and processing techniques might yield even more robust models capable of handling increasingly complex tasks.

Concluding Thoughts

In summary, the MAmmoTH2 framework presents a compelling step forward in the instruction tuning of LLMs. By leveraging the vast, untapped reservoir of instructional content available online, it sidesteps the significant costs and limitations of traditional data collection methods. The impressive benchmark results not only attest to the viability of this approach but also hint at its vast potential to revolutionize how we enhance LLMs for varied applications. These developments are poised to shape the future trajectory of LLM research and applications, making a notable impact across numerous fields that rely on deep reasoning capabilities.

Markdown