Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (2401.16380v1)

Published 29 Jan 2024 in cs.CL

Abstract: LLMs are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training ($\textbf{WRAP}$) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $\sim3x$. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data.

References (74)

Citations (40)

View on Semantic Scholar

Summary

The paper introduces WRAP, a technique that rephrases web documents to generate synthetic data for dual pre-training alongside original content.
The paper demonstrates a threefold speedup in training and over 10% improvement in perplexity using the C4 dataset and various benchmarks.
The paper shows that combining diverse rephrasing styles enhances model generalization and boosts zero-shot question-answering accuracy by over 2%.

Introduction to Web Rephrase Augmented Pre-training (WRAP)

LLMs stand at the forefront of AI research, pushing the boundaries of NLP capabilities. These models, due to their significant scale, often rely on expansive datasets collected from web scrapes. However, such datasets are characteristically unstructured and noisy, leading to a dependency on high computational power and extensive data during the pre-training phase. Addressing this inefficiency, the present paper introduces a method termed Web Rephrase Augmented Pre-training (WRAP), which utilizes an existing instruction-tuned LLM to paraphrase web documents into specific styles. This dual pre-training approach, on original and rephrased data, opens a pathway to more efficient model learning.

Efficacy of WRAP

The paper details compelling evidence for the effectiveness of WRAP. By testing on the C4 dataset, it is established that WRAP allows for a threefold acceleration in the pre-training process. Moreover, within the same computational budget, it notably improves perplexity by an average of over 10% across the Pile's different segments and leads to more than a 2% enhancement in zero-shot question-answering accuracy across an array of tasks. These improvements stem from the introduction of style diversity that mirrors evaluation styles used downstream and an overall enhancement in data quality when compared to unfiltered web content.

Impact of Rephrasing Style and Data Combination

The research also provides insights into how different rephrasing styles affect LLM performance, particularly in out-of-distribution (OOD) scenarios. By rephrasing web text into diverse styles, such as simplified, Wikipedia-like, or Q&A formats, and pre-training on these alongside real data, the models achieve better generalization in OOD settings. The paper further explores the effect of combining real web text with synthetic rephrases, advocating a balanced approach that fosters model robustness against noisy inputs without sacrificing the quality improvements offered by synthetic rephrasing.

Comparative Analysis and Future Directions

Aligning the findings with existing literature, the paper positions WRAP as a technique that mitigates challenges associated with data curation, limited data, and computational efficiency. The comparison with other pre-training methods, including those that use larger datasets and additional compute, showcases WRAP's superior performance on a range of benchmarks. Looking forward, this paper paves the way for more nuanced strategies for pre-training LLMs, especially in scenarios of data scarcity or when seeking to enhance the utility of available data.

This exploration emphasizes the advantages of synthetic data in pre-training scenarios, yet it also sheds light on the limitations, such as the costs of synthetic data generation and the challenges of ensuring content diversity. Nonetheless, WRAP stands as a testament to the evolving landscape of LLM training, underscoring the interplay between data quality, model efficiency, and computational resources.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1752179924301996529

https://twitter.com/pratyushmaini/status/1752337225097076809

https://twitter.com/SamuelAlbanie/status/1759832261782798582

https://twitter.com/pratyushmaini/status/1752337235968716867

https://twitter.com/brandontrabucco/status/1822018441009885533

https://twitter.com/Euclaise_/status/1753066828883173732