Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (2401.16380v1)

Published 29 Jan 2024 in cs.CL

Abstract: LLMs are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training ($\textbf{WRAP}$) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $\sim3x$. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Winogrande: An adversarial winograd schema challenge at scale. 2019.
  2. Semdedup: Data-efficient learning at web-scale through semantic deduplication. ArXiv, abs/2303.09540, 2023. URL https://api.semanticscholar.org/CorpusID:257557221.
  3. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023.
  4. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245.
  5. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023b.
  8. Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503, 2023.
  9. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  10. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  11. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Faster neural network training with data echoing. arXiv preprint arXiv:1907.05550, 2019.
  14. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  15. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  16. Together Computer. Redpajama: an open dataset for training large language models. 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  17. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  702–703, 2020.
  18. Nl-augmenter: A framework for task-sensitive natural language augmentation, 2021.
  19. Jacob Eisenstein. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  359–369, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1037.
  20. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  21. Drawing multiple augmentation samples per image during training efficiently decreases test error. arXiv preprint arXiv:2105.13343, 2021.
  22. Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences, 112(33):10336–10341, 2015.
  23. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  24. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  25. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  26. Simcse: Simple contrastive learning of sentence embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp.  6894–6910. Association for Computational Linguistics (ACL), 2021.
  27. Edward Gibson et al. The dependency locality theory: A distance-based theory of linguistic complexity. Image, language, brain, 2000:95–126, 2000.
  28. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  29. Whose language counts as high quality? measuring language ideologies in text data selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2562–2580, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.165. URL https://aclanthology.org/2022.emnlp-main.165.
  30. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  31. Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8129–8138, 2020.
  32. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  33. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  34. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2567–2577, 2019.
  35. Matt Gardner Johannes Welbl, Nelson F. Liu. Crowdsourcing multiple choice science questions. 2017.
  36. Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing. arXiv preprint arXiv:2305.16635, 2023.
  37. Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint arXiv:2304.08460, 2023.
  38. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  39. Starcoder: may the source be with you! 2023a.
  40. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023b.
  41. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023c.
  42. Truthfulqa: Measuring how models mimic human falsehoods, 2021.
  43. Tinygsm: achieving¿ 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023a.
  44. Logiqa 2.0 — an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  1–16, 2023b. doi: 10.1109/TASLP.2023.3293046.
  45. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023c.
  46. Fingpt: Large generative models for a small language. arXiv preprint arXiv:2311.05640, 2023.
  47. Pratyush Maini. Phi-1.5 model: A case of comparing apples to oranges? 2023. URL https://pratyushmaini.github.io/phi-1_5/.
  48. T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
  49. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  50. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  51. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  52. Masanori Oya. Three types of average dependency distances of sentences in a multilingual parallel corpus. In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, pp.  652–661, 2021.
  53. Investigating the representation of open domain dialogue context for transformer models. In Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, pp.  538–547, Prague, Czechia, September 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.sigdial-1.50.
  54. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  55. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  56. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  57. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  58. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  59. Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818, 2023.
  60. Model dementia: Generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
  61. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021.
  62. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  63. Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944, 2023.
  64. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  65. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 2022.
  66. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  67. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.494.
  68. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding, 2024.
  69. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023.
  70. To repeat or not to repeat: Insights from scaling llm under token-crisis. arXiv preprint arXiv:2305.13230, 2023.
  71. The devil is in the details: A deep dive into the rabbit hole of data filtering. arXiv preprint arXiv:2309.15954, 2023.
  72. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  73. Tinyllama: An open-source small language model, 2024.
  74. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
Citations (40)

Summary

  • The paper introduces WRAP, a technique that rephrases web documents to generate synthetic data for dual pre-training alongside original content.
  • The paper demonstrates a threefold speedup in training and over 10% improvement in perplexity using the C4 dataset and various benchmarks.
  • The paper shows that combining diverse rephrasing styles enhances model generalization and boosts zero-shot question-answering accuracy by over 2%.

Introduction to Web Rephrase Augmented Pre-training (WRAP)

LLMs stand at the forefront of AI research, pushing the boundaries of NLP capabilities. These models, due to their significant scale, often rely on expansive datasets collected from web scrapes. However, such datasets are characteristically unstructured and noisy, leading to a dependency on high computational power and extensive data during the pre-training phase. Addressing this inefficiency, the present paper introduces a method termed Web Rephrase Augmented Pre-training (WRAP), which utilizes an existing instruction-tuned LLM to paraphrase web documents into specific styles. This dual pre-training approach, on original and rephrased data, opens a pathway to more efficient model learning.

Efficacy of WRAP

The paper details compelling evidence for the effectiveness of WRAP. By testing on the C4 dataset, it is established that WRAP allows for a threefold acceleration in the pre-training process. Moreover, within the same computational budget, it notably improves perplexity by an average of over 10% across the Pile's different segments and leads to more than a 2% enhancement in zero-shot question-answering accuracy across an array of tasks. These improvements stem from the introduction of style diversity that mirrors evaluation styles used downstream and an overall enhancement in data quality when compared to unfiltered web content.

Impact of Rephrasing Style and Data Combination

The research also provides insights into how different rephrasing styles affect LLM performance, particularly in out-of-distribution (OOD) scenarios. By rephrasing web text into diverse styles, such as simplified, Wikipedia-like, or Q&A formats, and pre-training on these alongside real data, the models achieve better generalization in OOD settings. The paper further explores the effect of combining real web text with synthetic rephrases, advocating a balanced approach that fosters model robustness against noisy inputs without sacrificing the quality improvements offered by synthetic rephrasing.

Comparative Analysis and Future Directions

Aligning the findings with existing literature, the paper positions WRAP as a technique that mitigates challenges associated with data curation, limited data, and computational efficiency. The comparison with other pre-training methods, including those that use larger datasets and additional compute, showcases WRAP's superior performance on a range of benchmarks. Looking forward, this paper paves the way for more nuanced strategies for pre-training LLMs, especially in scenarios of data scarcity or when seeking to enhance the utility of available data.

This exploration emphasizes the advantages of synthetic data in pre-training scenarios, yet it also sheds light on the limitations, such as the costs of synthetic data generation and the challenges of ensuring content diversity. Nonetheless, WRAP stands as a testament to the evolving landscape of LLM training, underscoring the interplay between data quality, model efficiency, and computational resources.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com