Synthetic continued pretraining (2409.07431v2)

Published 11 Sep 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Pretraining on large-scale, unstructured internet text enables LLMs to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient--to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining with EntiGraph enables a LLM to answer questions and follow generic instructions related to the source documents without access to them. If, instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.

PDF HTML Abstract

Synthetic Continued Pretraining

The research paper titled "Synthetic Continued Pretraining" addresses the challenge of data inefficiency in LLM (LM) pretraining, particularly for small, domain-specific corpora. The authors propose an innovative method termed "synthetic continued pretraining" to overcome the limitations of traditional pretraining methodologies. This essay provides an expert-level overview of the paper, highlighting its methods, results, and implications.

Introduction

Modern LMs have demonstrated remarkable capabilities in acquiring knowledge from large-scale, unstructured text corpora. Despite their successes, there exists a significant inefficiency in the knowledge acquisition process, where models need exposure to hundreds or thousands of different representations of the same fact. This inefficiency is particularly problematic when adapting LMs to niche domains with limited textual data.

Methodology

The paper introduces a novel approach called synthetic continued pretraining. The core idea is to first synthesize a large and diverse corpus from a small domain-specific corpus, and then continue pretraining the LM on this synthetic corpus. The authors implement this approach using EntiGraph, a synthetic data augmentation algorithm that constructs new text by generating relationships between entities extracted from the small corpus.

Entity Extraction: The algorithm begins by extracting salient entities from the source document.
Entity Description: It then generates detailed descriptions for each extracted entity.
Relation Analysis: Finally, it generates synthetic text describing the relationships between various pairs and subsets of entities.

Experimental Setup

The authors execute their method on the QuALITY dataset, composed of 265 documents summing to 1.3M tokens. They generate a synthetic dataset of 600M tokens using GPT-4-turbo, and subsequently continue pretraining Llama 3 8B on this synthetic corpus.

Results

The experimental results are compelling. When evaluated on closed-book question answering (QA) tasks, the Llama 3 8B model pretrained on the EntiGraph corpus achieves 56.42% accuracy on the QuALITY test set, a substantial improvement over the 39.49% accuracy of the base model. Moreover, the accuracy scales log-linearly with the number of synthetic tokens up to 600M. In comparison, continuing pretraining on just the raw or paraphrased corpus yields significantly lower improvements.

Theoretical Analysis

To further understand the benefits of their method, the authors develop a mathematical model. They conceptualize the knowledge acquisition process as a stochastic process on graphs, arguing that synthetic data augmentation via EntiGraph "rearranges" knowledge into a more learnable format. This model predicts a mixture-of-exponential scaling trend for accuracy, closely matching empirical observations.

Open-book Experiments

In open-book settings where the source documents are accessible at test time, the authors demonstrate that synthetic continued pretraining complements retrieval-augmented generation (RAG) methods. The EntiGraph-pretrained model outperforms its baseline counterpart when used in a state-of-the-art RAG system.

Instruction Following and Practical Implications

The authors show that their pretraining method generalizes beyond QA to instruction following tasks. Models continually pretrained with EntiGraph data can answer complex queries and perform summarizations without access to the source documents. This implies that synthetic training data can effectively impart detailed domain-specific knowledge to LMs, improving their performance across various tasks.

Conclusion and Future Directions

This paper presents synthetic continued pretraining as a viable solution to the inefficiency of LMs learning from small corpora. By generating diverse and synthetic representations of knowledge, the EntiGraph algorithm allows LMs to achieve higher accuracy and better generalization in domain-specific tasks. Moving forward, future research could explore further scaling this approach, developing even more diverse synthetic data augmentation methods, and applying it to other domain-specific corpora.

Implications and Speculation

On a broader scale, synthetic continued pretraining represents a strategic method for maximizing the utility of increasingly scarce high-quality text data. As LMs continue to scale in size and capability, ensuring they can effectively learn from small but valuable corpora will be critical. If successfully adapted to general pretraining regimes, this approach could help maintain the pace of progress in LM capabilities even as traditional data sources become exhausted.

In summary, the method of synthetic continued pretraining proposed by the authors provides a critical advancement in efficiently leveraging small, domain-specific corpora for continued LM pretraining, demonstrating significant improvements in performance and data efficiency. This work paves the way for future explorations into synthetic data generation, offering promising avenues for enhancing the effectiveness of LMs in various specialized domains.