Scaling Laws of Synthetic Data for Language Models (2503.19551v2)

Published 25 Mar 2025 in cs.CL and cs.AI

Abstract: LLMs achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the rectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.

PDF Abstract

The investigation into scaling laws for synthetic data in LLMs addresses the critical challenge of diminishing high-quality organic web data for pre-training. The paper "Scaling Laws of Synthetic Data for LLMs" (Qin et al., 25 Mar 2025 ) systematically explores whether synthetic datasets exhibit predictable scalability akin to traditional pre-training corpora, introducing the SynthLLM framework for generating such data at scale and validating its scaling properties, particularly in the domain of mathematical reasoning.

The SynthLLM Framework for Synthetic Data Generation

SynthLLM is presented as a scalable methodology designed to convert existing pre-training corpora into diverse, high-quality synthetic datasets without extensive reliance on human-annotated seeds. The framework operates through a sequence of stages:

Reference Document Filtering: This initial stage aims to curate a high-quality, domain-specific subset of documents from a larger corpus (e.g., Fineweb-Edu). It employs an iterative classification approach. First, a "cold-start" classifier, trained on synthetically generated positive examples (derived from syllabi) and random negative examples, identifies an initial relevant set $\mathcal{D}^0$ . Subsequently, a "fine-grained" classifier is iteratively refined. In each iteration $t$ , documents from $\mathcal{D}^{t-1}$ and newly sampled documents $\mathcal{D}^{t-1}_{R}$ are rated by a powerful LLM (GPT-4o) based on relevance, clarity, and quality. A classifier (e.g., random forest on document embeddings) is trained on these ratings and applied to the full corpus to produce $\mathcal{D}^{t}$ , retaining documents exceeding a quality threshold (e.g., 6.5). Typically, two iterations are found to be sufficient.
Document-Grounded Question Generation: Leveraging the filtered document set $\mathcal{D}$ $D$ , questions (prompts) are generated using an LLM designated as $\mathbf{M}^Q$ $M^{Q}$ (e.g., Mistral-Large-Instruct). Three distinct generation levels are proposed to progressively increase diversity and scalability:
- Level 1: Extracts existing questions or generates new ones based solely on the content of a single document. This approach is similar to prior work but is limited by the availability of documents inherently containing questions.
- Level 2: Introduces concept decomposition and recombination within a single document. Key topics/concepts $\mathbf{K}$ are extracted from a document $d$ . A random subset $\mathbf{k}_s \subset \mathbf{K}$ is sampled and combined to guide the generation of a new question, which remains grounded in the original document $d$ . This enhances diversity beyond simple extraction.
- Level 3: Targets maximum diversity and scalability through multi-document concept recombination. A global concept graph $G$ is constructed based on topic/concept co-occurrence across all documents in $\mathcal{D}$ . Concept combinations $\mathbf{K}^g$ are sampled via random walks on $G$ . The framework then retrieves the top- $k$ (e.g., $k=2$ ) most relevant documents from $\mathcal{D}$ (using Jaccard similarity based on concepts) to serve as grounding references for generating questions based on $\mathbf{K}^g$ . This allows synthesis of information spanning multiple source documents.
Answer Generation: Corresponding answers (responses) for the generated questions are produced using a separate, capable LLM designated as $\mathbf{M}^{A}$ (e.g., Qwen2.5-Math-72B was used in the paper's experiments). While the presented work did not incorporate sophisticated answer verification steps like voting or self-critique, such methods could potentially be integrated to further enhance data quality.

Empirical Validation of Scaling Laws

Extensive experiments were performed, primarily fine-tuning Llama-3.2 (1B, 3B) and Llama-3.1 (8B) models on synthetic mathematical reasoning data generated by SynthLLM. The key finding is the reliable adherence of SynthLLM-generated data to the Rectified Scaling Law across different model sizes. This law, formulated as $L(D) = \frac{B}{D_l + D^\beta} + E$ , models the fine-tuning loss $L(D)$ as a function of the fine-tuning dataset size $D$ . It includes a parameter $D_l$ representing the effective data size derived from pre-training knowledge, $B$ related to the initial reducible loss, $\beta$ representing the data scaling exponent, and $E$ the irreducible loss.

Crucially, the simpler marginal power law often used for pre-training scaling, $L(D) = \frac{B}{D^\beta} + E$ , failed to provide an accurate fit to the empirical data observed during fine-tuning on synthetic data. This highlights the importance of the $D_l$ term in the rectified law, which accounts for the substantial knowledge already embedded in the models via pre-training. The rectified scaling law demonstrated strong predictive power, accurately forecasting model performance on held-out data points corresponding to larger dataset sizes used during validation.

Key Scaling Parameters and Predictions

Analysis of the fitted rectified scaling law curves yielded several important quantitative insights:

Performance Plateau: Extrapolation based on the scaling law suggests that performance gains significantly diminish as the synthetic dataset size approaches approximately 300 billion tokens. This indicates a potential saturation point for the utility of adding more SynthLLM-generated data in this specific fine-tuning context (mathematical reasoning).
Data Efficiency and Model Size: Larger models exhibit greater data efficiency in leveraging synthetic data. The 8B parameter model was projected to reach its peak performance (minimum loss) with approximately 1 Trillion synthetic tokens, whereas the 3B parameter model required substantially more, around 4 Trillion tokens, to reach its optimum.
Scaling Exponent ( $\beta$ ): The rate at which error decreases with increasing data size varied with model scale. The 1B model showed a smaller $\beta$ (0.34) compared to the 3B and 8B models (which shared a similar $\beta$ ). This suggests that smaller models derive comparatively less benefit from larger amounts of synthetic fine-tuning data.
Initial Performance ( $B/D_l$ ): Larger models displayed better initial performance before synthetic fine-tuning, reflected in a smaller $\frac{B}{D_l}$ ratio. This aligns with the expectation that larger pre-trained models possess more relevant knowledge ( $D_l$ is effectively larger).

Comparative Performance and Ablation Studies

SynthLLM was benchmarked against existing synthetic math datasets and augmentation techniques:

Comparison with Other Synthetic Datasets: Models fine-tuned on SynthLLM data (7.4M samples generated) consistently outperformed models trained on other prominent synthetic math datasets, including MAmmoTH2, JiuZhang3.0 (conceptually similar to SynthLLM Level-1), and the larger OpenMathInstruct-2 (14M seed-based samples). SynthLLM-trained models demonstrated superior performance across various benchmarks like MATH, GSM8K, Minerva, OlympiadBench, College Math, and Gaokao-Math. Notably, an 8B model fine-tuned on 3.2M SynthLLM samples achieved comparable results to one trained on 14M OpenMathInstruct-2 samples on in-distribution benchmarks (MATH, GSM8K) and significantly better results on out-of-distribution (OOD) benchmarks, indicating improved generalization.
Comparison with Data Augmentation: Ablation studies were conducted using a limited reference set (2,000 documents). SynthLLM's Level-2 (intra-document concept recombination) and Level-3 (inter-document concept recombination via graph) methods demonstrated superior scalability compared to applying standard augmentation techniques (like rephrasing or persona augmentation) to Level-1 extracted questions. While augmentation methods plateaued quickly in performance as more questions were generated per document, SynthLLM's methods continued to yield improvements up to 150 questions per document, highlighting their efficiency in extracting value from limited seed data.

These results collectively suggest that the concept-recombination strategies employed by SynthLLM (Levels 2 and 3) are more effective for generating diverse and high-quality synthetic data at scale compared to simpler extraction or standard augmentation methods.

Conclusion

The paper provides compelling evidence that synthetic data, generated via a structured and scalable framework like SynthLLM, adheres to predictable scaling laws, specifically the rectified scaling law when used for fine-tuning pre-trained models. SynthLLM's methodology, particularly its use of concept extraction, recombination, and graph-based multi-document synthesis, proves effective in generating high-quality, diverse synthetic data that leads to strong performance, surpassing other synthetic data generation and augmentation techniques in the mathematical domain. These findings establish synthetic data as a viable and scalable resource for continued LLM development, offering a pathway to mitigate reliance on finite organic data sources while enabling predictable performance improvements through principled data scaling.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Zeyu Qin (16 papers)
Qingxiu Dong (39 papers)
Xingxing Zhang (65 papers)
Li Dong (154 papers)
Xiaolong Huang (29 papers)
Ziyi Yang (77 papers)
Mahmoud Khademi (17 papers)
Dongdong Zhang (79 papers)
Hany Hassan Awadalla (24 papers)
Yi R. Fung (31 papers)
Weizhu Chen (128 papers)
Minhao Cheng (43 papers)
Furu Wei (291 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/ceobillionaire/status/1909394849851187477

https://twitter.com/Montreal_AI/status/1909397905263177738

https://twitter.com/Dorialexander/status/1913862252525978030

https://twitter.com/carlosrof/status/1909787665584599479

YouTube

Show All Videos

HackerNews

Scaling Laws of Synthetic Data for Language Models (2 points, 0 comments)