BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining (2508.10975v1)

Published 14 Aug 2025 in cs.LG and cs.CL

Abstract: Recent advances in LLM pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper demonstrates that carefully curated and diverse synthetic data can outperform natural data, achieving up to 7.7× faster convergence in LLM training.
It employs both generator-driven and source rephrasing paradigms, with rigorous ablations highlighting the role of information density and stylistic diversity.
Efficient use of small-scale models and scalable infrastructure enables BeyondWeb to breach the data wall and democratize high-quality LLM pretraining.

BeyondWeb: Systematic Advances in Synthetic Data for Trillion-Scale LLM Pretraining

Introduction

The "BeyondWeb" paper presents a comprehensive empirical and methodological paper of synthetic data for LLM pretraining, culminating in the introduction of the BeyondWeb framework. The work addresses the data wall encountered in web-scale pretraining, where further scaling of natural data yields diminishing returns, and systematically investigates the mechanisms, design choices, and scaling properties of synthetic data generation. The authors provide strong evidence that carefully curated, diverse, and information-dense synthetic data can not only match but surpass the performance of models trained on natural data, and that naive synthetic data strategies are insufficient. The paper is notable for its rigorous ablations, multi-scale experiments, and clear articulation of the trade-offs and requirements for effective synthetic data curation.

Synthetic Data Paradigms: Generator-Driven vs. Source Rephrasing

The paper distinguishes two principal paradigms for synthetic data generation:

Generator-Driven: Large LLMs generate data de novo, encapsulating their parametric knowledge. This approach, exemplified by Cosmopedia and the Phi family, can distill knowledge into compact forms but is limited by generator cost, coverage, and risk of model collapse.
Source Rephrasing: Smaller LLMs rephrase existing web data into higher-quality, task-aligned formats (e.g., Q&A, instructional passages). This paradigm, as in WRAP and Nemotron-CC, is more scalable and cost-effective, and has become the dominant industry practice.

The authors argue, and empirically demonstrate, that source rephrasing can match or exceed generator-driven methods in downstream performance, especially when diversity and information density are prioritized.

The BeyondWeb Framework

BeyondWeb is a synthetic data generation pipeline that leverages targeted document rephrasing, format transformation, style modification, and content restructuring to produce diverse, information-dense synthetic corpora. The framework is built on a scalable, production-grade infrastructure (Ray + vLLM on Kubernetes), enabling efficient generation at trillion-token scale.

Key features include:

Diverse Generation Strategies: Multiple rephrasing and transformation prompts to avoid stylistic and topical homogeneity.
Quality-First Curation: Selection of high-quality web data as seed material, with rigorous filtering and deduplication.
Scalable Infrastructure: Distributed, parallelizable generation and experiment tracking, supporting heterogeneous compute environments.

Empirical Results and Scaling Properties

Pareto Frontier and Training Efficiency

BeyondWeb establishes a new Pareto frontier for synthetic pretraining data. Notably, a 3B model trained on BeyondWeb data for 180B tokens outperforms all but one 8B baseline trained on the same token budget, and the 8B BeyondWeb model achieves up to 7.7× faster convergence than RedPajama and 2.7× faster than Nemotron-Synth.

Figure 1: BeyondWeb establishes a new Pareto frontier for synthetic pretraining data, with superior accuracy and training efficiency compared to state-of-the-art baselines.

Information Density and Distillation

Ablation studies reveal that simple summarization-based rephrasing can match the performance of sophisticated generator-driven approaches (e.g., Cosmopedia), indicating that increased per-token information density is a primary driver of synthetic data benefits. However, BeyondWeb's more intentional, diverse strategies yield further substantial gains, demonstrating that synthetic data is not merely knowledge distillation.

Figure 2: Summarization-based rephrasing achieves similar gains as generator-driven Cosmopedia, but BeyondWeb's approach yields higher accuracy, highlighting the importance of intentional synthetic data design.

Surpassing the Data Wall

Controlled experiments show that naive synthetic data generation (e.g., simple continuations) provides only marginal improvements over data repetition and cannot breach the data wall. In contrast, BeyondWeb's strategic synthetic data generation significantly exceeds the performance ceiling of natural data, with a +4.2pp improvement over the full-data upper bound.

Figure 3: Experimental design for data splitting and corpus construction, isolating the effects of repetition versus synthetic augmentation.

Figure 4: BeyondWeb's synthetic data (dark blue) significantly outperforms both naive continuation and full natural data, demonstrating the necessity of thoughtful synthetic data design.

Quality, Style, and Diversity

Seed Data Quality: Rephrasing high-quality web data yields larger gains than using low-quality sources, but high-quality input alone is insufficient for optimal synthetic data.
Style Matching: Upsampling conversational data improves downstream performance, but gains saturate beyond 20% conversational content, indicating that style alignment is necessary but not sufficient.
Diversity: Diverse generation strategies are critical for sustained improvements, especially at large training budgets. Single-strategy approaches plateau, while BeyondWeb maintains positive learning curves across scales.
Figure 5: Performance comparison across different quality combinations in training data, showing the importance of high-quality seed data.

Figure 6: Effect of conversational data ratio on final accuracy, with gains saturating beyond 20% conversational content.

Figure 7: Training dynamics across model scales, with BeyondWeb maintaining consistent improvements and avoiding overfitting in overtrained regimes.

Generator Family and Size

Model Family Robustness: Synthetic data benefits are consistent across generator model families; the general LLMing ability of the rephraser does not predict synthetic data quality.
Size Saturation: Increasing rephraser size from 1B to 3B yields substantial gains, but improvements plateau beyond 3B, making small models a cost-effective choice for synthetic data generation.

Figure 8: Synthetic data benefits are consistent across generator model families, and generator model quality does not predict synthetic data quality.

Figure 9: Effect of generator model size on synthetic data quality, with diminishing returns beyond 3B parameters.

Implementation Considerations

Infrastructure: Distributed, containerized generation pipelines (Ray + vLLM on Kubernetes) are essential for scaling synthetic data generation to trillions of tokens.
Data Curation: Rigorous filtering, deduplication, and quality control of both seed and synthetic data are necessary to avoid redundancy and maximize information density.
Prompt Engineering: Diverse, targeted prompts for rephrasing and transformation are required to avoid stylistic collapse and ensure broad coverage.
Resource Requirements: Effective synthetic data generation does not require massive generator models; 3B-parameter LLMs suffice for most rephrasing tasks, reducing compute costs.
Evaluation: Multi-task, multi-shot evaluation suites are necessary to robustly assess generalization and avoid overfitting to specific benchmarks.

Implications and Future Directions

The findings have several practical and theoretical implications:

Synthetic Data as a Primary Modality: High-quality, diverse synthetic data can replace or augment natural data in pretraining, enabling efficient scaling beyond the data wall.
Democratization: The ability to use small, open-source models for rephrasing lowers the barrier to entry for high-quality LLM pretraining.
Alignment and Control: Synthetic data generation offers greater control over style, content, and alignment, with potential for explicit value alignment during pretraining.
Scaling Laws: The intrinsic repetition and diversity properties of synthetic data require new theoretical frameworks for scaling analysis.
Domain Generalization: The source rephrasing paradigm is not limited to web data or text, and can be extended to domain-specific or multimodal corpora.

Conclusion

BeyondWeb provides a rigorous, scalable, and empirically validated framework for synthetic data generation in LLM pretraining. The work demonstrates that there is no single "silver bullet" for synthetic data; rather, strong outcomes require joint optimization of data quality, style alignment, diversity, and infrastructure. The results challenge the conventional reliance on ever-larger models and natural data, and establish synthetic data curation as a central pillar of future LLM development.

Figure 10: Learning curves across model scales, with BeyondWeb maintaining superior performance throughout training.

Figure 11: BeyondWeb's significant advantages over state-of-the-art baselines across three model scales, establishing a new Pareto frontier for synthetic pretraining data.

PDF Markdown

Follow-up Questions

Related Papers

Authors (31)

First 10 authors:

Tweets

https://twitter.com/pratyushmaini/status/1957456725814177882

https://twitter.com/amrokamal1997/status/1957849887535677861

https://twitter.com/HuggingPapers/status/1957534835075084591

https://twitter.com/rosinality/status/1957329743755563329

https://twitter.com/paws_ed/status/1957412495406076110

https://twitter.com/VukRosic99/status/1957748931401773209

YouTube

Show All Videos

alphaXiv

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining (63 likes, 0 questions)