Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 24 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 434 tok/s Pro
Kimi K2 198 tok/s Pro
2000 character limit reached

Nemotron-Synth: Synthetic Data for LLM Pretraining

Updated 18 August 2025
  • Nemotron-Synth is a high-quality synthetic dataset curated for LLM pretraining, integrating rigorous extraction and classifier-guided quality assessment.
  • Its multi-stage pipeline combines enhanced token yield extraction, fuzzy deduplication, and targeted synthetic rephrasing to maximize data diversity and clarity.
  • The dataset serves as a benchmark for scalable synthetic curation, demonstrating improved model performance and efficiency over previous frameworks.

Nemotron-CC’s high-quality synthetic subset, commonly referred to as Nemotron-Synth, represents an advanced, rigorously curated synthetic data asset for LLM pretraining. Nemotron-Synth is the outcome of a multi-stage pipeline that optimizes both token diversity and content quality, integrating classifier-based document selection with targeted synthetic rephrasing, all derived from the expansive Nemotron-CC Common Crawl corpus. Across relevant literature, Nemotron-Synth serves both as the canonical demonstration of scalable synthetic dataset curation (Su et al., 3 Dec 2024) and as a reference baseline for subsequent frameworks such as BeyondWeb (Maini et al., 14 Aug 2025). It is instrumental for model alignment, instruction tuning, and long-horizon pretraining regimes.

1. Pipeline Foundations and Data Extraction

Nemotron-Synth is generated atop Nemotron-CC, a dataset built from raw Common Crawl HTML pages. Extraction employs Justext as the preferred HTML-to-text converter. Empirical findings indicate that Justext yields approximately 28–57% more high-quality tokens than alternatives such as Trafilatura (Su et al., 3 Dec 2024). The extraction phase is designed to maximize raw document yield, applying only minimal filtering upfront to retain otherwise removable tokens in high-quality segments. Fuzzy deduplication and exact substring deduplication protocols are subsequently enacted to promote unique token diversity across the dataset.

The initial focus on raw token yield reflects the finding that previous aggressive filtering regimes, such as those used in DCLM or FineWeb-Edu, discard up to 90% of total data, ultimately limiting attainable token horizons for long-run pretraining (Su et al., 3 Dec 2024). By contrast, Nemotron-CC’s pipeline retains a greater proportion of unique documents and tokens, creating a more fertile substrate for downstream quality selection and synthetic transformation.

2. Classifier-Based Quality Labeling and Ensembling

Document quality assessment within Nemotron-CC is accomplished using an ensemble of model-based classifiers, each trained on distinct annotated datasets emphasizing different facets of “high-quality” (educational, informativeness, etc.) (Su et al., 3 Dec 2024). Each classifier outputs a real-valued score for every document, which is rounded to an integer from 0 to 19. The ensemble procedure defines the aggregate document score as

Sdocument=max{s1,s2,s3}S_{document} = \max \{ s_1, s_2, s_3 \}

where s1s_1, s2s_2, and s3s_3 are the classifier-derived scores. The resultant integer scores stratify documents into 20 fine-grained buckets, which are further reduced to 5 global quality tiers via an annealing optimization reflecting empirical downstream model performance during pretraining. These granular labels are essential for calibrating subsequent rephrasing strategies and distributional synthetic augmentation.

3. Synthetic Data Rephrasing and Subset Construction

Nemotron-Synth is distinguished by a targeted synthetic rephrasing stage performed on selected documents as determined by the classifier ensembles (Su et al., 3 Dec 2024). For documents labeled as low-quality, rephrasing is employed with prompts such as “Wikipedia style,” seeking to remove noise, redundancy, and formatting defects while retaining salient information. This produces an improved parallel corpus, demonstrably reducing model perplexity and enhancing downstream accuracy.

High-quality documents undergo a suite of more sophisticated synthetic transformations, using four core prompt modes:

  • Diverse Question–Answer Pairs: Transforming text into QA pairs spanning factual, yes/no, multiple-choice, and open-ended questions.
  • Distill: Compressing the document into a shorter, clearer passage.
  • Extract Knowledge: Selecting essential details while discarding non-informative and verbose material.
  • Knowledge List: Reformatting key points into a structured, compact list.

These rephrasing operations are executed by an instruct-tuned model (e.g., Mistral NeMo 12B). Importantly, long documents are split into token-bounded segments to avoid oversimplification during generation. The resulting subset contains approximately 1.8 trillion synthetic tokens in total, with ~336B from the rephrased low-quality corpus and ~1.5T from high-quality diversified outputs. Nemotron-Synth, as an Editor's term, refers to the collection of synthetic outputs derived from the classifier-selected portion of Nemotron-CC, merged alongside the retained original tokens.

4. Alignment Algorithms and Loss Functions

The Nemotron-4 framework, tightly integrated with Nemotron-Synth, adopts an iterative “weak-to-strong” alignment paradigm (NVIDIA et al., 17 Jun 2024). In this protocol, a weaker aligned generator produces synthetic data via multi-dimensional prompt generation (open Q&A, coding, writing tasks, etc.), subsequently used for fine-tuning stronger base models. The Nemotron-4-340B-Instruct model generates synthetic content, while Nemotron-4-340B-Reward applies a linear “reward head” to deliver evaluation vectors for attributes such as Helpfulness, Correctness, Coherence, Complexity, and Verbosity. Quality scores are aggregated to rank or filter outputs.

A pivotal algorithm is Reward-aware Preference Optimization (RPO), with the following loss function:

Lrpo(x,yc,yl)=D[βlog(π(ycx)πref(ycx))βlog(π(ylx)πref(ylx))    η(r(x,yc)r(x,yl))]\mathcal{L}_{rpo}(x, y_c, y_l) = \mathcal{D}\left[ \beta \log \left( \frac{\pi(y_c | x)}{\pi_{ref}(y_c | x)} \right) - \beta \log \left( \frac{\pi(y_l | x)}{\pi_{ref}(y_l | x)} \right) \;\middle\|\; \eta \left(r^*(x, y_c) - r^*(x, y_l)\right) \right]

where xx is the prompt, ycy_c and yly_l are chosen and discarded responses, π\pi and πref\pi_{ref} are policy and reference models, rr^* are reward scores, and D\mathcal{D} is a distance/divergence function. This loss construction helps distinguish small and large quality gaps, mitigating overfitting and increasing label robustness.

5. Benchmark Performance and Comparative Assessment

In comparative studies with previous web-scale and synthetic datasets, Nemotron-Synth demonstrates clear efficacy. Training 8B parameter models on Nemotron-CC (which incorporates Nemotron-Synth) for 1T tokens yields a 5.6-point MMLU improvement over DCLM (Su et al., 3 Dec 2024). The full dataset, including Nemotron-Synth, contains four times more unique tokens than DCLM and supports long-horizon training (up to 15T tokens). Models trained on this regime outperform Llama 3.1 8B on multiple downstream tasks, achieving gains of +5 on MMLU, +3.1 on ARC-Challenge, and modest aggregate improvements across ten diverse benchmarks.

Recent developments contextualize Nemotron-Synth as a highly competitive but not definitive baseline. The BeyondWeb synthetic framework reports an average performance improvement of up to +2.6pp versus Nemotron-Synth at the 8B scale and up to 2.7× faster training (Maini et al., 14 Aug 2025). BeyondWeb's optimization focuses on maximizing per-token information density, strategic style matching, and intentional distributional diversity in the rephrased corpus.

6. Broader Impacts, Open Sourcing, and Research Implications

Nemotron-Synth emerges from an open-sourced pipeline, which encompasses generation prompts, response synthesis methods for both single- and multi-turn examples, stringent post-processing (such as filtering generic dialogue turns), verification protocols (including Nemotron-4-340B-Reward and LLM-as-Judge), and the iterative alignment regime (NVIDIA et al., 17 Jun 2024). This open approach advances transparency, reproducibility, and community-driven enhancement in prompt engineering, data generation, and reward modeling.

The practical impacts of Nemotron-Synth are visible in alignment tunings where over 98% of supervised and preference fine-tuning data is synthetic. High-quality synthetic data lowers annotation costs, widens the scope of attainable domains (natural language, code, math, creative writing), and facilitates pretraining for smaller and multilingual models. This suggests that synthetic subsets such as Nemotron-Synth will remain critical for scaling LLMs efficiently, maintaining robust task generalization, and supporting low-resource language adaptation through translation and rephrasing (Joshi et al., 18 Oct 2024).

7. Limitations and Prospective Directions

Nemotron-Synth’s quality and efficiency are contingent upon classifier accuracy, instruct model competence, and prompt engineering diversity. The BeyondWeb framework demonstrates that complementary strategies—more thorough style matching, input curation, and diversity—can push the performance and efficiency frontier beyond those achieved by Nemotron-Synth (Maini et al., 14 Aug 2025). A plausible implication is that future synthetic corpus generation will require jointly optimizing classifier selection, document segmentation, style and information density rephrasing, and careful mixture strategies between natural and synthetic tokens.

Misconceptions such as equating increased synthetic token count with automatic quality improvement are addressed by data showing that the main benefits derive from the transformation of web content into denser, more task-aligned and diverse formats, rather than from simple augmentation.

In summary, Nemotron-Synth exemplifies scalable, classifier-guided, synthetic data curation for LLM pretraining. Its pipeline and downstream results have established new standards for data quality and diversity, with continued research indicating that further optimizations in style, information density, and diversity remain leading contributors to frontier model performance.