DeMix Corpora: Robust LLM Data Mixtures

Updated 7 February 2026

DeMix Corpora are large-scale, openly distributed pre-training datasets formed by validated mixtures from over 30 diverse sources.
The methodology decouples mixture search from training through model merging and advanced filtering to optimize data ratios.
Empirical evaluations show enhanced LLM performance with staged curriculum designs and precise mixture validation protocols.

DeMix Corpora refers to both a family of large-scale, openly distributed pre-training datasets that employ validated data mixtures across multiple domains, and, more generally, to the methodology and practice of constructing, curating, and evaluating data mixtures (“demixing”) for robust machine learning—particularly for LLMs. DeMix Corpora leverage systematic mixture discovery, advanced filtering, and explicit mixture ratio validation, in contrast to ad-hoc or static corpus construction. The term is exemplified by the DeMix Corpora dataset and methodology introduced by Li et al. (2026) (Li et al., 31 Jan 2026), as well as by advances in demixing methodologies for domain separation, language mixing, and data selection (Wang et al., 3 Feb 2026, Selmo et al., 2020). DeMix Corpora intersect with topics including code-mixed language corpora, semantic drift corpora, and multi-domain speech mixture datasets.

1. Corpus Composition and Scale

DeMix Corpora, as released in (Li et al., 31 Jan 2026), aggregate approximately 15 trillion raw tokens from over 30 heterogeneous sources, resulting in a final mixed corpus of 22 trillion tokens. The corpus is used in three curriculum stages for LLM pre-training:

Stage	Total Tokens	General	Math	Multilingual	Code
1	14 T	77.0%	6.8%	8.6%	7.4%
2	6 T	60.5%	16.8%	15.3%	7.4%
3	2 T	47.8%	31.7%	19.0%	11.5%
Final	22 T (tot.)	68.5%*	13.8%*	13.1%*	8.8%*

(*Weighted stage averages.)

Major source types include general web data (FineWeb-Edu, DCLM-Baseline, DOLMA), multilingual corpora (web-crawl-zh, Nemo-CC-Multilingual), math datasets (Nemo-Math-Mind, SwallowMath), and code corpora (OpenCoder, Source-Code, StarCoder). Each domain undergoes strict selection, filtering, and regularization to maintain consistent coverage and quality.

2. Mixture Design and Validation Protocol

The central innovation in DeMix Corpora is the adoption of a “decouple searching from training” approach using model merging to predict optimal data mixture ratios (Li et al., 31 Jan 2026). Rather than relying on proxy models for each mixture configuration, the process is as follows:

A single base model $\Theta_0$ is pretrained on 50 B tokens of general data.
For each candidate domain corpus $D_i$ , $\Theta_0$ is continued on $D_i$ for 50 B tokens to yield a domain-specific component model $O_i = \Theta_0 + \Delta(D_i)$ .
Candidate mixtures are represented as convex combinations $\alpha = (\alpha_1, ..., \alpha_N)$ (with $\sum \alpha_i = 1$ ), and corresponding models are constructed via weighted merging $M_\text{mix} = \sum \alpha_i O_i$ —without further training.
Each proxy model $M_\text{mix}$ is directly evaluated on a suite of LLM benchmarks (ARC-E, HellaSwag, Winogrande, PIQA, SIQA, GSM8K, MATH, HumanEval, MBPP).
A LightGBM regressor maps mixture weights to benchmark performance, further guiding mixture search by resampling top regions after three iterations.
The final $\alpha^*$ mixture ratio is the average over the top candidates predicted by the regressor.

This paradigm yields validated, performance-optimized mixture schedules for each curriculum stage.

3. Filtering, Deduplication, and Quality Control

All candidate subcorpora in DeMix Corpora are subjected to aggressive filtering and normalization to eliminate data contamination and low-quality samples.

Key procedures include:

Exact deduplication (global 24-gram hash-based).
Fuzzy deduplication via MinHash (threshold $\geq$ 90% similarity, 260 hashes, 20 bands).
Perplexity filtering using Qwen3-0.6B as scorer (removal threshold dataset-specific, $\sim$ 2% removed per corpus).
FastText quality classification (1M positives vs 1M negatives, $\sim$ 3% English data removed).
Chinese data filtered via a small classifier trained on 1M pseudo-labels from 5K human labels.
Domain annotation and regularization: Each subcorpus is constrained to blend 1:1 with general data at the instance level prior to mixture search.

In total, $\sim$ 8% of raw web data is filtered by deduplication, $\sim$ 2% by perplexity, and $\sim$ 3–5% by in-domain quality classifiers.

4. Mixture Curriculum and Fine-tuning Practices

DeMix Corpora recommend a staged curriculum for pre-training, with a general-heavy mixture in early stages (Stage 1), followed by increased domain and task emphasis (math, code, multilingual) in Stages 2 and 3. Domain-specific data never exceeds 50% in any candidate subcorpus, due to quality and coverage regularization constraints.

After pre-training on DeMix Corpora, further domain adaptation is facilitated by training small component models on new data and re-merging with base models, preserving general capability while allowing domain-specific enhancements.

5. Empirical Validation and Quality Measures

Mixture proxy validation is conducted via rank correlation and capability recovery:

Proxy model scores (macro-Spearman’s $\rho$ ) with fully-trained, real-mixture models achieve $\approx$ 0.845; top-25% $\rho$ is 0.59.
Capability Recovery Rate (proxy mean/reference mean) is approximately 0.85.
The best discovered mixture ratio yields a 1.7B model ranking of 24.00 (out of 96) on the benchmark suite, outperforming all baseline mixtures ( $\geq$ 29).

These results demonstrate that model merging for mixture search achieves efficient, high-fidelity mixture validation without repeated full-scale pre-training.

6. Comparison to Prior Demixing Methodologies

DeMix Corpora’s data mixture methodology contrasts with:

Static or heuristic mixing (ad-hoc ratios, e.g., fixed general:domain splits).
Proxy-based selection (small-scale proxy models, e.g., DoReMi, CLIMB, Meta-rater), which are limited by scale or flatten the macro/micro structure (Wang et al., 3 Feb 2026).
Intrinsic geometric approaches (e.g., UniGeM (Wang et al., 3 Feb 2026)), which approach curation as an unsupervised manifold approximation problem, optimizing a hierarchical two-stage process: Macro-Exploration (semantic region coverage via stability-driven clustering) and Micro-Mining (high-quality instance selection via local geometry and structure).

By decoupling mixture search and training, DeMix achieves both greater search coverage and efficiency. However, limitations include restriction to the set of component domains with pretrained $\Delta(D_i)$ , curriculum schedule dependency, and the need to retrain new components to explore out-of-support mixture ratios beyond the validated regime.

7. Distribution, Access, and Licensing

DeMix Corpora, including both data and code for mixture discovery and model merging, are released for open research at https://github.com/Lucius-lsr/DeMix (Li et al., 31 Jan 2026). Access is subject to source dataset licenses. The corpora are intended for benchmarking, extending, and reproducing state-of-the-art LLM training and study of robust data curation under explicit mixture hypotheses.

8. Relationship to Broader Demixing and Source-Separation Corpora

The DeMix framework is orthogonal but complementary to other lines of "demixing" research: e.g., differential embedding corpora for tracking semantic drift and regional variation (Selmo et al., 2020), code-mixed language corpora for token-level language identification (Mellado et al., 2022, Churina et al., 31 May 2025), and multi-domain synthetic speech mixtures for source separation (Maciejewski et al., 2018). Each of these domains applies explicit source/mode separation but focuses on different artifacts (word embeddings, language tags, speech waveforms) and evaluation metrics. The underlying principles—multiplicity of sources, validated mixture construction, and rigorous benchmarking—are shared across the DeMix paradigm.