ATLAS: Tracing RLVR Data Lineage
- The paper introduces ATLAS, which reconstructs data provenance in RLVR datasets through iterative canonicalization and semantic matching.
- The framework identifies 20 atomic sources accounting for over 99.7% of 1.45M instances, mitigating provenance collapse and contamination.
- ATLAS enables source-level counterfactual attribution and dataset scoring, paving the way for decontaminated training sets like DAPO++.
Atomic-source Tracing via Lineage-Aware Search (ATLAS) is a framework for tracing Reinforcement Learning from Verifiable Rewards (RLVR) datasets back to their atomic sources in order to address provenance collapse in a rapidly expanding dataset ecosystem. In the formulation introduced in "RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data" (Huang et al., 26 May 2026), ATLAS combines canonicalization, temporal index matching, semantic similarity matching, and iterative source recovery to reconstruct lineage at scale. Applied to RLVR corpora, it attributes over 99.7% of 1.45M instances to 20 atomic sources, identifies contamination pathways into downstream datasets, and provides the lineage substrate for source-level utility analysis, dataset scoring, and the construction of a decontaminated training set, DAPO++ (Huang et al., 26 May 2026).
1. Provenance collapse in RLVR data
The ATLAS framework is motivated by the paper’s diagnosis of provenance collapse in RLVR datasets. RLVR datasets are described as collections used to train models with reward signals that can be automatically checked, such as math problems with verifiable answers. The paper argues that many such datasets are built from earlier datasets, heavily filtered, rewritten, recomposed, or aggregated from multiple sources, while their lineage remains unclear. It defines provenance collapse as the loss of information about where the data originally came from, who modified it, and whether it can be trusted. In this setting, some datasets are characterized as “openly-closed”: publicly released, but with opaque internal provenance (Huang et al., 26 May 2026).
The consequences identified are concrete. Hidden overlap and reuse can make many ostensibly new datasets variants of the same upstream material. Contamination risk arises when evaluation benchmarks leak into training data. Dataset comparison becomes misleading if competing methods train on substantially overlapping corpora. Dataset curation also becomes harder because the absence of lineage obscures which parts of a training set carry genuinely new or useful signal.
Within this problem setting, ATLAS is not restricted to identifying immediate predecessors. Its stated goal is to recover the original atomic sources underlying RLVR datasets. This shifts the unit of analysis from named downstream releases to the smaller set of upstream sources that recur across the ecosystem. A plausible implication is that benchmarking and curation should be performed at the source level rather than only at the dataset-release level.
2. Core definitions and lineage representation
The paper distinguishes among several related terms. A Data Source is a collection not originally intended for RLVR training. An atomic source is a source with a singular origin or consistent construction standard. A Dataset is any collection of data, often identified by a Hugging Face dataset ID. An RLVR dataset is a dataset specifically curated for RLVR training, usually composed of several sources (Huang et al., 26 May 2026).
ATLAS operationalizes lineage through canonicalization. Because datasets differ in schema, formatting, prompt style, and question-answer structure, the authors manually inspect samples and canonicalize each instance into a standardized form,
where is the SHA-1 hash of the prompt, the prompt, the question, the answer, the source label, the dataset ID, and the timestamp or release time. Canonicalization is treated as essential because many datasets differ only in surface formatting.
This representation makes lineage a property of normalized instances rather than raw serialized records. In practical terms, it allows ATLAS to reason jointly over exact reuse, reformatted duplicates, and later semantic variants while preserving the temporal order needed for provenance reconstruction.
3. ATLAS framework and iterative search procedure
ATLAS is presented as an iterative framework with four practical stages: data collection and canonicalization, temporal index matching, semantic similarity matching, and iterative source recovery (Huang et al., 26 May 2026).
In the temporal index stage, each prompt is hashed as
The dataset pool is then traversed in chronological order while maintaining a global lineage dictionary 0 and an occurrence list 1. When the same hash appears in a later dataset, metadata are appended to the occurrence list. The high-level update rule is given as
2
The paper’s rationale is that exact or near-exact prompt reuse is common in RLVR data, so hashing avoids expensive string matching.
Exact matching alone is insufficient because unresolved instances may have been rewritten, reformatted, lightly paraphrased, or converted from multiple-choice to open-ended forms. For the unresolved set 3, ATLAS applies semantic matching using Sentence-BERT embeddings, cosine similarity retrieval, and human auditing. In the appendix pseudocode, a similarity threshold 4 determines whether two instances should be merged, summarized as
5
where 6 and 7 are Sentence-BERT embeddings. The paper emphasizes that this stage is not fully automatic: manual inspection verifies candidate matches case by case.
If too many instances remain unmatched, ATLAS enters another source-recovery cycle. The remaining unmatched prompts are inspected, a likely missing source family is inferred, candidate datasets are searched, the dataset pool is augmented, and the earlier stages are rerun. Iteration stops when the unmatched fraction falls below
8
Any residual instances are labeled unknown. This design makes ATLAS a lineage-aware search procedure rather than a one-pass deduplication system.
4. Provenance concentration and contamination analysis
Using ATLAS, the authors trace 1,450,827 RLVR instances back to 20 atomic sources, with fewer than 1% left unknown, and report attribution of over 99.7% of all instances to those 20 sources (Huang et al., 26 May 2026). The source labels listed in the appendix include cn_k12, olympiads, aops_forum, stack_exchange, big_math, numina_math1.5, still, gsm8k, areal_boba, amc_aime, math, dapo, num_glue, lila_crawl, and basic_arithmetic, alongside synthetic families such as synthetic_math, orca_math, lila_synthetic, synthetic_amc, and meta_math, plus a special test_leak category.
The paper further notes that a few broad aggregated collections, especially the NuminaMath-CoT and NuminaMath-1.5 series, serve as major reuse hubs. This supports the paper’s conclusion that many RLVR datasets are derivative variants of a small set of shared upstream sources, with few introducing genuinely new data. This suggests that apparent corpus diversity at the release level can substantially overstate true source diversity.
ATLAS is also used for million-scale pairwise similarity matching between RLVR datasets and 14 math evaluation benchmarks. The reported outcome is 36,148 leaked instances across datasets, with leakage often detectable only through semantic matching and many cases consisting of superficial formatting variants of benchmark test items. The paper explicitly includes exact duplicates, near-duplicates, and lightly rewritten or retyped benchmark problems within this leakage category. It argues that some apparent training improvements may be partly explained by contamination rather than genuine reasoning gains (Huang et al., 26 May 2026).
A common misconception is that contamination concerns only verbatim benchmark duplication. ATLAS is presented as evidence against that view, because the paper reports that many leaks become visible only under semantic matching rather than exact hash reuse. Another misconception is that benchmark leakage is a local property of a single dataset release. The lineage analysis indicates instead that leakage can be inherited through chains of reuse, so a leak in one source may propagate into many downstream datasets.
5. Source-level Counterfactual Attribution and dataset scoring
ATLAS establishes lineage; the paper then asks which atomic sources are actually useful for RLVR training. To answer that question, it introduces Source-level Counterfactual Attribution (SCA), which treats each atomic source 9 as an intervention unit. For each source, an RL checkpoint 0 is trained from a shared base model 1, forming the counterfactual pair 2 (Huang et al., 26 May 2026).
For each instance, correctness under the base model and the source-trained model yields four behavioral categories: 3 (wrong before, wrong after), labeled unsolvable; 4 (wrong before, right after), labeled genuinely learnable; 5 (right before, wrong after), labeled degrade case; and 6 (right before, right after), labeled too easy / already mastered. The paper defines a learnability score 7 from the proportions 8 of these categories, weighted by coefficients 9. Its interpretation is explicit: a good source should have many 0 examples, not too many 1 examples, not too many 2 examples, and ideally few 3 cases.
Building on ATLAS and SCA, the paper defines a composite dataset quality score 4. The first axis, 5, covers static data quality through verifiability, learnability, and contamination robustness. For contamination robustness, the paper gives
6
where 7 is the number of leaked samples and 8 is the total number of samples. The overall static score is
9
The second and third axes, 0 and 1, capture sampling efficiency gain and capability boundary expansion through improvements in 2 and 3, normalized on Math500 and interpolated with 4 using a scale-dependent weight 5.
The final score is
6
with weights varying smoothly with model scale 7 according to
8
The paper reports strong correlations between 9 and downstream performance: on Qwen3-1.7B, Pearson 0 and Spearman 1; on Qwen3-8B, Pearson 2 and Spearman 3 (Huang et al., 26 May 2026). In the paper’s interpretation, this occurs because 4 incorporates clean verifiable supervision, learnable examples, low benchmark contamination, source diversity, and empirical gains from source-level RL checkpoints.
6. DAPO++, evaluation protocol, and lineage-aware benchmarking
The empirical findings motivate the curation of DAPO++, described as a decontaminated version of DAPO-Math-17k. According to the paper, leaked or test-set-related instances are removed from DAPO and replaced with non-MCQ, SCA-annotated, learnable samples from the stack_exchange atomic source identified by ATLAS (Huang et al., 26 May 2026). This design is intended to preserve useful training signal while removing contamination.
The evaluation uses Qwen3-1.7B-Base and Qwen3-8B-Base with GRPO for RL training, 500 optimization steps, temperature 1.0 during rollout generation, Math-Verify rewards, and validation on Math500. The compared datasets are DeepScaleR, OpenR1-Math-220k, DeepMath-103K, DAPO-Math-17k, Skywork-OR1-RL-Data, and DAPO++. The benchmark suite comprises Math500, Minerva, Olympiad, HLE, AMC23, AIME24, AIME25, AMO, and the out-of-distribution benchmark GPQA-Diamond.
For mathematical reasoning, the paper reports that on Qwen3-1.7B the best baseline is DeepMath-103K with Average5 = 15.4, while DAPO++ reaches 15.7. On Qwen3-8B, the best baseline is DAPO-Math-17k with Average6 = 29.3, while DAPO++ reaches 29.6. On GPQA, DAPO++ attains Overall 35.9 for Qwen3-1.7B and 55.4 for Qwen3-8B. The paper therefore presents DAPO++ as consistently improving over the original DAPO and outperforming the reported baselines.
The paper also introduces SRank as a lineage-aware ranking over multiple model scales. It weights scale-specific ranks by cross-dataset standard deviation,
7
and defines
8
The resulting ranking is
9
In the paper’s framing, this ranking reflects a lineage-aware benchmarking perspective in which datasets are evaluated not only by raw downstream performance but also by provenance, contamination, and source-level utility.
Ablation findings further indicate that converting MCQs into open-ended problems improves performance and that removing contaminated examples does not hurt, and often improves, results. This suggests that leaked samples need not be interpreted as beneficial hard examples; within the paper’s argument, they may instead be noisy, redundant, or spuriously helpful through memorization. The broader implication is that decontamination and atomic-source tracing are not merely auditing tools but components of RLVR dataset design itself.