Papers
Topics
Authors
Recent
Search
2000 character limit reached

ATLAS: Tracing RLVR Data Lineage

Updated 4 July 2026
  • The paper introduces ATLAS, which reconstructs data provenance in RLVR datasets through iterative canonicalization and semantic matching.
  • The framework identifies 20 atomic sources accounting for over 99.7% of 1.45M instances, mitigating provenance collapse and contamination.
  • ATLAS enables source-level counterfactual attribution and dataset scoring, paving the way for decontaminated training sets like DAPO++.

Atomic-source Tracing via Lineage-Aware Search (ATLAS) is a framework for tracing Reinforcement Learning from Verifiable Rewards (RLVR) datasets back to their atomic sources in order to address provenance collapse in a rapidly expanding dataset ecosystem. In the formulation introduced in "RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data" (Huang et al., 26 May 2026), ATLAS combines canonicalization, temporal index matching, semantic similarity matching, and iterative source recovery to reconstruct lineage at scale. Applied to RLVR corpora, it attributes over 99.7% of 1.45M instances to 20 atomic sources, identifies contamination pathways into downstream datasets, and provides the lineage substrate for source-level utility analysis, dataset scoring, and the construction of a decontaminated training set, DAPO++ (Huang et al., 26 May 2026).

1. Provenance collapse in RLVR data

The ATLAS framework is motivated by the paper’s diagnosis of provenance collapse in RLVR datasets. RLVR datasets are described as collections used to train models with reward signals that can be automatically checked, such as math problems with verifiable answers. The paper argues that many such datasets are built from earlier datasets, heavily filtered, rewritten, recomposed, or aggregated from multiple sources, while their lineage remains unclear. It defines provenance collapse as the loss of information about where the data originally came from, who modified it, and whether it can be trusted. In this setting, some datasets are characterized as “openly-closed”: publicly released, but with opaque internal provenance (Huang et al., 26 May 2026).

The consequences identified are concrete. Hidden overlap and reuse can make many ostensibly new datasets variants of the same upstream material. Contamination risk arises when evaluation benchmarks leak into training data. Dataset comparison becomes misleading if competing methods train on substantially overlapping corpora. Dataset curation also becomes harder because the absence of lineage obscures which parts of a training set carry genuinely new or useful signal.

Within this problem setting, ATLAS is not restricted to identifying immediate predecessors. Its stated goal is to recover the original atomic sources underlying RLVR datasets. This shifts the unit of analysis from named downstream releases to the smaller set of upstream sources that recur across the ecosystem. A plausible implication is that benchmarking and curation should be performed at the source level rather than only at the dataset-release level.

2. Core definitions and lineage representation

The paper distinguishes among several related terms. A Data Source is a collection not originally intended for RLVR training. An atomic source is a source with a singular origin or consistent construction standard. A Dataset is any collection of data, often identified by a Hugging Face dataset ID. An RLVR dataset is a dataset specifically curated for RLVR training, usually composed of several sources (Huang et al., 26 May 2026).

ATLAS operationalizes lineage through canonicalization. Because datasets differ in schema, formatting, prompt style, and question-answer structure, the authors manually inspect samples and canonicalize each instance into a standardized form,

d=(h,p,q,a,s,id,t),d = (h, p, q, a, s, \text{id}, t),

where hh is the SHA-1 hash of the prompt, pp the prompt, qq the question, aa the answer, ss the source label, id\text{id} the dataset ID, and tt the timestamp or release time. Canonicalization is treated as essential because many datasets differ only in surface formatting.

This representation makes lineage a property of normalized instances rather than raw serialized records. In practical terms, it allows ATLAS to reason jointly over exact reuse, reformatted duplicates, and later semantic variants while preserving the temporal order needed for provenance reconstruction.

3. ATLAS framework and iterative search procedure

ATLAS is presented as an iterative framework with four practical stages: data collection and canonicalization, temporal index matching, semantic similarity matching, and iterative source recovery (Huang et al., 26 May 2026).

In the temporal index stage, each prompt pp is hashed as

h=SHA1(p).h = \mathrm{SHA1}(p).

The dataset pool is then traversed in chronological order while maintaining a global lineage dictionary hh0 and an occurrence list hh1. When the same hash appears in a later dataset, metadata are appended to the occurrence list. The high-level update rule is given as

hh2

The paper’s rationale is that exact or near-exact prompt reuse is common in RLVR data, so hashing avoids expensive string matching.

Exact matching alone is insufficient because unresolved instances may have been rewritten, reformatted, lightly paraphrased, or converted from multiple-choice to open-ended forms. For the unresolved set hh3, ATLAS applies semantic matching using Sentence-BERT embeddings, cosine similarity retrieval, and human auditing. In the appendix pseudocode, a similarity threshold hh4 determines whether two instances should be merged, summarized as

hh5

where hh6 and hh7 are Sentence-BERT embeddings. The paper emphasizes that this stage is not fully automatic: manual inspection verifies candidate matches case by case.

If too many instances remain unmatched, ATLAS enters another source-recovery cycle. The remaining unmatched prompts are inspected, a likely missing source family is inferred, candidate datasets are searched, the dataset pool is augmented, and the earlier stages are rerun. Iteration stops when the unmatched fraction falls below

hh8

Any residual instances are labeled unknown. This design makes ATLAS a lineage-aware search procedure rather than a one-pass deduplication system.

4. Provenance concentration and contamination analysis

Using ATLAS, the authors trace 1,450,827 RLVR instances back to 20 atomic sources, with fewer than 1% left unknown, and report attribution of over 99.7% of all instances to those 20 sources (Huang et al., 26 May 2026). The source labels listed in the appendix include cn_k12, olympiads, aops_forum, stack_exchange, big_math, numina_math1.5, still, gsm8k, areal_boba, amc_aime, math, dapo, num_glue, lila_crawl, and basic_arithmetic, alongside synthetic families such as synthetic_math, orca_math, lila_synthetic, synthetic_amc, and meta_math, plus a special test_leak category.

The paper further notes that a few broad aggregated collections, especially the NuminaMath-CoT and NuminaMath-1.5 series, serve as major reuse hubs. This supports the paper’s conclusion that many RLVR datasets are derivative variants of a small set of shared upstream sources, with few introducing genuinely new data. This suggests that apparent corpus diversity at the release level can substantially overstate true source diversity.

ATLAS is also used for million-scale pairwise similarity matching between RLVR datasets and 14 math evaluation benchmarks. The reported outcome is 36,148 leaked instances across datasets, with leakage often detectable only through semantic matching and many cases consisting of superficial formatting variants of benchmark test items. The paper explicitly includes exact duplicates, near-duplicates, and lightly rewritten or retyped benchmark problems within this leakage category. It argues that some apparent training improvements may be partly explained by contamination rather than genuine reasoning gains (Huang et al., 26 May 2026).

A common misconception is that contamination concerns only verbatim benchmark duplication. ATLAS is presented as evidence against that view, because the paper reports that many leaks become visible only under semantic matching rather than exact hash reuse. Another misconception is that benchmark leakage is a local property of a single dataset release. The lineage analysis indicates instead that leakage can be inherited through chains of reuse, so a leak in one source may propagate into many downstream datasets.

5. Source-level Counterfactual Attribution and dataset scoring

ATLAS establishes lineage; the paper then asks which atomic sources are actually useful for RLVR training. To answer that question, it introduces Source-level Counterfactual Attribution (SCA), which treats each atomic source hh9 as an intervention unit. For each source, an RL checkpoint pp0 is trained from a shared base model pp1, forming the counterfactual pair pp2 (Huang et al., 26 May 2026).

For each instance, correctness under the base model and the source-trained model yields four behavioral categories: pp3 (wrong before, wrong after), labeled unsolvable; pp4 (wrong before, right after), labeled genuinely learnable; pp5 (right before, wrong after), labeled degrade case; and pp6 (right before, right after), labeled too easy / already mastered. The paper defines a learnability score pp7 from the proportions pp8 of these categories, weighted by coefficients pp9. Its interpretation is explicit: a good source should have many qq0 examples, not too many qq1 examples, not too many qq2 examples, and ideally few qq3 cases.

Building on ATLAS and SCA, the paper defines a composite dataset quality score qq4. The first axis, qq5, covers static data quality through verifiability, learnability, and contamination robustness. For contamination robustness, the paper gives

qq6

where qq7 is the number of leaked samples and qq8 is the total number of samples. The overall static score is

qq9

The second and third axes, aa0 and aa1, capture sampling efficiency gain and capability boundary expansion through improvements in aa2 and aa3, normalized on Math500 and interpolated with aa4 using a scale-dependent weight aa5.

The final score is

aa6

with weights varying smoothly with model scale aa7 according to

aa8

The paper reports strong correlations between aa9 and downstream performance: on Qwen3-1.7B, Pearson ss0 and Spearman ss1; on Qwen3-8B, Pearson ss2 and Spearman ss3 (Huang et al., 26 May 2026). In the paper’s interpretation, this occurs because ss4 incorporates clean verifiable supervision, learnable examples, low benchmark contamination, source diversity, and empirical gains from source-level RL checkpoints.

6. DAPO++, evaluation protocol, and lineage-aware benchmarking

The empirical findings motivate the curation of DAPO++, described as a decontaminated version of DAPO-Math-17k. According to the paper, leaked or test-set-related instances are removed from DAPO and replaced with non-MCQ, SCA-annotated, learnable samples from the stack_exchange atomic source identified by ATLAS (Huang et al., 26 May 2026). This design is intended to preserve useful training signal while removing contamination.

The evaluation uses Qwen3-1.7B-Base and Qwen3-8B-Base with GRPO for RL training, 500 optimization steps, temperature 1.0 during rollout generation, Math-Verify rewards, and validation on Math500. The compared datasets are DeepScaleR, OpenR1-Math-220k, DeepMath-103K, DAPO-Math-17k, Skywork-OR1-RL-Data, and DAPO++. The benchmark suite comprises Math500, Minerva, Olympiad, HLE, AMC23, AIME24, AIME25, AMO, and the out-of-distribution benchmark GPQA-Diamond.

For mathematical reasoning, the paper reports that on Qwen3-1.7B the best baseline is DeepMath-103K with Averagess5 = 15.4, while DAPO++ reaches 15.7. On Qwen3-8B, the best baseline is DAPO-Math-17k with Averagess6 = 29.3, while DAPO++ reaches 29.6. On GPQA, DAPO++ attains Overall 35.9 for Qwen3-1.7B and 55.4 for Qwen3-8B. The paper therefore presents DAPO++ as consistently improving over the original DAPO and outperforming the reported baselines.

The paper also introduces SRank as a lineage-aware ranking over multiple model scales. It weights scale-specific ranks by cross-dataset standard deviation,

ss7

and defines

ss8

The resulting ranking is

ss9

In the paper’s framing, this ranking reflects a lineage-aware benchmarking perspective in which datasets are evaluated not only by raw downstream performance but also by provenance, contamination, and source-level utility.

Ablation findings further indicate that converting MCQs into open-ended problems improves performance and that removing contaminated examples does not hurt, and often improves, results. This suggests that leaked samples need not be interpreted as beneficial hard examples; within the paper’s argument, they may instead be noisy, redundant, or spuriously helpful through memorization. The broader implication is that decontamination and atomic-source tracing are not merely auditing tools but components of RLVR dataset design itself.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Atomic-source Tracing via Lineage-Aware Search (ATLAS).