One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

Published 6 Jan 2026 in cs.LG and cs.CL | (2601.03111v1)

Abstract: The reasoning ability of LLMs can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in LLMs. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.

Abstract PDF Chat (Pro)

Summary

The paper demonstrates that a single, well-crafted RL sample can outperform traditional methods, achieving up to a 14.5 point gain on non-math benchmarks.
It introduces the polymath learning paradigm, which synthesizes natural and synthetic samples to infuse multidisciplinary STEM skills into the model.
The study highlights that optimal sample design, emphasizing algebra and precalculus skill diversity, significantly cuts data needs while enhancing model self-verification.

Extreme Data Efficiency in RL Scaling: A Polymath Learning Paradigm

Introduction and Motivation

The current scaling paradigm for advancing LLM reasoning capabilities emphasizes ever-larger datasets and resource-intensive RL fine-tuning. "One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling" (2601.03111) fundamentally challenges this conventional wisdom by dissecting the influence of data efficiency in RL for LLMs. The study introduces "polymath learning," showing that a single, meticulously designed training example can produce strong, transferable improvements in both mathematical and multidisciplinary reasoning tasks. Central hypotheses address (i) the cross-domain generalizability of one-shot reinforcement learning, (ii) characteristics that determine optimal sample selection, and (iii) the synthesis of meta-samples with maximal multidisciplinary skill coverage.

Experimental Paradigm and Baselines

The work primarily employs Qwen2.5-7B-base, leveraging the GRPO [shao2024deepseekmath] RL algorithm and a fixed batch regime. Evaluation spans extensive benchmarks, including MATH500, AIME, MinervaMath, SuperGPQA, MMLU-Pro, and Scibench. The performance of polymath learning (single-sample RL) is systematically compared to strong baselines: zero-shot inference, in-context learning with 1-shot exemplars, and comprehensive RL with 1k–8k sample subsets (notably, using sophisticated LIMR-based data selection).

Models are exposed to two sample types:

Natural polymath samples: Selected from mathematics domains (e.g., prealgebra, algebra) by their low LIMR scores, ensuring high RL learnability without over-specialization.
Synthetic polymath samples: Constructed via LLM-driven instruction-followed generation, emphasizing simultaneous coverage of salient skills from math, physics, chemistry, and biology.

Cross-Domain Generalization Effects

The central empirical result is that RL with a single polymath sample confers outsized, robust improvements not only in math benchmarks but across physics, chemistry, engineering, and other science domains. This effect is particularly pronounced for math-distant domains—fields whose semantic representation (via subject-embedding distance) is far from MATH500—as shown by a 14.5 point average gain over comprehensive RL baselines in such subjects.

Figure 1: The subject-level performance of different learning strategies. OE denotes open-ended problems, and subjects are sorted by embedding distance from MATH500. Stars/triangles indicate best performances by in-context and polymath learning, respectively.

Notably, the best one-shot polymath RL models consistently outperform strong comprehensive RL models on non-math datasets and perform competitively on mathematics itself. These results are robust across model initializations and are fundamentally non-trivial given the dramatically reduced supervision signal.

Characteristics of Optimal Polymath Samples

Analysis of high-impact samples reveals that rich algebra and precalculus skill coverage is a strong predictor of cross-domain reasoning gains. Data diversity at the intra-sample level, measured by the breadth of salient skills referenced by each example, underpins the sample's transferability.

Figure 2: Skill spectrum between natural and synthetic polymath samples; synthetic samples demonstrate markedly broader skill coverage.

Empirically, synthetic polymath samples synthesized to integrate multidisciplinary skills (e.g., the "Synthetic Prime" sample) universally outperform both domain-specific and natural polymath samples. The integration of concepts from diverse STEM domains within one sample strengthens the generalization and reasoning signal, providing superior sample efficiency than simply increasing dataset scale.

Fine-Grained Skill Analysis

Domain-specific skill prevalence further clarifies transfer mechanics. Quantitative fields (science, engineering) heavily rely on algebraic and precalculus skills, and effective samples must encode these foundational operations.

Figure 3: Average number of mathematical skills per problem in subject domains. Algebra and Precalculus dominate.

Synthetic specialist samples with targeted, multidisciplinary backgrounds excel in certain areas (e.g., the "Synthetic Prime" in physics/chemistry, geometry in engineering). This granularity opens the possibility of targeted skill injection via sample engineering to address identified weaknesses.

Self-Verification Mechanisms

Self-verification—models' emergent ability to internally check or revise outputs—appears more frequently and robustly after polymath learning than with comprehensive RL, particularly for samples rich in number theory or intermediate algebra skills. Patterns such as "re-evaluate," "recheck," and code-based validation are subject- and sample-dependent, revealing a complex interaction between learned verification routines and reinforcement supervision signals.

Figure 4: Self-verification patterns under different comprehensive and polymath samples across all subjects. Distinct samples elicit different verification types.

Implications and Future Outlook

This work has several impactful implications:

Sample engineering is more effective than dataset scaling: Meticulous construction and skill balancing in even a single sample can outperform RL on thousands of examples. This finding contradicts the default to brute-force data scaling and calls for a more nuanced approach to dataset curation in RL for LM reasoning.
Enabling practical RL efficiency: The results dramatically reduce RL time and data costs for achieving competitive or superior generalization, opening pathways for RL alignment and reasoning enhancement with minimal computational resources.
AI curriculum design: The approach provides concrete direction for designing compact, high-signal synthetic "curriculums," enabling targeted improvement on weak domains or skills by altering the internal composition of a single sample.
Limits and future questions: Despite encouraging robustness, further expansion is needed to (a) automate skill-based meta-sample synthesis on broader domains, (b) anchor sample construction strategies to formal measures of skill diversity, (c) understand observed limitations (e.g., format overfitting), and (d) adapt rewards and selection for non-math domains with unreliable automatic supervision.

Conclusion

"One Sample to Rule Them All" (2601.03111) definitively demonstrates that the data efficiency of RL for LLM reasoning can be improved by two orders of magnitude—down to a single well-crafted sample—if sample construction is treated as a problem of skill and diversity optimization. These results necessitate a shift from dataset scaling to high-precision sample engineering as the lingua franca for RL-based reasoning enhancement in LLMs. Future research should systematize and automate the design of these meta-samples, generalize the methodology outside math-heavy reasoning, and link sample structure to theoretical bounds on generalization and transfer.