Scalable RL with Curriculum Sampling

Updated 13 March 2026

The paper introduces a method that adaptively sequences training tasks to maximize sample efficiency and policy generalization.
It details frameworks such as learning progress-based, teacher-student, and reverse curriculum strategies to optimize high-dimensional RL training.
Empirical validations in robotics, navigation, and multimodal reasoning demonstrate faster convergence and improved performance over static curricula.

Scalable Reinforcement Learning with Curriculum Sampling (RLCS) refers to a class of reinforcement learning (RL) methodologies that systematically adapt the sequence and distribution of training tasks or data in large, complex, and often multi-dimensional problem spaces. The core objective of RLCS is to maximize sample efficiency and policy generalization by automatically generating curricula—ordered sequences or dynamic distributions of training instances—such that the RL agent is consistently exposed to tasks at an appropriate level of challenge. This adaptive focus on the "learning frontier" is achieved via formal mechanisms to estimate task difficulty or learning progress, often eschewing hand-tuned heuristics or monotonic difficulty assumptions. RLCS has enabled marked advances in domains with high task diversity, combinatorial structure, or sparse/complex reward landscapes, such as robotics, multimodal reasoning, autonomous driving, and mathematical reasoning.

1. Fundamental Principles and Motivation

The motivation for RLCS stems from the observation that uniform task sampling in vast or unstructured problem spaces frequently results in inefficient allocation of computational resources: most sampled instances are either trivial for the current policy (yielding no learning signal) or remain unsolvable (yielding uninformative gradients) (Li et al., 24 Jan 2026, Team et al., 1 Jul 2025). In contrast, curriculum learning seeks to sequence training experiences such that knowledge and skills acquired in easier tasks accelerate progress on subsequent, harder tasks. However, manually constructing such curricula quickly becomes infeasible as task spaces grow exponentially in dimension or complexity, and simple monotonic progressions ("from easy to hard") do not align with empirical learning dynamics where difficulty is multi-faceted and context-dependent.

RLCS frameworks address these limitations by constructing adaptive, data-driven curricula based on online estimations of the agent’s learning progress, success rates, or reward variance, often leveraging feedback mechanisms at the level of rollouts, tasks, or entire sub-task hierarchies.

2. Algorithmic Frameworks and Key Methods

Multiple algorithmic paradigms instantiate RLCS:

Learning Progress-based RLCS (LP-ACRL): At each curriculum stage, the mean episodic reward for each task is tracked. The signed learning progress (LP), defined as the difference in mean reward between consecutive updates, is used to adapt the task-sampling distribution by emphasizing tasks with positive LP, thereby focusing sampling on instances where the agent demonstrates ongoing improvement. A tempered softmax over LP values yields the next curriculum distribution, optionally smoothed to prevent abrupt shifts. This procedure enables robust automatic curriculum generation in high-dimensional, multi-axis robotic locomotion, yielding state-of-the-art generalization across diverse terrains (Li et al., 24 Jan 2026).
Teacher-Student and CMDP-based RLCS: The curriculum sequencing problem can be formalized as a Markov decision process (MDP) in which the "teacher" agent selects tasks for the "student" RL agent such that overall progress towards a target objective is maximized. The teacher’s policy is trained, typically via policy gradient methods, to select tasks based on aggregate learning progress, reward histories, or other meta-features of the student state. This approach empirically outperforms both tabula-rasa and heuristic curricula in multi-task domains (Schraner, 2022, Narvekar et al., 2018).
Reverse Curriculum Generation: For goal-conditioned RL with sparse rewards, start states are selected by expanding backward from the goal via local random walks, repeatedly focusing training on states of intermediate success probability. The curriculum thus identifies and prioritizes a "funnel" through which the agent can reliably acquire the target skill (Florensa et al., 2017).
Variance- and Reward-based Dynamic Sampling: In settings where tasks can be decomposed into sub-tasks with multi-valued reward signals (e.g., tool learning for LLMs), the RLCS process involves computing sample- and epoch-level statistics (mean, variance) of rewards per data point or sub-task. Data deemed over-mastered ("easy") or stagnant ("hard-stable") is filtered, while data within high-variance, informative regimes is prioritized. Sub-task curricula can be scheduled to promote asynchronous convergence, further increasing sample efficiency (Feng et al., 18 Sep 2025).
Difficulty-Tiered and Accuracy-Frontier Sampling: For multimodal VLMs and mathematical reasoning, the training set is partitioned into discrete "tiers" of estimated difficulty (via pilot model accuracy or human labels). Curriculum sampling is performed to maintain the empirical batch-level accuracy near a target threshold (e.g., 0.5), focusing learning on the "Goldilocks" zone and dynamically modulating the batch composition to maintain gradient informativeness and computational efficiency (Team et al., 1 Jul 2025, Shi et al., 7 Apr 2025).

3. Mathematical Formulation

A generic RLCS protocol can be summarized as follows:

Define a task space $\mathcal{T}$ , with each instance $\zeta$ parameterized by categorical and (possibly discretized) continuous factors (e.g., terrain type, velocity, etc.) (Li et al., 24 Jan 2026).
At each iteration or curriculum stage $j$ $j$ , estimate for each task:
- Mean reward $R_{c_j}(\zeta) = \mathbb{E}_{\tau\sim c_j}[R_\tau]$
- Learning progress: $\mathrm{LP}_{c_j}(\zeta) = R_{c_j}(\zeta) - R_{c_{j-1}}(\zeta)$
Task-sampling distribution updated according to a tempered softmax:

$c_{j+1}(\zeta) = \frac{\exp\left(\mathrm{LP}_{c_j}(\zeta)/\beta\right)}{\sum_{\zeta'}\exp\left(\mathrm{LP}_{c_j}(\zeta')/\beta\right)}$

with optional exponential moving average smoothing.

RL policy updates are performed over samples of tasks drawn from $c_{j+1}$ , with the update rule appropriate to the base RL algorithm (e.g., PPO, TRPO).
The curriculum feedback loop is repeated: after each block of policy updates, LP estimates and $c_{j+1}$ are recomputed, focusing data collection where learning is active.

Alternative approaches may use task success probability, reward-based variance, teacher-student rollouts, or explicit curriculum-MDPs to operationalize adaptive selection and progression (Tzannetos et al., 2023, Narvekar et al., 2018).

4. Empirical Validation and Scalability

Multiple studies demonstrate that RLCS methods confer substantial and scalable benefits:

In scaled quadruped locomotion (ANYmal D), LP-ACRL enabled simultaneous mastery of 600 task instances (formed by the cross product of velocities, terrain types, and difficulty levels), with policies attaining $2.5\,\mathrm{m/s}$ on rough terrains and $3.0\,\mathrm{m/s}$ yaw rates. RLCS achieved $\sim80\%$ task mastery within $1,500$ PPO iterations, outperforming uniform and hand-crafted curricula by large margins in both coverage and sample efficiency (Li et al., 24 Jan 2026).
For manipulation and navigation with sparse rewards, reverse curriculum generation increased success probabilities from 2–10% (uniform baseline) to 75–93% after 100–150 iterations, even in high-dimensional, continuous state spaces (Florensa et al., 2017).
In tool-augmented LLM training (DSCL), dynamic sampling and staged sub-task curricula yielded a $+3.3$ to $+3.2$ pp accuracy increase over strong baselines, with up to $40\%$ reduction in redundant sample processing (Feng et al., 18 Sep 2025).
For multimodal VLMs (GLM-4.5V), curriculum sampling led to $77.3\%$ average accuracy versus $72.4\%$ for uniform RL and $65.2\%$ for SFT alone, with a $1.8\times$ gain in tokens-to-accuracy efficiency (Team et al., 1 Jul 2025).
In end-to-end driving, agent-centric curriculum buffers prioritized scenarios with positive learning potential; success rates improved by $+9$ , $+13$ , and $+21$ points across traffic densities, converging with $20\%$ fewer updates and only $8\%$ increase in wall-clock time over domain randomization (Abouelazm et al., 13 May 2025).

5. Theoretical Properties and Limitations

Several RLCS frameworks provide convergence or monotonic improvement guarantees under convexity and bounded step conditions (Klink et al., 2019, Tzannetos et al., 2023). Many derive selection rules by formalizing "one-step improvement" criteria for learning progress (as in ZPD-inspired ProCuRL) or optimizing explicit regularized objectives in policy/context spaces (as in self-paced contextual RL).

Limitations include:

Dependence on reliable estimation of learning progress or success rates, which can be noisy or biased in sparse or deceptive-reward domains.
Scalability to continuous or extremely high-cardinality task spaces may require density modeling (e.g., normalizing flows for context distributions) or learned proxies for task difficulty.
Some RLCS methods presume the ability to arbitrarily reset simulators or environments (e.g., for reverse curricula or CMDP states), which is impractical in many real-world settings and necessitates safeguards (Florensa et al., 2017).
Formal sample-complexity bounds remain limited, though empirical scaling is routinely demonstrated.

6. Domain-Specific Extensions and Generalizability

RLCS is generalizable to a broad spectrum of domains:

In mathematical reasoning, adaptive curriculum sampling (e.g., AdaRFT) tunes task selection to maintain a "Goldilocks" success rate, yielding up to $2\times$ faster convergence and systematic accuracy gains, without modification to reward function or architecture (Shi et al., 7 Apr 2025).
For tool learning with interdependent reward components, RLCS can be instantiated via multi-dimensional reward-statistic filtering and staged curriculum policies, yielding robust convergence when sub-tasks master asynchronously (Feng et al., 18 Sep 2025).
In contextual RL, self-paced approaches regularize intermediate context distributions (via KL penalties) and use dual optimization to balance progress towards target distributions and smoothness of learning, suggesting RLCS is extensible to settings requiring continual domain adaptation (Klink et al., 2019).
In vision-language and multimodal reasoning, difficulty tiers informed by pilot evaluations and per-tier online tracking enable RLCS to stably train on $10^9$ examples, balancing compute and gradient informativeness (Team et al., 1 Jul 2025).

7. Future Directions and Open Problems

RLCS continues to evolve, with open questions surrounding:

Efficient estimation of learning progress and difficulty in partially observable, non-stationary, or real-world domains.
Combining RLCS with meta-learning to dynamically learn not only curricula but also optimal curriculum update rules and schedules.
Scaling to continuous task spaces: embedding-based curricula or context distributional shift methods require further advances to remain computationally tractable as task spaces grow.
Robustness in the presence of adversarial or deceptive reward signals, where naive LP or variance-based scoring can be gamed.
Formal convergence characterization for complex, stochastic, multi-stage RLCS frameworks remains limited and is an active area of research.

RLCS has become a foundational paradigm for scalable agent training in diverse, high-dimensional, and real-world domains, supplanting static or hand-engineered curricula and unlocking new levels of capability and efficiency in modern RL systems (Li et al., 24 Jan 2026, Team et al., 1 Jul 2025, Feng et al., 18 Sep 2025, Florensa et al., 2017).