Self-Evolving Curriculum for LLM Reasoning (2505.14970v2)

Published 20 May 2025 in cs.AI and cs.LG

Abstract: Reinforcement learning (RL) has proven effective for fine-tuning LLMs, significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

Summary

The paper introduces SEC, a framework that automatically sequences training problems to enhance RL fine-tuning for LLM reasoning tasks.
It formulates curriculum selection as a non-stationary Multi-Armed Bandit problem using the absolute advantage as a reward proxy.
Experiments show SEC outperforms random and fixed curricula, achieving up to a 33% relative accuracy gain on out-of-distribution tasks.

The paper "Self-Evolving Curriculum for LLM Reasoning" (2505.14970) introduces Self-Evolving Curriculum (SEC), an automatic curriculum learning framework designed to improve the effectiveness of Reinforcement Learning (RL) fine-tuning for LLMs on reasoning tasks. The core challenge addressed is the sequencing of training problems, which significantly impacts RL training success. Standard methods like random curricula are often suboptimal, manually designed curricula are labor-intensive, and some online methods are computationally expensive.

SEC formulates the curriculum selection process as a non-stationary Multi-Armed Bandit (MAB) problem that runs concurrently with the LLM's RL fine-tuning. The approach requires partitioning the training data into distinct categories (e.g., based on difficulty levels or problem types). Each category is treated as an 'arm' in the MAB.

A key contribution is the definition of a reward signal for the curriculum policy that serves as a proxy for the immediate learning gain. The paper proposes using the absolute value of the advantage function ( $|\widehat{A}_t|$ ) from policy gradient methods. The intuition is that larger advantages indicate greater potential for parameter updates and thus higher learning gain. The paper demonstrates that, in the common scenario of RL with Verifiable Rewards (using a binary correctness signal), maximizing the expected absolute advantage is equivalent to prioritizing problems where the model has an approximate 50% success rate. This aligns with psychological theories like the Zone of Proximal Development, suggesting that problems neither too easy nor too hard yield the most learning.

The MAB policy learns an expected return $Q_t(c)$ for each category $c$ over time. This value is updated iteratively using the TD(0) method (Exponential Moving Average) based on the average absolute advantage observed from problems sampled from that category in the current training step. At each RL training step, categories are sampled according to a Boltzmann distribution over their current $Q_t(c)$ values, balancing exploration and exploitation. Training problems are then sampled uniformly from the chosen categories to form the training batch.

The practical implementation involves:

Data Categorization: Preprocessing the training data to assign problems to categories (e.g., based on existing difficulty labels, or estimated difficulty).
Initialization: Initialize $Q$ -values for all categories (typically to zero).
Iterative Training:
- At each step, use the current $Q$ -values and a temperature parameter $\tau$ to sample categories via a softmax (Boltzmann) distribution.
- Sample problems uniformly from the selected categories to form a batch.
- Execute rollouts for the sampled problems using the current LLM policy.
- Compute rewards and estimate advantages (e.g., using GRPO, PPO, RLOO methods).
- Update the LLM policy using the chosen RL algorithm and the computed advantages.
- Calculate the reward for each sampled category as the average absolute advantage of problems drawn from that category in the batch.
- Update the $Q$ -value for each sampled category using the TD(0) rule: $Q_{t+1}(c) = \alpha r_t(c) + (1-\alpha) Q_t(c)$ , where $\alpha$ is the learning rate.

Experimental results using Qwen2.5-3B and Qwen2.5-7B models across planning (Countdown, Zebra), inductive reasoning (ARC-1D), and mathematics (MATH) tasks show that SEC consistently outperforms standard random and difficulty-ordered curricula, particularly in improving generalization to harder, out-of-distribution problems. For instance, SEC achieved significant relative accuracy gains over random baselines on OOD sets like Countdown (+13% for 3B), Zebra (+21% for 3B), and AIME24 (+33% for 3B). The paper also demonstrates that SEC can effectively manage multi-task fine-tuning by defining categories based on combinations of problem type and difficulty, preventing performance collapse seen with random sampling. The method's effectiveness is also shown to generalize to different underlying RL algorithms beyond GRPO, such as PPO and RLOO.

The curriculum analysis reveals that SEC adaptively shifts the sampled problem difficulty, starting easier and gradually introducing harder problems as the model improves, confirming its dynamic nature.

While effective, SEC's limitations include the reliance on predefined curriculum categories and the introduction of additional hyperparameters ( $\alpha$ , $\tau$ ) that require tuning. Future work could explore automatically discovering categories or using more sophisticated methods to estimate learning gain.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1925670128634490987

YouTube

Show All Videos