- The paper presents a co-evolutionary framework that integrates a Synthesizer and a Solver to generate adaptive question variants and enhance test-time reasoning without ground-truth labels.
- It employs a variance-aware, capability-adaptive reward mechanism that targets intermediate difficulty samples, ensuring robust gradient signals for effective online policy optimization.
- Experimental results show significant gains in mathematical reasoning benchmarks and improved transfer generalization compared to traditional majority-vote and static curriculum approaches.
Test-Time Curriculum Synthesis for Self-Evolving: A Technical Summary
Introduction and Motivation
Test-Time Curriculum Synthesis (TTCS) addresses critical limitations in current test-time training (TTT) and self-evolving frameworks for LLMs, especially in reasoning-centric tasks such as mathematical problem solving. TTT and test-time reinforcement learning (TTRL) leverage unlabeled test instances and self-supervised objectives for online adaptation but fail on hard reasoning questions due to unreliable pseudo-labels and a lack of learnable intermediate samples. Traditional majority-voting for pseudo-labels is ineffective when the model’s inference distribution is far from the correct solution, generating spurious consensus and thus corrupting the reward signal. Existing approaches with static curricula or static synthetic data generators are also insufficient, as they cannot adaptively align training data difficulty with the model’s evolving capabilities, leading to sample inefficiency and model collapse.
TTCS Framework
TTCS proposes a co-evolutionary test-time training framework that instantiates two coupled policies from a shared backbone LLM: a Synthesizer, generating question variants, and a Solver, adapting to those questions through reinforcement learning under label-free constraints. Both are optimized online via Group Relative Policy Optimization (GRPO), a variance-aware policy gradient method suitable for unstable or noisy self-generated reward settings.
Curriculum-Based Variant Synthesis
At each test-time training iteration, the Synthesizer receives a test question and, using a specialized prompting protocol, generates multiple structurally isomorphic but surface-distinct question variants. These variants are explicitly designed to lie near the Solver’s capability frontier: neither trivially easy nor unsolvably hard, but at a regime where the model demonstrates high prediction variance (self-consistency ∼0.5), thus maximizing learning signal per gradient update. Synthetic questions are filtered to ensure both validity and sufficient novelty, with rewards penalizing excessive similarity to both the reference test question and other synthetic samples in the batch (e.g., using text/skeleton similarity and clustering metrics).
Capability-Adaptive Rewarding
The generated question variants are assigned rewards based on how well they differentiate the Solver’s current competence. A key technical aspect is the variance-driven reward structure:
Rcap​(x′)=(4s(x′)(1−s(x′)))γ
where s(x′) is the Solver's self-consistency (majority agreement proportion) on question x′. This reward peaks at intermediate difficulty and vanishes for entirely solved or unsolved cases, aligning optimal sample selection with maximal policy gradient signal and training stability. A composite reward also integrates penalties for similarity, ensuring curriculum diversity and mitigating model collapse from repeated or overly homogenous samples.
Online Co-Evolution
The Solver is trained on a dynamically mixed dataset at each iteration, blending original test-set samples with the current batch of synthesized variants. Self-supervision follows by majority-vote aggregation of model outputs as pseudo-labels. Only samples close to the capability frontier are used for updating, based on the agreement margin, to prevent optimization on samples that provide little additional learning signal.
Crucially, the Solver's performance feeds back into the Synthesizer: the Synthesizer optimizes to propose variants that challenge but do not overwhelm the Solver, driving a structured curriculum through closed-loop co-adaptation. This iterative strategy is proven both theoretically (as attention to outcome variance maximizes expected gradient magnitude) and empirically effective.
Experimental Results
TTCS demonstrates substantial quantitative improvements over baselines (Self-Consistency, TTRL, R-Zero) across extensive mathematical benchmarks, including AIME24/25, AMC23, MATH-500, Minerva, and OlympiadBench. For example, on Qwen2.5-Math-1.5B, TTCS reaches an average accuracy of 41.5%, a +24.2 point improvement over the backbone and +14 over Self-Consistency. On the more competitive Qwen2.5-Math-7B, it achieves 52.5%, outstripping TTRL by +4.1 points and Self-Consistency by +20.4 points.
Performance advantages are pronounced on the most challenging tasks (e.g., AIME24/25), where TTCS outperforms TTRL by over +5 points, confirming that the adaptive, curriculum-driven approach provides reliable gradient signals where majority consensus-based adaptation fails.
Generalization and Transfer
TTCS-trained solvers generalize from mathematical self-evolution to broader reasoning benchmarks such as MMLU-Pro and SuperGPQA. Gains acquired in mathematical self-training transfer robustly, with TTCS surpassing baselines across general-domain tasks. Furthermore, models trained on specific benchmarks (e.g., MATH-500) show strong out-of-distribution improvements on more difficult benchmarks (e.g., AIME24), evidencing that TTCS induces generalizable reasoning logic rather than overfitting.
Ablation and Case Studies
Ablations confirm the necessity of synthesizer co-evolution, online data filtering at the capability boundary, and diversity-driven reward components. Removing any of these leads to substantial accuracy drops, particularly on challenging datasets. Using a static, stronger synthetic generator (e.g., Qwen2.5-14B-Instruct) only marginally improves results compared to a co-evolving, weaker synthesizer, establishing that online curriculum adaptivity is more crucial than the absolute generator capability.
Longitudinal case studies show that synthesized questions become more sophisticated through training, exhibiting structural variations, domain transfers, and shifts in complexity, reflecting meaningful curriculum progression.
Theoretical Contributions
TTCS grounds its capability-adaptive sampling in robust theoretical arguments. The reward structure aligns sample selection with maximal policy gradient variance (greatest learning opportunity) justified through Hoeffding's inequality for Bernoulli sampling conditions and analysis of variance-driven policy updates under GRPO. This ensures that self-evolving focuses on the most instructive examples for representation refinement even in the absence of ground-truth labels.
Implications and Future Directions
Practically, TTCS establishes a scalable framework for self-evolving LLMs in unavailable-supervision settings, substantially lowering the reliance on external labels or ground-truth data. Its curriculum-driven test-time adaptation strategy can be extended to other domains (vision-language, multi-modal reasoning) and as a foundation for autonomous agentic LLMs that dynamically improve capabilities during real-world deployment.
Theoretically, TTCS illuminates the value of coupling adaptive curriculum generation with outcome-variance aware updating in RL-driven self-improvement and provides a blueprint for preventing model collapse and reward corruption in purely synthetic or recursively-generated data regimes.
Future lines of inquiry include further investigation into cross-modal curriculum synthesis, application to longer-horizon reasoning and planning tasks, and tighter integration with agentic learning setups where environmental interaction further expands the notion of exogenous supervision.
Conclusion
TTCS advances self-evolving LLMs by integrating closed-loop curriculum generation with effective test-time adaptation, enabling substantial gains in reasoning capabilities and transfer generalization without human annotation. The introduction of capability-aware synthetic sample selection and robust reward engineering is empirically and theoretically validated, providing an effective paradigm for future autonomous learning systems.