TTCS: Test-Time Curriculum Synthesis for Self-Evolving

Published 30 Jan 2026 in cs.LG, cs.AI, and cs.CL | (2601.22628v1)

Abstract: Test-Time Training offers a promising way to improve the reasoning ability of LLMs by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.

Abstract PDF Upgrade to Chat

Summary

The paper presents a co-evolutionary framework that integrates a Synthesizer and a Solver to generate adaptive question variants and enhance test-time reasoning without ground-truth labels.
It employs a variance-aware, capability-adaptive reward mechanism that targets intermediate difficulty samples, ensuring robust gradient signals for effective online policy optimization.
Experimental results show significant gains in mathematical reasoning benchmarks and improved transfer generalization compared to traditional majority-vote and static curriculum approaches.

Test-Time Curriculum Synthesis for Self-Evolving: A Technical Summary

Introduction and Motivation

Test-Time Curriculum Synthesis (TTCS) addresses critical limitations in current test-time training (TTT) and self-evolving frameworks for LLMs, especially in reasoning-centric tasks such as mathematical problem solving. TTT and test-time reinforcement learning (TTRL) leverage unlabeled test instances and self-supervised objectives for online adaptation but fail on hard reasoning questions due to unreliable pseudo-labels and a lack of learnable intermediate samples. Traditional majority-voting for pseudo-labels is ineffective when the model’s inference distribution is far from the correct solution, generating spurious consensus and thus corrupting the reward signal. Existing approaches with static curricula or static synthetic data generators are also insufficient, as they cannot adaptively align training data difficulty with the model’s evolving capabilities, leading to sample inefficiency and model collapse.

TTCS Framework

TTCS proposes a co-evolutionary test-time training framework that instantiates two coupled policies from a shared backbone LLM: a Synthesizer, generating question variants, and a Solver, adapting to those questions through reinforcement learning under label-free constraints. Both are optimized online via Group Relative Policy Optimization (GRPO), a variance-aware policy gradient method suitable for unstable or noisy self-generated reward settings.

Curriculum-Based Variant Synthesis

At each test-time training iteration, the Synthesizer receives a test question and, using a specialized prompting protocol, generates multiple structurally isomorphic but surface-distinct question variants. These variants are explicitly designed to lie near the Solver’s capability frontier: neither trivially easy nor unsolvably hard, but at a regime where the model demonstrates high prediction variance (self-consistency $\sim 0.5$ ), thus maximizing learning signal per gradient update. Synthetic questions are filtered to ensure both validity and sufficient novelty, with rewards penalizing excessive similarity to both the reference test question and other synthetic samples in the batch (e.g., using text/skeleton similarity and clustering metrics).

Capability-Adaptive Rewarding

The generated question variants are assigned rewards based on how well they differentiate the Solver’s current competence. A key technical aspect is the variance-driven reward structure:

$R_\text{cap}(x') = \left(4s(x')(1 - s(x'))\right)^\gamma$

where $s(x')$ is the Solver's self-consistency (majority agreement proportion) on question $x'$ . This reward peaks at intermediate difficulty and vanishes for entirely solved or unsolved cases, aligning optimal sample selection with maximal policy gradient signal and training stability. A composite reward also integrates penalties for similarity, ensuring curriculum diversity and mitigating model collapse from repeated or overly homogenous samples.

Online Co-Evolution

The Solver is trained on a dynamically mixed dataset at each iteration, blending original test-set samples with the current batch of synthesized variants. Self-supervision follows by majority-vote aggregation of model outputs as pseudo-labels. Only samples close to the capability frontier are used for updating, based on the agreement margin, to prevent optimization on samples that provide little additional learning signal.

Crucially, the Solver's performance feeds back into the Synthesizer: the Synthesizer optimizes to propose variants that challenge but do not overwhelm the Solver, driving a structured curriculum through closed-loop co-adaptation. This iterative strategy is proven both theoretically (as attention to outcome variance maximizes expected gradient magnitude) and empirically effective.

Experimental Results

Performance on Mathematical Reasoning

TTCS demonstrates substantial quantitative improvements over baselines (Self-Consistency, TTRL, R-Zero) across extensive mathematical benchmarks, including AIME24/25, AMC23, MATH-500, Minerva, and OlympiadBench. For example, on Qwen2.5-Math-1.5B, TTCS reaches an average accuracy of 41.5%, a +24.2 point improvement over the backbone and +14 over Self-Consistency. On the more competitive Qwen2.5-Math-7B, it achieves 52.5%, outstripping TTRL by +4.1 points and Self-Consistency by +20.4 points.

Performance advantages are pronounced on the most challenging tasks (e.g., AIME24/25), where TTCS outperforms TTRL by over +5 points, confirming that the adaptive, curriculum-driven approach provides reliable gradient signals where majority consensus-based adaptation fails.

Generalization and Transfer

TTCS-trained solvers generalize from mathematical self-evolution to broader reasoning benchmarks such as MMLU-Pro and SuperGPQA. Gains acquired in mathematical self-training transfer robustly, with TTCS surpassing baselines across general-domain tasks. Furthermore, models trained on specific benchmarks (e.g., MATH-500) show strong out-of-distribution improvements on more difficult benchmarks (e.g., AIME24), evidencing that TTCS induces generalizable reasoning logic rather than overfitting.

Ablation and Case Studies

Ablations confirm the necessity of synthesizer co-evolution, online data filtering at the capability boundary, and diversity-driven reward components. Removing any of these leads to substantial accuracy drops, particularly on challenging datasets. Using a static, stronger synthetic generator (e.g., Qwen2.5-14B-Instruct) only marginally improves results compared to a co-evolving, weaker synthesizer, establishing that online curriculum adaptivity is more crucial than the absolute generator capability.

Longitudinal case studies show that synthesized questions become more sophisticated through training, exhibiting structural variations, domain transfers, and shifts in complexity, reflecting meaningful curriculum progression.

Theoretical Contributions

TTCS grounds its capability-adaptive sampling in robust theoretical arguments. The reward structure aligns sample selection with maximal policy gradient variance (greatest learning opportunity) justified through Hoeffding's inequality for Bernoulli sampling conditions and analysis of variance-driven policy updates under GRPO. This ensures that self-evolving focuses on the most instructive examples for representation refinement even in the absence of ground-truth labels.

Implications and Future Directions

Practically, TTCS establishes a scalable framework for self-evolving LLMs in unavailable-supervision settings, substantially lowering the reliance on external labels or ground-truth data. Its curriculum-driven test-time adaptation strategy can be extended to other domains (vision-language, multi-modal reasoning) and as a foundation for autonomous agentic LLMs that dynamically improve capabilities during real-world deployment.

Theoretically, TTCS illuminates the value of coupling adaptive curriculum generation with outcome-variance aware updating in RL-driven self-improvement and provides a blueprint for preventing model collapse and reward corruption in purely synthetic or recursively-generated data regimes.

Future lines of inquiry include further investigation into cross-modal curriculum synthesis, application to longer-horizon reasoning and planning tasks, and tighter integration with agentic learning setups where environmental interaction further expands the notion of exogenous supervision.

Conclusion

TTCS advances self-evolving LLMs by integrating closed-loop curriculum generation with effective test-time adaptation, enabling substantial gains in reasoning capabilities and transfer generalization without human annotation. The introduction of capability-aware synthetic sample selection and robust reward engineering is empirically and theoretically validated, providing an effective paradigm for future autonomous learning systems.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (8)

Collections

GitHub

GitHub - XMUDeepLIT/TTCS: The code implementation for TTCS: Test-Time Curriculum Synthesis for Self-Evolving. (2 stars)

TTCS: Test-Time Curriculum Synthesis for Self-Evolving

Summary

Test-Time Curriculum Synthesis for Self-Evolving: A Technical Summary

Introduction and Motivation

TTCS Framework

Curriculum-Based Variant Synthesis

Capability-Adaptive Rewarding

Online Co-Evolution

Experimental Results

Performance on Mathematical Reasoning

Generalization and Transfer

Ablation and Case Studies

Theoretical Contributions

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

GitHub

Tweets