- The paper introduces a reasoning-driven, solver-adaptive framework that leverages chain-of-thought reasoning to guide problem design for large reasoning models.
- The method utilizes reward-model-free reinforcement learning with direct solver feedback, showing an average pass@1 improvement of up to 3.4% on various benchmarks.
- Empirical results demonstrate strong generalization and data efficiency across diverse mathematical and non-mathematical tasks through iterative co-evolution between the generator and solver.
Solver-Adaptive, Reasoning-Driven Problem Synthesis for Large Reasoning Models
Introduction
This work addresses two central limitations of current data synthesis methods for large reasoning model (LRM) training: the lack of explicit, pedagogically-aware reasoning during problem generation and failure to adaptively calibrate the difficulty of synthetic data to the solver's evolving proficiency. The proposed method introduces a novel, reasoning-driven, solver-adaptive framework for generating high-quality problem datasets that systematically foster solver progress. Explicit chain-of-thought (CoT) traces are employed not only for solution reasoning but to scaffold and guide the problem design process itself.
Methodology
The framework builds on three main components: (1) reasoning-driven cold-start data generation via reverse-engineered problem-design CoT, (2) reward-model-free RL using direct solver feedback as the reward signal, and (3) co-evolutionary iteration between generator and solver.
Problem-Design CoT Cold-Start
Rather than indiscriminately synthesizing new problems or relying on basic data augmentation, the generator is bootstrapped using multi-part mathematical problems for which subquestion decompositions are available. From these, explicit problem-design CoT—detailing the rationale behind moving from one subproblem to a more complex or conceptually adjacent one—are elicited using a prompting protocol. This process results in a scalable, pedagogically aligned corpus of problem-design demonstrations, which are used for supervised fine-tuning (SFT) of the initial generator.
Solver-Adaptive RL with Verifiable Solver Feedback
Traditional data synthesis for LRMs has suffered from either reliance on reward models (prone to reward hacking and limited to coarse or inconsistent human preferences) or costly pipelines using auxiliary difficulty estimators. In this framework, the synthetic problem generator’s reward is computed directly from the solver’s empirical accuracy on both original and synthesized problems, thereby targeting the very “edge” of the solver’s capability. The reward functional
Racc=1−∣anew−(1−aori)∣+min(anew,1−anew)
encourages the generator to propose problems that are maximally informative for learning, i.e., those for which solver accuracy hovers around 0.5 (maximal decision uncertainty), while also supporting controlled inversion of difficulty for targeted curricular scaffolding. Outputs are further constrained by strict formatting rewards to ensure data fidelity.
Training is accomplished with Group Relative Policy Optimization (GRPO), obviating the need for a learned value function and resulting in stable, efficient policy updates with strong generalization.
Iterative Co-Evolution
To further compound gains, the generator and solver are trained in alternating rounds. The solver’s accuracy improvements on synthesized problems create a moving target for the generator, which, in turn, evolves to propose higher-quality and harder problems, enabling continual curriculum adaptation. Empirically, additional co-evolution rounds yield further solver improvement.
Empirical Results
Benchmarks and Baselines
Extensive evaluation encompasses ten mathematical and general reasoning benchmarks, including MATH, GSM8K, AIME, Olympiad, AMC, Minerva, MMLU-Pro, SuperGPQA, and BBEH. Compared baselines include direct seed set training, Self-Instruct, CoT-Self-Instruct, R-Zero (decision-boundary reward), RLMT (preference-model reward), and standard cold-starts.
Across all mathematical reasoning datasets, the proposed framework achieves consistent, robust gains. For representative models (Qwen3-4B, Qwen3-8B), the average pass@1 improvement is 3.4% over the strongest alternative, and an additional 2.17% over recent preference-model-based RLMT approaches. These results are maintained under fixed budget comparisons, firmly establishing that observed gains derive from data quality rather than artifacts of training set size. Notably, the method generalizes effectively to cross-modal vision-LLMs (Qwen2.5-VL-7B) and non-mathematical reasoning datasets, where a ~3% improvement in average score is observed.
Ablation studies confirm that including both the inversion and boundary terms in the reward yields the best performance, strongly supporting the designed reward structure.
Generalization and Data Efficiency
The problem generator generalizes well across seed sets derived from different datasets (MATH, GSM8K, DAPO), consistently enhancing solver performance relative to using seed data alone. Furthermore, the reward signal computed using solver consistency is shown to correlate strongly (r=0.89) with actual accuracy (pass@1), providing theoretical and empirical justification for this label-free RL approach.
Case Analyses
Qualitative inspection demonstrates the system’s capacity for nuanced, context-sensitive adaptation. For instance, given a student mid-way through mastering permutations, the generator can scaffold from a straightforward to a constraint-augmented problem, targeting inclusion-exclusion reasoning at precisely the right stage of curriculum exposure.
Theoretical and Practical Implications
The work demonstrates that large models can be effectively trained in open-ended problem-posing via explicit reasoning-driven data synthesis, augmenting solver-centric RL pipelines. It conclusively establishes that problem-generation can be made both explicit and solver-adaptive, supporting continual curriculum learning without requiring human-in-the-loop reward modeling.
Implications extend to:
- Educational Technology: Direct, adaptive curriculum and assessment problem generation enabling more fine-grained, personalized, and efficient AI tutoring systems.
- Data Efficiency: Reducing reliance on expensive, human-curated high-quality datasets through targeted synthetic data generation with strong control of value and coverage.
- Curriculum Co-Evolution: Enabling bidirectional, automated teacher-student co-evolution that reflects real didactic dynamics in large model training.
Conclusion
The proposed framework establishes a scalable, robust pipeline for reasoning-driven, solver-adaptive problem generation for large reasoning models. Empirical results show significant, consistent improvement in downstream reasoning, robust generalization across domains and modalities, and highlight the value of direct solver feedback as both a learning signal and generator objective. This work points to a future where advanced AI systems can autonomously generate and adapt high-value learning content for continual self-improvement and targeted educational interventions.
Reference:
"Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models" (2511.09907)