Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models

Published 13 Nov 2025 in cs.AI and cs.CV | (2511.09907v4)

Abstract: Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data are used to bootstrap problem-design strategies in the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our proposed framework achieves a cumulative average improvement of 3.4%, demonstrating robust generalization across both language and vision-LLMs.

Summary

  • The paper introduces a reasoning-driven, solver-adaptive framework that leverages chain-of-thought reasoning to guide problem design for large reasoning models.
  • The method utilizes reward-model-free reinforcement learning with direct solver feedback, showing an average pass@1 improvement of up to 3.4% on various benchmarks.
  • Empirical results demonstrate strong generalization and data efficiency across diverse mathematical and non-mathematical tasks through iterative co-evolution between the generator and solver.

Solver-Adaptive, Reasoning-Driven Problem Synthesis for Large Reasoning Models

Introduction

This work addresses two central limitations of current data synthesis methods for large reasoning model (LRM) training: the lack of explicit, pedagogically-aware reasoning during problem generation and failure to adaptively calibrate the difficulty of synthetic data to the solver's evolving proficiency. The proposed method introduces a novel, reasoning-driven, solver-adaptive framework for generating high-quality problem datasets that systematically foster solver progress. Explicit chain-of-thought (CoT) traces are employed not only for solution reasoning but to scaffold and guide the problem design process itself.

Methodology

The framework builds on three main components: (1) reasoning-driven cold-start data generation via reverse-engineered problem-design CoT, (2) reward-model-free RL using direct solver feedback as the reward signal, and (3) co-evolutionary iteration between generator and solver.

Problem-Design CoT Cold-Start

Rather than indiscriminately synthesizing new problems or relying on basic data augmentation, the generator is bootstrapped using multi-part mathematical problems for which subquestion decompositions are available. From these, explicit problem-design CoT—detailing the rationale behind moving from one subproblem to a more complex or conceptually adjacent one—are elicited using a prompting protocol. This process results in a scalable, pedagogically aligned corpus of problem-design demonstrations, which are used for supervised fine-tuning (SFT) of the initial generator.

Solver-Adaptive RL with Verifiable Solver Feedback

Traditional data synthesis for LRMs has suffered from either reliance on reward models (prone to reward hacking and limited to coarse or inconsistent human preferences) or costly pipelines using auxiliary difficulty estimators. In this framework, the synthetic problem generator’s reward is computed directly from the solver’s empirical accuracy on both original and synthesized problems, thereby targeting the very “edge” of the solver’s capability. The reward functional

Racc=1anew(1aori)+min(anew,1anew)R_{acc} = 1 - |a_{\text{new}} - (1 - a_{\text{ori}})| + \min (a_{\text{new}}, 1 - a_{\text{new}})

encourages the generator to propose problems that are maximally informative for learning, i.e., those for which solver accuracy hovers around 0.5 (maximal decision uncertainty), while also supporting controlled inversion of difficulty for targeted curricular scaffolding. Outputs are further constrained by strict formatting rewards to ensure data fidelity.

Training is accomplished with Group Relative Policy Optimization (GRPO), obviating the need for a learned value function and resulting in stable, efficient policy updates with strong generalization.

Iterative Co-Evolution

To further compound gains, the generator and solver are trained in alternating rounds. The solver’s accuracy improvements on synthesized problems create a moving target for the generator, which, in turn, evolves to propose higher-quality and harder problems, enabling continual curriculum adaptation. Empirically, additional co-evolution rounds yield further solver improvement.

Empirical Results

Benchmarks and Baselines

Extensive evaluation encompasses ten mathematical and general reasoning benchmarks, including MATH, GSM8K, AIME, Olympiad, AMC, Minerva, MMLU-Pro, SuperGPQA, and BBEH. Compared baselines include direct seed set training, Self-Instruct, CoT-Self-Instruct, R-Zero (decision-boundary reward), RLMT (preference-model reward), and standard cold-starts.

Numerical Performance

Across all mathematical reasoning datasets, the proposed framework achieves consistent, robust gains. For representative models (Qwen3-4B, Qwen3-8B), the average pass@1 improvement is 3.4% over the strongest alternative, and an additional 2.17% over recent preference-model-based RLMT approaches. These results are maintained under fixed budget comparisons, firmly establishing that observed gains derive from data quality rather than artifacts of training set size. Notably, the method generalizes effectively to cross-modal vision-LLMs (Qwen2.5-VL-7B) and non-mathematical reasoning datasets, where a ~3% improvement in average score is observed.

Ablation studies confirm that including both the inversion and boundary terms in the reward yields the best performance, strongly supporting the designed reward structure.

Generalization and Data Efficiency

The problem generator generalizes well across seed sets derived from different datasets (MATH, GSM8K, DAPO), consistently enhancing solver performance relative to using seed data alone. Furthermore, the reward signal computed using solver consistency is shown to correlate strongly (r=0.89r = 0.89) with actual accuracy (pass@1), providing theoretical and empirical justification for this label-free RL approach.

Case Analyses

Qualitative inspection demonstrates the system’s capacity for nuanced, context-sensitive adaptation. For instance, given a student mid-way through mastering permutations, the generator can scaffold from a straightforward to a constraint-augmented problem, targeting inclusion-exclusion reasoning at precisely the right stage of curriculum exposure.

Theoretical and Practical Implications

The work demonstrates that large models can be effectively trained in open-ended problem-posing via explicit reasoning-driven data synthesis, augmenting solver-centric RL pipelines. It conclusively establishes that problem-generation can be made both explicit and solver-adaptive, supporting continual curriculum learning without requiring human-in-the-loop reward modeling.

Implications extend to:

  • Educational Technology: Direct, adaptive curriculum and assessment problem generation enabling more fine-grained, personalized, and efficient AI tutoring systems.
  • Data Efficiency: Reducing reliance on expensive, human-curated high-quality datasets through targeted synthetic data generation with strong control of value and coverage.
  • Curriculum Co-Evolution: Enabling bidirectional, automated teacher-student co-evolution that reflects real didactic dynamics in large model training.

Conclusion

The proposed framework establishes a scalable, robust pipeline for reasoning-driven, solver-adaptive problem generation for large reasoning models. Empirical results show significant, consistent improvement in downstream reasoning, robust generalization across domains and modalities, and highlight the value of direct solver feedback as both a learning signal and generator objective. This work points to a future where advanced AI systems can autonomously generate and adapt high-value learning content for continual self-improvement and targeted educational interventions.

Reference:

"Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models" (2511.09907)

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.