SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning (2506.08989v1)

Published 10 Jun 2025 in cs.LG and cs.CL

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training LLMs on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

Insightful Overview of "SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning"

The paper presents a novel framework called Self-aware Weakness-driven Problem Synthesis (SwS) aimed at enhancing the learning capabilities of LLMs through reinforcement learning (RL). In contrast to traditional approaches that indiscriminately expand problem sets, SwS introduces a more sophisticated method that identifies and leverages model weaknesses to foster targeted problem augmentation.

Motivation and Approach

The catalyst for SwS arises from the scarcity of well-crafted, human-labeled mathematical problems required for effective RL training and the inefficiency of traditional problem synthesis methods that overlook individual model capabilities. To address these challenges, SwS systematically identifies model weaknesses as problem areas that the model persistently fails to master during RL training. By extracting core concepts from these failure cases, it synthesizes new problems aimed at reinforcing the model's deficient areas.

The framework operates in three core stages:

Self-aware Weakness Identification: During an initial RL phase, the model's weaknesses are identified based on problems that it consistently fails to solve.
Targeted Problem Synthesis: Core concepts are extracted from failure cases and strategically recombined to generate problems that target the model's deficient capabilities.
Augmented Training: The model undergoes further training with the augmented problem set, designed to mitigate its weaknesses iteratively.

Experimental Validation

Experiments conducted across various model sizes (ranging from 3B to 32B parameters) and using eight diverse reasoning benchmarks demonstrate the efficacy of the SwS framework. Remarkably, the models trained using SwS showcase performance improvements averaging 10.0% for 7B models and 7.7% for 32B models, surpassing training exclusively on human-labeled problem datasets. These improvements persist across both standard and competition-level benchmarks, underscoring SwS's capability to enhance reasoning by explicitly focusing on individual model weaknesses.

Implications and Future Directions

In theoretical terms, SwS contributes substantially to our understanding of self-improvement mechanisms in AI. By allowing models to recognize and rectify their deficiencies, it promotes more efficient learning and adaptation. Practically, SwS demonstrates that models can achieve significant performance gains without an extensive reliance on external datasets, which can be costly and time-consuming to produce.

Looking forward, extending the SwS methodology beyond the field of problem-solving to encompass other domains and tasks could yield further insights into the adaptability and generalization capabilities of LLMs within RL settings. Additionally, exploring the application of SwS in conjunction with other optimization strategies such as curriculum learning could potentially lead to even greater increases in training efficiency.

Conclusion

SwS exemplifies a pragmatic approach to enhancing LLM reasoning through targeted RL-driven problem synthesis. By focusing on model weaknesses, SwS facilitates more efficient and effective learning, potentially offering a blueprint for self-improvement frameworks across various AI applications. Future explorations into its scalability and applicability across diverse tasks promise to further advance our capabilities in developing more intelligent and autonomous systems.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (8)

Tweets

https://twitter.com/MasterVito0601/status/1933383061296849102

YouTube

Show All Videos