Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning (2502.14768v1)

Published 20 Feb 2025 in cs.CL and cs.AI

Abstract: Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

Summary

The paper introduces Logic-RL, a framework using rule-based reinforcement learning and synthetic logic puzzles to train LLMs, focusing on enhancing reasoning skills rather than just reaching correct answers.
Logic-RL employs a stringent format reward function and strategic system prompt engineering to guide LLMs to follow valid reasoning pathways and avoid relying on superficial patterns.
A 7B parameter model trained with Logic-RL on 5K logic problems showed improved reasoning abilities and significant generalization performance on challenging math benchmarks like AIME and AMC.

Logic-RL: Rule-Based Reinforcement Learning for LLM Reasoning

The paper "Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning" (2502.14768) explores the integration of rule-based RL to enhance the reasoning capabilities of LLMs, drawing inspiration from DeepSeek-R1. The authors address the challenge of imbuing LLMs with robust reasoning skills by leveraging synthetic logic puzzles for training. The rationale for using synthetic data stems from its inherent controllability in terms of complexity and the ease of answer verification, which facilitates the RL training process.

Technical Contributions to RL Training

The paper makes several technical contributions to achieve effective and stable RL training of LLMs:

System Prompt Engineering

A critical component of the approach is the design of a system prompt that explicitly guides the LLM to focus on the reasoning process and the generation of answers. This emphasis is intended to prevent the model from adopting shortcut strategies and to promote a more deliberate and structured thought process.

Reward Function Design

The authors implement a stringent format reward function that penalizes the model for taking shortcuts or deviating from the desired reasoning pathway. This reward function is crucial for ensuring that the model learns to solve the logic puzzles through valid reasoning steps rather than exploiting superficial patterns or biases in the training data.

Training Recipe and Convergence

A straightforward training recipe is employed to achieve stable convergence during the RL training process. The paper emphasizes the importance of a carefully tuned training regime to prevent divergence and to ensure that the model effectively learns the underlying reasoning principles.

Model Capabilities and Generalization

The results of the paper demonstrate that a 7B parameter model, after being trained with the proposed Logic-RL framework, exhibits advanced reasoning skills, including reflection, verification, and summarization. These capabilities were not initially present in the model when trained solely on the logic corpus. Remarkably, the model demonstrates generalization abilities to the challenging math benchmarks AIME and AMC after being trained on only 5K logic problems. This suggests that the rule-based RL approach can effectively transfer reasoning skills learned from synthetic logic puzzles to more complex, real-world problem-solving scenarios.

Implications for Reasoning in LLMs

The research findings highlight the potential of rule-based RL as a means of enhancing the reasoning capabilities of LLMs. By focusing on the reasoning process and using a carefully designed reward function, the Logic-RL framework can promote the development of advanced reasoning skills that generalize to challenging benchmarks. The successful application of this approach to a relatively small 7B model suggests that it could be a promising avenue for improving the reasoning abilities of larger LLMs as well.

Conclusion

The Logic-RL framework, with its emphasis on system prompt engineering, stringent format reward functions, and a stable training recipe, offers a compelling approach to enhancing the reasoning capabilities of LLMs. The generalization of a 7B model trained on 5K logic problems to benchmarks like AIME and AMC underscores the potential of this method.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1893419823520174309

https://twitter.com/_akhaliq/status/1892776201069977898

https://twitter.com/Montreal_IA/status/1893085916798845101

https://twitter.com/TheTuringPost/status/1894346415717192141

https://twitter.com/geneweng/status/1893276789197156430

https://twitter.com/SciTechAccess/status/1894250753755336776