Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks (2506.13351v1)

Published 16 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advances in LLMs have showcased impressive reasoning abilities in structured tasks like mathematics and programming, largely driven by Reinforcement Learning with Verifiable Rewards (RLVR), which uses outcome-based signals that are scalable, effective, and robust against reward hacking. However, applying similar techniques to open-ended long-form reasoning tasks remains challenging due to the absence of generic, verifiable reward signals. To address this, we propose Direct Reasoning Optimization (DRO), a reinforcement learning framework for fine-tuning LLMs on open-ended, particularly long-form, reasoning tasks, guided by a new reward signal: the Reasoning Reflection Reward (R3). At its core, R3 selectively identifies and emphasizes key tokens in the reference outcome that reflect the influence of the model's preceding chain-of-thought reasoning, thereby capturing the consistency between reasoning and reference outcome at a fine-grained level. Crucially, R3 is computed internally using the same model being optimized, enabling a fully self-contained training setup. Additionally, we introduce a dynamic data filtering strategy based on R3 for open-ended reasoning tasks, reducing cost while improving downstream performance. We evaluate DRO on two diverse datasets -- ParaRev, a long-form paragraph revision task, and FinQA, a math-oriented QA benchmark -- and show that it consistently outperforms strong baselines while remaining broadly applicable across both open-ended and structured domains.

Summary

The paper introduces the Reasoning Reflection Reward (RRR) to guide LLMs in refining their chain-of-thought reasoning without external evaluations.
It leverages the Group Relative Policy Optimization framework and dynamic data filtering to optimize open-ended reasoning tasks efficiently.
Empirical results demonstrate enhanced performance and a 45% reduction in training costs on datasets like ParaRev and FinQA.

Overview of "Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks"

In the context of advancing LLMs, the paper titled "Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks" proposes an innovative framework for optimizing model reasoning abilities on open-ended tasks. This framework, Direct Reasoning Optimization (DRO), introduces a novel reward mechanism called Reasoning Reflection Reward (RRR) which integralizes reinforcement learning (RL) to improve reasoning capabilities without relying on external evaluation mechanisms.

Reasoning Reflection Reward (RRR)

The core contribution of the paper is the Reasoning Reflection Reward (RRR). The RRR framework provides a token-level dense reward signal intended to guide LLMs in fine-tuning their reasoning processes, particularly for open-ended tasks. RRR operates by measuring the LLM's internal self-certainty over reference outcomes conditioned on its earlier reasoning steps. This internal mechanism assesses the alignment and consistency between the model-generated chain-of-thought (CoT) reasoning and reference results, effectively promoting reasoning paths that lead to desired outcomes.

Two primary challenges addressed by RRR are the identification of reasoning-reflective tokens and mitigating the effects of reference tokens compensating for poor reasoning. RRR tackles these issues by emphasizing token-level variance and strategically propagating reasoning certainty from impactful tokens throughout the generation sequence.

Architecture of Direct Reasoning Optimization (DRO)

DRO leverages RRR within the Group Relative Policy Optimization (GRPO) framework, an RL-based strategy known for its comparative and group-relative scoring method. This configuration allows LLMs to optimize their reasoning quality internally, thus achieving self-contained, dynamic reinforcement learning without external reward models. Another pivotal component of DRO is its dynamic data filtering strategy, which identifies and excludes samples of limited reasoning diversity or excessive complexity, maximizing data efficiency and improving model performance during training.

Empirical Evaluation and Results

The evaluation of DRO involves two distinct datasets: ParaRev, focused on scientific paragraph revisions, and FinQA, centered on numerical reasoning in financial data. Findings demonstrate DRO's superior performance over various baselines, including models reliant on standard similarity-based metrics like ROUGE for open-ended tasks. Notably, DRO achieves significant improvement in win rates (against models such as GPT-4o) and reduces training cost by approximately 45%.

On the FinQA dataset, DRO achieves results comparable to those guided by correctness-based rewards, showcasing its adaptability to both structured and open-ended reasoning tasks. The paper effectively argues for DRO's scalable application across diverse domains, emphasizing its potential beyond currently verifiable tasks.

Implications and Future Work

The proposed DRO framework has significant implications for enhancing the reasoning capabilities of LLMs, particularly in domains lacking explicit reward verification. The internal reward mechanism simplifies the training process, avoiding reliance on external evaluators and reward models which risk vulnerabilities like reward hacking. This approach highlights a paradigm shift towards self-supervised learning in LLMs, paving the way for more adaptable, autonomous systems.

Further research could explore the extension of DRO to real-world applications involving complex decision-making processes, large-scale data analysis, and interactive dialogues. Future developments in self-rewarding strategies may enhance LLM scalability, enabling more nuanced understanding and optimization in broader AI applications.

In conclusion, the paper's contributions offer a compelling framework for refining LLM reasoning through self-contained reward mechanisms, overcoming traditional limitations in reward modelling for open-ended tasks, and establishing promising avenues for future research and application in AI technologies.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1935084918482386971

https://twitter.com/Synced_Global/status/1935240684572996019

https://twitter.com/BioInfo/status/1939378542451802445

YouTube

Show All Videos