Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs (2506.18896v1)

Published 23 Jun 2025 in cs.CL

Abstract: Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in LLMs. Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFlux

Summary

  • The paper introduces a novel PRM that integrates step-level and trajectory-level rewards to supervise both intermediate reasoning trajectories and final responses.
  • It demonstrates performance improvements, including up to 12.1% gain in supervised fine-tuning and enhanced evaluation for data selection and test-time scaling.
  • The model enhances alignment, interpretability, and data efficiency, supporting robust fine-tuning and reinforcement learning with dense process-level rewards.

ReasonFlux-PRM: Trajectory-Aware Process Reward Models for Long Chain-of-Thought Reasoning in LLMs

The paper introduces ReasonFlux-PRM, a trajectory-aware process reward model (PRM) designed to provide fine-grained supervision for both intermediate reasoning trajectories and final responses in LLMs engaged in long chain-of-thought (CoT) reasoning. This work addresses the limitations of existing PRMs, which are primarily trained on final model outputs and are ill-suited for evaluating the increasingly prevalent trajectory–response outputs generated by advanced reasoning models such as Deepseek-R1 and OpenAI-o1.

Motivation and Problem Formulation

Recent advances in LLMs have led to the adoption of trajectory–response output formats, where a model first generates an extended, often unstructured, intermediate reasoning trajectory, followed by a concise, step-by-step final response. These trajectory–response pairs are widely used for distillation, enabling smaller models to emulate the reasoning capabilities of larger models. However, existing PRMs, trained on final responses, are not calibrated to evaluate the quality of intermediate thinking trajectories, which are structurally distinct and often noisier than final outputs. Empirical analysis in the paper demonstrates that current PRMs struggle to distinguish high- and low-quality trajectories and that their use in data selection can degrade downstream model performance.

ReasonFlux-PRM: Model and Reward Design

ReasonFlux-PRM is explicitly designed to address the structural and semantic mismatch between intermediate trajectories and final responses. The model introduces a novel reward design that integrates both step-level and trajectory-level supervision:

  • Step-Level Rewards: Each step in the thinking trajectory is evaluated using three components:
    • Alignment Score: Measures semantic similarity between each trajectory step and the corresponding step in the final response, using a pretrained encoder and cosine similarity.
    • Quality Score: Employs a strong LLM (e.g., GPT-4o) as a judge to assess the logical soundness and correctness of each step in context.
    • Coherence Score: Enforces local consistency between adjacent steps via a contrastive mutual information objective, penalizing incoherent transitions.

These components are aggregated using a softmax-based weighting scheme to produce a unified step-level reward.

  • Trajectory-Level Reward: To capture the global problem-solving strategy, ReasonFlux-PRM introduces a template-guided reward. An expert LLM extracts a high-level reasoning template from the trajectory–response pair, and a policy model is tasked with solving the problem by following this template. The trajectory-level reward is defined as the average correctness of the generated responses, reflecting the generalizability and soundness of the overall reasoning strategy.
  • Joint Training Objective: The model is trained to minimize the discrepancy between predicted and reference rewards at both the step and trajectory levels, using a weighted sum of mean squared errors.

Application Scenarios

ReasonFlux-PRM is evaluated in both offline and online settings:

  • Offline Data Selection: The model assigns composite rewards to trajectory–response pairs, enabling the selection of high-quality data for supervised fine-tuning (SFT) of smaller models. This approach is shown to outperform both strong PRM baselines and human-curated datasets, achieving up to 6.0% and 6.1% gains on MATH500 and GPQA-Diamond, respectively.
  • Online Reward Modeling: ReasonFlux-PRM provides dense process-level rewards for reinforcement learning (RL) policy optimization (e.g., via GRPO). The integration of ReasonFlux-PRM rewards leads to consistent improvements over rule-based and prior PRM-based signals, with gains of up to 5.9% on challenging benchmarks.
  • Test-Time Scaling: The model supports reward-guided Best-of-N selection, enabling more effective inference-time scaling. ReasonFlux-PRM demonstrates superior ability to identify high-quality reasoning traces, maintaining strong performance as the number of sampled candidates increases.

Empirical Results

Extensive experiments on AIME, MATH500, and GPQA-Diamond benchmarks validate the effectiveness of ReasonFlux-PRM:

  • Supervised Fine-Tuning: ReasonFlux-PRM-7B selected data yields an average 12.1% improvement over baselines.
  • Reinforcement Learning: Incorporation of ReasonFlux-PRM-7B rewards results in a 4.5% average gain.
  • Test-Time Scaling: Best-of-N selection with ReasonFlux-PRM-7B achieves a 6.3% improvement.

Ablation studies confirm the importance of balancing step-level and trajectory-level rewards, and scaling analyses show that larger reward models provide more accurate supervision.

Implementation and Efficiency

ReasonFlux-PRM is implemented using Qwen2.5-1.5B and Qwen2.5-7B as base models, with training conducted on curated trajectory–response datasets (e.g., OpenThoughts-114K). The model is computationally efficient: fine-tuning on 1k ReasonFlux-PRM-selected samples outperforms training on 59k raw samples, and the additional overhead in RL training is moderate relative to the performance gains.

The code and models are publicly available, facilitating adoption in both research and production settings, including resource-constrained and edge deployments.

Implications and Future Directions

ReasonFlux-PRM advances the state of process reward modeling by enabling robust, fine-grained supervision of both intermediate and final reasoning steps. This has several practical and theoretical implications:

  • Data Efficiency: High-quality data selection via trajectory-aware rewards enables smaller models to achieve strong reasoning performance with less data.
  • Alignment and Interpretability: Step-level and trajectory-level rewards provide interpretable signals for model alignment and error analysis, supporting safer and more reliable LLM deployment.
  • Generalization: The template-guided trajectory-level reward encourages models to learn generalizable problem-solving strategies, not just surface-level step correctness.

Future work may extend ReasonFlux-PRM to more open-ended domains (e.g., code generation, dialogue), explore dynamic weighting of reward components, and investigate integration with other RL and search-based optimization frameworks. The trajectory-aware reward modeling paradigm established by ReasonFlux-PRM is likely to inform the next generation of LLM alignment and reasoning supervision techniques.