- The paper introduces a novel PRM that integrates step-level and trajectory-level rewards to supervise both intermediate reasoning trajectories and final responses.
- It demonstrates performance improvements, including up to 12.1% gain in supervised fine-tuning and enhanced evaluation for data selection and test-time scaling.
- The model enhances alignment, interpretability, and data efficiency, supporting robust fine-tuning and reinforcement learning with dense process-level rewards.
ReasonFlux-PRM: Trajectory-Aware Process Reward Models for Long Chain-of-Thought Reasoning in LLMs
The paper introduces ReasonFlux-PRM, a trajectory-aware process reward model (PRM) designed to provide fine-grained supervision for both intermediate reasoning trajectories and final responses in LLMs engaged in long chain-of-thought (CoT) reasoning. This work addresses the limitations of existing PRMs, which are primarily trained on final model outputs and are ill-suited for evaluating the increasingly prevalent trajectory–response outputs generated by advanced reasoning models such as Deepseek-R1 and OpenAI-o1.
Recent advances in LLMs have led to the adoption of trajectory–response output formats, where a model first generates an extended, often unstructured, intermediate reasoning trajectory, followed by a concise, step-by-step final response. These trajectory–response pairs are widely used for distillation, enabling smaller models to emulate the reasoning capabilities of larger models. However, existing PRMs, trained on final responses, are not calibrated to evaluate the quality of intermediate thinking trajectories, which are structurally distinct and often noisier than final outputs. Empirical analysis in the paper demonstrates that current PRMs struggle to distinguish high- and low-quality trajectories and that their use in data selection can degrade downstream model performance.
ReasonFlux-PRM: Model and Reward Design
ReasonFlux-PRM is explicitly designed to address the structural and semantic mismatch between intermediate trajectories and final responses. The model introduces a novel reward design that integrates both step-level and trajectory-level supervision:
- Step-Level Rewards: Each step in the thinking trajectory is evaluated using three components:
- Alignment Score: Measures semantic similarity between each trajectory step and the corresponding step in the final response, using a pretrained encoder and cosine similarity.
- Quality Score: Employs a strong LLM (e.g., GPT-4o) as a judge to assess the logical soundness and correctness of each step in context.
- Coherence Score: Enforces local consistency between adjacent steps via a contrastive mutual information objective, penalizing incoherent transitions.
These components are aggregated using a softmax-based weighting scheme to produce a unified step-level reward.
- Trajectory-Level Reward: To capture the global problem-solving strategy, ReasonFlux-PRM introduces a template-guided reward. An expert LLM extracts a high-level reasoning template from the trajectory–response pair, and a policy model is tasked with solving the problem by following this template. The trajectory-level reward is defined as the average correctness of the generated responses, reflecting the generalizability and soundness of the overall reasoning strategy.
- Joint Training Objective: The model is trained to minimize the discrepancy between predicted and reference rewards at both the step and trajectory levels, using a weighted sum of mean squared errors.
Application Scenarios
ReasonFlux-PRM is evaluated in both offline and online settings:
- Offline Data Selection: The model assigns composite rewards to trajectory–response pairs, enabling the selection of high-quality data for supervised fine-tuning (SFT) of smaller models. This approach is shown to outperform both strong PRM baselines and human-curated datasets, achieving up to 6.0% and 6.1% gains on MATH500 and GPQA-Diamond, respectively.
- Online Reward Modeling: ReasonFlux-PRM provides dense process-level rewards for reinforcement learning (RL) policy optimization (e.g., via GRPO). The integration of ReasonFlux-PRM rewards leads to consistent improvements over rule-based and prior PRM-based signals, with gains of up to 5.9% on challenging benchmarks.
- Test-Time Scaling: The model supports reward-guided Best-of-N selection, enabling more effective inference-time scaling. ReasonFlux-PRM demonstrates superior ability to identify high-quality reasoning traces, maintaining strong performance as the number of sampled candidates increases.
Empirical Results
Extensive experiments on AIME, MATH500, and GPQA-Diamond benchmarks validate the effectiveness of ReasonFlux-PRM:
- Supervised Fine-Tuning: ReasonFlux-PRM-7B selected data yields an average 12.1% improvement over baselines.
- Reinforcement Learning: Incorporation of ReasonFlux-PRM-7B rewards results in a 4.5% average gain.
- Test-Time Scaling: Best-of-N selection with ReasonFlux-PRM-7B achieves a 6.3% improvement.
Ablation studies confirm the importance of balancing step-level and trajectory-level rewards, and scaling analyses show that larger reward models provide more accurate supervision.
Implementation and Efficiency
ReasonFlux-PRM is implemented using Qwen2.5-1.5B and Qwen2.5-7B as base models, with training conducted on curated trajectory–response datasets (e.g., OpenThoughts-114K). The model is computationally efficient: fine-tuning on 1k ReasonFlux-PRM-selected samples outperforms training on 59k raw samples, and the additional overhead in RL training is moderate relative to the performance gains.
The code and models are publicly available, facilitating adoption in both research and production settings, including resource-constrained and edge deployments.
Implications and Future Directions
ReasonFlux-PRM advances the state of process reward modeling by enabling robust, fine-grained supervision of both intermediate and final reasoning steps. This has several practical and theoretical implications:
- Data Efficiency: High-quality data selection via trajectory-aware rewards enables smaller models to achieve strong reasoning performance with less data.
- Alignment and Interpretability: Step-level and trajectory-level rewards provide interpretable signals for model alignment and error analysis, supporting safer and more reliable LLM deployment.
- Generalization: The template-guided trajectory-level reward encourages models to learn generalizable problem-solving strategies, not just surface-level step correctness.
Future work may extend ReasonFlux-PRM to more open-ended domains (e.g., code generation, dialogue), explore dynamic weighting of reward components, and investigate integration with other RL and search-based optimization frameworks. The trajectory-aware reward modeling paradigm established by ReasonFlux-PRM is likely to inform the next generation of LLM alignment and reasoning supervision techniques.