Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReasonFlux-PRM: Trajectory-Aware Process Rewards

Updated 30 June 2025
  • ReasonFlux-PRM is a trajectory-aware process reward model that evaluates both step-level and overall chain-of-thought reasoning in LLMs.
  • It integrates fine-grained alignment, quality, and coherence scores with trajectory-level template supervision to improve data curation, reinforcement learning, and output re-ranking.
  • Empirical benchmarks demonstrate significant gains in supervised fine-tuning, RL policy optimization, and BoN test-time scaling across math and science domains.

ReasonFlux-PRM refers to a trajectory-aware Process Reward Model (PRM) designed to evaluate and supervise long chain-of-thought reasoning in LLMs, particularly for settings where models produce explicit, often unstructured or exploratory, intermediate thinking trajectories before composing stepwise answers. ReasonFlux-PRM integrates both step-level and global trajectory-level supervision, directly targeting the unique challenges in current LLM reasoning: robust data curation for supervised fine-tuning, dense process-aligned rewards for reinforcement learning, and effective model selection in inference-time scaling. Its practical efficacy is demonstrated on benchmarks spanning mathematics and science domains, outperforming strong existing PRMs and even human curation in several scenarios.

1. Motivation and Conceptual Foundations

The recent proliferation of trajectory–response outputs from advanced LLMs (e.g., Deepseek-R1, OpenAI o1-preview) has exposed limitations in prior PRMs, which are predominantly trained on only final responses and are ill-suited for evaluating complex, branching, and sometimes noisy intermediate reasoning traces. These models often misalign reward assignment for intermediate steps, struggle with feedback for unstructured “trajectories” of thought, and can degrade downstream model quality if used naïvely for data selection.

ReasonFlux-PRM is specifically designed for this emerging landscape. It provides dual-granularity reward assessment—both at the fine-grained step level (how each intermediate thought aligns with the corresponding answer step) and at the trajectory (template/strategy) level (whether the overarching reasoning strategy is sound, transferable, and logically coherent). This architecture enables the selection, supervision, and optimization of high-quality complex reasoning traces that reflect not only surface correctness but deeper coherence and generalizability.

2. Architecture and Reward Modeling Approach

Model Backbone and Training

ReasonFlux-PRM is instantiated in two model scales: 1.5B and 7B parameters, based on Qwen2.5-Instruct backbones. Models are trained using a curated dataset of 10,000 trajectory–response pairs, extracted from sources such as OpenThoughts-114K. Supervision signals comprise:

  • Step-level rewards: Integration of three primary proxies for each step sts_t:
    • Alignment score: Cosine similarity between learned embeddings of trajectory step and matching response step, rtalign=sim(Φ(st),Φ(at))r_t^{\textrm{align}} = \textrm{sim}(\Phi(s_t), \Phi(a_t)).
    • LLM-as-a-judge quality: Raw quality estimate, rtqual=J(stx,s<t,a)r_t^{\textrm{qual}} = J(s_t\,|\,x, s_{<t}, a).
    • Coherence penalty/bonus: Contrastive mutual information for logical step continuity,

    rtcoh=logexp(sim(Φ(st1),Φ(st))/τ)sNexp(sim(Φ(st1),Φ(s))/τ)r_t^{\textrm{coh}} = \log \frac{\exp({\textrm{sim}(\Phi(s_{t-1}), \Phi(s_t))}/\tau)}{\sum_{s' \in \mathcal{N}} \exp({\textrm{sim}(\Phi(s_{t-1}), \Phi(s'))}/\tau)}

  • Aggregate step reward:

rtstep=k{align, qual, coh}softmax(rtalign,rtqual,rtcoh)krtkr_{t}^{\text{step}} = \sum_{k\in\{\text{align, qual, coh}\}} \text{softmax}(r_t^{\text{align}}, r_t^{\text{qual}}, r_t^{\text{coh}})_k \cdot r_t^k

  • Trajectory-level (template) supervision: An LLM is used to extract an abstract reasoning template T\mathcal{T} from the full trajectory–response. The model then assesses generalizability by solving new instances guided by T\mathcal{T}, recording the empirical correctness rate:

rfinal=1Nj=1NI(y(j) is correct)r^{\text{final}} = \frac{1}{N} \sum_{j = 1}^N \mathbb{I}(y^{(j)} \text{ is correct})

  • Overall loss:

Ltotal=λstep1Tt=1TLstep(Rϕ(stx,s<t,a),rtstep)+λfinalLfinal(Rϕ(x,y),rfinal)\mathcal{L}_{\text{total}} = \lambda_{\text{step}} \cdot \frac{1}{T} \sum_{t=1}^T \mathcal{L}_{\text{step}}(R_\phi(s_t \mid x, s_{<t}, a), r_{t}^{\text{step}}) + \lambda_{\text{final}} \cdot \mathcal{L}_{\text{final}}(R_\phi(x, y), r^{\text{final}})

Here, RϕR_\phi is the PRM’s predicted reward, with aggregation coefficients λstep,λfinal\lambda_{\text{step}}, \lambda_{\text{final}}.

For scoring, a composite trajectory-response score is employed: r^=1Tt=1Tr^tstep+αr^final\hat{r} = \frac{1}{T} \sum_{t=1}^{T} \hat{r}_t^{\text{step}} + \alpha \cdot \hat{r}^{\text{final}}

3. Data Selection, RL Rewarding, and Inference Scaling

Three Principal Use Cases

  1. Offline Data Curation for Supervised Fine-Tuning:

    • ReasonFlux-PRM scores trajectory–response pairs and selects the highest-quality samples for use as supervised fine-tuning targets.
    • Empirically, small sets (1k) of PRM-selected data consistently outperform not only random selection but also large-scale human-curated and prior PRM-based selection in improving downstream model accuracy.
  2. Policy Optimization in Reinforcement Learning:
    • During online training, ReasonFlux-PRM provides dense process-aligned rewards, combining both step-level and trajectory-level guidance.
    • The reward signal is incorporated as:

    rnew=(1β)rout+βr^r_{\text{new}} = (1-\beta)\cdot r_{\text{out}} + \beta \cdot \hat{r}

    where routr_{\text{out}} is a baseline reward (e.g., correct/incorrect final answer), and β\beta controls trajectory awareness. - This scheme enables policy models to receive more actionable feedback even in the absence of human stepwise annotations.

  3. Test-Time Best-of-N (BoN) Scaling:

    • The model can be used as a re-ranking function for multiple candidate outputs, supporting BoN accuracy gains larger than majority voting or conventional PRMs.

These three features are each validated in experiments showing robust gains (average 12.1% in SFT, 4.5% in RL, and 6.3% in BoN test-time scaling), outperforming prior PRMs including Qwen2.5-Math-PRM-72B and human annotation-based curation.

4. Empirical Performance and Analysis

Benchmark Results

  • Tested on AIME24, AIME25, MATH500, and GPQA-Diamond:
    • Offline SFT: Fine-tuning on 1k ReasonFlux-PRM-7B selected traces yields SFT accuracy improvements of up to 12.1% over strong baselines (including those using 59k raw data or human curation).
    • Reinforcement Learning: On Deepseek-R1-Distill-Qwen-7B, ReasonFlux-PRM-7B reward achieves MATH500 at 94.8% (vs. 89.6% rule-based, or 92.8% for prior Qwen2.5-Math-PRM) and raises GPQA-Diamond by over four points.
    • BoN Test-Time Inference: Consistently leads to the highest accuracy among strong competitors, with further improvement observed as N increases.

Resource Efficiency

  • ReasonFlux-PRM-1.5B: Designed for edge and resource-constrained settings; delivers comparable RL and SFT performance to the 7B model with notably reduced compute and memory overhead.
  • Small, curated PRM-selected sets confer greater utility (and higher downstream performance) than large random or even human-labeled datasets, reducing fine-tuning cost.

Analytical Observations

  • Score histograms reveal significantly enhanced separation of high- and low-quality trajectories compared to previous PRMs, avoiding the “overlap” problem (good and bad answers assigned similar scores).
  • Step-level and trajectory-level signals together ensure models reward both local logical integrity and global transferability of reasoning strategy.

5. Broader Significance and Prospects

Impact

ReasonFlux-PRM establishes trajectory-aware, dual-granularity reward modeling as an effective paradigm for process supervision in LLMs. It broadens reward modeling from end-point correctness to encompass the entire chain-of-thought, facilitating both automated data curation and robust RL training.

Future Directions

  • Domain Extension: Prospective adaptation to more open-ended domains such as dialogue, code generation, or tool-use scenarios.
  • Adaptive Aggregation: Making the reward aggregation coefficients learnable or context-sensitive.
  • Scalability: Further reduction of parameter count for extreme edge deployment or integration with planning-based reasoning frameworks.

6. Summary Table: Main Innovations and Outcomes

Feature Implementation/Outcome
Dual-level rewards Step-level (align/qual/coh) + trajectory-level (template transferability)
Benchmark performance improvement +12.1% (SFT), +4.5% (RL), +6.3% (BoN scaling) vs. strong PRM/human baseline
Resource-efficient variants 1.5B PRM model for edge use; nearly matches 7B performance
Data curation efficiency Small, targeted sets outperform large random/human-labeled sets
Open-source code/models https://github.com/Gen-Verse/ReasonFlux

7. References and Implementation Resources

ReasonFlux-PRM therefore represents a trajectory-aware advance in PRM construction, enabling robust, fine-grained, and sample-efficient alignment in LLMs for long-chain, process-centric reasoning domains.