Reasoning Quality Reward in Process Q-Value Models

Updated 14 September 2025

Reasoning quality reward is a supervisory signal that evaluates intermediate logical steps to assign fine-grained credit in a multi-step chain-of-thought process.
The Process Q-Value Model (PQM) utilizes a comparative loss function that enforces proper step ordering, achieving an 11.6% accuracy boost on benchmarks like MATH500.
PQM integrates MDP-based state representations and Monte Carlo rollout sampling, offering enhanced interpretability and robust performance in reasoning-intensive tasks.

A reasoning quality reward is a supervisory signal—often scalar-valued—assigned to intermediate steps or trajectories within a multi-step reasoning process, designed to shape complex LLM outputs toward logically correct, coherent, and causally aligned solutions. Unlike traditional outcome-only rewards, reasoning quality rewards explicitly evaluate the internal consistency, anticipatory value, and sequential interdependency of intermediate steps, providing finer-grained credit assignment for the internal “chain-of-thought” (CoT) or reasoning strategy that leads to a conclusion. This concept is fundamental to process reward modeling (PRM) and is core to modern advances in alignment, mathematical reasoning, and complex decision-making with LLMs.

1. Problem Formulation: Reasoning as Markov Decision Process and Q-value Ranking

The Process Q-value Model (PQM) reframes process reward modeling within the Markov Decision Process (MDP) formalism. Here, a partially generated reasoning trajectory is represented by a state $s_t = (x, a_{1:t-1})$ , where $x$ is the original instruction or question and $a_{1:t-1}$ are prior reasoning actions. The next action $a_t$ is selected by a policy $\pi$ . The Q-value function estimates, for each possible next step, the expected (transformed) probability of obtaining a correct final solution, conditioned on the current partial solution: $Q^{\pi}(a_{1:t-1}, a_t) \coloneqq \sigma^{-1}\Big(\mathbb{E}_{a_{t+1:H} \sim \pi(\cdot|a_{1:t})}\; \mathcal{I}(x, a_{1:H})\Big)$ where $\sigma^{-1}$ is the inverse sigmoid, ensuring $Q \in \mathbb{R}$ , and $\mathcal{I}(x, a_{1:H})$ is an indicator of success. This allows process rewards to measure not merely correctness at a particular step, but the expected value of its influence on entire future reasoning—preserving inter-step dependencies.

PQM contrasts with prior classification-based PRMs, which treat each step as an independent binary decision and optimize via cross-entropy loss, neglecting downstream impact and stepwise interrelations.

2. Comparative Loss Function: Capturing Step Dependencies and Ordering

Recognizing that correct reasoning steps should increase the model’s chance of eventual success, PQM introduces a comparative loss to directly enforce the correct relative ordering between Q-values of correct and incorrect steps. The loss ensures: $Q^*_{w_{|W|}} < \ldots < Q^*_{w_1} \ll Q^*_0 < Q^*_{c_1} < Q^*_{c_2} < \ldots < Q^*_{c_{|C|}}$ where $C$ and $W$ index correct and incorrect steps.

Two loss variants are defined:

Theoretical comparative loss: $\mathcal{L}_{\text{theorem}} = -\frac{1}{H} \left[ \sum_{t=2}^{|W|} \log \frac{\exp(Q_{w_t})}{\sum_{q=1}^{t} \exp(Q_{w_q})} + \sum_{t=0}^{|C|} \log \frac{\exp(Q_{c_t})}{\sum_{q=0}^{t} \exp(Q_{c_q}) + \sum_{w\in W} \exp(Q_w + \zeta)} \right]$
Practical variant: $\mathcal{L} = -\frac{1}{|C|} \sum_{t=0}^{|C|} \log\frac{\exp(Q_{c_t})}{\sum_{q=0}^t \exp(Q_{c_q}) + \sum_{w\in W} \exp(Q_w + \zeta)}$ where $\zeta$ is a margin hyperparameter. This design accentuates the Q-value gap between correct and incorrect reasoning, penalizing misleading or premature step decisions, and encourages the Q-value trajectory to rise with continued correctness and drop sharply on incorrect steps.

3. Empirical Outcomes and Performance Metrics

PQM’s reasoning quality reward was evaluated on multi-step, mathematically intensive benchmarks (e.g., MATH500, GSM-Plus) and using various LLM architectures (MetaMath-Mistral-7B, MuggleMath-13B, Llama-3-70B-Instruct) and sampling policies (Best-of-N sampling). PQM uses the minimum Q-value across all steps as the verification score for a sampled reasoning trajectory, reflecting the critical vulnerability of the entire solution to flawed intermediate steps.

Key empirical results include:

On Llama-3-70B-Instruct and the MATH500 benchmark, PQM achieved a verification accuracy of 51.4%, compared to 39.8% for a baseline using binary cross-entropy loss—a substantial 11.6% absolute improvement.
Across BON@k settings (e.g., BON@8, BON@16, BON@32) and all model backbones, PQM consistently outperformed classification-based and regression-based (MSE) reward models, validating its superiority for both step-level and trajectory-level reward granularity.

Ablation studies underscored the significance of the margin $\zeta$ , with moderate values providing optimal separation of correct and incorrect step reward signals.

4. Theoretical Guarantees and Interpretability

PQM’s comparative loss function is underpinned by rigorous theoretical guarantees:

The derived optimal Q-value ranking (Theorem 1) ensures that, under mild assumptions (such as correct steps leading to higher expected future success), the model converges to a reward assignment that matches the natural causal structure of sequential reasoning.
The framework demonstrates that classification-based PRMs are a limiting case of the MDP/Q-value approach, applicable when the transition probabilities for correct next steps are degenerate (extreme).
The comparative loss’s structure directly aligns with the partial ordering of reasoning correctness, and margin tuning enables flexible control over the sensitivity of reward separation between correct and incorrect intermediate steps.

Empirically, Q-value curves visibly “rise” with correct step continuation and “drop” at points of error, providing interpretability to the model’s scoring function.

5. Implementation and Computational Considerations

Implementing PQM requires:

Formulating reasoning as an MDP, exposing a state (context plus prior steps) and action space (candidate next step).
Training via the comparative loss, which necessitates label assignment not only per-step but also for stepwise Q-value ranking.
Efficient sampling to estimate future correctness probabilities for the Q-value target in practice—typically via Monte Carlo rollouts.

In practice, PQM can be integrated into the post-hoc verification pipelines of reasoning agents, used for reward model fine-tuning in RLHF/RLVR setups, or employed at inference time for process-based trajectory selection under Best-of-N/Trajectories. Because the model optimizes a relative ordering rather than absolute probabilities, PQM can leverage weaker labels or pairwise step comparisons, reducing annotation bottlenecks.

Scaling PQM to LLMs involves computational considerations in rollout sampling and Q-value estimation. The use of analytic approximations or small-batch sample estimation balances resource efficiency with statistical stability.

6. Practical Implications and Future Applications

PQM’s Q-ranking–based process reward robustly improves multi-step reasoning tasks across architectures, outperforming traditional approaches. In practice, PQM:

Enables fine-grained trajectory filtering in solution verification, critical for math, science, and code generation tasks where a single flawed step invalidates an entire output.
Offers a semantically interpretable signal for debugging, curriculum design, and human-in-the-loop feedback collection by pinpointing precisely where reasoning failures occur.
Lays foundation for consistent, reliable reward modeling in domains requiring stepwise verifiability and robustness against error propagation.

Broader extensions include exploring improved comparative objectives for tasks with non-binary intermediate feedback, combining PQM with causality-aware reward designs, and applying analogous principles to domains beyond text, such as reasoning in multimodal or hierarchical settings.

In sum, the reasoning quality reward formalized by PQM is grounded in an MDP-based Q-value trajectory, employs a principled comparative loss to enforce correct step ordering, is empirically validated across challenging reasoning tasks, and provides both interpretability and practical effectiveness as a reward signal in reasoning-intensive settings (Li et al., 15 Oct 2024).

PDF Markdown Chat (Pro)

References (1)

Process Reward Model with Q-Value Rankings (2024)

Follow Topic

Get notified by email when new papers are published related to Reasoning Quality Reward.