Process Q-value Model (PQM)

Updated 29 November 2025

Process Q-value Model (PQM) is a framework that maps multi-step trajectories to scalar Q-values for predicting long-term outcomes under MDP/POMDP settings.
It leverages ranking, optimization, and transfer learning to improve decision making, reward modeling, and policy robustness across various domains.
PQM has been applied in multi-task RL, LLM agent control, and robust planning, demonstrating enhanced sample efficiency and empirical performance improvements.

The Process Q-value Model (PQM) refers to multiple technically distinct but conceptually related frameworks for constructing and utilizing Q-value functions over multi-step processes, reasoning chains, or trajectories. These models are unified by their focus on capturing long-term, step-level or process-level value information under Markov Decision Process (MDP) or partially observed MDP (POMDP) settings, and often leverage ranking, optimization, and transfer methodologies to enhance decision making, reward modeling, or policy robustness. PQM has been instantiated in multiple domains: multi-task reinforcement learning, LLM agent control, process-based reward modeling, and learning-augmented robust MDP planning.

1. Mathematical Formulation of PQM

The PQM is formally defined as a function mapping state-action (and, in multi-goal RL, state-state-action) tuples to scalar values denoting expected long-term process outcomes. The canonical formulation in multi-task RL is the planning quasi-metric: $f: \mathcal{S}\times\mathcal{S}\times\mathcal{A}\ \rightarrow\ \mathbb{R}_+$

$f(s,s',a) = \min_\pi\,\mathbb{E}\left[\#\{\text{steps to reach } s' | s_0=s, a_0=a, \pi\}\right]$

This quasi-metric quantifies the expected minimal steps required to reach $s'$ from $s$ by starting with action $a$ and subsequently following an optimal policy. Importantly, $f$ need not be symmetric due to irreversibility in environments (Micheli et al., 2020).

For LLM agents and process reasoning, PQM generalizes the step-level Q-value: $Q^\pi(u, \tau_t, a_t) = \mathbb{E}_{\tau_{t+1:T}\sim\pi}\left[\sum_{k=t}^T r(u, \tau)\ \big|\ \tau_t, a_t\right]$ Here, $\tau_t$ denotes history, $u$ is task instruction, and $r$ is (possibly sparse) reward (Zhai et al., 14 Sep 2024, Li et al., 15 Oct 2024).

For reward modeling in process verification, PQM is defined via ranking of future success probabilities over entire solution chains: $Q^\pi(s_t, a_t) = \sigma^{-1}\left(\mathbb{E}_{a_{t+1:H}\sim\pi}[I(x, a_1:H)]\right)$ with $I(x,a_1:H)$ an indicator for correct final outcome and $\sigma$ the sigmoid function (Li et al., 15 Oct 2024).

2. Objective Functions and Training Methodologies

The core approach in all PQM instantiations is to train a model—typically function approximators or neural networks—to minimize a theoretically motivated loss reflecting process-level ranking or Bellman consistency.

In planning quasi-metrics, the training loss is a squared Bellman error enforcing consistency across transitions: $L_f(w_f) = (f_{w_f}(s_t,s_{t+1},a_t) - 1)^2 + [f_{w_f}(s_t,s',a_t) - (1 + f_{\tilde{w}_f}(s_{t+1},s',a^*))]^2$ where $a^* = \arg\min_\alpha f_{\tilde{w}_f}(s_{t+1},s',\alpha)$ and $\tilde{w}_f$ denotes target-network parameters (Micheli et al., 2020).

For LLM agents, step-level Q-value fitting utilizes preference pairs and Direct Policy Optimization (DPO) objectives: $\mathcal{L}_{\rm step}(\theta) = - \mathbb{E}_{(u,\tau_t,a^w,a^l)} \log \sigma(\Delta_t)$ with $\Delta_t$ a Bradley–Terry difference of log-probabilities under trained vs reference policy, regularized by $\beta$ (Zhai et al., 14 Sep 2024).

In process reward modeling, PQM exploits a margin-aware list-wise Plackett–Luce loss: $\mathcal{L}_{\text{PQM}}(\tau) = -\frac{1}{|C|} \sum_{t=0}^{|C|} \log \frac{\exp(Q_{c_t})}{\sum_{q=0}^t \exp(Q_{c_q}) + \sum_{w\in W} \exp(Q_w+\zeta)}$ Here $C$ and $W$ index correct and wrong steps, and $\zeta$ enforces separation (Li et al., 15 Oct 2024).

3. PQM in Multi-Task RL and Transfer Learning

In multi-task RL, PQM decomposes goal-reaching policies into a task-agnostic world model (the quasi-metric $f_{w_f}$ ) and lightweight, task-specific aimers ( $h_{w_h}$ ). The aimer chooses the most reachable target within the goal set: $h(s, g) = \arg\min_{s'\in g} \min_a f(s, s', a)$ After training, the online policy composes these modules to select

$a_t = a_{w_a}(s_t, h_{w_h}(s_t, g))$

This structure enables pre-training of $f_{w_f}$ for sample-efficient transfer across disparate tasks by relearning only the aimer (Micheli et al., 2020). Empirical evaluations on bit-flip and MuJoCo settings demonstrate multi-fold speedups in early convergence and strong correlation of PQM metrics to true environment step distances.

4. Step-Level and Process-Level PQM for LLM Agents

PQM architectures are increasingly used for LLM-based agents performing multi-step decision-making or complex reasoning. The step-level Q-value model allows preferential action selection based on process outcomes, rather than independent token or step correctness. Data is annotated by Monte Carlo Tree Search (MCTS), producing Q-value targets leveraged in preference-based DPO training (Zhai et al., 14 Sep 2024). At inference, candidate actions sampled from the LLM are ranked by PQM-derived Q scores: $Q(u, \tau_t, a) = \beta \ln \pi_\theta(a|u, \tau_t) - \beta \ln \pi_{\rm ref}(a|u, \tau_t)$ The resulting policy achieves substantial improvements on multi-step benchmarks (WebShop, HotPotQA), outperforming fine-tuning baselines and supporting plug-and-play integration across diverse agent architectures.

5. Process Reward Modeling and Q-Value Rankings

PQM generalizes standard cross-entropy or BCE-based process reward models by globally ranking the Q-values assigned to the entire chain of decision or reasoning steps. Instead of independently scoring each step, PQM enforces a process-consistent ordering, where Q-values for correct prefixes dominate those for incorrect prefixes. This ranking-based loss yields robust empirical gains in best-of-n verification accuracy, especially on challenging mathematical reasoning tasks (Li et al., 15 Oct 2024). Comparative ablations further validate the necessity of margin hyperparameters ( $\zeta$ ) and demonstrate the scalability of PQM to large-scale LLM corpora.

6. PQM in Robust Learning-Augmented Planning

In learning-augmented single-trajectory MDPs, PQM (termed PROjection Pursuit Model) integrates untrusted Q-value advice with robust baseline policies. Both black-box and “grey-box” strategies are analyzed:

Black-box PQM uses fixed trust regions proportional to deviations from the baseline policy, producing an explicit consistency–robustness tradeoff parameterized by $\lambda$ .
Grey-box PQM dynamically tracks TD errors from oracle Q-value estimates, adapting trust regions and achieving 1-consistency and near-optimal robustness (Li et al., 2023).

The main performance theorems show PQM can interpolate between oracle consistency and baseline robustness, outperforming black-box or worst-case strategies in terms of dynamic regret and ratio of expectations.

PQM Instantiation	Domain	Training Objective
Quasi-metric RL (Micheli et al., 2020)	Multi-task RL	Bellman quasi-metric loss
Step-level Q (LLM) (Zhai et al., 14 Sep 2024)	LLM agent control	DPO preference/ranking loss
Process reward (Li et al., 15 Oct 2024)	Reasoning verification	Plackett–Luce margin ranking loss
Projection Pursuit (Li et al., 2023)	Robust RL planning	Consistency–robustness tradeoff

7. Limitations, Open Questions, and Future Research

Several open issues remain in PQM development and deployment:

Annotation noise: Automatic step correctness pipelines introduce inaccuracies in ranking losses; human or hybrid labeling may tighten empirical gaps (Li et al., 15 Oct 2024).
Verification ceilings: PQM best-of-n accuracy lags the theoretical optimum; further integration with inter-solution or tree-based reasoning may be beneficial.
Online RL integration: While PQM is theoretically compatible with RL objectives, existing applications in process reward modeling are largely offline; full RL integration is an open area (Li et al., 15 Oct 2024).
Robust value scaling: Grey-box PQM enables dynamic scaling of trust in oracle Q-value advice, but analysis of mixing times and practical adaptation in non-ergodic MDPs remains ongoing (Li et al., 2023).

A plausible implication is that PQM, by leveraging process- and chain-level value signals rather than step-local correctness, can form the backbone for increasingly sample-efficient, transferable, and robust agents across reasoning, planning, and control domains. However, precise empirical and theoretical boundaries depend on the quality of value annotation, stability of ranking losses, and compatibility with online learning algorithms.