Papers
Topics
Authors
Recent
Search
2000 character limit reached

GroundedPRM: Process Reward Modeling Innovations

Updated 18 May 2026
  • GroundedPRM is a framework that grounds intermediate process steps using objective signals from tools, UI percepts, or physical scene feedback.
  • In LLM reasoning, it employs a tree-guided Monte Carlo search with tool-verified step validation to enhance factual fidelity.
  • For GUI tasks and robotic manipulation, it uses dynamic memory and spatially-grounded action representations to improve generalization and sim-to-real transfer.

GroundedPRM refers to a set of process reward modeling innovations in both multi-step reasoning with LLMs and spatially-grounded robotic manipulation. The term encompasses three distinct but independently motivated frameworks: (1) GroundedPRM for step-level LLM supervision with tool-verified fidelity and Monte Carlo tree search, (2) GroundedPRM for GUI-based tasks with perceptual and memory grounding, and (3) Grounded Parameterized Motion Primitives for manipulation in continuous robotic control. Across these domains, the unifying principle is the grounding of process-level reward signals in objective, executable, or physically verifiable feedback rather than subjective or outcome-only estimation.

1. Motivation and Background

Traditional process reward models (PRMs) seek to improve multi-step reasoning by assigning rewards to each intermediate process step, using scalar signals to promote correct reasoning trajectories. However, prior approaches are limited by noisy credit assignment (Monte Carlo rollouts), low factual fidelity (LLM self-judging hallucinations), and misalignment with true process structure. In long-horizon GUI tasks, standard PRMs also fail due to context overloading (“lost in the middle”) and a lack of awareness of dynamic environment changes. In robotic manipulation, generalization is impaired by action representations that are insufficiently grounded in the physical scene.

GroundedPRM frameworks are developed to address these limitations by explicitly grounding reward evaluation—either in external tools, executable percepts, or physical geometry—and by incorporating structured reasoning traces and dynamic context distillation (Zhang et al., 16 Oct 2025, Xiong et al., 27 Sep 2025, Jiang et al., 2024).

2. Tree-Guided, Fidelity-Aware Reward Modeling for LLM Reasoning

GroundedPRM for LLM reasoning (Zhang et al., 16 Oct 2025) introduces a fidelity-aware supervision pipeline centered on three innovations:

  • Structured Reasoning with MCTS: Unlike flat Monte Carlo rollouts, GroundedPRM builds tree-structured reasoning paths using Monte Carlo Tree Search (MCTS). Each node represents a partial reasoning trace; expansions generate alternative completions via the base LLM. Node selection uses Upper Confidence bounds for Trees (UCT), with Q(s,a)Q(s, a) tracking aggregate returns and N(s,a)N(s, a) visitation counts.
  • Tool-Verified Step Validation: Each intermediate step sjs_j is verified using an external tool (e.g., Wolfram Alpha, SymPy). Steps are cast as structured queries, and executable correctness is established by binary verification (vj=+1v_j = +1 for correct, 1-1 for incorrect).
  • Hybrid Reward Aggregation: Stepwise tool feedback (vjv_j) and the global outcome (FF) are combined:

ui=1T1ij=i+1T1γji1vj+βFu_i = \frac{1}{T-1-i} \sum_{j=i+1}^{T-1} \gamma^{j-i-1} v_j + \beta F

The final per-step reward is

rtotal(sisi+1)=αrtool(sisi+1)+βrtree(sisi+1),r_\text{total}(s_i \to s_{i+1}) = \alpha r_\text{tool}(s_i \to s_{i+1}) + \beta r_\text{tree}(s_i \to s_{i+1}),

with (α,β)(\alpha, \beta) tuned for factual fidelity and global trajectory compatibility.

Rationale-enhanced supervision is introduced: instead of binary labels only, the model is autoregressively trained to generate both per-step correctness labels and natural-language rationales, augmenting interpretability and instruction-finetuning compatibility.

3. GroundedPRM in GUI-based Process Supervision

In the GUI domain (Xiong et al., 27 Sep 2025), GroundedPRM (also termed GUI-PRA) addresses unique process supervision challenges by explicitly grounding reward evaluation both in distilled history and in perceptual UI evidence:

  • Dynamic Memory Mechanism: To prevent “lost in the middle” failures, a compressed history N(s,a)N(s, a)0 is constructed via:
    • Relevance-based Retrieval: Retaining only the most recent N(s,a)N(s, a)1 steps from the full agent transcript.
    • Progressive Summarization: Earlier contextual interactions are reduced via a one-sentence LLM-based summarization.
  • Adaptive UI Perception: Action evaluations are grounded in the observed effects on the GUI:
    • At each step, a multi-tool perceive-reason-verify loop collects UI state change evidence using tools like OmniParser (structured OCR) and Point (vision-language pointer grounding).
    • The judge’s prompt receives both the distilled context and the perceptual evidence, scoring process steps per candidate in N(s,a)N(s, a)2.

This approach enables process rewards that are both aware of historical context and grounded in observable, dynamic UI changes.

4. Grounded Parameterized Motion Primitives in Robotic Manipulation

GroundedPRM in robotic manipulation (Jiang et al., 2024) formalizes action selection as the grounding of parameterized motion primitives in scene geometry:

  • Action Space Construction: Each action is a composite N(s,a)N(s, a)3:
    • N(s,a)N(s, a)4 is a discrete primitive (e.g., grasp, push),
    • N(s,a)N(s, a)5 is a grounding point selected from a segmented point cloud,
    • N(s,a)N(s, a)6 are continuous primitive-specific motion parameters.
  • Actor-Critic Policy Architecture:
    • Parallel PointNet++-like modules extract actor and critic pointwise features.
    • For each N(s,a)N(s, a)7 (point N(s,a)N(s, a)8, primitive N(s,a)N(s, a)9), an MLP outputs candidate motion sjs_j0; a critic map sjs_j1 scores all sjs_j2 combinations.
    • At execution, the policy selects sjs_j3 and applies sjs_j4.
  • Spatial Grounding and Generalization: Enforcing choices over discrete grounded points in the observed scene induces invariance to object pose, leading to strong generalization across previously unseen instances and categories (success rate sjs_j5 on novel categories compared to sjs_j6 for non-grounded baselines).
  • Sim-to-Real Robustness: The representation’s physical grounding ensures sim-to-real transferability (zero-shot sjs_j7 success rate in real trials) due to geometric and perceptual alignment.

5. Training Protocols and Empirical Evaluation

LLM Reasoning (Math Domain)

  • Dataset: A pipeline combining MCTS search and tool validation annotates sjs_j8 math problem traces (just sjs_j9 of Math-Shepherd-PRM-7B’s vj=+1v_j = +10).
  • Model: Qwen2.5-7B-Instruct finetuned with LoRA (rank 128), batch size 32, 6 epochs, cosine LR.
  • Results: On ProcessBench, GroundedPRM achieves F1=39.7 (vs. 31.5 for Math-Shepherd-PRM-7B and 31.3 for EurusPRM-Stage2), giving a vj=+1v_j = +11 relative improvement. Reward-guided greedy search reaches vj=+1v_j = +12 (SOTA compared to all PRM-supervised decoding).
  • Ablations: Removal of outcome or rationale signals significantly degrades performance, confirming the value of hybrid reward and generative rationale.

GUI Process Supervision

  • Evaluation on AndroidWorld and Mobile-MiniWoB++ benchmarks confirms double-digit improvement in success rates versus both base and standard PRM-augmented models. Ablations demonstrate necessity of all components (memory, perception, tool diversity).

Robotic Manipulation

  • DoubleBin 6D pose alignment benchmark: GroundedPRM achieves vj=+1v_j = +13 success on unseen categories after 30 steps, outperforming P-DQN and RAPS baselines.
  • Zero-shot sim-to-real: Achieves 73% real-world success across diverse object instances even with significant domain shift.

6. Key Comparative Table

Domain What is GroundedPRM? Unique Grounding Signal(s)
LLM Reasoning Tree-guided, tool-verified PRM Executable step verification
GUI Automation Judge with memory + UI awareness Summarized context + UI parses
Robotic Manipulation Spatially-grounded action space Physical scene, 3D contact point

This table clarifies that "GroundedPRM" spans distinct implementations, always centering around an explicit, objective, or perceptually/executionally grounded signal.

7. Limitations and Future Directions

GroundedPRM approaches depend on the coverage and accuracy of the chosen grounding mechanism—external tools, perceptual pipelines, or point cloud quality. Current limitations include:

  • Incomplete generalization for extremely long-horizon tasks without adaptive memory (GUI-PRM) (Xiong et al., 27 Sep 2025).
  • Potential failure cases if tool or perception modules err, or if MCTS coverage is insufficient.
  • Scalability challenges for domains requiring hierarchical or multi-pass summarization or perception.
  • No end-to-end fine-tuning of entire pipelines (e.g., GUI-PRM relies on pre-trained components without joint optimization).

Potential extensions involve learning task-adaptive history summarizers, integrating richer perception modules, and jointly training the process model with grounding primitives or perception-to-reward adapters.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GroundedPRM.