GroundedPRM: Process Reward Modeling Innovations

Updated 18 May 2026

GroundedPRM is a framework that grounds intermediate process steps using objective signals from tools, UI percepts, or physical scene feedback.
In LLM reasoning, it employs a tree-guided Monte Carlo search with tool-verified step validation to enhance factual fidelity.
For GUI tasks and robotic manipulation, it uses dynamic memory and spatially-grounded action representations to improve generalization and sim-to-real transfer.

GroundedPRM refers to a set of process reward modeling innovations in both multi-step reasoning with LLMs and spatially-grounded robotic manipulation. The term encompasses three distinct but independently motivated frameworks: (1) GroundedPRM for step-level LLM supervision with tool-verified fidelity and Monte Carlo tree search, (2) GroundedPRM for GUI-based tasks with perceptual and memory grounding, and (3) Grounded Parameterized Motion Primitives for manipulation in continuous robotic control. Across these domains, the unifying principle is the grounding of process-level reward signals in objective, executable, or physically verifiable feedback rather than subjective or outcome-only estimation.

1. Motivation and Background

Traditional process reward models (PRMs) seek to improve multi-step reasoning by assigning rewards to each intermediate process step, using scalar signals to promote correct reasoning trajectories. However, prior approaches are limited by noisy credit assignment (Monte Carlo rollouts), low factual fidelity (LLM self-judging hallucinations), and misalignment with true process structure. In long-horizon GUI tasks, standard PRMs also fail due to context overloading (“lost in the middle”) and a lack of awareness of dynamic environment changes. In robotic manipulation, generalization is impaired by action representations that are insufficiently grounded in the physical scene.

GroundedPRM frameworks are developed to address these limitations by explicitly grounding reward evaluation—either in external tools, executable percepts, or physical geometry—and by incorporating structured reasoning traces and dynamic context distillation (Zhang et al., 16 Oct 2025, Xiong et al., 27 Sep 2025, Jiang et al., 2024).

2. Tree-Guided, Fidelity-Aware Reward Modeling for LLM Reasoning

GroundedPRM for LLM reasoning (Zhang et al., 16 Oct 2025) introduces a fidelity-aware supervision pipeline centered on three innovations:

Structured Reasoning with MCTS: Unlike flat Monte Carlo rollouts, GroundedPRM builds tree-structured reasoning paths using Monte Carlo Tree Search (MCTS). Each node represents a partial reasoning trace; expansions generate alternative completions via the base LLM. Node selection uses Upper Confidence bounds for Trees (UCT), with $Q(s, a)$ tracking aggregate returns and $N(s, a)$ visitation counts.
Tool-Verified Step Validation: Each intermediate step $s_j$ is verified using an external tool (e.g., Wolfram Alpha, SymPy). Steps are cast as structured queries, and executable correctness is established by binary verification ( $v_j = +1$ for correct, $-1$ for incorrect).
Hybrid Reward Aggregation: Stepwise tool feedback ( $v_j$ ) and the global outcome ( $F$ ) are combined:

$u_i = \frac{1}{T-1-i} \sum_{j=i+1}^{T-1} \gamma^{j-i-1} v_j + \beta F$

The final per-step reward is

$r_\text{total}(s_i \to s_{i+1}) = \alpha r_\text{tool}(s_i \to s_{i+1}) + \beta r_\text{tree}(s_i \to s_{i+1}),$

with $(\alpha, \beta)$ tuned for factual fidelity and global trajectory compatibility.

Rationale-enhanced supervision is introduced: instead of binary labels only, the model is autoregressively trained to generate both per-step correctness labels and natural-language rationales, augmenting interpretability and instruction-finetuning compatibility.

3. GroundedPRM in GUI-based Process Supervision

In the GUI domain (Xiong et al., 27 Sep 2025), GroundedPRM (also termed GUI-PRA) addresses unique process supervision challenges by explicitly grounding reward evaluation both in distilled history and in perceptual UI evidence:

Dynamic Memory Mechanism: To prevent “lost in the middle” failures, a compressed history $N(s, a)$ $N (s, a)$ 0 is constructed via:
- Relevance-based Retrieval: Retaining only the most recent $N(s, a)$ 1 steps from the full agent transcript.
- Progressive Summarization: Earlier contextual interactions are reduced via a one-sentence LLM-based summarization.
Adaptive UI Perception: Action evaluations are grounded in the observed effects on the GUI:
- At each step, a multi-tool perceive-reason-verify loop collects UI state change evidence using tools like OmniParser (structured OCR) and Point (vision-language pointer grounding).
- The judge’s prompt receives both the distilled context and the perceptual evidence, scoring process steps per candidate in $N(s, a)$ 2.

This approach enables process rewards that are both aware of historical context and grounded in observable, dynamic UI changes.

4. Grounded Parameterized Motion Primitives in Robotic Manipulation

GroundedPRM in robotic manipulation (Jiang et al., 2024) formalizes action selection as the grounding of parameterized motion primitives in scene geometry:

Action Space Construction: Each action is a composite $N(s, a)$ $N (s, a)$ 3:
- $N(s, a)$ 4 is a discrete primitive (e.g., grasp, push),
- $N(s, a)$ 5 is a grounding point selected from a segmented point cloud,
- $N(s, a)$ 6 are continuous primitive-specific motion parameters.
Actor-Critic Policy Architecture:
- Parallel PointNet++-like modules extract actor and critic pointwise features.
- For each $N(s, a)$ 7 (point $N(s, a)$ 8, primitive $N(s, a)$ 9), an MLP outputs candidate motion $s_j$ 0; a critic map $s_j$ 1 scores all $s_j$ 2 combinations.
- At execution, the policy selects $s_j$ 3 and applies $s_j$ 4.
Spatial Grounding and Generalization: Enforcing choices over discrete grounded points in the observed scene induces invariance to object pose, leading to strong generalization across previously unseen instances and categories (success rate $s_j$ 5 on novel categories compared to $s_j$ 6 for non-grounded baselines).
Sim-to-Real Robustness: The representation’s physical grounding ensures sim-to-real transferability (zero-shot $s_j$ 7 success rate in real trials) due to geometric and perceptual alignment.

5. Training Protocols and Empirical Evaluation

LLM Reasoning (Math Domain)

Dataset: A pipeline combining MCTS search and tool validation annotates $s_j$ 8 math problem traces (just $s_j$ 9 of Math-Shepherd-PRM-7B’s $v_j = +1$ 0).
Model: Qwen2.5-7B-Instruct finetuned with LoRA (rank 128), batch size 32, 6 epochs, cosine LR.
Results: On ProcessBench, GroundedPRM achieves F1=39.7 (vs. 31.5 for Math-Shepherd-PRM-7B and 31.3 for EurusPRM-Stage2), giving a $v_j = +1$ 1 relative improvement. Reward-guided greedy search reaches $v_j = +1$ 2 (SOTA compared to all PRM-supervised decoding).
Ablations: Removal of outcome or rationale signals significantly degrades performance, confirming the value of hybrid reward and generative rationale.

GUI Process Supervision

Evaluation on AndroidWorld and Mobile-MiniWoB++ benchmarks confirms double-digit improvement in success rates versus both base and standard PRM-augmented models. Ablations demonstrate necessity of all components (memory, perception, tool diversity).

Robotic Manipulation

DoubleBin 6D pose alignment benchmark: GroundedPRM achieves $v_j = +1$ 3 success on unseen categories after 30 steps, outperforming P-DQN and RAPS baselines.
Zero-shot sim-to-real: Achieves 73% real-world success across diverse object instances even with significant domain shift.

6. Key Comparative Table

Domain	What is GroundedPRM?	Unique Grounding Signal(s)
LLM Reasoning	Tree-guided, tool-verified PRM	Executable step verification
GUI Automation	Judge with memory + UI awareness	Summarized context + UI parses
Robotic Manipulation	Spatially-grounded action space	Physical scene, 3D contact point

This table clarifies that "GroundedPRM" spans distinct implementations, always centering around an explicit, objective, or perceptually/executionally grounded signal.

7. Limitations and Future Directions

GroundedPRM approaches depend on the coverage and accuracy of the chosen grounding mechanism—external tools, perceptual pipelines, or point cloud quality. Current limitations include:

Incomplete generalization for extremely long-horizon tasks without adaptive memory (GUI-PRM) (Xiong et al., 27 Sep 2025).
Potential failure cases if tool or perception modules err, or if MCTS coverage is insufficient.
Scalability challenges for domains requiring hierarchical or multi-pass summarization or perception.
No end-to-end fine-tuning of entire pipelines (e.g., GUI-PRM relies on pre-trained components without joint optimization).

Potential extensions involve learning task-adaptive history summarizers, integrating richer perception modules, and jointly training the process model with grounding primitives or perception-to-reward adapters.

References

"GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning" (Zhang et al., 16 Oct 2025)
"GUI-PRA: Process Reward Agent for GUI Tasks" (Xiong et al., 27 Sep 2025)
"HACMan++: Spatially-Grounded Motion Primitives for Manipulation" (Jiang et al., 2024)

Markdown Report Issue Upgrade to Chat

References (3)

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning (2025)

GUI-PRA: Process Reward Agent for GUI Tasks (2025)

HACMan++: Spatially-Grounded Motion Primitives for Manipulation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GroundedPRM.

GroundedPRM: Process Reward Modeling Innovations

1. Motivation and Background

2. Tree-Guided, Fidelity-Aware Reward Modeling for LLM Reasoning

3. GroundedPRM in GUI-based Process Supervision

4. Grounded Parameterized Motion Primitives in Robotic Manipulation

5. Training Protocols and Empirical Evaluation

LLM Reasoning (Math Domain)

GUI Process Supervision

Robotic Manipulation

6. Key Comparative Table

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GroundedPRM: Process Reward Modeling Innovations

1. Motivation and Background

2. Tree-Guided, Fidelity-Aware Reward Modeling for LLM Reasoning

3. GroundedPRM in GUI-based Process Supervision

4. Grounded Parameterized Motion Primitives in Robotic Manipulation

5. Training Protocols and Empirical Evaluation

LLM Reasoning (Math Domain)

GUI Process Supervision

Robotic Manipulation

6. Key Comparative Table

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research