GroundedPRM: Process Reward Modeling Innovations
- GroundedPRM is a framework that grounds intermediate process steps using objective signals from tools, UI percepts, or physical scene feedback.
- In LLM reasoning, it employs a tree-guided Monte Carlo search with tool-verified step validation to enhance factual fidelity.
- For GUI tasks and robotic manipulation, it uses dynamic memory and spatially-grounded action representations to improve generalization and sim-to-real transfer.
GroundedPRM refers to a set of process reward modeling innovations in both multi-step reasoning with LLMs and spatially-grounded robotic manipulation. The term encompasses three distinct but independently motivated frameworks: (1) GroundedPRM for step-level LLM supervision with tool-verified fidelity and Monte Carlo tree search, (2) GroundedPRM for GUI-based tasks with perceptual and memory grounding, and (3) Grounded Parameterized Motion Primitives for manipulation in continuous robotic control. Across these domains, the unifying principle is the grounding of process-level reward signals in objective, executable, or physically verifiable feedback rather than subjective or outcome-only estimation.
1. Motivation and Background
Traditional process reward models (PRMs) seek to improve multi-step reasoning by assigning rewards to each intermediate process step, using scalar signals to promote correct reasoning trajectories. However, prior approaches are limited by noisy credit assignment (Monte Carlo rollouts), low factual fidelity (LLM self-judging hallucinations), and misalignment with true process structure. In long-horizon GUI tasks, standard PRMs also fail due to context overloading (“lost in the middle”) and a lack of awareness of dynamic environment changes. In robotic manipulation, generalization is impaired by action representations that are insufficiently grounded in the physical scene.
GroundedPRM frameworks are developed to address these limitations by explicitly grounding reward evaluation—either in external tools, executable percepts, or physical geometry—and by incorporating structured reasoning traces and dynamic context distillation (Zhang et al., 16 Oct 2025, Xiong et al., 27 Sep 2025, Jiang et al., 2024).
2. Tree-Guided, Fidelity-Aware Reward Modeling for LLM Reasoning
GroundedPRM for LLM reasoning (Zhang et al., 16 Oct 2025) introduces a fidelity-aware supervision pipeline centered on three innovations:
- Structured Reasoning with MCTS: Unlike flat Monte Carlo rollouts, GroundedPRM builds tree-structured reasoning paths using Monte Carlo Tree Search (MCTS). Each node represents a partial reasoning trace; expansions generate alternative completions via the base LLM. Node selection uses Upper Confidence bounds for Trees (UCT), with tracking aggregate returns and visitation counts.
- Tool-Verified Step Validation: Each intermediate step is verified using an external tool (e.g., Wolfram Alpha, SymPy). Steps are cast as structured queries, and executable correctness is established by binary verification ( for correct, for incorrect).
- Hybrid Reward Aggregation: Stepwise tool feedback () and the global outcome () are combined:
The final per-step reward is
with tuned for factual fidelity and global trajectory compatibility.
Rationale-enhanced supervision is introduced: instead of binary labels only, the model is autoregressively trained to generate both per-step correctness labels and natural-language rationales, augmenting interpretability and instruction-finetuning compatibility.
3. GroundedPRM in GUI-based Process Supervision
In the GUI domain (Xiong et al., 27 Sep 2025), GroundedPRM (also termed GUI-PRA) addresses unique process supervision challenges by explicitly grounding reward evaluation both in distilled history and in perceptual UI evidence:
- Dynamic Memory Mechanism: To prevent “lost in the middle” failures, a compressed history 0 is constructed via:
- Relevance-based Retrieval: Retaining only the most recent 1 steps from the full agent transcript.
- Progressive Summarization: Earlier contextual interactions are reduced via a one-sentence LLM-based summarization.
- Adaptive UI Perception: Action evaluations are grounded in the observed effects on the GUI:
- At each step, a multi-tool perceive-reason-verify loop collects UI state change evidence using tools like OmniParser (structured OCR) and Point (vision-language pointer grounding).
- The judge’s prompt receives both the distilled context and the perceptual evidence, scoring process steps per candidate in 2.
This approach enables process rewards that are both aware of historical context and grounded in observable, dynamic UI changes.
4. Grounded Parameterized Motion Primitives in Robotic Manipulation
GroundedPRM in robotic manipulation (Jiang et al., 2024) formalizes action selection as the grounding of parameterized motion primitives in scene geometry:
- Action Space Construction: Each action is a composite 3:
- 4 is a discrete primitive (e.g., grasp, push),
- 5 is a grounding point selected from a segmented point cloud,
- 6 are continuous primitive-specific motion parameters.
- Actor-Critic Policy Architecture:
- Spatial Grounding and Generalization: Enforcing choices over discrete grounded points in the observed scene induces invariance to object pose, leading to strong generalization across previously unseen instances and categories (success rate 5 on novel categories compared to 6 for non-grounded baselines).
- Sim-to-Real Robustness: The representation’s physical grounding ensures sim-to-real transferability (zero-shot 7 success rate in real trials) due to geometric and perceptual alignment.
5. Training Protocols and Empirical Evaluation
LLM Reasoning (Math Domain)
- Dataset: A pipeline combining MCTS search and tool validation annotates 8 math problem traces (just 9 of Math-Shepherd-PRM-7B’s 0).
- Model: Qwen2.5-7B-Instruct finetuned with LoRA (rank 128), batch size 32, 6 epochs, cosine LR.
- Results: On ProcessBench, GroundedPRM achieves F1=39.7 (vs. 31.5 for Math-Shepherd-PRM-7B and 31.3 for EurusPRM-Stage2), giving a 1 relative improvement. Reward-guided greedy search reaches 2 (SOTA compared to all PRM-supervised decoding).
- Ablations: Removal of outcome or rationale signals significantly degrades performance, confirming the value of hybrid reward and generative rationale.
GUI Process Supervision
- Evaluation on AndroidWorld and Mobile-MiniWoB++ benchmarks confirms double-digit improvement in success rates versus both base and standard PRM-augmented models. Ablations demonstrate necessity of all components (memory, perception, tool diversity).
Robotic Manipulation
- DoubleBin 6D pose alignment benchmark: GroundedPRM achieves 3 success on unseen categories after 30 steps, outperforming P-DQN and RAPS baselines.
- Zero-shot sim-to-real: Achieves 73% real-world success across diverse object instances even with significant domain shift.
6. Key Comparative Table
| Domain | What is GroundedPRM? | Unique Grounding Signal(s) |
|---|---|---|
| LLM Reasoning | Tree-guided, tool-verified PRM | Executable step verification |
| GUI Automation | Judge with memory + UI awareness | Summarized context + UI parses |
| Robotic Manipulation | Spatially-grounded action space | Physical scene, 3D contact point |
This table clarifies that "GroundedPRM" spans distinct implementations, always centering around an explicit, objective, or perceptually/executionally grounded signal.
7. Limitations and Future Directions
GroundedPRM approaches depend on the coverage and accuracy of the chosen grounding mechanism—external tools, perceptual pipelines, or point cloud quality. Current limitations include:
- Incomplete generalization for extremely long-horizon tasks without adaptive memory (GUI-PRM) (Xiong et al., 27 Sep 2025).
- Potential failure cases if tool or perception modules err, or if MCTS coverage is insufficient.
- Scalability challenges for domains requiring hierarchical or multi-pass summarization or perception.
- No end-to-end fine-tuning of entire pipelines (e.g., GUI-PRM relies on pre-trained components without joint optimization).
Potential extensions involve learning task-adaptive history summarizers, integrating richer perception modules, and jointly training the process model with grounding primitives or perception-to-reward adapters.
References
- "GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning" (Zhang et al., 16 Oct 2025)
- "GUI-PRA: Process Reward Agent for GUI Tasks" (Xiong et al., 27 Sep 2025)
- "HACMan++: Spatially-Grounded Motion Primitives for Manipulation" (Jiang et al., 2024)