MGPO: Multi-turn Grounding Policy Optimization

Updated 10 July 2025

MGPO is a reinforcement learning framework that uses multi-turn grounding to optimize agent behavior in interactive, long-horizon tasks.
It introduces trajectory-level updates and hierarchical policy decomposition to improve credit assignment and sample efficiency.
MGPO has demonstrated superior performance in visual reasoning, dialogue, and spatio-temporal tasks through robust multi-turn planning.

Multi-turn Grounding-based Policy Optimization (MGPO) is a family of reinforcement learning (RL) algorithms designed to optimize interactive agent behavior—especially large (multi-modal) models—across complex, long-horizon, multi-turn tasks with explicit or emergent grounding. MGPO frameworks address the limitations of single-turn RL by introducing multi-turn (or trajectory-level) policy updates that facilitate reasoning, information-seeking, and region selection, all while maintaining strong sample efficiency and robust credit assignment. This paradigm is especially relevant in settings such as high-resolution visual reasoning, document-grounded dialogue, spatio-temporal grounding in videos, and multi-turn language agent tasks.

1. Motivation and Definitional Scope

MGPO methods were introduced as a response to the challenges faced by large language and multi-modal models (LMMs, LLMs) that must operate over interactive sequences requiring long-term planning, precise grounding of intermediate actions (e.g., spatial region selection or reference resolution), and effective credit assignment. Classical RL approaches for LLMs—such as RLHF with single-turn rewards or per-turn fine-tuning—are insufficient for multi-turn, context-dependent scenarios. These approaches typically:

Struggle to perform long-horizon credit assignment
Fail to ground intermediate actions (such as spatial or referential selections)
Are sample-inefficient due to reliance on on-policy data or per-action reward signals

MGPO generalizes and builds upon multi-level RL and preference optimization, incorporating both hierarchical policy decomposition and explicit multi-turn dialogue structure. A defining feature is the use of multi-turn, iterative interaction loops where intermediate grounding decisions (visual, spatial, referential, etc.) are explicitly modeled and optimized via RL.

2. Canonical Architectures and Algorithms

MGPO encompasses a family of architectures, including hierarchical RL, multi-turn preference optimization, and trajectory-level policy learning. A prototypical instantiation is provided by the high-resolution visual reasoning MGPO algorithm (2507.05920):

Multi-Turn Dialogue Framework: The model participates in a structured, multi-turn exchange. At each turn $j$ $j$ :
- The model receives current context (e.g., an image or dialogue history).
- It predicts a grounding action (e.g., cropping coordinates $[x_1, y_1, x_2, y_2]$ in visual reasoning), which is used to extract a sub-image for the next turn.
Grounding Action Normalization: The predicted coordinates are normalized from the preprocessed (possibly resized) image dimensions back to the original high-resolution space according to

$[\hat{x}_1, \hat{y}_1, \hat{x}_2, \hat{y}_2] = [x_1, y_1, x_2, y_2] / S_{\text{input}} \times S_{\text{ori}}$

where $S_{\text{input}}$ and $S_{\text{ori}}$ are the respective image sizes.

Policy Optimization: The RL agent maximizes the expected reward

$\nabla_\theta \mathcal{J}_{\text{MGPO}}(\theta) = \mathbb{E}_{\{X_a^{(j),g}\} \sim p_\theta} \left[ (r^g - b) \sum_{j=1}^{k_g} \nabla_\theta \log p_\theta (X_a^{(j),g} | \mathcal{H}^{(j)}) \right]$

where $r^g$ is the binary reward for a successful answer, $b$ is a running average baseline to reduce variance, and $k_g$ is the number of steps in the $g$ -th interaction.

Multi-Turn Template and Cold Start Handling: The use of fixed conversational templates ensures stable triggering of grounding behavior, addressing the cold start problem where the model does not initially ground its predictions during rollouts.

In contrast, language-only MGPO-inspired frameworks (e.g., ArCHer (2402.19446), DMPO (2406.14868)) may employ a hierarchical MDP, with high-level utterance selection guided by off-policy critics and low-level token generation optimized by policy gradient methods. For multi-turn RL from preferences, mirror descent updates or occupancy-measure constraints are used to align policies with trajectory-level (dialogue-wide) preferences.

3. Credit Assignment and Grounding Mechanisms

MGPO introduces intermediacy in action grounding; that is, it enables agents to ground their intermediate policy outputs (such as focus regions, referential resolutions, or reasoning steps) via explicit RL. Solutions to the long-horizon credit assignment problem include:

Hierarchical Decomposition: Partitioning the optimization problem into high-level (utterance or region selection) and low-level (token- or pixel-level generation) subproblems, with critics mediating the reward flow from long-term outcomes to immediate actions (2402.19446).
Occupancy Measure Constraints: By leveraging state–action occupancy in loss construction (2406.14868), the MGPO family mitigates compounding error and facilitates multi-turn credit assignment.
Preference-based and Multi-Turn Feedback: Instead of per-turn rewards, MGPO techniques may use feedback signals over complete task trajectories, enhancing agents' planning and grounding abilities (2405.14655).
Multi-Turn RL Adaptation and On-Policy Correction: REFUEL (2410.04612) adopts a regression-based update to estimate advantages at all dialogue turns using on-policy rollouts, ensuring that the context distribution for optimization closely matches that expected at deployment.

4. Empirical Performance and Benchmarking

MGPO-based algorithms have demonstrated state-of-the-art results across multiple complex benchmarks:

On high-resolution visual reasoning tasks, MGPO trained with dialogue-round cropping improves in-distribution and out-of-distribution performance by 5.4% (MME-Realworld) and 5.2% (V* Bench) respectively over prior methods, enabling 7B-parameter models to outperform commercial baselines such as GPT-4o (2507.05920).
In document-grounded dialogue and synthetic data generation, multi-turn grounding techniques employing taxonomy-driven CoT prompting and iterative document retrieval achieve superior performance on CoQA, MultiDoc2Dial, QuAC, and OR-QuAC, surpassing models trained on purely human-authored data (2409.11500).
For multi-turn RL with preference feedback, both mirror-descent-based and DMPO algorithms outperform single-turn baselines (e.g., RLHF, DPO) on multi-turn interactive environments such as WebShop, ScienceWorld, and Education Dialogue; these methods also demonstrate robust handling of noisy supervision and trajectory-length disparities (2405.14655, 2406.14868).

5. Theoretical Guarantees and Sample Efficiency

MGPO frameworks introduce theoretical advances that ensure policy convergence, sample reuse, and robust optimization:

Convergence to Nash Equilibrium: Mirror-descent-based multi-turn preference optimization (MTPO, MTPO-τ) converges to a unique Nash equilibrium for the regularized preference game, with formal KL-divergence bounds (2405.14655).
Variance and Error Control: By introducing occupancy measure constraints and length normalization (DMPO), MGPO-style objectives avoid partition function and trajectory-length pitfalls, providing theoretical justification for improved robustness and sample efficiency (2406.14868).
Off-Policy and On-Policy Efficiency: Hierarchical architectures (e.g., ArCHer (2402.19446)) capitalize on off-policy temporal-difference learning at the utterance level, enabling roughly 100-fold improvements in sample efficiency over on-policy baselines; on-policy data collection in REFUEL further eliminates covariate shift (2410.04612).

6. Practical Implementation Considerations

Efficient deployment of MGPO requires attention to the following aspects:

Dialog and Interaction Template Engineering: Fixed, multi-turn prompt structures with separate stages for grounding and answering stabilize learning and support reproducibility (2507.05920).
Reward Structure Design: MGPO methods achieve emergent grounding from binary outcome rewards, sidestepping the need for expensive region-label or trajectory-level annotation.
Cold Start and Stability: Specific template-enforced output formats and restriction of policy loss computation to valid turns are necessary to avoid degenerate behavior during training.
Module Composition and Extension: MGPO architectures are composable—enhancements to the policy optimization routine (e.g., critic regularization, advanced baseline methods), reward shaping (e.g., for intermediate reasoning steps), or incorporation of additional modalities (e.g., video or multi-document retrieval) are readily supported.
Scaling to Real-World Applications: MGPO offers distinct advantages in computational efficiency, and its independence from dense reward or grounding annotation suggests scalability to real-world multimodal systems where annotation costs are prohibitive.

7. Connections, Impact, and Future Directions

MGPO sits at the intersection of hierarchical RL, preference-based policy optimization, multi-modal grounding, and multi-turn dialogue modeling. Its principles are already informing:

New benchmarks for multi-turn multi-modal dialogue and visual grounding, such as SAMA-Bench and SAMA-239K for video chat (2505.18812).
Modular environments and experimentation frameworks enabling rapid benchmarking and analysis of trajectory-level policy learning dynamics (2504.20073).
Enhanced RL paradigms for self-correction, policy verification, and error recovery, as in the Policy as Generative Verifier (PAG) framework (2506.10406).

Ongoing research directions include the integration of more advanced model-based planning at low levels, improved stabilization in extremely long-horizon tasks, richer reward architectures for intermediate reasoning, and further reductions in dependency on explicit annotation or dense feedback.

MGPO represents a class of approaches essential for developing next-generation LMMs and LLMs capable of grounded, robust, and sample-efficient multi-turn reasoning and interaction across modalities and domains.