High-Res Visual Reasoning: MGPO
- The paper introduces the MGPO framework, which iteratively refines high-resolution images by grounding task-relevant regions via multi-turn reinforcement learning.
- MGPO reduces computational load by dynamically cropping images to focus on essential visual tokens, thereby mitigating the token explosion issue in standard LMMs.
- The algorithm demonstrates superior benchmark performance and interpretability, making it valuable for applications in autonomous driving, remote sensing, and medical imaging.
High-resolution Visual Reasoning MGPO Algorithm refers to a class of methods—culminating in the Multi-turn Grounding-based Policy Optimization (MGPO) framework (Huang et al., 8 Jul 2025)—that empower large multimodal models (LMMs) to perform detailed, selective, and interpretable reasoning over high-resolution images. MGPO overcomes fundamental bottlenecks of standard LMMs, which typically produce an overwhelming number of visual tokens from high-resolution inputs, by learning to localize, extract, and iteratively reason about the most task-relevant regions through reinforcement learning (RL)-guided multi-turn grounding.
1. MGPO: Multi-turn Grounding-based Policy Optimization Framework
MGPO implements an end-to-end RL system that instantiates a multi-turn conversational loop between the LMM and its environment. The process begins by prompting the model to generate explicit grounding coordinates (bounding boxes) for the image region most salient for a given question. The system then crops out this region from the high-resolution input and appends the sub-image to the conversational history; this process may be repeated over multiple rounds. The final answer is generated only after one or more grounding/refinement iterations.
MGPO operates on backbones such as Qwen2.5-VL-7B equipped with Native Resolution Vision Transformers (NaViT), but instead of exhaustively tokenizing the entire high-resolution image, it dynamically focuses modeling and reasoning resources onto successively refined image crops, guided by the model’s own predicted grounding coordinates and the conversational history.
Key process steps:
- Grounding prediction: In the first turn, the model is instructed (via a fixed prompt) to output coordinates [x₁, y₁, x₂, y₂] of the relevant region.
- Cropping and recursion: The system extracts the corresponding sub-image (P_crop) and updates the state, repeating the questioning as needed.
- Final answer stage: After sufficient iterations or at the system’s prerogative, the model generates the answer, accessing both the high-resolution context and the history of extracted regions.
All these intermediate steps and cropping actions are differentiably reinforced according to the success of the final answer, forming the basis for policy optimization.
2. Efficient Visual Token Processing
Processing high-resolution images in transformer-based architectures generally yields quadratic growth in token count as image size increases (e.g., a 1K×1K image can easily generate >1,024 tokens). Most of these tokens encode uninformative background or distractors. MGPO addresses this by learning to focus on regions directly related to the reasoning or VQA task.
Tokenization follows:
- An image I is divided into non-overlapping patches of size p×p.
- Patches may be merged, e.g., m×m, to align with manageable token sequence lengths:
- When a region is selected and cropped (via predicted coordinates), the extraction normalizes coordinates with respect to the input and original image sizes to ensure correct localization and cropping:
- Let S_input and S_ori be the input and original sizes, then cropping is adjusted:
By recurrently focusing on sub-images, MGPO keeps the number of processed tokens minimal and only retains maximum-resolution details for the most relevant regions.
3. Iterative Visual Grounding and Multi-turn Reasoning
Standard supervised fine-tuning (SFT) approaches require annotated grounding data (bounding boxes or masks for VQA regions of interest) to teach models to localize important regions. MGPO instead triggers visual grounding actions during RL rollouts by explicit multi-turn prompting: in the first round, the model must output grounding coordinates; in the next, it's provided with the cropped region for answer generation.
The algorithmic structure:
- Initial prompt: User asks a question; the model replies with bounding box coordinates.
- System extracts the region, providing the sub-image and the same question in a new turn.
- Model generates an answer, utilizing both the region and the overall image.
- (Optionally) Multiple rounds can be supported, incrementally collecting more focused regions.
- Only the correctness of the final answer (binary reward: correct/incorrect) guides policy gradients, which are assigned to all turns in the dialogue.
This multi-turn format is crucial: it addresses the empirical cold start problem where models do not spontaneously invoke grounding in RL rollouts, requiring explicit prompting to ensure grounding actions are learned.
4. Reinforcement Learning Objective and Policy Optimization
The training leverages a policy gradient objective that associates the reward for a correct answer not just with the final output, but with every intermediate grounding (cropping) and reasoning action in the multi-turn sequence.
For each rollout g with kᵍ dialog steps:
- = 1 if answer is correct, 0 otherwise
- is the average reward over candidate rollouts
- is the model output (bounding box or answer) at step j
No explicit supervision for the grounding coordinates is necessary: the correctness of the final answer suffices to optimize both grounding and answering steps.
5. Empirical Performance and Benchmark Results
MGPO demonstrates significant gains on high-resolution benchmarks compared to baseline methods such as GRPO and even surpasses models with much larger parameters:
Model | MME-Realworld (ID) | V* Bench (OOD) |
---|---|---|
Supervised/GRPO | lower | lower |
MGPO (Qwen2.5-VL-7B) | +5.4% | +5.2% |
OpenAI o1 | - | lower |
GPT-4o | - | lower |
MGPO was trained on 21K standard VQA samples (no grounding annotation) and outperformed both OpenAI's o1 and GPT-4o models on OOD visual reasoning in V* Bench (Huang et al., 8 Jul 2025).
6. Interpretability, Practical Implications, and Applications
The explicit grounding mechanism in MGPO enforces interpretable intermediate steps: the bounding boxes predicted per round serve as transparent evidence for decision-making, improving explainability. This feature is valuable in deployment contexts sensitive to model trustworthiness.
Practical application domains benefiting from MGPO include:
- Autonomous driving: focusing on traffic signs or obstacles within large camera frames
- Remote sensing: isolating regions with significant geospatial changes
- Medical imaging: zooming in on diagnostically relevant areas
- OCR and document analysis: localizing and extracting fine-grained text regions from dense pages
- Surveillance: detailed inspection of high-resolution video streams
By leveraging multi-turn, grounding-focused reasoning, MGPO avoids the cost—both computational and annotation-wise—of dense supervision and achieves strong generalization and data efficiency.
7. Relationship to Prior and Parallel Work in High-Resolution Visual Reasoning
MGPO builds upon and extends paradigms for modular-, active-, and multi-turn visual reasoning (Kim et al., 2018, Chen et al., 28 Mar 2024, Kolner et al., 30 Sep 2024):
- It addresses token explosion and redundancy not by post-hoc token compression (e.g., via visual registers (Zhang et al., 27 Jan 2025)) but by learning to crop out and recursively attend to essential regions via RL.
- The iterative, conversational, and self-correcting approach distinguishes it from conventional one-pass or SFT-trained LMMs.
- MGPO’s policy optimization approach could synergize with other modular and hierarchical techniques (e.g., multi-granularity reasoning, plug-and-play grounding) by introducing RL-based self-grounding into broader frameworks.
MGPO’s strategy—emergent grounding without explicit supervision, multi-turn fused answer prediction, and RL-driven cropping—sets a new standard for sample-efficient, interpretable, and high-performing high-resolution visual reasoning in LMMs.