GeM-VG: Multi-Image Visual Grounding
- The paper presents GeM-VG, a multimodal LLM that unifies multi-image visual grounding with dual reasoning heads and a novel hybrid reinforcement learning strategy.
- It introduces MG-Data-240K, a large-scale dataset covering referential, semantic, spatial, and temporal associations to address past grounding limitations.
- The model achieves state-of-the-art results, showing significant improvements across benchmarks such as MIG-Bench and ODINW over previous methods.
GeM-VG is a Multimodal LLM (MLLM) specifically designed for Generalized Multi-image Visual Grounding. The framework overcomes previous limitations in grounding tasks involving multiple images, such as single-target localization and restricted task diversity. GeM-VG introduces a unified formalism for multi-image grounding, a novel large-scale dataset for training, a dual-mode reasoning/output architecture, and a new hybrid reinforcement learning (RL) strategy integrating chain-of-thought (CoT) and direct answering. The model demonstrates strong results across diverse grounding and multimodal understanding benchmarks (Zheng et al., 8 Jan 2026).
1. Problem Formalization and Learning Objective
Generalized Multi-image Visual Grounding (VG) requires the system to locate, from a set of images and a textual query , a set of instances. Each instance has a bounding box and a corresponding image index :
GeM-VG addresses both single-target and multi-target scenarios, requiring cross-image comparison, association, and reasoning. Supervised finetuning employs cross-entropy loss over the serialized output. For RL-based finetuning, the model uses Group Relative Policy Optimization (GRPO):
where represents normalized advantage for sampled rollout in response to prompt .
2. MG-Data-240K: A Comprehensive Multi-image Grounding Dataset
To address the limitations of prior datasets, GeM-VG introduces MG-Data-240K, consisting of 240,000 multi-image grounding samples categorized by their cross-image dependency and reasoning requirement:
| Task Type | #Samples | Source Datasets |
|---|---|---|
| Referential Retrieval | 97K | D, COCO |
| Semantic Association | 77K | COCO |
| Spatial Association | 20K | Ego-Exo4D, MVTrack |
| Temporal Association | 46K | STAR |
- Referential Retrieval: Locates single or multiple targets using explicit descriptions, no cross-image cues.
- Semantic Association: Matches objects of the same class across images.
- Spatial Association: Matches the same object under varying viewpoints.
- Temporal Association: Matches objects across video frames.
Image groups range from two to four images (mean 3), with both single- and multi-target examples. This composition addresses issues of target quantity, image relation, and task diversity present in earlier benchmarks.
3. Model Architecture and Dual Reasoning Heads
GeM-VG builds upon the Qwen2-VL MLLM backbone, integrating vision and language via three stages:
- Vision Encoding: Each image is processed by a visual encoder (CNN or transformer) to produce feature map .
- Projection: Features are mapped to the LLM embedding space through a linear projection , giving .
- Multimodal Fusion: and the text embedding of are concatenated as input to the LLM decoder with cross-attention.
For output flexibility, GeM-VG supports both:
- Chain-of-Thought (CoT) Head: Outputs an intermediate reasoning chain culminating in the bounding boxes.
- Direct-Answer Head: Predicts bounding boxes and image indices directly, without intermediate explanation.
Both heads are implemented by the same transformer decoder without introducing extra parameters. The architecture enables the model to switch between explicit reasoning and concise answers depending on the prompt and data.
4. Hybrid Reinforcement Finetuning Incorporating CoT and Direct Output
Following supervised finetuning for basic grounding and CoT conditioning, GeM-VG applies a two-stage, rule-based R1-like RL protocol. Central steps include:
- Rollout and Advantage Estimation: For prompt , sampled completions are scored by a rule-based reward ; normalized advantages are:
- Reward Composition: The total reward is
where: - : Output syntax correctness. - : Correct selection of images containing targets. - : Average IoU for matched predictions. - : Fraction of ground-truth targets matched.
- Reward Modulation:
- Early-stage: Adjusts to maintain balance of CoT and direct answers, targeting a 0.5/0.5 ratio.
- Late-stage: Adds a length-aware bonus to inexact CoT answers, encouraging detailed reasoning.
The final adjusted reward is used within the GRPO update.
5. Experimental Evaluation Across Diverse Benchmarks
GeM-VG is evaluated on a variety of multi-image grounding, single-image grounding, video grounding, and general multi-image understanding tasks. All localization metrics use an IoU threshold .
| Benchmark | Metric | GeM-VG Score | Prior State-of-the-Art | Improvement |
|---|---|---|---|---|
| MIG-Bench | [email protected] | 74.65% | UniVG-R1: 72.64% | +2.0 points |
| MC-Bench | AP | 32.8 | Migician: 23.1 | +9.7 points |
| ODINW (rare classes) | AP | 41.1 | Qwen2-VL-7B: 32.0 | +9.1 points |
| ReasonVOS (video) | [email protected] | 64.41% | UniVG-R1: 58.73% | +5.68 points |
Additional results confirm no loss in general multimodal understanding: on MuirBench, MIBench, BLINK, and MMIU, GeM-VG matches or outperforms previous models.
6. Ablation Studies and Additional Insights
Several ablation studies in (Zheng et al., 8 Jan 2026) quantify the design choices:
- CoT vs. Direct Output: CoT-only stage 2 SFT yields 71.46 (MIG-Bench), direct-only 69.45, and balanced mix 71.97, supporting complementary benefit.
- Hybrid RL Components: Full hybrid reward modulation achieves the best performance (74.31 MIG-Bench), outperforming early- or late-stage only, as well as unmodulated RL.
- Reward Terms: Omitting precision or recall components reduces AP on ODINW; both are necessary for robust localization.
- MG-Data-240K Contribution: Removing the dataset drops MIG-Bench from 74.65 to 72.04 and MC-Bench from 32.8 to 22.4, underlining the necessity of task and cue diversity.
- Backbone Generalization: The protocol transfers effectively; applying the same SFT+RL pipeline to Qwen2.5-VL-7B yields notable improvements (69.7 MIG-Bench, 39.7 MC-Bench).
In summary, GeM-VG constitutes a unified approach to generalized multi-image visual grounding. Its unique combination of data composition, dual-output architecture, and hybrid reinforcement learning strategies yields state-of-the-art results and strong transfer across datasets and model backbones (Zheng et al., 8 Jan 2026).