GeM-VG: Multi-Image Visual Grounding

Updated 15 January 2026

The paper presents GeM-VG, a multimodal LLM that unifies multi-image visual grounding with dual reasoning heads and a novel hybrid reinforcement learning strategy.
It introduces MG-Data-240K, a large-scale dataset covering referential, semantic, spatial, and temporal associations to address past grounding limitations.
The model achieves state-of-the-art results, showing significant improvements across benchmarks such as MIG-Bench and ODINW over previous methods.

GeM-VG is a Multimodal LLM (MLLM) specifically designed for Generalized Multi-image Visual Grounding. The framework overcomes previous limitations in grounding tasks involving multiple images, such as single-target localization and restricted task diversity. GeM-VG introduces a unified formalism for multi-image grounding, a novel large-scale dataset for training, a dual-mode reasoning/output architecture, and a new hybrid reinforcement learning (RL) strategy integrating chain-of-thought (CoT) and direct answering. The model demonstrates strong results across diverse grounding and multimodal understanding benchmarks (Zheng et al., 8 Jan 2026).

1. Problem Formalization and Learning Objective

Generalized Multi-image Visual Grounding (VG) requires the system to locate, from a set of $m$ images $V = \{V_1, ..., V_m\}$ and a textual query $T$ , a set $O$ of $n$ instances. Each instance has a bounding box $b_k = (x_1^k, y_1^k, x_2^k, y_2^k)$ and a corresponding image index $i_k \in \{1,...,m\}$ :

$O = \{(b_1, i_1), ..., (b_n, i_n)\} = \mathcal{M}(V, T)$

GeM-VG addresses both single-target and multi-target scenarios, requiring cross-image comparison, association, and reasoning. Supervised finetuning employs cross-entropy loss over the serialized output. For RL-based finetuning, the model uses Group Relative Policy Optimization (GRPO):

$\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{N} \sum_{i=1}^N \left[ \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i - \beta\,\text{KL}\left(\pi_\theta(\cdot|q) || \pi_{ref}(\cdot|q)\right) \right]$

where $A_i$ represents normalized advantage for sampled rollout $V = \{V_1, ..., V_m\}$ 0 in response to prompt $V = \{V_1, ..., V_m\}$ 1.

2. MG-Data-240K: A Comprehensive Multi-image Grounding Dataset

To address the limitations of prior datasets, GeM-VG introduces MG-Data-240K, consisting of 240,000 multi-image grounding samples categorized by their cross-image dependency and reasoning requirement:

Task Type	#Samples	Source Datasets
Referential Retrieval	97K	D $V = \{V_1, ..., V_m\}$ 2, COCO
Semantic Association	77K	COCO
Spatial Association	20K	Ego-Exo4D, MVTrack
Temporal Association	46K	STAR

Referential Retrieval: Locates single or multiple targets using explicit descriptions, no cross-image cues.
Semantic Association: Matches objects of the same class across images.
Spatial Association: Matches the same object under varying viewpoints.
Temporal Association: Matches objects across video frames.

Image groups range from two to four images (mean $V = \{V_1, ..., V_m\}$ 33), with both single- and multi-target examples. This composition addresses issues of target quantity, image relation, and task diversity present in earlier benchmarks.

3. Model Architecture and Dual Reasoning Heads

GeM-VG builds upon the Qwen2-VL MLLM backbone, integrating vision and language via three stages:

Vision Encoding: Each image $V = \{V_1, ..., V_m\}$ 4 is processed by a visual encoder (CNN or transformer) to produce feature map $V = \{V_1, ..., V_m\}$ 5.
Projection: Features are mapped to the LLM embedding space through a linear projection $V = \{V_1, ..., V_m\}$ 6, giving $V = \{V_1, ..., V_m\}$ 7.
Multimodal Fusion: $V = \{V_1, ..., V_m\}$ 8 and the text embedding $V = \{V_1, ..., V_m\}$ 9 of $T$ 0 are concatenated as input to the LLM decoder with cross-attention.

For output flexibility, GeM-VG supports both:

Chain-of-Thought (CoT) Head: Outputs an intermediate reasoning chain culminating in the bounding boxes.
Direct-Answer Head: Predicts bounding boxes and image indices directly, without intermediate explanation.

Both heads are implemented by the same transformer decoder without introducing extra parameters. The architecture enables the model to switch between explicit reasoning and concise answers depending on the prompt and data.

4. Hybrid Reinforcement Finetuning Incorporating CoT and Direct Output

Following supervised finetuning for basic grounding and CoT conditioning, GeM-VG applies a two-stage, rule-based R1-like RL protocol. Central steps include:

Rollout and Advantage Estimation: For prompt $T$ 1, $T$ 2 sampled completions $T$ 3 are scored by a rule-based reward $T$ 4; normalized advantages are:

$T$ 5

Reward Composition: The total reward is

$T$ 6

where: - $T$ 7: Output syntax correctness. - $T$ 8: Correct selection of images containing targets. - $T$ 9: Average IoU for matched predictions. - $O$ 0: Fraction of ground-truth targets matched.

Reward Modulation:
- Early-stage: Adjusts to maintain balance of CoT and direct answers, targeting a 0.5/0.5 ratio.
- Late-stage: Adds a length-aware bonus to inexact CoT answers, encouraging detailed reasoning.

The final adjusted reward is used within the GRPO update.

5. Experimental Evaluation Across Diverse Benchmarks

GeM-VG is evaluated on a variety of multi-image grounding, single-image grounding, video grounding, and general multi-image understanding tasks. All localization metrics use an IoU threshold $O$ 1.

Benchmark	Metric	GeM-VG Score	Prior State-of-the-Art	Improvement
MIG-Bench	[email protected]	74.65%	UniVG-R1: 72.64%	+2.0 points
MC-Bench	AP $O$ 2	32.8	Migician: 23.1	+9.7 points
ODINW (rare classes)	AP $O$ 3	41.1	Qwen2-VL-7B: 32.0	+9.1 points
ReasonVOS (video)	[email protected]	64.41%	UniVG-R1: 58.73%	+5.68 points

Additional results confirm no loss in general multimodal understanding: on MuirBench, MIBench, BLINK, and MMIU, GeM-VG matches or outperforms previous models.

6. Ablation Studies and Additional Insights

Several ablation studies in (Zheng et al., 8 Jan 2026) quantify the design choices:

CoT vs. Direct Output: CoT-only stage 2 SFT yields 71.46 (MIG-Bench), direct-only 69.45, and balanced mix 71.97, supporting complementary benefit.
Hybrid RL Components: Full hybrid reward modulation achieves the best performance (74.31 MIG-Bench), outperforming early- or late-stage only, as well as unmodulated RL.
Reward Terms: Omitting precision or recall components reduces AP $O$ 4 on ODINW; both are necessary for robust localization.
MG-Data-240K Contribution: Removing the dataset drops MIG-Bench from 74.65 to 72.04 and MC-Bench from 32.8 to 22.4, underlining the necessity of task and cue diversity.
Backbone Generalization: The protocol transfers effectively; applying the same SFT+RL pipeline to Qwen2.5-VL-7B yields notable improvements (69.7 MIG-Bench, 39.7 MC-Bench).

In summary, GeM-VG constitutes a unified approach to generalized multi-image visual grounding. Its unique combination of data composition, dual-output architecture, and hybrid reinforcement learning strategies yields state-of-the-art results and strong transfer across datasets and model backbones (Zheng et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GeM-VG.