Papers
Topics
Authors
Recent
Search
2000 character limit reached

GeM-VG: Multi-Image Visual Grounding

Updated 15 January 2026
  • The paper presents GeM-VG, a multimodal LLM that unifies multi-image visual grounding with dual reasoning heads and a novel hybrid reinforcement learning strategy.
  • It introduces MG-Data-240K, a large-scale dataset covering referential, semantic, spatial, and temporal associations to address past grounding limitations.
  • The model achieves state-of-the-art results, showing significant improvements across benchmarks such as MIG-Bench and ODINW over previous methods.

GeM-VG is a Multimodal LLM (MLLM) specifically designed for Generalized Multi-image Visual Grounding. The framework overcomes previous limitations in grounding tasks involving multiple images, such as single-target localization and restricted task diversity. GeM-VG introduces a unified formalism for multi-image grounding, a novel large-scale dataset for training, a dual-mode reasoning/output architecture, and a new hybrid reinforcement learning (RL) strategy integrating chain-of-thought (CoT) and direct answering. The model demonstrates strong results across diverse grounding and multimodal understanding benchmarks (Zheng et al., 8 Jan 2026).

1. Problem Formalization and Learning Objective

Generalized Multi-image Visual Grounding (VG) requires the system to locate, from a set of mm images V={V1,...,Vm}V = \{V_1, ..., V_m\} and a textual query TT, a set OO of nn instances. Each instance has a bounding box bk=(x1k,y1k,x2k,y2k)b_k = (x_1^k, y_1^k, x_2^k, y_2^k) and a corresponding image index ik∈{1,...,m}i_k \in \{1,...,m\}:

O={(b1,i1),...,(bn,in)}=M(V,T)O = \{(b_1, i_1), ..., (b_n, i_n)\} = \mathcal{M}(V, T)

GeM-VG addresses both single-target and multi-target scenarios, requiring cross-image comparison, association, and reasoning. Supervised finetuning employs cross-entropy loss over the serialized output. For RL-based finetuning, the model uses Group Relative Policy Optimization (GRPO):

JGRPO(θ)=1N∑i=1N[πθ(oi∣q)πθold(oi∣q)Ai−β KL(πθ(⋅∣q)∣∣πref(⋅∣q))]\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{N} \sum_{i=1}^N \left[ \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i - \beta\,\text{KL}\left(\pi_\theta(\cdot|q) || \pi_{ref}(\cdot|q)\right) \right]

where AiA_i represents normalized advantage for sampled rollout oio_i in response to prompt qq.

2. MG-Data-240K: A Comprehensive Multi-image Grounding Dataset

To address the limitations of prior datasets, GeM-VG introduces MG-Data-240K, consisting of 240,000 multi-image grounding samples categorized by their cross-image dependency and reasoning requirement:

Task Type #Samples Source Datasets
Referential Retrieval 97K D3^3, COCO
Semantic Association 77K COCO
Spatial Association 20K Ego-Exo4D, MVTrack
Temporal Association 46K STAR
  • Referential Retrieval: Locates single or multiple targets using explicit descriptions, no cross-image cues.
  • Semantic Association: Matches objects of the same class across images.
  • Spatial Association: Matches the same object under varying viewpoints.
  • Temporal Association: Matches objects across video frames.

Image groups range from two to four images (mean ∼\sim3), with both single- and multi-target examples. This composition addresses issues of target quantity, image relation, and task diversity present in earlier benchmarks.

3. Model Architecture and Dual Reasoning Heads

GeM-VG builds upon the Qwen2-VL MLLM backbone, integrating vision and language via three stages:

  1. Vision Encoding: Each image VjV_j is processed by a visual encoder (CNN or transformer) to produce feature map FjF_j.
  2. Projection: Features are mapped to the LLM embedding space through a linear projection WpW_p, giving Hv(j)=Wp Flatten(Fj)H_v^{(j)} = W_p\,{\rm Flatten}(F_j).
  3. Multimodal Fusion: {Hv(j)}\{H_v^{(j)}\} and the text embedding HTH_T of TT are concatenated as input to the LLM decoder with cross-attention.

For output flexibility, GeM-VG supports both:

  • Chain-of-Thought (CoT) Head: Outputs an intermediate reasoning chain culminating in the bounding boxes.
  • Direct-Answer Head: Predicts bounding boxes and image indices directly, without intermediate explanation.

Both heads are implemented by the same transformer decoder without introducing extra parameters. The architecture enables the model to switch between explicit reasoning and concise answers depending on the prompt and data.

4. Hybrid Reinforcement Finetuning Incorporating CoT and Direct Output

Following supervised finetuning for basic grounding and CoT conditioning, GeM-VG applies a two-stage, rule-based R1-like RL protocol. Central steps include:

  • Rollout and Advantage Estimation: For prompt qq, NN sampled completions {oi}\{o_i\} are scored by a rule-based reward R(q,a,oi)R(q,a,o_i); normalized advantages are:

Ai=ri−mean({rj})std({rj})A_i = \frac{r_i - \text{mean}(\{r_j\})}{\text{std}(\{r_j\})}

  • Reward Composition: The total reward is

R=Rformat+Rimage+Rprecision+RrecallR = R_\text{format} + R_\text{image} + R_\text{precision} + R_\text{recall}

where: - Rformat∈{0,1}R_\text{format} \in \{0,1\}: Output syntax correctness. - Rimage∈{0,1}R_\text{image} \in \{0,1\}: Correct selection of images containing targets. - Rprecision=1M∑i=1MIoUiR_\text{precision} = \frac{1}{M} \sum_{i=1}^M \text{IoU}_i: Average IoU for matched predictions. - Rrecall=1∣GT∣∑i=1MI(IoUi≥0.5)R_\text{recall} = \frac{1}{|\text{GT}|} \sum_{i=1}^M \mathbb{I}(\text{IoU}_i \ge 0.5): Fraction of ground-truth targets matched.

  • Reward Modulation:
    • Early-stage: Adjusts to maintain balance of CoT and direct answers, targeting a 0.5/0.5 ratio.
    • Late-stage: Adds a length-aware bonus to inexact CoT answers, encouraging detailed reasoning.

The final adjusted reward is used within the GRPO update.

5. Experimental Evaluation Across Diverse Benchmarks

GeM-VG is evaluated on a variety of multi-image grounding, single-image grounding, video grounding, and general multi-image understanding tasks. All localization metrics use an IoU threshold Ï„=0.5\tau=0.5.

Benchmark Metric GeM-VG Score Prior State-of-the-Art Improvement
MIG-Bench [email protected] 74.65% UniVG-R1: 72.64% +2.0 points
MC-Bench AP50_{50} 32.8 Migician: 23.1 +9.7 points
ODINW (rare classes) AP50_{50} 41.1 Qwen2-VL-7B: 32.0 +9.1 points
ReasonVOS (video) [email protected] 64.41% UniVG-R1: 58.73% +5.68 points

Additional results confirm no loss in general multimodal understanding: on MuirBench, MIBench, BLINK, and MMIU, GeM-VG matches or outperforms previous models.

6. Ablation Studies and Additional Insights

Several ablation studies in (Zheng et al., 8 Jan 2026) quantify the design choices:

  • CoT vs. Direct Output: CoT-only stage 2 SFT yields 71.46 (MIG-Bench), direct-only 69.45, and balanced mix 71.97, supporting complementary benefit.
  • Hybrid RL Components: Full hybrid reward modulation achieves the best performance (74.31 MIG-Bench), outperforming early- or late-stage only, as well as unmodulated RL.
  • Reward Terms: Omitting precision or recall components reduces AP50_{50} on ODINW; both are necessary for robust localization.
  • MG-Data-240K Contribution: Removing the dataset drops MIG-Bench from 74.65 to 72.04 and MC-Bench from 32.8 to 22.4, underlining the necessity of task and cue diversity.
  • Backbone Generalization: The protocol transfers effectively; applying the same SFT+RL pipeline to Qwen2.5-VL-7B yields notable improvements (69.7 MIG-Bench, 39.7 MC-Bench).

In summary, GeM-VG constitutes a unified approach to generalized multi-image visual grounding. Its unique combination of data composition, dual-output architecture, and hybrid reinforcement learning strategies yields state-of-the-art results and strong transfer across datasets and model backbones (Zheng et al., 8 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GeM-VG.