Papers
Topics
Authors
Recent
Search
2000 character limit reached

Differential Grounding (DiG) for Fine-Grained MLLMs

Updated 12 June 2026
  • Differential Grounding (DiG) is a framework that requires MLLMs to detect and localize subtle differences in image pairs, thereby bolstering object-level perception.
  • It leverages a fully automated 3D rendering pipeline to generate paired images with precise bounding box annotations, ensuring scalable and controlled supervision.
  • A curriculum learning strategy with hybrid F1 and IoU rewards enables stable RL convergence and leads to significant performance improvements on multimodal spatial reasoning benchmarks.

Differential Grounding (DiG) is a proxy-task framework for multimodal LLMs (MLLMs) designed to advance fine-grained perception and spatial reasoning. The DiG task requires an MLLM to identify and localize every difference between two visually similar images without prior information about the number of differences, thereby explicitly supervising the development of object-level and compositional perception capabilities (Tao et al., 14 Dec 2025).

1. Formal Problem Definition and Objective

Given a pair of images, IaI_a (reference) and IbI_b (perturbed), which differ by an unknown and variable number MM of localized changes, the Differential Grounding task requires the model to (i) estimate M^\hat{M}, the total number of differences, and (ii) produce spatial localizations of each difference as bounding boxes. The input is formally:

  • X=(Ia,Ib,P)X = (I_a, I_b, P), where Ia,IbRH×W×3I_a, I_b \in \mathbb{R}^{H \times W \times 3} and PP is a prompt such as “find all differences.”

The model outputs a token sequence O=(o1,...,oT)O = (o_1, ..., o_T), which is decoded into a predicted set of bounding boxes:

Bpred={bi}i=1N^,    bi=[xmin,ymin,xmax,ymax]B_{pred} = \{b_i\}_{i=1}^{\hat{N}}, \;\; b_i = [x_{min}, y_{min}, x_{max}, y_{max}]

with ground truth:

Bgt={bjgt}j=1MB_{gt} = \{b_j^{gt}\}_{j=1}^{M}

The optimization objective maximizes a composite reward:

IbI_b0

where IbI_b1 is an output-format indicator and IbI_b2 combines detection-level F1 score (using Hungarian matching) and average generalized Intersection-over-Union (IoU) over correctly matched pairs:

IbI_b3

Training uses Group Relative Policy Optimization (GRPO) and a KL-regularized RL loss:

IbI_b4

A clipped surrogate is used to stabilize optimization.

2. Automated Synthetic Data Generation Pipeline

The DiG framework employs a fully automated, 3D rendering-based pipeline to generate large-scale paired datasets with precise supervision:

  • Base Scene Generation: Scenes are rendered in Blender with object counts IbI_b5. Object attributes (shape, color, size, material) are drawn from discrete, controlled sets, and 3D positions are randomized within a tabletop region. 2D bounding box projections are computed for annotation.
  • Perturbation: For each example, IbI_b6 perturbed objects/regions are selected based on the current curriculum phase. Valid perturbations include discrete changes in shape, color, size, material, object addition, or object removal. The perturbed scene is rendered as IbI_b7, and the exact differences are recorded as IbI_b8 via projection.

This approach supports exhaustive control over the visual content and difference types, ensures precise ground-truth annotations, and enables scalable data synthesis without manual labeling.

Component Function Key Options/Parameters
Base Scene Sampling Generate random scene in Blender Shapes: Cube/Sphere/Cylinder
Colors: {blue,…,yellow}
Sizes: {0.4,0.6,0.8}
Materials: metallic, matte
Perturbation Mechanism Edit IbI_b9 items per scene Change discrete attr. or add/remove object
Bounding Box Annotation 3D→2D projection for supervision Exact location of all differences

3. Curriculum Learning for Stable Optimization

Reward sparsity arising from rare and spatially precise difference signals is addressed through a curriculum learning procedure comprising three phases with progressively increasing task complexity:

  • Phase 1: MM0 (single-difference), MM1 (MM2k steps)
  • Phase 2: MM3 (double-difference), MM4 (MM5k steps)
  • Phase 3: MM6 (mixed, up to MM7), fine-tuning (MM8, 30k steps)

Mathematically:

MM9

This staged progression from dense reward (single-difference) to sparse, compositional regimes (multiple-differences) enables stable RL convergence and gradual acquisition of generalized spatial reasoning.

4. Training Loop Overview

The training loop alternates between data generation and model update within each curriculum stage.

M^\hat{M}1

This process takes a pretrained MLLM and sequentially adapts it via RL on the DiG task under group advantage feedback, following the staged curriculum over M^\hat{M}0.

5. Experimental Evaluation and Benchmark Results

The DiG method yields substantial improvements across a range of fine-grained and general multimodal benchmarks (here presented for Qwen3-VL-4B/8B).

Task/Benchmark Improvement (4B) Improvement (8B)
HalBench +3.8 +3.4
HRB8K +1.7 +0.5
VSR +1.4 +0.4
RefCOCO testA@50 IoU up to +3.4
RefCOCO+ val@50 +2.4 +2.4
RefCOCOg val@50 +2.2 +1.4
MMBench +0.8 +2.2
MMStar +3.5
ScienceQA +2.3

Ablation studies indicate:

  • Performance increases progressively from DiG-1 (single-difference curriculum) through DiG-2 and DiG-Mix.
  • Hybrid rewards combining F1 and IoU terms consistently outperform either metric alone.

6. High-Level Insights and Broader Impact

  • Explicit differential supervision ensures the model's attention is directed toward subtle local changes, addressing known limitations of holistic global-alignment objectives in MLLMs.
  • Synthetic controllable data mitigates annotation bottlenecks and covers challenging edge cases, including minor attribute changes, under consistent rendering.
  • Curriculum learning dynamically shapes the reward landscape, enabling stable policy optimization and facilitating acquisition of multi-object compositionality.
  • DiG training yields domain-general improvements on standard multimodal perception and grounding benchmarks, indicating effective transfer of fine-grained and spatially precise representations.

A plausible implication is that differential grounding serves as a scalable, robust proxy for developing and benchmarking fine-grained visual reasoning skills in future generations of multimodal LLMs (Tao et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differential Grounding (DiG).