Differential Grounding (DiG) for Fine-Grained MLLMs

Updated 12 June 2026

Differential Grounding (DiG) is a framework that requires MLLMs to detect and localize subtle differences in image pairs, thereby bolstering object-level perception.
It leverages a fully automated 3D rendering pipeline to generate paired images with precise bounding box annotations, ensuring scalable and controlled supervision.
A curriculum learning strategy with hybrid F1 and IoU rewards enables stable RL convergence and leads to significant performance improvements on multimodal spatial reasoning benchmarks.

Differential Grounding (DiG) is a proxy-task framework for multimodal LLMs (MLLMs) designed to advance fine-grained perception and spatial reasoning. The DiG task requires an MLLM to identify and localize every difference between two visually similar images without prior information about the number of differences, thereby explicitly supervising the development of object-level and compositional perception capabilities (Tao et al., 14 Dec 2025).

1. Formal Problem Definition and Objective

Given a pair of images, $I_a$ (reference) and $I_b$ (perturbed), which differ by an unknown and variable number $M$ of localized changes, the Differential Grounding task requires the model to (i) estimate $\hat{M}$ , the total number of differences, and (ii) produce spatial localizations of each difference as bounding boxes. The input is formally:

$X = (I_a, I_b, P)$ , where $I_a, I_b \in \mathbb{R}^{H \times W \times 3}$ and $P$ is a prompt such as “find all differences.”

The model outputs a token sequence $O = (o_1, ..., o_T)$ , which is decoded into a predicted set of bounding boxes:

$B_{pred} = \{b_i\}_{i=1}^{\hat{N}}, \;\; b_i = [x_{min}, y_{min}, x_{max}, y_{max}]$

with ground truth:

$B_{gt} = \{b_j^{gt}\}_{j=1}^{M}$

The optimization objective maximizes a composite reward:

$I_b$ 0

where $I_b$ 1 is an output-format indicator and $I_b$ 2 combines detection-level F1 score (using Hungarian matching) and average generalized Intersection-over-Union (IoU) over correctly matched pairs:

$I_b$ 3

Training uses Group Relative Policy Optimization (GRPO) and a KL-regularized RL loss:

$I_b$ 4

A clipped surrogate is used to stabilize optimization.

2. Automated Synthetic Data Generation Pipeline

The DiG framework employs a fully automated, 3D rendering-based pipeline to generate large-scale paired datasets with precise supervision:

Base Scene Generation: Scenes are rendered in Blender with object counts $I_b$ 5. Object attributes (shape, color, size, material) are drawn from discrete, controlled sets, and 3D positions are randomized within a tabletop region. 2D bounding box projections are computed for annotation.
Perturbation: For each example, $I_b$ 6 perturbed objects/regions are selected based on the current curriculum phase. Valid perturbations include discrete changes in shape, color, size, material, object addition, or object removal. The perturbed scene is rendered as $I_b$ 7, and the exact differences are recorded as $I_b$ 8 via projection.

This approach supports exhaustive control over the visual content and difference types, ensures precise ground-truth annotations, and enables scalable data synthesis without manual labeling.

Component	Function	Key Options/Parameters
Base Scene Sampling	Generate random scene in Blender	Shapes: Cube/Sphere/Cylinder
		Colors: {blue,…,yellow}
		Sizes: {0.4,0.6,0.8}
		Materials: metallic, matte
Perturbation Mechanism	Edit $I_b$ 9 items per scene	Change discrete attr. or add/remove object
Bounding Box Annotation	3D→2D projection for supervision	Exact location of all differences

3. Curriculum Learning for Stable Optimization

Reward sparsity arising from rare and spatially precise difference signals is addressed through a curriculum learning procedure comprising three phases with progressively increasing task complexity:

Phase 1: $M$ 0 (single-difference), $M$ 1 ( $M$ 2k steps)
Phase 2: $M$ 3 (double-difference), $M$ 4 ( $M$ 5k steps)
Phase 3: $M$ 6 (mixed, up to $M$ 7), fine-tuning ( $M$ 8, 30k steps)

Mathematically:

$M$ 9

This staged progression from dense reward (single-difference) to sparse, compositional regimes (multiple-differences) enables stable RL convergence and gradual acquisition of generalized spatial reasoning.

4. Training Loop Overview

The training loop alternates between data generation and model update within each curriculum stage.

$\hat{M}$ 1

This process takes a pretrained MLLM and sequentially adapts it via RL on the DiG task under group advantage feedback, following the staged curriculum over $\hat{M}$ 0.

5. Experimental Evaluation and Benchmark Results

The DiG method yields substantial improvements across a range of fine-grained and general multimodal benchmarks (here presented for Qwen3-VL-4B/8B).

Task/Benchmark	Improvement (4B)	Improvement (8B)
HalBench	+3.8	+3.4
HRB8K	+1.7	+0.5
VSR	+1.4	+0.4
RefCOCO testA@50 IoU	–	up to +3.4
RefCOCO+ val@50	+2.4	+2.4
RefCOCOg val@50	+2.2	+1.4
MMBench	+0.8	+2.2
MMStar	+3.5	–
ScienceQA	+2.3	–

Ablation studies indicate:

Performance increases progressively from DiG-1 (single-difference curriculum) through DiG-2 and DiG-Mix.
Hybrid rewards combining F1 and IoU terms consistently outperform either metric alone.

6. High-Level Insights and Broader Impact

Explicit differential supervision ensures the model's attention is directed toward subtle local changes, addressing known limitations of holistic global-alignment objectives in MLLMs.
Synthetic controllable data mitigates annotation bottlenecks and covers challenging edge cases, including minor attribute changes, under consistent rendering.
Curriculum learning dynamically shapes the reward landscape, enabling stable policy optimization and facilitating acquisition of multi-object compositionality.
DiG training yields domain-general improvements on standard multimodal perception and grounding benchmarks, indicating effective transfer of fine-grained and spatially precise representations.

A plausible implication is that differential grounding serves as a scalable, robust proxy for developing and benchmarking fine-grained visual reasoning skills in future generations of multimodal LLMs (Tao et al., 14 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differential Grounding (DiG).