Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Published 14 Dec 2025 in cs.CV and cs.AI | (2512.12633v2)

Abstract: Multimodal LLMs have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.

Summary

  • The paper introduces DiG, a framework that enhances fine-grained perception by detecting differences between paired images using curriculum learning and reinforcement optimization.
  • It employs a Blender-based 3D rendering pipeline to generate high-quality paired images with controlled visual discrepancies for precise spatial reasoning.
  • Experimental results show improved grounding accuracy and reduced hallucination rates, with effective transfer to various multimodal benchmarks.

Summary of "DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal LLM" (2512.12633)

Introduction

This paper introduces Differential Grounding (DiG), a novel framework designed to enhance fine-grained visual perception and precise spatial reasoning in Multimodal LLMs (MLLMs). MLLMs have shown significant progress in vision-language tasks; however, they still struggle with subtle visual cues and detailed spatial relationships. DiG addresses these limitations by formulating a proxy task where the model identifies and localizes all differences between similar image pairs without prior knowledge of their number. Figure 1

Figure 1: Illustration of Differential Grounding (DiG).

DiG leverages an automated 3D rendering-based data generation pipeline to produce high-quality paired images with controllable visual discrepancies. To tackle the challenge of sparsity in difference signals, the paper employs curriculum learning, progressively increasing the complexity from single to multiple differences. This strategy achieves stable optimization and enables the transfer of fine-grained perception skills to standard downstream tasks.

Methodology

The DiG framework involves the following key components:

  • Data Construction Pipeline: A Blender-based 3D rendering engine automatically generates paired images with precisely controlled visual differences, such as changes in shape, color, and size. This ensures realistic, diverse datasets with accurate ground-truth annotations. Figure 2

    Figure 2: Overview of the proposed DiG framework.

  • Reinforcement Learning Approach: The model is optimized using Group Relative Policy Optimization (GRPO), incorporating a carefully designed reward function composed of format validity, detection accuracy, and spatial precision. The accuracy reward is calculated using a bipartite matching problem solved by the Hungarian algorithm to assess one-to-one correspondence between predictions and ground truths.
  • Curriculum-Based Training Strategy: A curriculum learning approach addresses reward sparsity by gradually increasing task complexity. The training progresses from single-difference scenarios to complex mixed-difference scenes, ensuring that models develop fine-grained perceptual reasoning capabilities scalably and effectively. Figure 3

    Figure 3: Ablation on curriculum scheduling in DiG training.

Experimental Results

The effectiveness of the DiG framework was validated through extensive experiments across various visual perception benchmarks, grounding datasets, and general multimodal reasoning tasks. Notable highlights include:

  • Perception Benchmarks: Models trained with DiG demonstrated significant performance gains, with improvements in sensitivity to fine-grained visual cues and a reduction in hallucination rates. This was evidenced by substantial improvements on benchmarks such as HalBench and VSR.
  • General Multimodal Benchmarks: The learned skills transferred effectively to other multimodal tasks, enhancing overall reasoning and perception capabilities in models. DiG-equipped models outperformed several larger proprietary models, suggesting that the improvements stem from more effective representation learning, not merely increased model size.
  • Grounding Tasks: DiG's impact was significant in referring expression comprehension across datasets like RefCOCO, RefCOCO+, and RefCOCOg, demonstrating improved localization accuracy and spatial reasoning. Figure 4

    Figure 4: Visualization of differential grounding dynamics.

Conclusion and Implications

DiG represents a scalable approach to enhance fine-grained perceptual capabilities in MLLMs. The framework's ability to train models to focus on difference detection and spatial reasoning offers a promising path toward perceptually aligned multimodal intelligence. This work contributes to the ongoing development of MLLMs by proposing a new paradigm that advances both theoretical understanding and practical applications.

The research suggests that future developments may explore additional proxy tasks and further refinement of reinforcement learning techniques to continue improving the alignment between visual perception and language processing in AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.