InstanceVG: Instance-Aware Generalized Grounding
- InstanceVG is an instance-aware multi-task framework that fuses generalized referring expression comprehension (GREC) with segmentation (GRES) using spatially anchored queries.
- It utilizes a shared BEiT-3 encoder and novel modules like the Attention-based Point-Prior Decoder (APD) and PIPH to achieve consistent box, mask, and point predictions.
- Experimental results across ten benchmark datasets demonstrate state-of-the-art performance, improving both multi-target detection and precise segmentation.
InstanceVG is an instance-aware multi-task framework for generalized visual grounding, jointly addressing Generalized Referring Expression Comprehension (GREC) and Generalized Referring Expression Segmentation (GRES). It introduces instance queries anchored by spatial priors to unify and streamline predictions of boxes, masks, and points, offering improved capacity for multi-target and non-target scenarios as well as enforcing consistency between all localization granularities. InstanceVG is distinguished as the first framework to equip generalized visual grounding with explicit instance awareness, achieving state-of-the-art performance across ten datasets and four tasks (Dai et al., 17 Sep 2025).
1. Problem Formulation and Objectives
InstanceVG targets two central tasks within generalized visual grounding: GREC focuses on identifying all referential objects in an image at the bounding box level. GRES aims to achieve fine-grained, pixel-level segmentation of referred objects.
Given an image and a referring expression , the framework may handle multiple instances or cases where no referent exists (). The outputs for each instance include a bounding box and mask . Multi-target and non-target scenarios are formalized by requiring variable-size output sets for both boxes and masks. The system is evaluated using multi-tag F1-score at bounding box level (IoU ), N-acc (no-target accuracy), mean IoU (mask-level), and class-wise IoU for segmentation.
2. Architecture: Modalities and Instance Queries
InstanceVG uses a shared BEiT-3 encoder to fuse visual (image patches) and textual (expression tokens) modalities. The architecture branches after encoding:
- Global-Semantic Branch: Employs SimFPN followed by a UNet-style decoder to predict a coarse semantic mask and perform existence classification for non-target detection.
- Instance Branch: Utilizes an Attention-based Point-Prior Decoder (APD) that generates instance queries anchored to specific image locations.
Attention-based Point-Prior Decoder (APD)
APD comprises two sub-modules:
- Instance Query Generator (IQG): Selects top- text tokens (by L2 norm) and localizes spatial anchors by cross-attention with image features. Dynamic point selection ensures queries are spatially diverse and align with high response regions.
- Point-Prior Multiscale Deformable Decoder: Adopts deformable DETR-style attention, replacing the fixed grid queries with dynamic spatial points for each instance.
Point-Guided Instance-aware Perception Head (PIPH)
PIPH processes decoded instance queries and the global semantic mask, producing:
- Per-query response masks via sigmoid activation,
- Foreground classification scores,
- Bounding box regressions ,
- Instance segmentation masks.
3. Matching and Consistency Strategies
InstanceVG ensures that each predicted instance query consistently aligns its point, box, and mask. This is achieved through:
- Point-Guided Matching: Implements one-to-one Hungarian matching between predictions and ground-truth instances using a composite cost matrix:
Default weights: , , , .
- Query-Mask Alignment: Once instance queries are matched to ground-truth boxes, their assignment propagates to masks, enforcing consistency between all localization outputs (point, box, and mask).
4. Training Protocol and Loss Functions
The total loss is a weighted sum comprising:
- (Detection loss): Standard for matched instances, including classification, box regression, and GIoU terms.
- : Global semantic segmentation using binary cross-entropy and Dice loss.
- : Instance-level mask supervision, with positive (matched) and negative (unmatched or background) queries.
- : Binary cross-entropy loss for referent existence.
Loss weights are set as , , , . Training is conducted using Adam optimizer, with learning rates of for the encoder and for other modules, and scheduled decay.
5. Multi-task Joint Learning Strategy
Joint optimization of GREC and GRES tasks is facilitated by shared encoding and feature extraction. Branches for global segmentation and instance-aware prediction are trained in parallel. The APD and PIPH enforce structural consistency, aligning spatial points, boxes, and masks per instance. This approach yields improvements in both multi-target detection and segmentation granularity, streamlining the overall process compared to independent task training.
A plausible implication is that such architectural jointness addresses previous inconsistencies and improves efficiency in real-world referential understanding.
6. Experimental Evaluation and Benchmarks
InstanceVG is evaluated on ten datasets spanning REC/RES (RefCOCO, RefCOCO+, RefCOCOg), GREC (gRefCOCO), and GRES/region-aware (gRefCOCO, Ref-ZOM, R-RefCOCO+). Key metrics reported include:
- On RefCOCO/+/g-REC (box detection), InstanceVG (ViT-L) achieves up to 96.04%/92.89%/90.62% vs SimVG-L's 94.70%/91.64%/89.15%.
- On RefCOCO/+/g-RES (mask segmentation), mIoU 87.12/84.33/82.27 vs prior best 86.64/83.74/81.32.
- On gRefCOCO-GRES, InstanceVG-B improves over CoHD by +4.9%, +2.5%, +3.1% in gIoU across splits.
- On Ref-ZOM: oIoU 71.12% (vs 68.99%), mIoU 71.52% (vs 69.81%), Acc 97.42% (vs 93.34%).
- On R-RefCOCO(+/g): rIoU up to 62.41/59.13/54.36 (vs 62.34/59.04/55.09).
- On gRefCOCO-GREC: F1 73.5/70.2/60.8 vs PropVG's 72.2/68.8/59.0.
Ablation studies confirm the importance of multi-task joint learning (+1.2% F1, +2.3% gIoU), APD (+2.0% F1, +1.9% gIoU), and PIPH (+3.0% F1, +2.3% gIoU). The number of queries balances coverage and ambiguity.
Qualitative findings demonstrate consistent spatial anchors for points, boxes, and masks per instance, robust multi-target and non-target handling, and fine-grained segmentation within globally consistent masks.
7. Context and Significance
InstanceVG fundamentally advances generalized visual grounding by transitioning from semantic to instance-aware perception, which previous frameworks typically neglected. The incorporation of point-based instance queries and joint box-mask prediction resolves inconsistencies between detection and segmentation granularity. The joint learning paradigm results in improved prediction alignment and efficiency, directly impacting tasks that require robust multi-target and zero-target handling. State-of-the-art results on diverse and challenging benchmarks highlight the practical gains achieved.
This suggests that future generalized grounding methods may integrate multi-modal, multi-granular supervision and spatial anchors to further boost consistency and performance, especially in settings with ambiguous or complex referential language.