Papers
Topics
Authors
Recent
Search
2000 character limit reached

InstanceVG: Instance-Aware Generalized Grounding

Updated 29 January 2026
  • InstanceVG is an instance-aware multi-task framework that fuses generalized referring expression comprehension (GREC) with segmentation (GRES) using spatially anchored queries.
  • It utilizes a shared BEiT-3 encoder and novel modules like the Attention-based Point-Prior Decoder (APD) and PIPH to achieve consistent box, mask, and point predictions.
  • Experimental results across ten benchmark datasets demonstrate state-of-the-art performance, improving both multi-target detection and precise segmentation.

InstanceVG is an instance-aware multi-task framework for generalized visual grounding, jointly addressing Generalized Referring Expression Comprehension (GREC) and Generalized Referring Expression Segmentation (GRES). It introduces instance queries anchored by spatial priors to unify and streamline predictions of boxes, masks, and points, offering improved capacity for multi-target and non-target scenarios as well as enforcing consistency between all localization granularities. InstanceVG is distinguished as the first framework to equip generalized visual grounding with explicit instance awareness, achieving state-of-the-art performance across ten datasets and four tasks (Dai et al., 17 Sep 2025).

1. Problem Formulation and Objectives

InstanceVG targets two central tasks within generalized visual grounding: GREC focuses on identifying all referential objects in an image at the bounding box level. GRES aims to achieve fine-grained, pixel-level segmentation of referred objects.

Given an image I∈RH×W×3I \in \mathbb{R}^{H \times W \times 3} and a referring expression T=[t1,…,tNt]T = [t_1,\dots,t_{N_t}], the framework may handle multiple instances or cases where no referent exists (K≥0K \geq 0). The outputs for each instance jj include a bounding box bj=(xj,yj,wj,hj)b_j = (x_j, y_j, w_j, h_j) and mask mj∈{0,1}H×Wm_j \in \{0,1\}^{H \times W}. Multi-target and non-target scenarios are formalized by requiring variable-size output sets for both boxes and masks. The system is evaluated using multi-tag F1-score at bounding box level (IoU ≥0.5\geq 0.5), N-acc (no-target accuracy), mean IoU (mask-level), and class-wise IoU for segmentation.

2. Architecture: Modalities and Instance Queries

InstanceVG uses a shared BEiT-3 encoder to fuse visual (image patches) and textual (expression tokens) modalities. The architecture branches after encoding:

  • Global-Semantic Branch: Employs SimFPN followed by a UNet-style decoder to predict a coarse semantic mask S^global\hat{S}_{global} and perform existence classification for non-target detection.
  • Instance Branch: Utilizes an Attention-based Point-Prior Decoder (APD) that generates instance queries anchored to specific image locations.

Attention-based Point-Prior Decoder (APD)

APD comprises two sub-modules:

  • Instance Query Generator (IQG): Selects top-NqN_q text tokens (by L2 norm) and localizes spatial anchors by cross-attention with image features. Dynamic point selection ensures queries are spatially diverse and align with high response regions.
  • Point-Prior Multiscale Deformable Decoder: Adopts deformable DETR-style attention, replacing the fixed grid queries with dynamic spatial points Pr[i]P_r[i] for each instance.

Point-Guided Instance-aware Perception Head (PIPH)

PIPH processes decoded instance queries and the global semantic mask, producing:

  • Per-query response masks via sigmoid activation,
  • Foreground classification scores,
  • Bounding box regressions (x,y,w,h)(x, y, w, h),
  • Instance segmentation masks.

3. Matching and Consistency Strategies

InstanceVG ensures that each predicted instance query consistently aligns its point, box, and mask. This is achieved through:

  • Point-Guided Matching: Implements one-to-one Hungarian matching between predictions and ground-truth instances using a composite cost matrix:

Cij=λcls CE(picls,yj)+λbox L1(pibox,bj)+λgiou GIoU(pibox,bj)+λpoint L1(Pr[i],center(bj))C_{ij} = \lambda_{cls}\,CE(p_i^{cls}, y_j) + \lambda_{box}\,L_1(p_i^{box}, b_j) + \lambda_{giou}\,GIoU(p_i^{box}, b_j) + \lambda_{point}\,L_1(P_r[i], center(b_j))

Default weights: λcls=1.0\lambda_{cls}=1.0, λbox=5.0\lambda_{box}=5.0, λgiou=2.0\lambda_{giou}=2.0, λpoint=2.0\lambda_{point}=2.0.

  • Query-Mask Alignment: Once instance queries are matched to ground-truth boxes, their assignment propagates to masks, enforcing consistency between all localization outputs (point, box, and mask).

4. Training Protocol and Loss Functions

The total loss is a weighted sum comprising: Ltotal=λdetrLdetr+λsegLsegglobal+λinstanceLinsseg+λexistLexistL_{total} = \lambda_{detr}L_{detr} + \lambda_{seg}L_{seg_{global}} + \lambda_{instance}L_{ins_{seg}} + \lambda_{exist}L_{exist}

  • LdetrL_{detr} (Detection loss): Standard for matched instances, including classification, box regression, and GIoU terms.
  • LsegglobalL_{seg_{global}}: Global semantic segmentation using binary cross-entropy and Dice loss.
  • LinssegL_{ins_{seg}}: Instance-level mask supervision, with positive (matched) and negative (unmatched or background) queries.
  • LexistL_{exist}: Binary cross-entropy loss for referent existence.

Loss weights are set as λdetr=0.1\lambda_{detr}=0.1, λseg=1.0\lambda_{seg}=1.0, λinstance=1.0\lambda_{instance}=1.0, λexist=0.2\lambda_{exist}=0.2. Training is conducted using Adam optimizer, with learning rates of 5×10−55 \times 10^{-5} for the encoder and 5×10−45 \times 10^{-4} for other modules, and scheduled decay.

5. Multi-task Joint Learning Strategy

Joint optimization of GREC and GRES tasks is facilitated by shared encoding and feature extraction. Branches for global segmentation and instance-aware prediction are trained in parallel. The APD and PIPH enforce structural consistency, aligning spatial points, boxes, and masks per instance. This approach yields improvements in both multi-target detection and segmentation granularity, streamlining the overall process compared to independent task training.

A plausible implication is that such architectural jointness addresses previous inconsistencies and improves efficiency in real-world referential understanding.

6. Experimental Evaluation and Benchmarks

InstanceVG is evaluated on ten datasets spanning REC/RES (RefCOCO, RefCOCO+, RefCOCOg), GREC (gRefCOCO), and GRES/region-aware (gRefCOCO, Ref-ZOM, R-RefCOCO+). Key metrics reported include:

  • On RefCOCO/+/g-REC (box detection), InstanceVG (ViT-L) achieves up to 96.04%/92.89%/90.62% vs SimVG-L's 94.70%/91.64%/89.15%.
  • On RefCOCO/+/g-RES (mask segmentation), mIoU 87.12/84.33/82.27 vs prior best 86.64/83.74/81.32.
  • On gRefCOCO-GRES, InstanceVG-B improves over CoHD by +4.9%, +2.5%, +3.1% in gIoU across splits.
  • On Ref-ZOM: oIoU 71.12% (vs 68.99%), mIoU 71.52% (vs 69.81%), Acc 97.42% (vs 93.34%).
  • On R-RefCOCO(+/g): rIoU up to 62.41/59.13/54.36 (vs 62.34/59.04/55.09).
  • On gRefCOCO-GREC: F1 73.5/70.2/60.8 vs PropVG's 72.2/68.8/59.0.

Ablation studies confirm the importance of multi-task joint learning (+1.2% F1, +2.3% gIoU), APD (+2.0% F1, +1.9% gIoU), and PIPH (+3.0% F1, +2.3% gIoU). The number of queries Nq=10N_q=10 balances coverage and ambiguity.

Qualitative findings demonstrate consistent spatial anchors for points, boxes, and masks per instance, robust multi-target and non-target handling, and fine-grained segmentation within globally consistent masks.

7. Context and Significance

InstanceVG fundamentally advances generalized visual grounding by transitioning from semantic to instance-aware perception, which previous frameworks typically neglected. The incorporation of point-based instance queries and joint box-mask prediction resolves inconsistencies between detection and segmentation granularity. The joint learning paradigm results in improved prediction alignment and efficiency, directly impacting tasks that require robust multi-target and zero-target handling. State-of-the-art results on diverse and challenging benchmarks highlight the practical gains achieved.

This suggests that future generalized grounding methods may integrate multi-modal, multi-granular supervision and spatial anchors to further boost consistency and performance, especially in settings with ambiguous or complex referential language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InstanceVG.