Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniRES++: Unified Multimodal RES Model

Updated 2 April 2026
  • The paper introduces a unified end-to-end pipeline for multi-granularity segmentation that leverages shared parameters and targeted feature exploration to achieve state-of-the-art results.
  • UniRES++ integrates dual-stream inputs and cascaded cross-attention to combine coarse global context with fine-detailed part features, significantly improving mIoU and cIoU metrics.
  • Extensive ablation studies confirm that combining object and part token flows, high-resolution imagery, and large-scale datasets optimally enhances segmentation accuracy.

UniRES++ is a unified multimodal LLM (MLLM) for referring expression segmentation (RES) across omni-level visual target granularities, including multi-object, single-object, and part-level segmentation. In contrast to prior approaches that treat different granularities as separate tasks, UniRES++ implements a single, end-to-end pipeline built for both object- and part-level visual grounding using shared parameters and multi-granularity routing. Designed to address practical scenarios in vision-language grounding, UniRES++ leverages a novel architecture, large-scale datasets annotated at multiple levels, and targeted feature exploitation mechanisms to achieve state-of-the-art (SOTA) performance on established and new RES benchmarks (Liu et al., 2 Apr 2025).

1. Unified Model Architecture

UniRES++ ingests an image II and a language expression SS, producing a pixel-wise mask at the level of granularity—object or part—specified by the referring expression. The architecture integrates vision and language via the following components:

  • Dual-stream Image Input: A low-resolution image IlI_l (336×336) encodes global scene information, while a high-resolution image IhI_h (1024×1024) captures fine visual details.
  • Multi-Granularity Vision Flow (MGVF):
    • ElE_l (CLIP ViT-L/14) encodes IlI_l to yield coarse global tokens FlRNl×CF_l \in \mathbb{R}^{N_l \times C}.
    • EhE_h (ConvNeXt-L) encodes IhI_h to produce FhRNh×CF_h \in \mathbb{R}^{N_h \times C}.
    • Open-domain detectors propose object boxes SS0 and part boxes SS1 on SS2.
    • ROIAlign extracts object tokens SS3 and part tokens SS4 from SS5.
    • Context is propagated by:

    SS6

  • Grounding Encoder: A frozen SAM encoder outputs patch tokens SS7.

  • Vision–Text Projection: Concatenate SS8 and project via SS9 to IlI_l0 for LLM ingestion.

  • LLM Backbone: Vicuna-7B (with LoRA-8) processes IlI_l1 and the text sequence IlI_l2, autoregressively emitting tokens until a dedicated [SEG] token.

  • Multi-Granularity Feature Exploitation (MGFE):

    • [SEG] decouples into [SEG_OBJECT] or [SEG_PART] via classifier IlI_l3.
    • Routing directs computation to object or part pathways using gate IlI_l4.
    • Re-weighting uses cross-attention: IlI_l5.
  • Pixel Decoder: Combines [SEG] embedding and IlI_l6 to predict a segmentation mask.

2. Multi-Granularity Pathways and Routing

The architecture enables explicit granularity selection through [SEG_OBJECT] and [SEG_PART] tokens in the LLM vocabulary. During inference, the model predicts the appropriate granularity by routing via the gating variable IlI_l7, which applies either object features (IlI_l8) or part features (IlI_l9):

  • If IhI_h0: object-level pathway is chosen.
  • If IhI_h1: part-level features dominate.
  • Routing and re-weighting allow unified inference for one object, multiple objects, or part of an object, without separate specialist models.

This design allows the system to handle any described target granularity with a single set of model parameters and joint training objectives.

3. Formulations and Training Objective

Key mathematical formulations in UniRES++ include:

  • Multi-Granularity Vision Flow:

IhI_h2

IhI_h3

IhI_h4

  • Projection and Routing:

IhI_h5

IhI_h6

IhI_h7

  • Training Loss:

IhI_h8

where:

IhI_h9

ElE_l0

ElE_l1

ElE_l2

4. Fine-Grained Visual Feature Exploration

UniRES++ employs targeted strategies for fine-grained feature capture:

  • Two-scale Encoding: Uses 336×336 resolution for global scene understanding and 1024×1024 for high-detail part-level grounding.
  • Open-Domain Box Proposals: Object- and part-level boxes allow region-specific feature extraction.
  • ROIAlign and Cascaded Cross-Attention: Token extraction from high-res features, followed by sequential context propagation (image → object → part) for progressively finer localization.
  • Vision–Language Self- and Cross-Attention: The LLM fuses vision tokens and text context, allowing dynamic focus on appropriate detail levels per query.

This approach facilitates hierarchical feature aggregation, with the LLM dynamically determining the requisite granularity during inference.

5. Training Regime and Data

Training follows a two-stage process:

  1. Pre-training: Initial grounding on GranD (“pixel grounding large multimodal model”) dataset for basic skill acquisition.
  2. Fine-Tuning: Multi-task supervision using:
    • Classic RES datasets: RefCOCO, RefCOCO+, RefCOCOg
    • Generalized RES: gRefCOCO
    • Multi-granularity RES: MRES-32M (32.2M masks and captions across 1M images)
    • Region captioning datasets: Visual Genome, RefCOCO, etc.

Key training hyperparameters:

  • Vision encoders (CLIP ViT-L/14, ConvNeXt-L, SAM) are frozen.
  • LoRA-8 tuning applied to the LLM backbone (Vicuna-7B).
  • Batch size: 256; AdamW optimizer; initial learning rate ElE_l3; linear warm-up for 100 steps; cosine decay.
  • Loss weights: ElE_l4, ElE_l5, ElE_l6, ElE_l7.
  • Each epoch: ~2000 steps, trained over 3 epochs.

Inference involves feeding an image and text to the model, which autoregresses to [SEG], determines granularity, and outputs the segmentation mask.

6. Empirical Performance

UniRES++ achieves SOTA results across classic, generalized, and multi-granularity RES benchmarks:

Dataset Metric UniRES++ Prior Best
RefCOCOm (MRES) Val mIoU (obj+part) 40.8% 34.3% (“UniRES”)
Val mIoU (part) 27.7% 19.6%
gRefCOCO (GRES) Val cIoU 69.9% 68.7% (GLaMM)
Classic RES RefCOCO Val oIoU 80.2% 77.4%
RefCOCO Val mIoU 80.8% 74.5%
RefCOCO+ Val 71.6%/73.6%
RefCOCOg Val 73.8%/74.4%

Performance gains are attributed to hierarchical token flows, unified routing, large-scale part-level data (MRES-32M), and synergistic multi-task joint training. The model outperforms GLaMM and GSVA-FT on all major metrics (Liu et al., 2 Apr 2025).

7. Ablation Results and Qualitative Analysis

Systematic ablation studies confirm that multi-granularity token pathways and fine-grained region features are both critical:

  • MGVF (granularity ablation): Sequentially adding object- and part-level tokens improves mIoU and cIoU; best performance with both, not either alone.
  • High-Resolution Impact: 1024×1024 yields peak segmentation performance; deviations reduce accuracy.
  • Data Scale: mIoU rises monotonically as more MRES-32M data is included.
  • MGFE Modules: Staged introduction of [SEG] decoupling, adjacent interaction, and decoder re-weighting each contribute incremental improvements, with all combined yielding highest mIoU and cIoU.
  • Training Data Complementarity: Joint learning on single-object (RefCOCO), part-level (MRES-32M), and multiple-object data (gRefCOCO) significantly outperforms single-task or specialist models.

Qualitative results demonstrate precise part-masking for challenging queries (e.g., “the top half of the red mug handle”), strong multi-object detection, and correct no-target classifications. Noted failure cases include extremely small/irregular parts or incomplete masks for complex, multi-instance requests, which suggests future data or capacity scaling may further enhance performance.


UniRES++ represents a unified large multimodal model and training paradigm for referring expression segmentation at every visual granularity, integrating hierarchical visual features, large-scale joint supervision, and dynamic token routing and re-weighting in a single framework that establishes new performance standards for multi-level language-guided segmentation tasks (Liu et al., 2 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniRES++ Model.