UniRES++: Unified Multimodal RES Model
- The paper introduces a unified end-to-end pipeline for multi-granularity segmentation that leverages shared parameters and targeted feature exploration to achieve state-of-the-art results.
- UniRES++ integrates dual-stream inputs and cascaded cross-attention to combine coarse global context with fine-detailed part features, significantly improving mIoU and cIoU metrics.
- Extensive ablation studies confirm that combining object and part token flows, high-resolution imagery, and large-scale datasets optimally enhances segmentation accuracy.
UniRES++ is a unified multimodal LLM (MLLM) for referring expression segmentation (RES) across omni-level visual target granularities, including multi-object, single-object, and part-level segmentation. In contrast to prior approaches that treat different granularities as separate tasks, UniRES++ implements a single, end-to-end pipeline built for both object- and part-level visual grounding using shared parameters and multi-granularity routing. Designed to address practical scenarios in vision-language grounding, UniRES++ leverages a novel architecture, large-scale datasets annotated at multiple levels, and targeted feature exploitation mechanisms to achieve state-of-the-art (SOTA) performance on established and new RES benchmarks (Liu et al., 2 Apr 2025).
1. Unified Model Architecture
UniRES++ ingests an image and a language expression , producing a pixel-wise mask at the level of granularity—object or part—specified by the referring expression. The architecture integrates vision and language via the following components:
- Dual-stream Image Input: A low-resolution image (336×336) encodes global scene information, while a high-resolution image (1024×1024) captures fine visual details.
- Multi-Granularity Vision Flow (MGVF):
- (CLIP ViT-L/14) encodes to yield coarse global tokens .
- (ConvNeXt-L) encodes to produce .
- Open-domain detectors propose object boxes 0 and part boxes 1 on 2.
- ROIAlign extracts object tokens 3 and part tokens 4 from 5.
- Context is propagated by:
6
Grounding Encoder: A frozen SAM encoder outputs patch tokens 7.
Vision–Text Projection: Concatenate 8 and project via 9 to 0 for LLM ingestion.
LLM Backbone: Vicuna-7B (with LoRA-8) processes 1 and the text sequence 2, autoregressively emitting tokens until a dedicated [SEG] token.
Multi-Granularity Feature Exploitation (MGFE):
- [SEG] decouples into [SEG_OBJECT] or [SEG_PART] via classifier 3.
- Routing directs computation to object or part pathways using gate 4.
- Re-weighting uses cross-attention: 5.
- Pixel Decoder: Combines [SEG] embedding and 6 to predict a segmentation mask.
2. Multi-Granularity Pathways and Routing
The architecture enables explicit granularity selection through [SEG_OBJECT] and [SEG_PART] tokens in the LLM vocabulary. During inference, the model predicts the appropriate granularity by routing via the gating variable 7, which applies either object features (8) or part features (9):
- If 0: object-level pathway is chosen.
- If 1: part-level features dominate.
- Routing and re-weighting allow unified inference for one object, multiple objects, or part of an object, without separate specialist models.
This design allows the system to handle any described target granularity with a single set of model parameters and joint training objectives.
3. Formulations and Training Objective
Key mathematical formulations in UniRES++ include:
- Multi-Granularity Vision Flow:
2
3
4
- Projection and Routing:
5
6
7
- Training Loss:
8
where:
9
0
1
2
4. Fine-Grained Visual Feature Exploration
UniRES++ employs targeted strategies for fine-grained feature capture:
- Two-scale Encoding: Uses 336×336 resolution for global scene understanding and 1024×1024 for high-detail part-level grounding.
- Open-Domain Box Proposals: Object- and part-level boxes allow region-specific feature extraction.
- ROIAlign and Cascaded Cross-Attention: Token extraction from high-res features, followed by sequential context propagation (image → object → part) for progressively finer localization.
- Vision–Language Self- and Cross-Attention: The LLM fuses vision tokens and text context, allowing dynamic focus on appropriate detail levels per query.
This approach facilitates hierarchical feature aggregation, with the LLM dynamically determining the requisite granularity during inference.
5. Training Regime and Data
Training follows a two-stage process:
- Pre-training: Initial grounding on GranD (“pixel grounding large multimodal model”) dataset for basic skill acquisition.
- Fine-Tuning: Multi-task supervision using:
- Classic RES datasets: RefCOCO, RefCOCO+, RefCOCOg
- Generalized RES: gRefCOCO
- Multi-granularity RES: MRES-32M (32.2M masks and captions across 1M images)
- Region captioning datasets: Visual Genome, RefCOCO, etc.
Key training hyperparameters:
- Vision encoders (CLIP ViT-L/14, ConvNeXt-L, SAM) are frozen.
- LoRA-8 tuning applied to the LLM backbone (Vicuna-7B).
- Batch size: 256; AdamW optimizer; initial learning rate 3; linear warm-up for 100 steps; cosine decay.
- Loss weights: 4, 5, 6, 7.
- Each epoch: ~2000 steps, trained over 3 epochs.
Inference involves feeding an image and text to the model, which autoregresses to [SEG], determines granularity, and outputs the segmentation mask.
6. Empirical Performance
UniRES++ achieves SOTA results across classic, generalized, and multi-granularity RES benchmarks:
| Dataset | Metric | UniRES++ | Prior Best |
|---|---|---|---|
| RefCOCOm (MRES) | Val mIoU (obj+part) | 40.8% | 34.3% (“UniRES”) |
| Val mIoU (part) | 27.7% | 19.6% | |
| gRefCOCO (GRES) | Val cIoU | 69.9% | 68.7% (GLaMM) |
| Classic RES | RefCOCO Val oIoU | 80.2% | 77.4% |
| RefCOCO Val mIoU | 80.8% | 74.5% | |
| RefCOCO+ Val | 71.6%/73.6% | — | |
| RefCOCOg Val | 73.8%/74.4% | — |
Performance gains are attributed to hierarchical token flows, unified routing, large-scale part-level data (MRES-32M), and synergistic multi-task joint training. The model outperforms GLaMM and GSVA-FT on all major metrics (Liu et al., 2 Apr 2025).
7. Ablation Results and Qualitative Analysis
Systematic ablation studies confirm that multi-granularity token pathways and fine-grained region features are both critical:
- MGVF (granularity ablation): Sequentially adding object- and part-level tokens improves mIoU and cIoU; best performance with both, not either alone.
- High-Resolution Impact: 1024×1024 yields peak segmentation performance; deviations reduce accuracy.
- Data Scale: mIoU rises monotonically as more MRES-32M data is included.
- MGFE Modules: Staged introduction of [SEG] decoupling, adjacent interaction, and decoder re-weighting each contribute incremental improvements, with all combined yielding highest mIoU and cIoU.
- Training Data Complementarity: Joint learning on single-object (RefCOCO), part-level (MRES-32M), and multiple-object data (gRefCOCO) significantly outperforms single-task or specialist models.
Qualitative results demonstrate precise part-masking for challenging queries (e.g., “the top half of the red mug handle”), strong multi-object detection, and correct no-target classifications. Noted failure cases include extremely small/irregular parts or incomplete masks for complex, multi-instance requests, which suggests future data or capacity scaling may further enhance performance.
UniRES++ represents a unified large multimodal model and training paradigm for referring expression segmentation at every visual granularity, integrating hierarchical visual features, large-scale joint supervision, and dynamic token routing and re-weighting in a single framework that establishes new performance standards for multi-level language-guided segmentation tasks (Liu et al., 2 Apr 2025).