Papers
Topics
Authors
Recent
2000 character limit reached

Omni-Referring Image Segmentation (OmniRIS)

Updated 10 December 2025
  • OmniRIS is a unified paradigm that segments all visual entities using diverse omni-modal inputs such as text, mask, box, and scribble.
  • The OmniRef dataset, with 30,956 images and 186,939 prompts, enables evaluation across single, multi, and no-target segmentation scenarios.
  • OmniSegNet, a dual-path transformer architecture, integrates cross-modal cues to achieve state-of-the-art performance in referent existence and mask accuracy.

Omni-Referring Image Segmentation (OmniRIS) describes the generalized task of segmenting all visual entities within an image guided by arbitrary combinations of textual instructions and visual reference cues. OmniRIS subsumes and extends classical referring image segmentation (RIS) and visual prompt-based segmentation by supporting omni-modal inputs (text, image, region, mask, box, scribble) and encompassing diverse referential relations (including many-to-many, one-to-many, and non-referent cases) in a unified framework. The foundation of this paradigm is established in “Omni-Referring Image Segmentation” (Zheng et al., 7 Dec 2025), which introduces a formal definition, a comprehensive dataset (OmniRef), and the OmniSegNet baseline model.

1. Formal Definition and Scope

Let ItRH×W×3I_t \in \mathbb{R}^{H \times W \times 3} denote the target image. OmniRIS introduces an omni-prompt set P={T,(Ir,Ps)}\mathcal{P} = \{T, (I_r, P_s)\} where TT is an optional text instruction and (Ir,Ps)(I_r, P_s) contains a reference image IrI_r and a spatial prompt Ps{0,1}H×WP_s \in \{0,1\}^{H \times W} (mask, box, or scribble), with any subset potentially omitted. The task is to learn a function fθf_\theta mapping

(It,P)({Mk}k=1K,y)(I_t, \mathcal{P}) \rightarrow (\{M_k\}_{k=1}^K, y)

such that {Mk}\{M_k\} are the predicted binary masks for referred objects and y{0,1}y \in \{0,1\} signals referent existence. The grounding regime includes:

  • One-vs-one: Single prompt \rightarrow one mask
  • One-vs-many: Single prompt refers to multiple instances
  • Many-vs-many: Multiple prompts each refer to (possibly different) targets
  • No-target: Prompt refers to absent entity (y=0y=0, Mk=0M_k = 0)

The model must learn the conditional p({Mk},yIt,P)p(\{M_k\}, y | I_t, \mathcal{P}). This generalization lifts the restrictions of closed-category and input-modality RIS, manages multi-instance and non-referent settings, and supports both textually fine-grained and visually grounded referential instructions (Zheng et al., 7 Dec 2025).

2. The OmniRef Dataset: Construction and Structure

The OmniRef dataset operationalizes the OmniRIS paradigm by exhaustively annotating $30,956$ images with $186,939$ omni-prompts that span text, visual (mask/box/scribble), and hybrid (text+visual) modalities. The construction pipeline is as follows (Zheng et al., 7 Dec 2025):

  • Step I—Image Selection: MSCOCO images filtered for at least two semantic categories with spatial diversity, resulting in $30,956$ images.
  • Step II—Visual Annotation: $26,859$ COCO images (with instance masks) are used to derive spatial prompts. Masks yield tight boxes for box prompts or simulated scribbles for scribble prompts. Reference images for each target are matched by category (positives) or specifically chosen to be absent (negatives).
  • Step III—Text Annotation: Prompts from gRefCOCO (for multi/none) and RefCOCOg (for complex single-targets) are repurposed, yielding multiple textual referents per image.
  • Step IV—Omni-Annotation Fusion: For $23,709$ test samples, each target is paired with both a text and a visual prompt under identical semantics, manually audited for correctness.

The dataset is split into Omni-Train (24,407 images, 108,354 prompts) and three Test sets: Text-test (25,795 text prompts), Visual-test (29,081 visual prompts), and Omni-test (23,709 combined prompts). Detailed support for single-target, multi-target, and no-target scenarios is preserved in all splits, enabling granular benchmarking of referent existence and grounding settings.

Split Images / Prompts Prompt Types Targets
Omni-Train 24,407 / 108,354 text, mask, box, scribble 35,987 single, 47,960 multi, 24,407 none
Text-Test 6,549 / 25,795 text single/multi/none
Visual-Test 6,549 / 29,081 mask/box/scribble single/multi/none
Omni-Test 6,549 / 23,709 text+visual single/multi/none

This dataset enables systematic evaluation of both textual and visually conditioned segmentation, robustly spanning high-level (attribute, count) and uncommon object grounding scenarios.

3. OmniSegNet: Baseline Architecture for OmniRIS

OmniSegNet is designed as a dual-path transformer architecture accommodating all omni-prompt modalities (Zheng et al., 7 Dec 2025):

  • Image Backbone & Pixel Encoder: Swin-B serves as the image encoder, with output features {Fmi}i=0..3\{F_m^i\}_{i=0..3} feeding multi-scale MaskDecoder blocks and a final MaskHead.
  • Text Prompt Path: A BERT encoder produces FtF_t (dim=768) for MaskDecoder consumption.
  • Visual Prompt Path (Omni-Prompt Encoder): A reference backbone processes the reference image, producing four scale-matched features FriF_r^i. The Prompt Embed Module (PEM) integrates each spatial prompt PiP^i (mask, box, scribble) as Fsi=Fri+Convi(Pi)F_s'^i = F_r^i + \mathrm{Conv}_i(P^i).
  • Prompt Generator: Three layers of deformable cross-attention, self-attention, and FFN aggregate prompt semantics into embedding FpF_p (with nn learnable queries).
  • Mask Decoder: Each scale's MaskDecoder receives both pixel features FmiF_m^i and (textual or visual) prompt features, fusing them before propagating to the next stage.
  • Output Heads: Fused features pass to the MaskHead for pixelwise mask prediction (MpredM_{\mathrm{pred}}) and to an MLP for referent existence logit ypredy_{\mathrm{pred}}.

Key stages:

  1. Fsi=Fri+Convi(Pi)F_s'^i = F_r^i + \mathrm{Conv}_i(P^i).
  2. Fq=DeformCrossAttn(Fq,Fr)F_q' = \mathrm{DeformCrossAttn}(F_q, F_r') (prompt generator).
  3. MaskDecoder block: sequential cross-attention with FmiF_m^i, prompt, self-attention, FFN.

This structure enables text-only, visual-only, or omni-modal prompt scenarios and supports referent multiplicity and absence detection.

4. Training Paradigm and Loss Functions

OmniSegNet is trained in three progressive stages (Zheng et al., 7 Dec 2025):

  1. Vision–Language Alignment: Pretraining on traditional RIS datasets (e.g., RefCOCO, gRefCOCO) with text prompts to instill language grounding.
  2. Visual Instruction Tuning: Freezing the text path, the model is specifically tuned on visual prompts from OmniRef to learn correspondence from masks, boxes, or scribbles.
  3. Joint Omni-Modal Training: Both prompt encoders are unfrozen; text-visual-omni modality batches are mixed (7:2 text:visual yields optimal cross-modal generalization; see ablation Table 8).

The loss function is a weighted sum: Ltotal=λ1Lmask+λ2Lregion+λ3Lnt\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_{\text{mask}} + \lambda_2 \mathcal{L}_{\text{region}} + \lambda_3 \mathcal{L}_{\text{nt}}

  • Lmask\mathcal{L}_{\text{mask}}: pixelwise cross-entropy to ground-truth masks (MgtM_{\mathrm{gt}}).
  • Lregion\mathcal{L}_{\text{region}}: region-level cross-entropy between downsampled mask and regional features.
  • Lnt\mathcal{L}_{\text{nt}}: existence (no-target) classification loss.

Hyperparameters are set empirically. Input images are 480×480480 \times 480; text is truncated to 20 tokens. Optimizer: AdamW.

5. Evaluation Metrics and Experimental Results

OmniRIS performance is evaluated using task-specific and unified metrics (Zheng et al., 7 Dec 2025):

  • IoU: MpredMgt/MpredMgt|M_{\mathrm{pred}} \cap M_{\mathrm{gt}}| / |M_{\mathrm{pred}} \cup M_{\mathrm{gt}}|
  • Cumulative IoU (cIoU): Average over nonzero-referent samples.
  • Generalized IoU (gIoU): Mean over all samples, assigning 1 for correct no-target and 0 otherwise.
  • No-Target Accuracy (N_acc): Fraction of no-target instances with ypred=ygty_{\mathrm{pred}} = y_{\mathrm{gt}}.
  • Precision@XX: Percentage of samples with IoU>X>X.

Key empirical results on OmniRef (Table 1, (Zheng et al., 7 Dec 2025)):

Method Text-test cIoU/gIoU/N_acc Visual-test cIoU/gIoU/N_acc Omni-test cIoU/gIoU/N_acc
LISA-7B† 64.95 / 66.02 / –
GSVA-7B† 65.30 / 67.57 / 63.44
ReLA 63.40 / 64.75 / 57.97
VRP-SAM 55.70 / 52.74 / 78.63
DCAMA 60.23 / 49.91 / 80.46
OmniSegNet 64.92 / 66.44 / 62.56 76.63 / 68.87 / 90.81 69.27 / 67.80 / 57.69

† Indicates use of Vicuna-7B as language backbone.

OmniSegNet consistently outperforms MLLM-derived and specialist baselines, achieves cIoU 76.63 and N_acc 90.81 on visual prompts, and maintains competitive performance on text/omni settings. Multi-shot, prompt fusion, and batch ratio ablations demonstrate best performance for mask prompts and 7:2 text:visual training (see Table 2a/2b/8).

6. Comparative Perspective and Extensions

OmniRIS unifies and extends several lines of research:

  • Beyond One-to-One: Supporting zero, one, or many referents per prompt, as advocated in DMMI (Hu et al., 2023) and DeRIS (Dai et al., 2 Jul 2025), with architectures built to handle multi-instance and no-referent segmentation without specialized branches or toggles.
  • Omni-supervised Training: Incorporating labeled, weakly labeled, and unlabeled data, as in Omni-RES (Huang et al., 2023), to harness large-scale vision-language resources via e.g., teacher-student filtering (APLR), with demonstrated +2+2-$9$\% mIoU gains at low annotation budgets.
  • Multi-modal Task Generalization: As in UniRef++ (Wu et al., 2023), the core modules can support RIS, few-shot segmentation, video-object segmentation, and various prompt modalities via unified encoding and multiway-fusion blocks.
  • Group-wise and Negative Mining: Insights from GRES/GRSer (Wu et al., 2023) on handling group retrieval, negative cases, and anti-expressions inform OmniRIS's robust negative handling and instance presence prediction.

OmniRIS is positioned as the canonical open-vocabulary, multi-modal, multi-referent segmentation setting.

7. Limitations and Future Directions

Several challenges and open problems remain for OmniRIS (Zheng et al., 7 Dec 2025):

  • Long-Tail and Out-of-Distribution Generalization: Reliance on MSCOCO categories in OmniRef constrains coverage of rare or open-world categories. Visual prompt diversity (scribble generation vs. real user input) may not transfer directly to practical deployments.
  • Computational Complexity: Dual-backbone and multi-decoders incur significant memory and training costs.
  • Scalability: Efficient negative mining, dynamic prompt allocation, and large-scale group processing are research frontiers (see (Wu et al., 2023, Dai et al., 2 Jul 2025)).
  • Integrated Foundation Models: Incorporation of large-scale VLMs (e.g., BLIP, SAM2) is proposed to close the gap on highly open-vocabulary settings; adapting OmniRIS frameworks to spatiotemporal (video), 3D, and self/weakly supervised contexts is a priority.
  • Prompt-Agnostic Training: Future research may unify all prompt modalities, facilitating continuous interpolation between text, sketch, point, box, and mask signals and robust fusion.

Potential applications include interactive image editing, human-computer interaction, image annotation, robotics, and domain-required few-shot segmentation.


In summary, Omni-Referring Image Segmentation (OmniRIS) specifies a rigorous and highly general paradigm for conditional visual segmentation given omni-modal prompts, unifying textual and visual reference frameworks and robustly supporting a comprehensive spectrum of referent cases, as instantiated by the OmniRef dataset and OmniSegNet model (Zheng et al., 7 Dec 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Omni-Referring Image Segmentation (OmniRIS).