Omni-Referring Image Segmentation (OmniRIS)

Updated 10 December 2025

OmniRIS is a unified paradigm that segments all visual entities using diverse omni-modal inputs such as text, mask, box, and scribble.
The OmniRef dataset, with 30,956 images and 186,939 prompts, enables evaluation across single, multi, and no-target segmentation scenarios.
OmniSegNet, a dual-path transformer architecture, integrates cross-modal cues to achieve state-of-the-art performance in referent existence and mask accuracy.

Omni-Referring Image Segmentation (OmniRIS) describes the generalized task of segmenting all visual entities within an image guided by arbitrary combinations of textual instructions and visual reference cues. OmniRIS subsumes and extends classical referring image segmentation (RIS) and visual prompt-based segmentation by supporting omni-modal inputs (text, image, region, mask, box, scribble) and encompassing diverse referential relations (including many-to-many, one-to-many, and non-referent cases) in a unified framework. The foundation of this paradigm is established in “Omni-Referring Image Segmentation” (Zheng et al., 7 Dec 2025), which introduces a formal definition, a comprehensive dataset (OmniRef), and the OmniSegNet baseline model.

1. Formal Definition and Scope

Let $I_t \in \mathbb{R}^{H \times W \times 3}$ denote the target image. OmniRIS introduces an omni-prompt set $\mathcal{P} = \{T, (I_r, P_s)\}$ where $T$ is an optional text instruction and $(I_r, P_s)$ contains a reference image $I_r$ and a spatial prompt $P_s \in \{0,1\}^{H \times W}$ (mask, box, or scribble), with any subset potentially omitted. The task is to learn a function $f_\theta$ mapping

$(I_t, \mathcal{P}) \rightarrow (\{M_k\}_{k=1}^K, y)$

such that $\{M_k\}$ are the predicted binary masks for referred objects and $y \in \{0,1\}$ signals referent existence. The grounding regime includes:

One-vs-one: Single prompt $\rightarrow$ one mask
One-vs-many: Single prompt refers to multiple instances
Many-vs-many: Multiple prompts each refer to (possibly different) targets
No-target: Prompt refers to absent entity ( $y=0$ , $M_k = 0$ )

The model must learn the conditional $p(\{M_k\}, y | I_t, \mathcal{P})$ . This generalization lifts the restrictions of closed-category and input-modality RIS, manages multi-instance and non-referent settings, and supports both textually fine-grained and visually grounded referential instructions (Zheng et al., 7 Dec 2025).

2. The OmniRef Dataset: Construction and Structure

The OmniRef dataset operationalizes the OmniRIS paradigm by exhaustively annotating $30,956$ images with $186,939$ omni-prompts that span text, visual (mask/box/scribble), and hybrid (text+visual) modalities. The construction pipeline is as follows (Zheng et al., 7 Dec 2025):

Step I—Image Selection: MSCOCO images filtered for at least two semantic categories with spatial diversity, resulting in $30,956$ images.
Step II—Visual Annotation: $26,859$ COCO images (with instance masks) are used to derive spatial prompts. Masks yield tight boxes for box prompts or simulated scribbles for scribble prompts. Reference images for each target are matched by category (positives) or specifically chosen to be absent (negatives).
Step III—Text Annotation: Prompts from gRefCOCO (for multi/none) and RefCOCOg (for complex single-targets) are repurposed, yielding multiple textual referents per image.
Step IV—Omni-Annotation Fusion: For $23,709$ test samples, each target is paired with both a text and a visual prompt under identical semantics, manually audited for correctness.

The dataset is split into Omni-Train (24,407 images, 108,354 prompts) and three Test sets: Text-test (25,795 text prompts), Visual-test (29,081 visual prompts), and Omni-test (23,709 combined prompts). Detailed support for single-target, multi-target, and no-target scenarios is preserved in all splits, enabling granular benchmarking of referent existence and grounding settings.

Split	Images / Prompts	Prompt Types	Targets
Omni-Train	24,407 / 108,354	text, mask, box, scribble	35,987 single, 47,960 multi, 24,407 none
Text-Test	6,549 / 25,795	text	single/multi/none
Visual-Test	6,549 / 29,081	mask/box/scribble	single/multi/none
Omni-Test	6,549 / 23,709	text+visual	single/multi/none

This dataset enables systematic evaluation of both textual and visually conditioned segmentation, robustly spanning high-level (attribute, count) and uncommon object grounding scenarios.

3. OmniSegNet: Baseline Architecture for OmniRIS

OmniSegNet is designed as a dual-path transformer architecture accommodating all omni-prompt modalities (Zheng et al., 7 Dec 2025):

Image Backbone & Pixel Encoder: Swin-B serves as the image encoder, with output features $\{F_m^i\}_{i=0..3}$ feeding multi-scale MaskDecoder blocks and a final MaskHead.
Text Prompt Path: A BERT encoder produces $F_t$ (dim=768) for MaskDecoder consumption.
Visual Prompt Path (Omni-Prompt Encoder): A reference backbone processes the reference image, producing four scale-matched features $F_r^i$ . The Prompt Embed Module (PEM) integrates each spatial prompt $P^i$ (mask, box, scribble) as $F_s'^i = F_r^i + \mathrm{Conv}_i(P^i)$ .
Prompt Generator: Three layers of deformable cross-attention, self-attention, and FFN aggregate prompt semantics into embedding $F_p$ (with $n$ learnable queries).
Mask Decoder: Each scale's MaskDecoder receives both pixel features $F_m^i$ and (textual or visual) prompt features, fusing them before propagating to the next stage.
Output Heads: Fused features pass to the MaskHead for pixelwise mask prediction ( $M_{\mathrm{pred}}$ ) and to an MLP for referent existence logit $y_{\mathrm{pred}}$ .

Key stages:

$F_s'^i = F_r^i + \mathrm{Conv}_i(P^i)$ .
$F_q' = \mathrm{DeformCrossAttn}(F_q, F_r')$ (prompt generator).
MaskDecoder block: sequential cross-attention with $F_m^i$ , prompt, self-attention, FFN.

This structure enables text-only, visual-only, or omni-modal prompt scenarios and supports referent multiplicity and absence detection.

4. Training Paradigm and Loss Functions

OmniSegNet is trained in three progressive stages (Zheng et al., 7 Dec 2025):

Vision–Language Alignment: Pretraining on traditional RIS datasets (e.g., RefCOCO, gRefCOCO) with text prompts to instill language grounding.
Visual Instruction Tuning: Freezing the text path, the model is specifically tuned on visual prompts from OmniRef to learn correspondence from masks, boxes, or scribbles.
Joint Omni-Modal Training: Both prompt encoders are unfrozen; text-visual-omni modality batches are mixed (7:2 text:visual yields optimal cross-modal generalization; see ablation Table 8).

The loss function is a weighted sum: $\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_{\text{mask}} + \lambda_2 \mathcal{L}_{\text{region}} + \lambda_3 \mathcal{L}_{\text{nt}}$

$\mathcal{L}_{\text{mask}}$ : pixelwise cross-entropy to ground-truth masks ( $M_{\mathrm{gt}}$ ).
$\mathcal{L}_{\text{region}}$ : region-level cross-entropy between downsampled mask and regional features.
$\mathcal{L}_{\text{nt}}$ : existence (no-target) classification loss.

Hyperparameters are set empirically. Input images are $480 \times 480$ ; text is truncated to 20 tokens. Optimizer: AdamW.

5. Evaluation Metrics and Experimental Results

OmniRIS performance is evaluated using task-specific and unified metrics (Zheng et al., 7 Dec 2025):

IoU: $|M_{\mathrm{pred}} \cap M_{\mathrm{gt}}| / |M_{\mathrm{pred}} \cup M_{\mathrm{gt}}|$
Cumulative IoU (cIoU): Average over nonzero-referent samples.
Generalized IoU (gIoU): Mean over all samples, assigning 1 for correct no-target and 0 otherwise.
No-Target Accuracy (N_acc): Fraction of no-target instances with $y_{\mathrm{pred}} = y_{\mathrm{gt}}$ .
Precision@ $X$ : Percentage of samples with IoU $>X$ .

Key empirical results on OmniRef (Table 1, (Zheng et al., 7 Dec 2025)):

Method	Text-test cIoU/gIoU/N_acc	Visual-test cIoU/gIoU/N_acc	Omni-test cIoU/gIoU/N_acc
LISA-7B†	64.95 / 66.02 / –	–	–
GSVA-7B†	65.30 / 67.57 / 63.44	–	–
ReLA	63.40 / 64.75 / 57.97	–	–
VRP-SAM	–	55.70 / 52.74 / 78.63	–
DCAMA	–	60.23 / 49.91 / 80.46	–
OmniSegNet	64.92 / 66.44 / 62.56	76.63 / 68.87 / 90.81	69.27 / 67.80 / 57.69

† Indicates use of Vicuna-7B as language backbone.

OmniSegNet consistently outperforms MLLM-derived and specialist baselines, achieves cIoU 76.63 and N_acc 90.81 on visual prompts, and maintains competitive performance on text/omni settings. Multi-shot, prompt fusion, and batch ratio ablations demonstrate best performance for mask prompts and 7:2 text:visual training (see Table 2a/2b/8).

6. Comparative Perspective and Extensions

OmniRIS unifies and extends several lines of research:

Beyond One-to-One: Supporting zero, one, or many referents per prompt, as advocated in DMMI (Hu et al., 2023) and DeRIS (Dai et al., 2 Jul 2025), with architectures built to handle multi-instance and no-referent segmentation without specialized branches or toggles.
Omni-supervised Training: Incorporating labeled, weakly labeled, and unlabeled data, as in Omni-RES (Huang et al., 2023), to harness large-scale vision-language resources via e.g., teacher-student filtering (APLR), with demonstrated $+2$ -$9$\% mIoU gains at low annotation budgets.
Multi-modal Task Generalization: As in UniRef++ (Wu et al., 2023), the core modules can support RIS, few-shot segmentation, video-object segmentation, and various prompt modalities via unified encoding and multiway-fusion blocks.
Group-wise and Negative Mining: Insights from GRES/GRSer (Wu et al., 2023) on handling group retrieval, negative cases, and anti-expressions inform OmniRIS's robust negative handling and instance presence prediction.

OmniRIS is positioned as the canonical open-vocabulary, multi-modal, multi-referent segmentation setting.

7. Limitations and Future Directions

Several challenges and open problems remain for OmniRIS (Zheng et al., 7 Dec 2025):

Long-Tail and Out-of-Distribution Generalization: Reliance on MSCOCO categories in OmniRef constrains coverage of rare or open-world categories. Visual prompt diversity (scribble generation vs. real user input) may not transfer directly to practical deployments.
Computational Complexity: Dual-backbone and multi-decoders incur significant memory and training costs.
Scalability: Efficient negative mining, dynamic prompt allocation, and large-scale group processing are research frontiers (see (Wu et al., 2023, Dai et al., 2 Jul 2025)).
Integrated Foundation Models: Incorporation of large-scale VLMs (e.g., BLIP, SAM2) is proposed to close the gap on highly open-vocabulary settings; adapting OmniRIS frameworks to spatiotemporal (video), 3D, and self/weakly supervised contexts is a priority.
Prompt-Agnostic Training: Future research may unify all prompt modalities, facilitating continuous interpolation between text, sketch, point, box, and mask signals and robust fusion.

Potential applications include interactive image editing, human-computer interaction, image annotation, robotics, and domain-required few-shot segmentation.

In summary, Omni-Referring Image Segmentation (OmniRIS) specifies a rigorous and highly general paradigm for conditional visual segmentation given omni-modal prompts, unifying textual and visual reference frameworks and robustly supporting a comprehensive spectrum of referent cases, as instantiated by the OmniRef dataset and OmniSegNet model (Zheng et al., 7 Dec 2025).