ConverSeg: Benchmark for Conversational Segmentation

Updated 18 February 2026

ConverSeg Benchmark is a comprehensive standard for conversational image segmentation that uses natural language prompts to generate pixel-accurate masks.
It leverages both human-annotated and AI-generated data across five concept families—Entities, Spatial, Relations, Affordances, and Physics—to challenge models with abstract reasoning tasks.
Evaluation metrics such as IoU, gIoU, and mIoU demonstrate that ConverSeg-Net achieves significant improvements, particularly in abstract reasoning categories.

ConverSeg is a benchmark for Conversational Image Segmentation (CIS), a task that requires predicting pixel-accurate binary masks from images given natural-language prompts that can refer to abstract, functional, or relational concepts. In contrast to prior referring image segmentation (RIS) benchmarks, which focus primarily on categorical or spatial queries, ConverSeg evaluates a model’s capacity for grounding higher-order reasoning, including spatial layout, transient interactions, affordances, physical safety, and intent. Developed by Sahoo & Gkioxari, ConverSeg leverages both human-annotated and AI-generated data, aiming to set a new standard for evaluation in vision-language segmentation beyond simple entity localization (Sahoo et al., 13 Feb 2026).

1. Conversational Image Segmentation: Task Definition and Scope

Conversational Image Segmentation (CIS) is formalized as follows: given an image $I$ and a prompt $p$ in natural language, the objective is to predict a binary mask $M_p \in \{0,1\}^{H \times W}$ such that the “foreground” pixels $\{(x,y) \mid M_p(x,y) = 1\}$ correspond precisely to regions in $I$ that satisfy the semantics of $p$ . CIS is structurally broader than standard RIS—which includes datasets like RefCOCO, RefCOCO+, RefCOCOg focusing on categories or straightforward spatial relations—by extending coverage to abstract, intent-driven prompts such as function, safety, and physical reasoning (e.g., “objects likely to tip over”, “surfaces stable enough to stack books”).

2. Conceptual Families and Predicate-Grounded Segmentation

ConverSeg organizes its evaluation around five concept families, each probing distinct axes of visual reasoning. For a prompt $p$ of category $c \in \mathcal{C} = \{\mathrm{Ent}, \mathrm{Spat}, \mathrm{Rel}, \mathrm{Aff}, \mathrm{Phys}\}$ , the objective is segmentation $S_c(I, p) = \{x \in I \mid \text{predicate}_c(x, p)\}$ , conditioning on:

Entities (Ent): open-vocabulary identification (e.g., “x is an instance of a ‘weathered wooden furniture’”)
Spatial Layout (Spat): complex geometric/occupancy (e.g., “x blocks the walkway”)
Relations/Events (Rel): transient interactions/states (e.g., “x participates in ‘player about to catch the ball’”)
Affordances/Functions (Aff): use-case, functional reasoning (e.g., “x could serve as a shovel”)
Physics/Safety (Phys): stability, support, hazard (e.g., “x poses a sharp-object hazard”)

This structure is designed to mirror the semantic variety and difficulty of natural human conversation about scenes.

Family	Example Predicate	Example Prompt
Ent	is an instance of a furniture	“Segment the orange plastic watering can.”
Spat	blocks the walkway	“Which items are blocking the walkway?”
Rel	participates in catching	“The player about to catch the ball”
Aff	affords cutting	“Surfaces suitable for hot cookware”
Phys	likely to tip over	“Objects likely to tip over if nudged”

3. Dataset Composition and AI-Powered Data Engine

The ConverSeg evaluation benchmark is composed of 1,687 prompt-mask pairs drawn from approximately 600 COCO validation images. The dataset includes two splits:

Human-annotated split: 493 examples (using COCO panoptic/instance masks)
SAM-seeded split: 1,194 examples (leveraging SAM2 and a detector)

Prompts average 7.6 words with a standard deviation of approximately 1.2. Each image typically yields about 2.8 prompts, reflecting multiple possible valid queries per image. The distribution of examples is roughly uniform across the five concept families ( $\sim$ 20% each).

Training data is synthesized via an AI-powered “data engine”, generating 106,000 positive prompt-mask pairs and an equal number of negatives (prompts with empty masks). The pipeline consists of five stages:

Scene understanding for region descriptions via Gemini-2.5-Flash
Mask generation through Moondream3 detector and SAM2
Mask verification using text-mask consistency checks and SAM2-based refinement
Concept-driven prompt generation using meta-prompts for each family and mask selection
Prompt-mask alignment verification with VLM rejection/acceptance

Negatives are created via concept-specific adversarial prompts (absent/wrong attributes), and human verification on the ConverSeg benchmark is performed with one-click validation for each $(I, p, \hat m)$ .

4. Evaluation Metrics and Protocol

Benchmark evaluation separates the SAM-seeded and human-annotated splits. The principal metric is Intersection over Union (IoU), defined as

$\mathrm{IoU}(P, G) = \frac{|P \cap G|}{|P \cup G|}$

where $P$ is the predicted mask and $G$ is ground truth. The mean IoU (mIoU) across $N$ queries is

$\mathrm{mIoU} = \frac{1}{N} \sum_{i=1}^N \mathrm{IoU}_i$

Generalized IoU (gIoU) and cumulative IoU (cIoU) are also reported, but core findings focus on IoU-based scores.

5. Example Queries and Groundings

Representative queries demonstrate the diversity of reasoning challenges inherent to the benchmark:

Ent: “Segment the orange plastic watering can” yields a mask on the watering can object.
Spat: “Which items are blocking the walkway?” segments scattered shoes/boxes that obstruct the path.
Rel: “The player about to catch the ball” localizes both the player’s hands and the ball.
Aff: “Surfaces suitable for hot cookware” isolates granite countertops, excluding wooden tables.
Phys: “Objects likely to tip over if nudged” selects tall, narrow bottles and unbalanced vases.

These examples illustrate the non-trivial mapping between natural-language intent and visual mask, particularly for abstract or functional queries with no clear category correlation.

6. Baselines and ConverSeg-Net Results

Baselines and the ConverSeg-Net model performance are reported on the SAM-seeded split as follows:

Model	Overall gIoU (%)
LISA (Llama2-13B)	55.2
UniLSeg-20 (CLIP ViT-B)	32.6
EVF-SAM (BEIT-3)	47.7
Seg-Zero (Qwen2.5-VL-7B)	69.2
ConverSeg-Net (3B)	70.8
ConverSeg-Net (7B)	72.4

Per-family gIoU on SAM-seeded split:

Family	gIoU (%)
Entities	74.0
Spatial	70.9
Relations	74.1
Affordances	68.7
Physics	64.2

Key observations include strong baseline performance for Entities/Spatial, but a relative decline for Physics and Affordances in all but ConverSeg-Net. ConverSeg-Net’s two-phase curriculum, using conversational data in “Phase 2,” achieves the largest improvements for the most abstract concept families, with the Ent–Phys gap reduced from approximately 24 percentage points to 9. Scaling the vision-LLM (3B → 7B) yields a further 1.6 percentage point improvement overall.

7. Significance and Implications

ConverSeg advances the state of vision-language segmentation by enforcing comprehensive concept coverage, rigorous challenge on physical and functional reasoning, and unsupervised scalable supervision through its AI-powered data engine. The benchmark’s balanced design and high-quality prompt-mask alignments allow detailed study of model generalization not just to new entities, but to novel reasoning forms—mirroring conversational understanding and scene interpretation. This suggests increasing focus on models’ internal reasoning about function, interaction, and intent, moving beyond static reference-based segmentation.

ConverSeg, CIS, and ConverSeg-Net collectively delineate new evaluation and training standards for conversational grounding of abstract visual concepts, providing a foundation for subsequent research in functional, physical, and interactive visual reasoning (Sahoo et al., 13 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConverSeg Benchmark.

ConverSeg: Benchmark for Conversational Segmentation

1. Conversational Image Segmentation: Task Definition and Scope

2. Conceptual Families and Predicate-Grounded Segmentation

3. Dataset Composition and AI-Powered Data Engine

4. Evaluation Metrics and Protocol

5. Example Queries and Groundings

6. Baselines and ConverSeg-Net Results

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ConverSeg: Benchmark for Conversational Segmentation

1. Conversational Image Segmentation: Task Definition and Scope

2. Conceptual Families and Predicate-Grounded Segmentation

3. Dataset Composition and AI-Powered Data Engine

4. Evaluation Metrics and Protocol

5. Example Queries and Groundings

6. Baselines and ConverSeg-Net Results

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research