Papers
Topics
Authors
Recent
Search
2000 character limit reached

ConverSeg: Benchmark for Conversational Segmentation

Updated 18 February 2026
  • ConverSeg Benchmark is a comprehensive standard for conversational image segmentation that uses natural language prompts to generate pixel-accurate masks.
  • It leverages both human-annotated and AI-generated data across five concept families—Entities, Spatial, Relations, Affordances, and Physics—to challenge models with abstract reasoning tasks.
  • Evaluation metrics such as IoU, gIoU, and mIoU demonstrate that ConverSeg-Net achieves significant improvements, particularly in abstract reasoning categories.

ConverSeg is a benchmark for Conversational Image Segmentation (CIS), a task that requires predicting pixel-accurate binary masks from images given natural-language prompts that can refer to abstract, functional, or relational concepts. In contrast to prior referring image segmentation (RIS) benchmarks, which focus primarily on categorical or spatial queries, ConverSeg evaluates a model’s capacity for grounding higher-order reasoning, including spatial layout, transient interactions, affordances, physical safety, and intent. Developed by Sahoo & Gkioxari, ConverSeg leverages both human-annotated and AI-generated data, aiming to set a new standard for evaluation in vision-language segmentation beyond simple entity localization (Sahoo et al., 13 Feb 2026).

1. Conversational Image Segmentation: Task Definition and Scope

Conversational Image Segmentation (CIS) is formalized as follows: given an image II and a prompt pp in natural language, the objective is to predict a binary mask Mp{0,1}H×WM_p \in \{0,1\}^{H \times W} such that the “foreground” pixels {(x,y)Mp(x,y)=1}\{(x,y) \mid M_p(x,y) = 1\} correspond precisely to regions in II that satisfy the semantics of pp. CIS is structurally broader than standard RIS—which includes datasets like RefCOCO, RefCOCO+, RefCOCOg focusing on categories or straightforward spatial relations—by extending coverage to abstract, intent-driven prompts such as function, safety, and physical reasoning (e.g., “objects likely to tip over”, “surfaces stable enough to stack books”).

2. Conceptual Families and Predicate-Grounded Segmentation

ConverSeg organizes its evaluation around five concept families, each probing distinct axes of visual reasoning. For a prompt pp of category cC={Ent,Spat,Rel,Aff,Phys}c \in \mathcal{C} = \{\mathrm{Ent}, \mathrm{Spat}, \mathrm{Rel}, \mathrm{Aff}, \mathrm{Phys}\}, the objective is segmentation Sc(I,p)={xIpredicatec(x,p)}S_c(I, p) = \{x \in I \mid \text{predicate}_c(x, p)\}, conditioning on:

  • Entities (Ent): open-vocabulary identification (e.g., “x is an instance of a ‘weathered wooden furniture’”)
  • Spatial Layout (Spat): complex geometric/occupancy (e.g., “x blocks the walkway”)
  • Relations/Events (Rel): transient interactions/states (e.g., “x participates in ‘player about to catch the ball’”)
  • Affordances/Functions (Aff): use-case, functional reasoning (e.g., “x could serve as a shovel”)
  • Physics/Safety (Phys): stability, support, hazard (e.g., “x poses a sharp-object hazard”)

This structure is designed to mirror the semantic variety and difficulty of natural human conversation about scenes.

Family Example Predicate Example Prompt
Ent is an instance of a furniture “Segment the orange plastic watering can.”
Spat blocks the walkway “Which items are blocking the walkway?”
Rel participates in catching “The player about to catch the ball”
Aff affords cutting “Surfaces suitable for hot cookware”
Phys likely to tip over “Objects likely to tip over if nudged”

3. Dataset Composition and AI-Powered Data Engine

The ConverSeg evaluation benchmark is composed of 1,687 prompt-mask pairs drawn from approximately 600 COCO validation images. The dataset includes two splits:

  • Human-annotated split: 493 examples (using COCO panoptic/instance masks)
  • SAM-seeded split: 1,194 examples (leveraging SAM2 and a detector)

Prompts average 7.6 words with a standard deviation of approximately 1.2. Each image typically yields about 2.8 prompts, reflecting multiple possible valid queries per image. The distribution of examples is roughly uniform across the five concept families (\sim20% each).

Training data is synthesized via an AI-powered “data engine”, generating 106,000 positive prompt-mask pairs and an equal number of negatives (prompts with empty masks). The pipeline consists of five stages:

  1. Scene understanding for region descriptions via Gemini-2.5-Flash
  2. Mask generation through Moondream3 detector and SAM2
  3. Mask verification using text-mask consistency checks and SAM2-based refinement
  4. Concept-driven prompt generation using meta-prompts for each family and mask selection
  5. Prompt-mask alignment verification with VLM rejection/acceptance

Negatives are created via concept-specific adversarial prompts (absent/wrong attributes), and human verification on the ConverSeg benchmark is performed with one-click validation for each (I,p,m^)(I, p, \hat m).

4. Evaluation Metrics and Protocol

Benchmark evaluation separates the SAM-seeded and human-annotated splits. The principal metric is Intersection over Union (IoU), defined as

IoU(P,G)=PGPG\mathrm{IoU}(P, G) = \frac{|P \cap G|}{|P \cup G|}

where PP is the predicted mask and GG is ground truth. The mean IoU (mIoU) across NN queries is

mIoU=1Ni=1NIoUi\mathrm{mIoU} = \frac{1}{N} \sum_{i=1}^N \mathrm{IoU}_i

Generalized IoU (gIoU) and cumulative IoU (cIoU) are also reported, but core findings focus on IoU-based scores.

5. Example Queries and Groundings

Representative queries demonstrate the diversity of reasoning challenges inherent to the benchmark:

  • Ent: “Segment the orange plastic watering can” yields a mask on the watering can object.
  • Spat: “Which items are blocking the walkway?” segments scattered shoes/boxes that obstruct the path.
  • Rel: “The player about to catch the ball” localizes both the player’s hands and the ball.
  • Aff: “Surfaces suitable for hot cookware” isolates granite countertops, excluding wooden tables.
  • Phys: “Objects likely to tip over if nudged” selects tall, narrow bottles and unbalanced vases.

These examples illustrate the non-trivial mapping between natural-language intent and visual mask, particularly for abstract or functional queries with no clear category correlation.

6. Baselines and ConverSeg-Net Results

Baselines and the ConverSeg-Net model performance are reported on the SAM-seeded split as follows:

Model Overall gIoU (%)
LISA (Llama2-13B) 55.2
UniLSeg-20 (CLIP ViT-B) 32.6
EVF-SAM (BEIT-3) 47.7
Seg-Zero (Qwen2.5-VL-7B) 69.2
ConverSeg-Net (3B) 70.8
ConverSeg-Net (7B) 72.4

Per-family gIoU on SAM-seeded split:

Family gIoU (%)
Entities 74.0
Spatial 70.9
Relations 74.1
Affordances 68.7
Physics 64.2

Key observations include strong baseline performance for Entities/Spatial, but a relative decline for Physics and Affordances in all but ConverSeg-Net. ConverSeg-Net’s two-phase curriculum, using conversational data in “Phase 2,” achieves the largest improvements for the most abstract concept families, with the Ent–Phys gap reduced from approximately 24 percentage points to 9. Scaling the vision-LLM (3B → 7B) yields a further 1.6 percentage point improvement overall.

7. Significance and Implications

ConverSeg advances the state of vision-language segmentation by enforcing comprehensive concept coverage, rigorous challenge on physical and functional reasoning, and unsupervised scalable supervision through its AI-powered data engine. The benchmark’s balanced design and high-quality prompt-mask alignments allow detailed study of model generalization not just to new entities, but to novel reasoning forms—mirroring conversational understanding and scene interpretation. This suggests increasing focus on models’ internal reasoning about function, interaction, and intent, moving beyond static reference-based segmentation.

ConverSeg, CIS, and ConverSeg-Net collectively delineate new evaluation and training standards for conversational grounding of abstract visual concepts, providing a foundation for subsequent research in functional, physical, and interactive visual reasoning (Sahoo et al., 13 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConverSeg Benchmark.