Papers
Topics
Authors
Recent
Search
2000 character limit reached

Segmentation Hallucination with Counterfactual Edits

Updated 6 February 2026
  • The paper introduces HalluSegBench, a benchmark that uses controlled counterfactual edits to rigorously evaluate segmentation hallucination in vision-language models.
  • It defines and employs metrics like delta-IoU and the Confusion Mask Score (CMS) to differentiate between language-driven and vision-driven hallucinations.
  • The evaluation reveals that while models can suppress label-induced hallucinations, mitigating visually plausible yet incorrect masks remains a significant challenge.

Segmentation hallucination in vision–LLMs refers to the phenomenon in which a model produces a segmentation mask for an object that is absent from the scene or assigns a plausibly shaped mask to an irrelevant or mislabeled region. This failure undermines grounded visual understanding and complicates applications requiring reliable spatial reference resolution. There are two dominant forms: (1) object-level hallucination, where a model segments a class not present in the scene, and (2) pixel-grounding hallucination, involving spatially plausible but semantically incorrect masks. The challenge of diagnosing and mitigating such hallucinations has driven the development of counterfactual visual reasoning benchmarks, most prominently exemplified by HalluSegBench, which introduces controlled, instance-level counterfactual edits as the foundation for segmentation hallucination evaluation (Li et al., 26 Jun 2025).

1. Motivation and Conceptual Foundations

Prior evaluation protocols for segmentation hallucination have predominantly perturbed only the text prompt—such as label-swapping—which models may trivially reject by aligning prediction to the prompt text, without evidence of true visual grounding. Such approaches insufficiently probe whether a model perceptually attends to the image content or merely exploits language–vision priors. Counterfactual visual reasoning, introduced by HalluSegBench, instead constructs visually coherent scene edits by replacing a target object with a similar but semantically distinct one. This setup forces models to differentiate between absence due to label change and absence due to true visual removal, operationalizing the distinction between language-driven and vision-driven hallucination. Grounded segmentation fidelity is thus rigorously assessed by evaluating model predictions when an object is visually and contextually removed or replaced (Li et al., 26 Jun 2025).

2. Dataset Construction: Factual–Counterfactual Pairs

The HalluSegBench benchmark comprises 1,340 factual–counterfactual image pairs spanning 281 unique object classes. Base images and original segmentation masks are sourced from the RefCOCO validation and test splits, amounting to 2,342 images and 2,680 masks. For each benchmark instance, the construction protocol proceeds as follows:

  1. For each image–mask annotation, a change-instruction (e.g., “Change the blue bus to a yellow taxi”) is generated via GPT-4o.
  2. A constrained mask-guided edit, also using a GPT-4o diffusion model, replaces the target object to yield a visually coherent counterfactual image II'.
  3. The original mask McM_c is retained for the factual instance. A new mask McM'_{c'} for the replacement object is generated using Grounded SAM, with manual filtering to ensure quality.

Instance-level substitutions focus on visually and semantically similar class pairs (e.g., bus ↔ taxi, elephant ↔ rhinoceros), maintaining global scene layout and object scale. Object masks typically occupy 5–10% of the image area, reflecting real-world scene statistics and increasing segmentation difficulty. Only single-object replacement is considered, with all other scene elements preserved, providing controlled conditions for isolating hallucination drivers.

3. Hallucination Metrics

HalluSegBench introduces two primary families of hallucination metrics: consistency-based delta-IoU and the Confusion Mask Score (CMS).

Consistency-Based Delta-IoU Metrics

For each model ff, four predicted masks are obtained per instance:

  • M^c=f(I,c)\hat M_c = f(I, c): factual image, correct class
  • M^c=f(I,c)\hat M_{c'} = f(I, c'): factual image, wrong class
  • M^c=f(I,c)\hat M_c' = f(I', c): counterfactual image, original class
  • M^c=f(I,c)\hat M_{c'}' = f(I', c'): counterfactual image, correct replacement class

Three IoU scores are defined:

IoUfactIoU(Mc,M^c),IoUtextualIoU(Mc,M^c),IoUvisualIoU(Mc,M^c)\mathrm{IoU}_{\mathrm{fact}} \equiv \mathrm{IoU}( M_c, \hat M_c ), \quad \mathrm{IoU}_{\mathrm{textual}} \equiv \mathrm{IoU}( M_c, \hat M_{c'} ), \quad \mathrm{IoU}_{\mathrm{visual}} \equiv \mathrm{IoU}( M'_{c'}, \hat M_c' )

From these, delta-IoU metrics quantify hallucination sensitivity:

ΔIoUtextual=IoUfactIoUtextual\Delta\mathrm{IoU}_{\mathrm{textual}} = \mathrm{IoU}_{\mathrm{fact}} - \mathrm{IoU}_{\mathrm{textual}}

ΔIoUvisual=IoUfactIoUvisual\Delta\mathrm{IoU}_{\mathrm{visual}} = \mathrm{IoU}_{\mathrm{fact}} - \mathrm{IoU}_{\mathrm{visual}}

Larger ΔIoUtextual\Delta\mathrm{IoU}_{\mathrm{textual}} indicates successful suppression when the input label is wrong (lower language-driven hallucination); higher ΔIoUvisual\Delta\mathrm{IoU}_{\mathrm{visual}} indicates suppression when the object is absent (lower vision-driven hallucination).

Confusion Mask Score (CMS) and Contrastive CMS (CCMS)

In hallucination settings (where the queried class does not exist in the image or was removed), traditional overlap metrics are degenerate. The Confusion Mask Score is defined as:

CMS=αC+NαMpresent\mathrm{CMS} = \frac{\alpha\,|C| + |N|}{\alpha\,|M_{\text{present}}|}

where C=M^MpresentC = \hat M \cap M_{\text{present}} is the area overlapping any distractor’s mask, N=M^MpresentN = \hat M \setminus M_{\text{present}} is the non-overlapping area, and α>1\alpha > 1 (chosen as α=3\alpha = 3) penalizes overlap with distractors.

CMS is computed in two settings:

  • CMSfact\mathrm{CMS}_{\mathrm{fact}}: factual image with wrong class (I,c)(I, c')
  • CMScounterfact\mathrm{CMS}_{\mathrm{counterfact}}: counterfactual image with removed class (I,c)(I', c)

The Contrastive CMS (CCMS) is their ratio:

CCMS=CMSfactCMScounterfact\mathrm{CCMS} = \frac{\mathrm{CMS}_{\mathrm{fact}}}{\mathrm{CMS}_{\mathrm{counterfact}}}

CCMS >1> 1 indicates higher language-driven hallucination; CCMS <1< 1 indicates greater susceptibility to vision-driven hallucination.

4. Model Evaluation Protocol

HalluSegBench evaluates six contemporary vision–language segmentation models: LISA (7B/13B), PixelLM (7B/13B), GLaMM-7B, and SESAME-7B (explicit hallucination mitigation). Each model is tested on all 1,340 factual–counterfactual pairs using the following protocol:

  1. Query the model with (I,c)(I, c), (I,c)(I, c'), (I,c)(I', c), and (I,c)(I', c').
  2. Compute all consistency and confusion metrics.
  3. Aggregate results across the dataset, and further stratify by object size (small, medium, large) and by object class.

This comprehensive approach enables comparative analysis of language-driven versus vision-driven hallucinations and facilitates detailed breakdowns of model behavior under controlled perturbations.

5. Experimental Findings

Key quantitative results include:

Metric Model Value Interpretation
Δ\DeltaIoUtextual_{\text{textual}} LISA-13B 0.4591 Best label-sensitivity; suppresses mask with wrong label
Δ\DeltaIoUvisual_{\text{visual}} PixelLM-13B 0.4273 Best vision-sensitivity; suppresses mask when object absent
Factual CMS SESAME-7B 0.1983 Lowest language-driven hallucination
Counterfactual CMS SESAME-7B 0.4304 Lowest vision-driven hallucination
mIoU PixelLM-13B 0.7240 State-of-the-art segmentation quality
mIoU SESAME-7B 0.5773 Trades off hallucination suppression for abstention

All models consistently show Δ\DeltaIoUvisual<Δ_{\text{visual}} < \DeltaIoUtextual_{\text{textual}}, indicating that vision-driven hallucinations are more challenging to eliminate than language-driven ones. SESAME-7B minimizes hallucination metrics but at the cost of lower overall segmentation quality due to frequent abstention. LISA and PixelLM occasionally confuse visually similar classes—e.g., sheep and cow—resulting in partial masks aligning to wrong objects. Qualitatively, most models hallucinate masks in the replaced region under (I,c)(I', c) (object absent), and sometimes latch onto distractors under label mismatch.

6. Strengths, Limitations, and Prospects

Strengths

  • HalluSegBench is the first pixel-level benchmark for counterfactual visual reasoning in segmentation.
  • The controlled design with per-instance single-object substitutions sharply disentangles vision- from language-driven hallucination.
  • Comprehensive metrics (ΔIoU, CMS, CCMS) provide orthogonal diagnostic views.
  • The benchmark spans 281 classes and realistic object scales.

Limitations

  • Restricted to single-object replacements; compositional or multi-object counterfactuals (e.g., color/size/attribute changes) are not addressed.
  • Visual fidelity of edits is dependent on the absence of subtle artifacts, with any mis-edits potentially confounding evaluation.
  • Complex referring expressions and compositional prompts (e.g., spatial relationships) are not systematically tested.
  • Current metrics presume a single removed object; generalization requires more complex formulations.

Future Directions

Potential directions include (1) extending the benchmark to support multi-object and attribute-level counterfactuals, (2) data augmentation with counterfactual supervision at training time to mitigate model hallucination, (3) development of artifact-aware editing pipelines, and (4) integration of CCMS with causal inference methods to formally disentangle vision and language reliance in segmentation masking (Li et al., 26 Jun 2025). A plausible implication is that explicit counterfactual supervision or evaluation may be required for robust grounding in future vision–language architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segmentation Hallucination with Counterfactual Edits.