Segmentation Hallucination with Counterfactual Edits
- The paper introduces HalluSegBench, a benchmark that uses controlled counterfactual edits to rigorously evaluate segmentation hallucination in vision-language models.
- It defines and employs metrics like delta-IoU and the Confusion Mask Score (CMS) to differentiate between language-driven and vision-driven hallucinations.
- The evaluation reveals that while models can suppress label-induced hallucinations, mitigating visually plausible yet incorrect masks remains a significant challenge.
Segmentation hallucination in vision–LLMs refers to the phenomenon in which a model produces a segmentation mask for an object that is absent from the scene or assigns a plausibly shaped mask to an irrelevant or mislabeled region. This failure undermines grounded visual understanding and complicates applications requiring reliable spatial reference resolution. There are two dominant forms: (1) object-level hallucination, where a model segments a class not present in the scene, and (2) pixel-grounding hallucination, involving spatially plausible but semantically incorrect masks. The challenge of diagnosing and mitigating such hallucinations has driven the development of counterfactual visual reasoning benchmarks, most prominently exemplified by HalluSegBench, which introduces controlled, instance-level counterfactual edits as the foundation for segmentation hallucination evaluation (Li et al., 26 Jun 2025).
1. Motivation and Conceptual Foundations
Prior evaluation protocols for segmentation hallucination have predominantly perturbed only the text prompt—such as label-swapping—which models may trivially reject by aligning prediction to the prompt text, without evidence of true visual grounding. Such approaches insufficiently probe whether a model perceptually attends to the image content or merely exploits language–vision priors. Counterfactual visual reasoning, introduced by HalluSegBench, instead constructs visually coherent scene edits by replacing a target object with a similar but semantically distinct one. This setup forces models to differentiate between absence due to label change and absence due to true visual removal, operationalizing the distinction between language-driven and vision-driven hallucination. Grounded segmentation fidelity is thus rigorously assessed by evaluating model predictions when an object is visually and contextually removed or replaced (Li et al., 26 Jun 2025).
2. Dataset Construction: Factual–Counterfactual Pairs
The HalluSegBench benchmark comprises 1,340 factual–counterfactual image pairs spanning 281 unique object classes. Base images and original segmentation masks are sourced from the RefCOCO validation and test splits, amounting to 2,342 images and 2,680 masks. For each benchmark instance, the construction protocol proceeds as follows:
- For each image–mask annotation, a change-instruction (e.g., “Change the blue bus to a yellow taxi”) is generated via GPT-4o.
- A constrained mask-guided edit, also using a GPT-4o diffusion model, replaces the target object to yield a visually coherent counterfactual image .
- The original mask is retained for the factual instance. A new mask for the replacement object is generated using Grounded SAM, with manual filtering to ensure quality.
Instance-level substitutions focus on visually and semantically similar class pairs (e.g., bus ↔ taxi, elephant ↔ rhinoceros), maintaining global scene layout and object scale. Object masks typically occupy 5–10% of the image area, reflecting real-world scene statistics and increasing segmentation difficulty. Only single-object replacement is considered, with all other scene elements preserved, providing controlled conditions for isolating hallucination drivers.
3. Hallucination Metrics
HalluSegBench introduces two primary families of hallucination metrics: consistency-based delta-IoU and the Confusion Mask Score (CMS).
Consistency-Based Delta-IoU Metrics
For each model , four predicted masks are obtained per instance:
- : factual image, correct class
- : factual image, wrong class
- : counterfactual image, original class
- : counterfactual image, correct replacement class
Three IoU scores are defined:
From these, delta-IoU metrics quantify hallucination sensitivity:
Larger indicates successful suppression when the input label is wrong (lower language-driven hallucination); higher indicates suppression when the object is absent (lower vision-driven hallucination).
Confusion Mask Score (CMS) and Contrastive CMS (CCMS)
In hallucination settings (where the queried class does not exist in the image or was removed), traditional overlap metrics are degenerate. The Confusion Mask Score is defined as:
where is the area overlapping any distractor’s mask, is the non-overlapping area, and (chosen as ) penalizes overlap with distractors.
CMS is computed in two settings:
- : factual image with wrong class
- : counterfactual image with removed class
The Contrastive CMS (CCMS) is their ratio:
CCMS indicates higher language-driven hallucination; CCMS indicates greater susceptibility to vision-driven hallucination.
4. Model Evaluation Protocol
HalluSegBench evaluates six contemporary vision–language segmentation models: LISA (7B/13B), PixelLM (7B/13B), GLaMM-7B, and SESAME-7B (explicit hallucination mitigation). Each model is tested on all 1,340 factual–counterfactual pairs using the following protocol:
- Query the model with , , , and .
- Compute all consistency and confusion metrics.
- Aggregate results across the dataset, and further stratify by object size (small, medium, large) and by object class.
This comprehensive approach enables comparative analysis of language-driven versus vision-driven hallucinations and facilitates detailed breakdowns of model behavior under controlled perturbations.
5. Experimental Findings
Key quantitative results include:
| Metric | Model | Value | Interpretation |
|---|---|---|---|
| IoU | LISA-13B | 0.4591 | Best label-sensitivity; suppresses mask with wrong label |
| IoU | PixelLM-13B | 0.4273 | Best vision-sensitivity; suppresses mask when object absent |
| Factual CMS | SESAME-7B | 0.1983 | Lowest language-driven hallucination |
| Counterfactual CMS | SESAME-7B | 0.4304 | Lowest vision-driven hallucination |
| mIoU | PixelLM-13B | 0.7240 | State-of-the-art segmentation quality |
| mIoU | SESAME-7B | 0.5773 | Trades off hallucination suppression for abstention |
All models consistently show IoUIoU, indicating that vision-driven hallucinations are more challenging to eliminate than language-driven ones. SESAME-7B minimizes hallucination metrics but at the cost of lower overall segmentation quality due to frequent abstention. LISA and PixelLM occasionally confuse visually similar classes—e.g., sheep and cow—resulting in partial masks aligning to wrong objects. Qualitatively, most models hallucinate masks in the replaced region under (object absent), and sometimes latch onto distractors under label mismatch.
6. Strengths, Limitations, and Prospects
Strengths
- HalluSegBench is the first pixel-level benchmark for counterfactual visual reasoning in segmentation.
- The controlled design with per-instance single-object substitutions sharply disentangles vision- from language-driven hallucination.
- Comprehensive metrics (ΔIoU, CMS, CCMS) provide orthogonal diagnostic views.
- The benchmark spans 281 classes and realistic object scales.
Limitations
- Restricted to single-object replacements; compositional or multi-object counterfactuals (e.g., color/size/attribute changes) are not addressed.
- Visual fidelity of edits is dependent on the absence of subtle artifacts, with any mis-edits potentially confounding evaluation.
- Complex referring expressions and compositional prompts (e.g., spatial relationships) are not systematically tested.
- Current metrics presume a single removed object; generalization requires more complex formulations.
Future Directions
Potential directions include (1) extending the benchmark to support multi-object and attribute-level counterfactuals, (2) data augmentation with counterfactual supervision at training time to mitigate model hallucination, (3) development of artifact-aware editing pipelines, and (4) integration of CCMS with causal inference methods to formally disentangle vision and language reliance in segmentation masking (Li et al., 26 Jun 2025). A plausible implication is that explicit counterfactual supervision or evaluation may be required for robust grounding in future vision–language architectures.