HalluSegBench: Benchmarking Segmentation Hallucinations
- HalluSegBench is a benchmark designed to systematically evaluate segmentation hallucinations in vision-language models (VLMs) by employing counterfactual visual reasoning through pixel-level scene edits.
- The benchmark introduces paired factual and counterfactual images with controlled object presence or identity changes, enabling the isolation and quantification of vision-driven hallucination phenomena.
- HalluSegBench exposes that current state-of-the-art VLMs are significantly more susceptible to vision-driven hallucinations than textual ones and establishes a new standard for assessing grounding fidelity and segmentation reliability.
HalluSegBench is a benchmark designed to systematically evaluate segmentation hallucinations in vision-LLMs (VLMs) through counterfactual visual reasoning. Traditional hallucination evaluation protocols in segmentation focus primarily on label-level or textual hallucinations, often failing to manipulate the underlying visual content. HalluSegBench introduces a paradigm shift by directly probing a model's grounding fidelity via controlled, pixel-level scene edits and a set of interpretable metrics, thereby isolating and quantifying vision-driven hallucination phenomena that remain underexplored in the field.
1. Foundations and Motivation
Segmentation hallucination is characterized as the tendency of a model to predict a segmentation mask for objects that are not grounded in the visual content—either by segmenting absent objects or by incorrectly labeling irrelevant regions. Prior approaches in this domain typically employ textual manipulations (for example, altering prompts or swapping labels for absent objects) while keeping the image static, thus underdiagnosing deficiencies in the models' visual grounding.
HalluSegBench addresses this limitation by evaluating models under visual scenario manipulations that explicitly challenge the correspondence between visual evidence and model predictions. The key innovation is the use of counterfactual visual reasoning, which enables the attribution of hallucinations to either language/model priors or failures in processing visual information.
2. Counterfactual Visual Reasoning Methodology
The central methodological advance of HalluSegBench is the creation of paired factual and counterfactual images that differ only by the presence or identity of a specific foreground object. For each instance in the dataset:
- The factual image contains an object of class .
- A counterfactual image is generated by replacing with another semantically meaningful object , while maintaining visual plausibility.
This setup supports four segmentation queries per pair:
- Segment in (object present—baseline segmentation).
- Segment in (object absent—textual hallucination check).
- Segment in (object absent—vision-driven hallucination check).
- Segment in (object present—counterfactual segmentation).
The dataset is constructed using the RefCOCO validation/test splits, resulting in 1340 factual-counterfactual image pairs covering 281 unique object classes. Edits are performed via generative image editing pipelines and followed by manual quality control to ensure semantic and visual coherence. Ground truth segmentation masks are available for both original and counterfactual objects.
3. Metric Suite for Hallucination Assessment
HalluSegBench introduces a comprehensive set of metrics to quantify model behavior under factual and counterfactual evaluation. These metrics provide insight into both consistency of segmentation and the severity/location of hallucinated predictions.
Consistency-Based Performance Metrics
These measure prediction quality under different conditions:
- Intersection over Union (IoU): For each task, IoU is measured between prediction and ground truth.
- : Segmentation of object in .
- : Segmentation of object in (should ideally be zero).
- : Segmentation of in (should ideally be zero).
- Delta IoU: Measures robustness to hallucination:
High signals the model correctly suppresses predictions when evidence is absent; low values indicate persistent hallucinations.
Direct Hallucination Metrics
- Confusion Mask Score (CMS):
Where is the overlap between the hallucinated mask and ground truth, is non-overlapping hallucinated area, is mask size, and prioritizes overlap errors (typically ).
- Contrastive Confusion Mask Score (CCMS):
CCMS > 1 indicates hallucination is more language-driven; CCMS < 1 points to prevalent vision-driven errors.
Metric Behavior Table
Metric | High Value Interpretation | Low Value Interpretation |
---|---|---|
Strong suppression of hallucination | Persistent hallucination | |
CMS | Severe, focused hallucination | Minimal or no hallucination |
CCMS | Language-driven hallucination | Vision-driven hallucination |
4. Empirical Evaluation of Vision-Language Segmentation Models
HalluSegBench has been utilized to benchmark state-of-the-art VLM-based segmentation models, including LISA (7B, 13B), PixeLLM (7B, 13B), GLaMM (7B), and SESAME-7B (which uses explicit hallucination mitigation strategies).
Principal Observations
- Vision-driven hallucinations are substantially more frequent and persistent than label-driven ones. All models exhibit noticeably lower than , demonstrating a stronger tendency to segment objects that have been visually removed than to respond incorrectly to misleading prompts on unaltered images.
- For instance, LISA-13B achieves but only .
- Mitigation approaches entail trade-offs between hallucination suppression and segmentation coverage. SESAME-7B achieves low CMS values by often abstaining from making predictions, but this results in reduced mean IoU. In contrast, models such as LISA and PixeLLM retain high segmentation coverage but are more prone to hallucination, revealing an axis of trade-off in model performance.
- Qualitative and per-example breakdowns indicate that small objects and subtle scene manipulations are particularly challenging and result in higher CMS and failure rates, even for top models.
5. Addressing Shortcomings of Previous Protocols
Most prior evaluations employ label or textual perturbations without manipulating the visual scene, providing an incomplete picture of segmentation hallucination. HalluSegBench, by incorporating counterfactual image editing in combination with textual changes, isolates the model’s reliance on visual versus language priors. This methodology enables instance-level diagnostic granularity and exposes emergent hallucination patterns that would be overlooked in label-only benchmarks.
6. Significance and Field-Wide Implications
HalluSegBench establishes a new standard for reliable, interpretable, and instance-level benchmarking of segmentation hallucinations in vision-language systems. By demonstrating that current state-of-the-art models are far more susceptible to vision-driven hallucinations than previously recognized, the benchmark emphasizes the necessity of counterfactual visual reasoning to genuinely assess and advance grounding fidelity. The suite of metrics provides actionable feedback for both method development and deployment audits, particularly in safety-critical or robust real-world vision applications (such as robotics and medical imaging).
7. Resources and Further Information
HalluSegBench, including dataset and code, is publicly accessible for the research community (HalluSegBench project page). This benchmark offers a rigorously controlled and semantically meaningful foundation for ongoing innovation in grounded visual understanding, segmentation reliability, and hallucination mitigation in multimodal AI systems.