Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

HalluSegBench: Benchmarking Segmentation Hallucinations

Updated 1 July 2025

HalluSegBench is a benchmark designed to systematically evaluate segmentation hallucinations in vision-language models (VLMs) by employing counterfactual visual reasoning through pixel-level scene edits.
The benchmark introduces paired factual and counterfactual images with controlled object presence or identity changes, enabling the isolation and quantification of vision-driven hallucination phenomena.
HalluSegBench exposes that current state-of-the-art VLMs are significantly more susceptible to vision-driven hallucinations than textual ones and establishes a new standard for assessing grounding fidelity and segmentation reliability.

HalluSegBench is a benchmark designed to systematically evaluate segmentation hallucinations in vision-LLMs (VLMs) through counterfactual visual reasoning. Traditional hallucination evaluation protocols in segmentation focus primarily on label-level or textual hallucinations, often failing to manipulate the underlying visual content. HalluSegBench introduces a paradigm shift by directly probing a model's grounding fidelity via controlled, pixel-level scene edits and a set of interpretable metrics, thereby isolating and quantifying vision-driven hallucination phenomena that remain underexplored in the field.

1. Foundations and Motivation

Segmentation hallucination is characterized as the tendency of a model to predict a segmentation mask for objects that are not grounded in the visual content—either by segmenting absent objects or by incorrectly labeling irrelevant regions. Prior approaches in this domain typically employ textual manipulations (for example, altering prompts or swapping labels for absent objects) while keeping the image static, thus underdiagnosing deficiencies in the models' visual grounding.

HalluSegBench addresses this limitation by evaluating models under visual scenario manipulations that explicitly challenge the correspondence between visual evidence and model predictions. The key innovation is the use of counterfactual visual reasoning, which enables the attribution of hallucinations to either language/model priors or failures in processing visual information.

2. Counterfactual Visual Reasoning Methodology

The central methodological advance of HalluSegBench is the creation of paired factual and counterfactual images that differ only by the presence or identity of a specific foreground object. For each instance in the dataset:

The factual image $\mathbf{I}$ contains an object of class $c$ .
A counterfactual image $\mathbf{I}'$ is generated by replacing $c$ with another semantically meaningful object $c'$ , while maintaining visual plausibility.

This setup supports four segmentation queries per pair:

Segment $c$ in $\mathbf{I}$ (object present—baseline segmentation).
Segment $c'$ in $\mathbf{I}$ (object absent—textual hallucination check).
Segment $c$ in $\mathbf{I}'$ (object absent—vision-driven hallucination check).
Segment $c'$ in $\mathbf{I}'$ (object present—counterfactual segmentation).

The dataset is constructed using the RefCOCO validation/test splits, resulting in 1340 factual-counterfactual image pairs covering 281 unique object classes. Edits are performed via generative image editing pipelines and followed by manual quality control to ensure semantic and visual coherence. Ground truth segmentation masks are available for both original and counterfactual objects.

3. Metric Suite for Hallucination Assessment

HalluSegBench introduces a comprehensive set of metrics to quantify model behavior under factual and counterfactual evaluation. These metrics provide insight into both consistency of segmentation and the severity/location of hallucinated predictions.

Consistency-Based Performance Metrics

These measure prediction quality under different conditions:

Intersection over Union (IoU): For each task, IoU is measured between prediction and ground truth.
- $\text{IoU}_{\text{fact}}$ : Segmentation of object $c$ in $\mathbf{I}$ .
- $\text{IoU}_{\text{textual}}$ : Segmentation of object $c'$ in $\mathbf{I}$ (should ideally be zero).
- $\text{IoU}_{\text{visual}}$ : Segmentation of $c$ in $\mathbf{I}'$ (should ideally be zero).
Delta IoU: Measures robustness to hallucination:

$\Delta \mathrm{IoU}_{\text{textual}} = \text{IoU}_{\text{fact}} - \text{IoU}_{\text{textual}}$

$\Delta \mathrm{IoU}_{\text{visual}} = \text{IoU}_{\text{fact}} - \text{IoU}_{\text{visual}}$

High $\Delta \mathrm{IoU}$ signals the model correctly suppresses predictions when evidence is absent; low values indicate persistent hallucinations.

Direct Hallucination Metrics

Confusion Mask Score (CMS):

$\text{CMS} = \frac{\alpha |\mathbf{C}| + |\mathbf{N}|}{\alpha |\mathbf{M}_c|}$

Where $\mathbf{C}$ is the overlap between the hallucinated mask and ground truth, $\mathbf{N}$ is non-overlapping hallucinated area, $|\mathbf{M}_c|$ is mask size, and $\alpha>1$ prioritizes overlap errors (typically $\alpha=3$ ).

Contrastive Confusion Mask Score (CCMS):

$\text{CCMS} = \frac{\text{CMS}_{\text{fact}}}{\text{CMS}_{\text{counterfact}}}$

CCMS > 1 indicates hallucination is more language-driven; CCMS < 1 points to prevalent vision-driven errors.

Metric Behavior Table

Metric	High Value Interpretation	Low Value Interpretation
$\Delta \mathrm{IoU}$	Strong suppression of hallucination	Persistent hallucination
CMS	Severe, focused hallucination	Minimal or no hallucination
CCMS	Language-driven hallucination	Vision-driven hallucination

4. Empirical Evaluation of Vision-Language Segmentation Models

HalluSegBench has been utilized to benchmark state-of-the-art VLM-based segmentation models, including LISA (7B, 13B), PixeLLM (7B, 13B), GLaMM (7B), and SESAME-7B (which uses explicit hallucination mitigation strategies).

Principal Observations

Vision-driven hallucinations are substantially more frequent and persistent than label-driven ones. All models exhibit noticeably lower $\Delta \mathrm{IoU}_{\text{visual}}$ $Δ IoU_{visual}$ than $\Delta \mathrm{IoU}_{\text{textual}}$ $Δ IoU_{textual}$ , demonstrating a stronger tendency to segment objects that have been visually removed than to respond incorrectly to misleading prompts on unaltered images.
- For instance, LISA-13B achieves $\Delta \mathrm{IoU}_{\text{textual}} = 0.4591$ but only $\Delta \mathrm{IoU}_{\text{visual}} = 0.3886$ .
Mitigation approaches entail trade-offs between hallucination suppression and segmentation coverage. SESAME-7B achieves low CMS values by often abstaining from making predictions, but this results in reduced mean IoU. In contrast, models such as LISA and PixeLLM retain high segmentation coverage but are more prone to hallucination, revealing an axis of trade-off in model performance.
Qualitative and per-example breakdowns indicate that small objects and subtle scene manipulations are particularly challenging and result in higher CMS and failure rates, even for top models.

5. Addressing Shortcomings of Previous Protocols

Most prior evaluations employ label or textual perturbations without manipulating the visual scene, providing an incomplete picture of segmentation hallucination. HalluSegBench, by incorporating counterfactual image editing in combination with textual changes, isolates the model’s reliance on visual versus language priors. This methodology enables instance-level diagnostic granularity and exposes emergent hallucination patterns that would be overlooked in label-only benchmarks.

6. Significance and Field-Wide Implications

HalluSegBench establishes a new standard for reliable, interpretable, and instance-level benchmarking of segmentation hallucinations in vision-language systems. By demonstrating that current state-of-the-art models are far more susceptible to vision-driven hallucinations than previously recognized, the benchmark emphasizes the necessity of counterfactual visual reasoning to genuinely assess and advance grounding fidelity. The suite of metrics provides actionable feedback for both method development and deployment audits, particularly in safety-critical or robust real-world vision applications (such as robotics and medical imaging).

7. Resources and Further Information

HalluSegBench, including dataset and code, is publicly accessible for the research community (HalluSegBench project page). This benchmark offers a rigorously controlled and semantically meaningful foundation for ongoing innovation in grounded visual understanding, segmentation reliability, and hallucination mitigation in multimodal AI systems.

PDF Markdown Chat (Upgrade)