Causal-HalBench: LVLM Causal Benchmark

Updated 20 November 2025

Causal-HalBench is a causal benchmarking framework that quantifies spurious co-occurrence biases leading to object hallucinations in LVLMs.
It employs a structural causal model with counterfactual image interventions and metrics like CAC, AAC, and CHR to rigorously assess model robustness.
The framework’s precise intervention pipeline and causal analysis inform the design of debiasing strategies, enhancing the reliability of vision-language models.

Causal-HalBench is a causal benchmarking framework for evaluating and quantifying spurious co-occurrence biases—specifically object hallucinations—in Large Vision-LLMs (LVLMs). Unlike previous hallucination-focused benchmarks, Causal-HalBench supplies a formal causal characterization, counterfactual-based image interventions, and causal metrics that rigorously assess the robustness of object recognition pipelines to context-induced bias. Through interventions constructed with proprietary LVLMs and advanced text-to-image models, Causal-HalBench exposes the susceptibility of mainstream models to spurious correlations and informs the design and evaluation of causal debiasing strategies (Xu et al., 13 Nov 2025).

1. Structural Causal Model and Spurious Pathways

Causal-HalBench is grounded in a structural causal modeling (SCM) approach to characterize object hallucinations. The SCM operates on the following variables:

$I$ : true visual features (pixel content, shape descriptors)
$O$ : ground-truth indicator of the queried object’s presence
$C$ : latent context bias, encoding statistical co-occurrence patterns learned from training data
$Y$ : the LVLM’s binary prediction on object presence (“yes”/“no”)

The causal relationships are depicted by the graph: $C \rightarrow I \rightarrow Y$

$\quad \searrow\ \ \uparrow$

$\phantom{C} \ \ \ \ Y$

The paths $C \rightarrow I$ and $C \rightarrow Y$ constitute a back-door route through which $C$ can confound the $I \rightarrow Y$ relationship, violating the desired direct dependence on image content alone. In formal equations (omitting exogenous noise terms for clarity):

$I = f_I(U_I),\quad C = f_C(I,U_C)$

$O = f_O(I, C, U_O),\quad Y = f_Y(I, O, C, U_Y)$

The presence of the $C \rightarrow Y$ edge operationalizes the hypothesis that LVLMs may shortcut to context-based cues rather than actual visual features, resulting in object hallucinations.

2. Formal Causal Definition of Spurious Correlation

Spurious correlation is rigorously defined as the change in the model output when the confounding context is altered, keeping object presence constant. Using Pearl's do-calculus notation, and denoting $c$ and $c'$ as high- and low-co-occurrence contexts, the causal influence of $C$ on $Y$ (even when $O$ is fixed) is:

$\Delta_{spur}(o) = P(Y=1\,|\,do(C=c)) - P(Y=1\,|\,do(C=c'), do(O=o))$

A nonzero $\Delta_{spur}$ quantifies the presence and magnitude of spurious co-occurrence bias. The benchmark further operationalizes this by image-level interventions:

$ACE = \mathbb{E}[Y\,|\,do(X=X_{cf}),Q] - \mathbb{E}[Y\,|\,X,Q]$

where $X_{cf}$ is a counterfactual image with context cues removed, and $Q$ is the object-prompt query.

The Direct Causal Strength ( $DCS$ ) is measured post-intervention as:

$DCS = \mathbb{E}[Y\,|\,do(X=X_{cf}),Q]$

3. Counterfactual Sample Construction Pipeline

Causal-HalBench approximates interventions via a scalable pipeline for high-quality counterfactual image generation:

Selection of Intervention Objects:
- For an image $X$ with annotated target $o_t$ , sample a contextual object $o_c$ (high co-occurrence with $o_t$ ).
- From minimally co-occurring objects with $o_c$ , select $o_{cf}$ candidates.
- Use a proprietary LVLM (Gemini) to rank these by inpainting suitability.
Counterfactual Description Generation:
- Query Gemini for annotated natural-language descriptions, then replace $o_t$ with $o_{cf}$ to create a counterfactual prompt.
Counterfactual Inpainting:
- Use SAM to generate a mask of $o_t$ , dilate as needed.
- Employ FLUX-controlnet for inpainting, guided by the counterfactual prompt.
- Filter results using human evaluation and CLIP scores to assure fidelity.
Dataset Statistics:
- 1,387 counterfactuals from 757 MSCOCO images, covering three question types per image (contextual, counterfactual, absent) and resulting in 9,709 unique question-answer pairs.

This pipeline ensures interventions that disrupt context cues without altering the underlying object or question type, thus supporting precise causal estimation.

4. Causal Metrics: CAC, AAC, and CHR

Evaluation relies on metrics derived from the average causal effect ( $ACE$ ) and direct causal strength ( $DCS$ ) applied to distinct QA scenarios:

Contextual object ( $Q_c$ ): actual context object present (ground-truth “yes”)
Counterfactual object ( $Q_{cf}$ ): newly inserted low-co-occurrence object (“yes”)
Absent object ( $Q_a$ ): object not present (“no”)

The benchmark defines:

Contextual object Accuracy Change (CAC):

$CAC = Acc(X, Q_c) - Acc(X_{cf}, Q_c) = ACE(Q=Q_c)$

Drop in accuracy for $Q_c$ after intervention.

Absent object Accuracy Change (AAC):

$AAC = Acc(X_{cf}, Q_a) - Acc(X, Q_a) = ACE(Q=Q_a)$

Increase in false positives for absent objects post-intervention.

Counterfactual object Hallucination Rate (CHR):

$CHR = 1 - Acc(X_{cf}, Q_{cf}) = 1 - DCS(Q=Q_{cf})$

Failure rate in detecting the newly inserted object.

Higher CAC or AAC signals greater reliance on spurious correlations, while lower CHR reflects retrained attention to authentic visual features.

5. Experimental Evaluation and Findings

A comparative analysis across nine mainstream LVLMs (including LLaVA-NEXT-8B, GPT-4o, Qwen2.5-VL-7B, Gemini1.5-pro) reveals:

Model	CAC (↓)	AAC (↑)	CHR (↓)
LLaVA-NEXT-8B	4.5	1.1	6.8
LLaVA-onevision	3.6	0.2	14.4
Kimi-VL-A3B	5.4	10.2	7.4
MiniCPM-o-2_6	3.2	2.8	14.9
InternVL2.5-8B	2.8	7.1	29.3
mPLUG-Owl3-7B	3.8	0.3	14.4
Qwen2.5-VL-7B	1.8	0.5	27.3
GPT-4o	3.6	1.5	12.4
Gemini1.5-pro	8.1	0.3	21.4

Key observations:

All models are affected by co-occurrence bias (CAC $>$ 1.8 pp).
Kimi-VL-A3B yields the highest hallucination on absent objects (AAC 10.2).
InternVL2.5 and Qwen2.5 show pronounced deficits in recognizing true counterfactual objects (CHR 27–29).
LLaVA-NEXT-8B demonstrates the strongest reliance on visual features (lowest CHR).

Comparison with benchmarks like POPE and CHAIR indicates Causal-HalBench counterfactuals expose latent vulnerabilities not revealed by factual-only metrics.

6. Comparative Assessment with Other Causal Benchmarks

Unlike CausalBench frameworks in single-cell perturbation (Chevalley et al., 2022) or causal inference (Kapkiç et al., 2024), Causal-HalBench is uniquely positioned for vision-language hallucination on complex image datasets and is architected for end-to-end causal measurement, not only discovery or treatment estimation. Where prior benchmarks emphasize network structure, inference performance, or treatment-outcome estimates, Causal-HalBench operationalizes the interventionist paradigm directly at the level of image semantics, enabling the quantification of context-induced model failures.

A plausible implication is that causal metrics like CAC, AAC, and CHR could be adapted for other domains where spurious correlations via high-order context may trigger systematic model failure.

7. Implications and Future Directions

The empirical finding that all leading LVLMs remain susceptible to context-induced hallucination suggests that current vision-language architectures insufficiently disentangle semantic content from context priors. Causal-HalBench’s formalisms and counterfactual engineering pipeline provide a rigorous foundation for causal debiasing techniques in future model development.

Continued progress may leverage:

End-to-end causal regularization for reducing $C\rightarrow Y$ dependence.
Extension of counterfactual protocols to broader visual reasoning domains and multi-object settings.
Integration with open causal-learning benchmarks for unified cross-domain evaluation of debiasing effectiveness and spurious-correlation resilience.

Causal-HalBench thus establishes a new standard for evaluating and understanding object hallucination through the lens of formal causal intervention (Xu et al., 13 Nov 2025).