Papers
Topics
Authors
Recent
2000 character limit reached

Hallucination Disentangled Decoding

Updated 29 December 2025
  • HDD is a zero-training mitigation method that disentangles visual and language-prior hallucinations in LVLMs.
  • It employs adaptive segmentation and contrastive decoding to enhance visual signals and suppress language-induced errors.
  • HDD delivers state-of-the-art improvements in accuracy and hallucination reduction without requiring model retraining.

Hallucination Disentangled Decoding (HDD) is a zero-training mitigation method for object hallucinations in large vision-LLMs (LVLMs). Designed for object recognition and captioning, HDD operates exclusively at inference time, leveraging adaptive segmentation and contrastive decoding to address both visual and language-prior hallucinations. The principal innovation lies in the disentanglement and explicit suppression of language-driven and visually-driven erroneous generations, substantially reducing both types of hallucinations without any model fine-tuning (Ma et al., 22 Dec 2025).

1. Hallucinations in Vision-LLMs

LVLM hallucinations manifest from two primary sources. Visual hallucinations arise from the vision encoder’s failure to accurately represent small or low-contrast entities, leading to spurious or missed object recognitions due to impoverished local detail in the shared vision–language embedding. Language-prior hallucinations stem from the LLM’s reliance on textual co-occurrence priors—termed “Decoding Inertia”—wherein object predictions are made based on frequent linguistic patterns (e.g., inferring “car” from “road”) rather than the provided visual evidence. These two error sources are entangled: vision mistakes can trigger further linguistic fabrications, while strong language priors can override weak visual signals.

2. HDD Pipeline and Algorithmic Components

HDD applies a two-pronged pipeline, explicitly decomposing and attenuating both classes of hallucination. Crucially, this pipeline does not require any gradient updates or parameter modifications.

2.1 Image Segmentation and Augmentation

The procedure commences by segmenting the input image VV using a model such as the Segment Anything Model (SAM), with Mask2Former and Mask R-CNN as viable alternatives. From the set of nn instance-level masks {Mask1,,Maskn}\{\text{Mask}_1, \ldots, \text{Mask}_n\}, the largest N=0.05nN = 0.05 \cdot n masks are aggregated into segmented image v1=i=1N(MaskiV)v_1 = \sum_{i=1}^N (\text{Mask}_i \odot V). The complement v2=Vv1v_2 = V - v_1 forms the second augmented image. An all-zero blank image vnv_n is appended to expose language priors without visual input.

2.2 Disentangled Decoding

At each decoding step tt, for prompt xx and generated prefix y<ty_{<t}:

  1. Logits logit(V)\operatorname{logit}(V), logit(v1)\operatorname{logit}(v_1), logit(v2)\operatorname{logit}(v_2), and logit(vn)\operatorname{logit}(v_n) are computed via forward passes through the LVLM.
  2. Jensen-Shannon divergence Di=JSD[p(ytvi,x,y<t)p(ytvn,x,y<t)]D_i = \operatorname{JSD}[p(y_t|v_i, x, y_{<t}) \Vert p(y_t|v_n, x, y_{<t})] is evaluated for i{1,2}i \in \{1, 2\} to quantify the visual informativeness of each segment.
  3. The segment with the largest DiD_i is selected as i=argmaxiDii^* = \arg\max_i D_i.
  4. Detail-difference weight δ=D1D2\delta = |D_1 - D_2| is computed. δ0\delta \approx 0 when both segments are uninformative, δ1\delta \to 1 when one segment is highly informative.
  5. Visual enhancement: for vin{V,v1,v2}v_{\text{in}} \in \{V, v_1, v_2\}, compute logitenh(vin)=(1δ)logit(vin)+δlogit(vi)\operatorname{logit}_\text{enh}(v_{\text{in}}) = (1 - \delta)\,\operatorname{logit}(v_{\text{in}}) + \delta\,\operatorname{logit}(v_{i^*}).
  6. Contrastive language-prior removal: logit(vin)=(1+α)logitenh(vin)αlogitenh(vn)\operatorname{logit}^*(v_{\text{in}}) = (1 + \alpha)\,\operatorname{logit}_\text{enh}(v_{\text{in}}) - \alpha\,\operatorname{logit}_\text{enh}(v_n), with α>0\alpha > 0 tuned per architecture.
  7. Blended logits: logitHDD=(1δ)logit(V)+δlogit(vi)\operatorname{logit}_\text{HDD} = (1 - \delta)\,\operatorname{logit}^*(V) + \delta\,\operatorname{logit}^*(v_{i^*}).
  8. Final token yty_t is sampled from SoftMax[logitHDD]\operatorname{SoftMax}[\operatorname{logit}_\text{HDD}].

The process is formalized as follows:

1
2
3
4
5
6
7
8
9
10
11
12
for t in 1T:
    for each v_in in {V, v1, v2, vn}:
        L[v_in] = LVLM(v_in, x, y_<t)     # raw logits
    Div1 = JSD(SoftMax L[v1], SoftMax L[vn])
    Div2 = JSD(SoftMax L[v2], SoftMax L[vn])
    i_star = argmax(Div1, Div2)
    delta = abs(Div1 - Div2)
    for each v_in in {V, v1, v2}:
        L_enh[v_in] = (1-delta)*L[v_in] + delta*L[v_i_star]
        L_star[v_in] = (1+alpha)*L_enh[v_in] - alpha*L_enh[vn]
    logit_HDD = (1-delta)*L_star[V] + delta*L_star[v_i_star]
    y_t = sample_from(SoftMax(logit_HDD))
α\alpha is grid-searched in [0.1,0.6][0.1, 0.6] for LLaVA-1.5 and InstructBLIP, and [1.0,1.6][1.0, 1.6] for LLaVA-NeXT; δ\delta is dynamically determined per step.

3. Computational Characteristics and Implementation

HDD exclusively alters decoding and requires no model retraining or exposure to additional data. The method is implemented with PyTorch, utilizing HuggingFace LVLM wrappers and SAM or its alternatives for segmentation. Each decoding step entails four LVLM forward passes, yielding approximately 8 ms/token latency on an RTX-4090 GPU, in comparison to 7 ms/token for simple visual contrastive decoding (VCD). This overhead remains compatible with practical real-time constraints.

4. Experimental Results and Evaluation

HDD demonstrates consistent performance improvements across multiple models and benchmarks. Experiments utilize LLaVA-v1.5-7b, InstructBLIP-7b, and LLaVA-NeXT on MS-COCO (captioning), A-OKVQA, GQA (VQA), and Visual Genome (detailed descriptions).

Quantitative evaluation employs:

  • POPE: binary object existence accuracy and F1 under Random/Popular/Adversarial queries.
  • CHAIR: entity-level (CHAIR_I) and sentence-level (CHAIR_S) hallucination rates, with lower values preferred.
  • GPT-4 Assisted VG Benchmark: SPI/WPI (descriptive richness), HSPI/HWPI (hallucinated sentences/words), and HSR/HWR (hallucination ratios).

Relative to strong baselines (VCD, OPERA, HALC, SID, RITUAL) and decoding strategies (greedy, beam, multinomial), HDD achieves:

  • POPE: accuracy improvements up to +9.9 percentage points (pp), F1 increases up to +10.2 pp over greedy/beam search.
  • CHAIR: 29.1% (CHAIR_I) and 30.3% (CHAIR_S) reductions in hallucination rates versus vanilla beam search.
  • GPT-4 Benchmark: 12.9% lower HSR compared to VCD, 18.3% lower HWR relative to OPERA, without sacrificing descriptive richness (SPI/WPI).

5. Qualitative Analysis and Benchmarks

Qualitative inspection reveals HDD’s efficacy at mitigating language-prior and visual hallucinations. On MS-COCO, standard LVLMs may hallucinate “a dog beside the bench” where no dog is present. HDD instead produces accurate captions such as “a cat on top of the bench” and adds correct details (“sunlight streaming through the window”). For adversarial A-OKVQA queries (e.g., “Is there a zebra?” for images containing none), HDD reliably responds “No,” suppressing spurious associations driven by language priors.

6. Limitations and Extensions

The principal drawback of HDD is the four-fold per-token computation relative to standard decoding. Nonetheless, this cost remains manageable in real-time applications on modern hardware. The framework is currently limited to 2D inputs; extension to video streams or 3D point clouds requires further development and is not addressed.

Potential extensions include dynamic per-query selection of NN, learnable or uncertainty-driven α\alpha, and integration with attention-based hallucination detectors for enhanced granularity.

7. Application Domains and Impact

HDD’s architecture-independent, training-free design renders it applicable to a range of domains where false positive recognition is critical. Notable applications include safety-critical medical imaging reports, autonomous driving scene text, e-commerce product image captioning, and any vision–language interface where erroneous object attributions are unacceptable.

HDD’s core contribution is the explicit disentanglement and joint mitigation of visual and language-prior hallucinations by (1) amplifying genuine local visual signals via segmentation and adaptive blending; and (2) subtracting LLM-driven language biases using blank-image contrast. This approach yields state-of-the-art hallucination reduction in LVLMs across diverse tasks, datasets, and evaluation metrics, without any model retraining requirement (Ma et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hallucination Disentangled Decoding (HDD).