Papers
Topics
Authors
Recent
2000 character limit reached

Hallucination-Aware Direct Preference Optimization

Updated 31 December 2025
  • HA-DPO is a fine-tuning algorithm family that mitigates hallucinations by aligning model outputs with non-hallucinatory preferences using specialized heuristics.
  • It extends Direct Preference Optimization by incorporating targeted data selection, customized loss weighting, and domain-specific metrics across vision, audio, and text models.
  • Empirical benchmarks demonstrate that HA-DPO significantly reduces hallucination rates, enhancing model grounding and overall output reliability.

Hallucination-Aware Direct Preference Optimization (HA-DPO) is a family of fine-tuning algorithms for reducing hallucinations—outputs from large multimodal or LLMs that are not accurately grounded in the provided input—via preference-based policy alignment. These frameworks extend Direct Preference Optimization (DPO) by explicitly constructing, selecting, or weighting preference data and losses based on hallucination-specific heuristics, metrics, or classifiers. HA-DPO has been effective in vision-LLMs (VLMs/MLLMs), audio generation, and text models, with active research covering algorithmic variants, data construction pipelines, and robustness guarantees.

1. Problem Motivation and Conceptual Framework

Hallucinations in generative models, particularly in MLLMs and multimodal architectures, manifest as spurious object references, fabricated relationships, or unsupported claims in generated outputs relative to the conditioned data (e.g., images, text, audio) (Zhao et al., 2023, Compagnoni et al., 27 Aug 2025, Fu et al., 2024). Conventional supervised fine-tuning (SFT) or reward modeling penalizes unhelpfulness without directly targeting grounding errors. DPO provides a scalable, reward-model-free preference optimization approach but is only "hallucination-aware" if the preference dataset or objective is designed to distinguish grounded from hallucinated outputs.

HA-DPO reframes the hallucination mitigation task as explicit preference learning: given two outputs for the same input, select the non-hallucinatory option as the "winner" and increase the model's likelihood for that sequence over the "loser" (hallucinatory) option. This paradigm can incorporate detailed heuristics, domain-specific metrics, or on-policy data curation to anchor hallucination awareness in the fine-tuning objective. The key insight is that by centering hallucination detection in the preference construction, the alignment step directly improves grounding fidelity (Yu et al., 30 Nov 2025, Yang et al., 16 Jan 2025).

2. Mathematical Formulations and Objectives

The core HA-DPO objective adapts the DPO loss to penalize hallucinations, typically via a margin-based likelihood ratio between positive (non-hallucinated) and negative (hallucinated) samples. Let πθ(yx)\pi_\theta(y|x) denote the fine-tuned model and πref(yx)\pi_{\rm ref}(y|x) the frozen reference model. For a preference tuple (x,y+,y)(x, y^{+}, y^{-}) where y+y^+ is less hallucinatory than yy^-, the standard loss is:

LDPO(θ)=E(x,y+,y)D[logσ(β[logπθ(y+x)πref(y+x)logπθ(yx)πref(yx)])]\mathcal{L}_{\rm DPO}(\theta) = -\mathbb{E}_{(x, y^+, y^-) \sim \mathcal{D}} \left[ \log \sigma\left( \beta \left[ \log \frac{\pi_\theta(y^+|x)}{\pi_{\rm ref}(y^+|x)} - \log \frac{\pi_\theta(y^-|x)}{\pi_{\rm ref}(y^-|x)} \right] \right) \right]

where σ\sigma is the logistic sigmoid and β>0\beta > 0 controls the KL-type constraint to πref\pi_{\rm ref} (Compagnoni et al., 27 Aug 2025, Zhao et al., 2023, Ouali et al., 2024, Xie et al., 2024, Fu et al., 2024, Yu et al., 30 Nov 2025, Yang et al., 16 Jan 2025).

HA-DPO extensions modify:

Some frameworks introduce multi-term objectives, e.g., V-DPO's joint loss penalizing lack of vision-text divergence and CHAIR-DPO's pairing with CHAIR-based hallucination detection (Compagnoni et al., 27 Aug 2025, Xie et al., 2024).

3. Preference Data Construction and Hallucination Detection

All effective HA-DPO approaches require robust construction of preference pairs that label one response as less hallucinated than the other. The main strategies are:

Approach Hallucination Signal Pair Generation Mechanism
CHAIR-DPO (Compagnoni et al., 27 Aug 2025) CHAIR metric Object detector labels in vision
HA-DPO (Zhao et al., 2023) GPT-4 correction GPT-4 rewriters and Visual Genome
HDPO (Fu et al., 2024) Targeted negative Visual, long-context, multimodal
CLIP-DPO (Ouali et al., 2024) CLIP similarity Ranking CLIP scores on captions
HA-DPO-Music (Zhang et al., 7 Aug 2025) PER (phoneme error rate) ASR evaluation on audio
Robust On-Policy (Yu et al., 30 Nov 2025), OPA-DPO (Yang et al., 16 Jan 2025) Hallucination classifier On-policy sampling + classifier

Key techniques involve using object detectors (CHAIR), contrastive vision-language scorers (CLIP), phoneme error rates (audio), external LLMs for hallucination identification and rewriting (GPT-4/GPT-4V), and domain-specific rules for pair assembly (Compagnoni et al., 27 Aug 2025, Zhao et al., 2023, Ouali et al., 2024, Zhang et al., 7 Aug 2025). On-policy sampling and classifier filtering are critical for avoiding off-policy collapse and ensuring the KL constraint is respected (Yu et al., 30 Nov 2025, Yang et al., 16 Jan 2025).

4. Algorithmic Recipes, Training Pipelines, and Hyperparameters

Standard HA-DPO implementations:

  1. Pair Sampling: For each prompt (possibly with vision or modality data), generate multiple candidate outputs (via sampling) from the current or reference model.
  2. Hallucination Rating: Score or classify candidates using automated metrics (CHAIR, CLIP, PER), expert-model ranking, or trained classifier.
  3. Preference Pair Filtering: Select pairs with sufficient hallucination margin, avoiding ties. Filter to maximize supervision signal (e.g., discard no-difference pairs (Compagnoni et al., 27 Aug 2025)).
  4. Fine-Tuning: Apply LoRA adapters or partial/fine-grain parameter updates, often with AdamW optimizer and cosine learning rate schedules.
  5. Batch Processing: Batch size, learning rates, and β\beta are tuned by benchmarking on hallucination-specific validation sets. Example: batch size 64, learning rate 2×1062\times 10^{-6}, LoRA (rank 128, α=\alpha=256) (Compagnoni et al., 27 Aug 2025, Fu et al., 2024, Yu et al., 30 Nov 2025).

Specialized strategies involve (a) warm-up SFT phases (e.g., reject-sampling), (b) iterative on-policy data curation with classifier-based positive/negative filtering (Yu et al., 30 Nov 2025), and (c) multi-term objectives with spectral or KL regularizers (Zhang et al., 29 Jul 2025, Xie et al., 2024). In all cases, model selection is based on minimizing hallucination metrics on held-out preference pairs.

5. Quantitative Results, Empirical Benchmarks, and Ablations

Across benchmarks—AMBER, Object HalBench, CHAIR-MSCOCO, MMHalBench, POPE—HA-DPO or close variants yield substantial reductions in both sentence-level (CHAIRs_s) and instance-level (CHAIRi_i) hallucination rates:

Model/Method AMBER CHAIRi_i HalRate Gain vs Baseline
LLaVA-1.5-7B Baseline 7.6% 35.0%
CHAIR-DPOβ=0.2_{\beta=0.2} 3.0% 14.7% –4.6/–20.3 pp
HDPO (LLaVA-7B) (Fu et al., 2024) 16.6% (CHAIRs_s) 15.8% Best overall
Robust HA-DPO (Yu et al., 30 Nov 2025) 2.4% (CHAIRi_i) 13.6% 50% hal-rate↓
POPE Acc. (MiniGPT-4) (Zhao et al., 2023) 86.13% -- +35 pp absolute
OPA-DPO (Yang et al., 16 Jan 2025) 4.25% (CHAIRi_i) -- 5.55 pp↓ vs SOTA

Ablation studies confirm that removing targeted preference data, filtering, sample reweighting, or adversarial perturbations (as in TARS (Zhang et al., 29 Jul 2025)) degrades hallucination mitigation performance (Compagnoni et al., 27 Aug 2025, Fu et al., 2024, Yu et al., 30 Nov 2025, Zhang et al., 29 Jul 2025). On-policy preference sampling is necessary for stable improvements due to the limitations (KL-barriers) of optimizing over strictly off-policy data (Yang et al., 16 Jan 2025, Yu et al., 30 Nov 2025).

6. Limitations, Extensions, and Generalization

Limitations:

  • Metric reliance: Object-centric metrics (CHAIR, object detectors) do not penalize attribute or relationship hallucinations (Compagnoni et al., 27 Aug 2025).
  • Detector coverage: Quality is bound by the object detector/classifier’s vocabulary and false negative rate (Compagnoni et al., 27 Aug 2025, Ouali et al., 2024).
  • Positive example quality: For fully automatic pipelines, non-human–filtered positive samples may introduce subtle artifacts (Fu et al., 2024, Yu et al., 30 Nov 2025).
  • Domain restriction: Some approaches are tailored for captioning or vision-language QA rather than dialog or reasoning (Fu et al., 2024).
  • Data scale: Empirical gains sometimes plateau with small datasets; scaling laws suggest more data improves scores further (Fu et al., 2024).

Extensions:

The general HA-DPO paradigm requires only that a hallucination-quantifying function M(yx)M(y|x) be available to induce preference pairs. Any scalar metric (e.g., SQL consistency in VQA; CLIP-score in VL; PER in audio) can instantiate HA-DPO by supplanting the role of CHAIR or equivalent in pair construction and loss definition (Compagnoni et al., 27 Aug 2025).

7. Theoretical Insights and Strategic Considerations

Several works emphasize the theoretical necessity of on-policy alignment to avoid KL-induced barriers to effective learning. If reference policy πref\pi_{\rm ref} assigns near-zero probability to on-policy positives (e.g., expert-written hallucination-free outputs), the KL divergence becomes infinite and DPO does not redistribute probability mass as intended (Yang et al., 16 Jan 2025). On-policy data collection—where preference pairs are generated by the current policy or its immediate refinement—circumvents this obstacle, enabling stable, progressive suppression of hallucinations with each iteration (Yu et al., 30 Nov 2025, Yang et al., 16 Jan 2025).

Dynamic reweighting of training samples can further accelerate convergence by focusing updates on near-ties (Rao–Kupper) or leveraging classifier confidence regions, ensuring that the gradient signal is not dominated by trivial or noisy preferences (Yu et al., 30 Nov 2025).

The HA-DPO family provides a general, modular, empirically validated methodology for hallucination mitigation in grounded generation, suitable for integration with open-source, commercial, and specialized base models. Its scalability, alignment fidelity, and extensibility to new metrics and modalities make it a central technique in model alignment and reliability research.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hallucination-Aware Direct Preference Optimization (HA-DPO).