Fact-RLHF: Factual Alignment in RLHF

Updated 1 November 2025

Fact-RLHF is a framework that augments traditional RLHF with explicit factual signals to align models closer to ground-truth, reducing hallucinations.
It combines supervised fine-tuning, human feedback, and augmented reward modeling to counteract reward hacking and unfaithful outputs in multimodal settings.
Empirical evaluations show that Fact-RLHF achieves near-GPT-4 performance on benchmarks like LLaVA-Bench, setting new standards for factual consistency.

Factually Augmented RLHF (Fact-RLHF) is a framework that extends reinforcement learning from human feedback (RLHF) with factual signals, aiming to align large models—especially vision-language and LLMs—more robustly with ground-truth information and minimize hallucinated outputs. Fact-RLHF has emerged as a response to demonstrated failures of preference-only RLHF to enforce factual correctness, enabling scalable, data-efficient, and robust training for trustworthy multimodal and LLMs.

1. Background and Motivation

RLHF enables agent alignment without hand-crafted reward functions by leveraging human feedback, typically by training a reward model to mimic human preferences between outputs. However, standard RLHF—especially in large multimodal models (LMMs) and LLMs—is susceptible to several critical problems:

Multimodal Misalignment: In LMMs, generated textual outputs may not be properly grounded in visual input, frequently resulting in hallucinations, i.e., text not supported by image content (Sun et al., 2023).
Reward Hacking: Models may exploit weaknesses in the reward model to generate superficially high-scoring, yet misleading or verbose outputs, deviating further from truthful or factually correct content (Sun et al., 2023).
Sparse Factual Supervision: RLHF typically relies on holistic, end-to-end response preference data, which may conflate fluency with factuality, further entrenching hallucinations or ungrounded statements (Sun, 2023, Kaufmann et al., 2023).

Fact-RLHF directly addresses these deficits by augmenting the training pipeline with explicit factual information—either in the reward modeling process or the optimization objective—to enhance the detection and penalization of hallucinated or unfaithful outputs, especially in multi-modal contexts.

2. Core Methodology and Technical Constructs

2.1. Fact-RLHF Training Pipeline

Fact-RLHF extends the RLHF process in multimodal models (vision-language) as follows (Sun et al., 2023):

Supervised Fine-Tuning (SFT): Pretrain a model on large-scale vision-language instruction datasets, including synthetic (e.g., GPT-4-generated) and human-annotated image-text pairs.
Human Feedback Collection: Gather human labels on prompt–response pairs, explicitly instructing annotators to identify the more hallucinated response.
Reward Model Training: Train the reward model using augmented inputs that include not just the prompt and the candidate response, but additional ground-truth information such as image captions or multi-choice answers.
RL Fine-Tuning (PPO): Optimize the policy to maximize the factually-calibrated reward, with KL-divergence regularization to the SFT model, mitigating mode collapse.

2.2. Factually Augmented Reward Model

Unlike standard RLHF reward models, which score a response given (image, prompt), Fact-RLHF's reward model operates on (image, prompt, response, factual information):

For each example, the additional factual input may consist of human-generated image captions, ground-truth VQA answers, rationales, or multi-choice candidates—for instance, all five COCO captions for image questions, or detailed rationales from A-OKVQA (Sun et al., 2023).
The reward model is trained with preference data and a binary cross-entropy loss over preferred and non-preferred responses:

$\mathcal{L}(r_{\boldsymbol{\theta}}) = - \mathbb{E}_{(\mathcal{I},x,y_0,y_1,i)}\left[ \log \sigma\left( r_{\boldsymbol{\theta}}(\mathcal{I}, x, y_i) - r_{\boldsymbol{\theta}}(\mathcal{I}, x, y_{1-i}) \right) \right]$

This factual conditioning allows the RM to more reliably penalize unfaithful or hallucinated responses by referencing ground truth.

2.3. Symbolic and Auxiliary Rewards

Fact-RLHF introduces symbolic rewards when ground truth answers are available (e.g., binary VQA), providing unambiguous supervision:

Rewards are boosted for correctly grounded answers, penalized for mismatches.
Length penalties are enforced to counteract verbosity, which commonly correlates with elevated hallucination rates (Sun et al., 2023).

2.4. Reward Hacking Defense

By explicitly conditioning on factual content, Fact-RLHF prevents superficial alignment. The model's responses are checked directly against known ground truth, making spurious but superficially plausible alternatives less likely to score well.

3. Empirical Evaluation and Comparative Performance

Several benchmarks and metrics are used to validate the efficacy of Fact-RLHF (Sun et al., 2023):

3.1. Benchmarks

LLaVA-Bench: Rates LMM responses by comparing to a text-only GPT-4 baseline.
MMHal-Bench: Specially developed for hallucination detection across multiple question types (attributes, adversarial, comparison).
POPE & MMBench: Assess general capabilities and object hallucination.

3.2. Results

Fact-RLHF achieves strong empirical results, surpassing previous methods:

Model	LLaVA-Bench (%)	MMHal-Bench (lower is better)
SFT-7B	86.3	1.8
RLHF-7B	94.1	1.8
Fact-RLHF-7B	94.1	2.1
SFT-13B	84.9	2.0
RLHF-13B	95.6	2.0

Fact-RLHF achieves 94–95.6% of GPT-4 (text-only) performance on LLaVA-Bench, far exceeding previous SFT and vanilla RLHF models.
On MMHal-Bench, Fact-RLHF yields top scores for both overall and hallucination rate, outperforming Kosmos-2, IDEFICS, InstructBLIP, and others.
Qualitative analysis reveals Fact-RLHF-trained models generate more truthful and less hallucinated responses, and are robust against verbosity-based reward hacking.

This suggests that explicit factual augmentation yields a quantifiable boost in both factual alignment and holistic capability when compared to preference-only RLHF and SFT.

4. Relation to Broader RLHF Methodology and Research Trends

Fact-RLHF is part of a broader effort to imbue RLHF with factual supervision, avoiding exclusive reliance on subjective human preference (Kaufmann et al., 2023). Several approaches contextualize Fact-RLHF:

Fact-Augmented Feedback: Direct integration of fact-verification signals (from humans, AI, or external resources) to supplement preference data.
Multi-Objective Reward Modeling: Jointly optimizing for helpfulness and factuality, formally $R = R_{helpful} + \lambda R_{fact}$ , or using multi-objective RL policies.
Rule-Based Constitutional Feedback: Adding explicit rules for factual correctness during learning.
AI-Assisted Fact Checking: Using model-based or ensemble fact checkers to fortify human judgments.
Mitigating Reward Model Overoptimization: Factual augmentation regularizes the reward, mitigating both reward hacking and the “reward hypothesis” mismatch.

Fact-RLHF thus operationalizes best-practices identified in RLHF research for increasing factual alignment, specifically in the multi-modal (vision-language) domain (Kaufmann et al., 2023, Sun, 2023).

5. Significance, Limitations, and Insights

Fact-RLHF constitutes the first demonstration of RLHF with explicit factual conditioning as a regularization and alignment signal for LMMs (Sun et al., 2023). Key outcomes are:

Novel Alignment Mechanism: By providing reward models with additional ground-truth information during training, Fact-RLHF creates a more robust, less gameable alignment signal.
Robust Hallucination Mitigation: Systematic reduction in hallucinated or ungrounded outputs—not adequately addressed by SFT or standard RLHF—has been demonstrated in domain-specific benchmarks.
Blueprint for Broader Adoption: The architecture and training recipes—including mathematical objective functions and use of augmented factual data—are provided for application to other multimodal and conversational AI systems.
Scalability and Efficiency: Fact-RLHF achieves improved results without incurring additional annotation cost at inference or online RL time; factual inputs are leveraged only during reward modeling and RL training.
Generalization: Factual augmentation not only improves hallucination metrics but also sample efficiency and cross-task performance, facilitating trustworthy deployment in realistic AI applications.

A plausible implication is that the methods instantiated in Fact-RLHF may serve as template mechanisms for future RLHF research that targets not only subjective alignment (helpfulness, harmlessness) but also factual alignment, especially in risk-sensitive or high-stakes domains.

6. Future Research and Open Directions

Fact-RLHF opens several research avenues and aligns with emerging priorities in RLHF and LLM alignment literature (Sun et al., 2023, Kaufmann et al., 2023):

Reward Model Design: Exploring different sources and representations of factual information, increasing granularity (e.g., segment or token-level reward densification).
Automated Factuality Signals: Integrating automated fact-checkers or knowledge-base derived ground truths at scale, further reducing human annotation bottlenecks.
Multi-Objective and Pareto Alignment: Formally treating factuality as one of several objectives, possibly with dynamic weighting or Pareto-optimization strategies.
Generalization Across Modalities: Transferring factual augmentation principles to additional modalities, such as audio-language or video-LLMs.
Online and Continual Augmentation: Adapting factual augmentation for online or continual learning settings, where ground-truth knowledge may evolve.

These directions are aligned with the critical observation that factual correctness and helpfulness often conflict, and additional research is required to appropriately balance these axes in model alignment pipelines (Kaufmann et al., 2023, Sun, 2023).

Summary Table: Standard RLHF vs. Factually Augmented RLHF

Aspect	Standard RLHF	Factually Augmented RLHF (Fact-RLHF)
Feedback	Human preference/response ranking	Human preference + factual information
Reward Model	Trained on response and context	Trained on response, context, and factual signals
Alignment Target	Helpfulness, general alignment	Helpfulness + factuality, hallucination reduction
Hallucination	Partially mitigated, reward model hackable	Robust mitigation, less reward hacking

Factually Augmented RLHF provides a principled, empirically validated, and methodologically transparent approach to training multimodal and LLMs that are not just helpful, but reliably factual and grounded. This framework has established new baselines for factuality and hallucination rates in vision-LLMs and serves as a touchstone for future research in scalable, robust model alignment.