Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 51 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning (2509.25760v1)

Published 30 Sep 2025 in cs.CL, cs.AI, and cs.LG

Abstract: While LLMs have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

Summary

  • The paper introduces a reinforcement learning framework that directly optimizes a truthfulness score by balancing accuracy, abstention, and hallucination penalties.
  • It employs a novel ternary reward system, assigning +1 for correct answers, 0 for abstentions, and -1 for hallucinations to improve model calibration.
  • Empirical results on benchmarks show up to 28.9% reduction in hallucinations and a 21.1% boost in truthfulness compared to traditional methods.

TruthRL: Directly Optimizing Truthfulness in LLMs via Reinforcement Learning

Motivation and Problem Formulation

LLMs exhibit strong performance on factoid question answering but remain susceptible to hallucinations—producing plausible yet factually incorrect statements, especially when queried outside their parametric knowledge. This issue is particularly acute in high-stakes domains, where incorrect outputs can have severe consequences. The core challenge is that conventional training paradigms, such as supervised fine-tuning (SFT) and reinforcement learning (RL) with accuracy-based rewards, inherently incentivize models to always provide an answer, even when uncertain, thus amplifying hallucinations. In contrast, a truly truthful LLM should maximize correct answers, abstain when uncertain, and minimize hallucinations.

The paper introduces a formal definition of truthfulness as a weighted combination of accuracy, uncertainty (abstention), and hallucination rates. The objective is to design training methods that maximize this truthfulness score, rather than accuracy alone.

TruthRL Framework and Reward Design

TruthRL is a general RL framework instantiated with Grouped Rollout Policy Optimization (GRPO), designed to directly optimize for truthfulness. The key innovation is a ternary reward structure that distinguishes between correct answers, hallucinations, and abstentions:

  • Correct answer: +1+1
  • Abstention ("I don't know"): $0$
  • Hallucination: 1-1

This reward structure explicitly penalizes hallucinations and treats abstentions neutrally, encouraging the model to abstain when uncertain rather than guessing. Figure 1

Figure 1: TruthRL’s ternary reward design incentivizes correct answers, penalizes hallucinations, and treats abstentions neutrally, in contrast to SFT/RL which reward hallucinations over abstentions.

The GRPO objective leverages group-based advantage estimation, ensuring that the policy update differentiates between abstention and hallucination. The framework also explores knowledge-enhanced and reasoning-enhanced reward variants, but empirical results indicate that the simple ternary reward is most effective.

Empirical Evaluation and Main Results

TruthRL is evaluated on four knowledge-intensive benchmarks (CRAG, NaturalQuestions, HotpotQA, MuSiQue) under both retrieval-augmented and non-retrieval settings, using Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct as backbone models. The evaluation metrics are truthfulness score, hallucination rate, and accuracy.

Key findings include:

  • TruthRL reduces hallucinations by up to 28.9% and improves truthfulness by up to 21.1% compared to vanilla RL.
  • TruthRL achieves the lowest hallucination rates and highest truthfulness scores across all tested configurations.
  • Knowledge-enhanced SFT baselines (e.g., R-Tuning) reduce hallucinations but at the cost of coverage or accuracy, whereas TruthRL maintains a better balance.
  • Binary reward RL variants achieve higher accuracy but at the expense of increased hallucinations and near-zero abstention rates, limiting overall truthfulness. Figure 2

Figure 2

Figure 2: TruthRL achieves the highest accuracy and lowest hallucination rate on CRAG, outperforming all baselines.

Figure 3

Figure 3

Figure 3

Figure 3: TruthRL consistently achieves the lowest hallucination rate across training epochs compared to other reward designs.

Analysis: Knowledge Boundary Recognition and Robustness

TruthRL enables LLMs to recognize their knowledge boundaries, abstaining primarily when genuinely uncertain. On challenging subsets where correct answers are rare, TruthRL produces minimal hallucinations and high abstention rates, while other methods hallucinate heavily. This demonstrates improved calibration and uncertainty awareness.

TruthRL is also robust to hallucination-baiting questions (e.g., multiple-choice with explicit candidates), where baseline methods exhibit high hallucination rates. TruthRL maintains the lowest hallucination and highest truthfulness in these settings.

The framework is robust across different LLM-based verifiers and judges, and its improvements generalize across model scales, with particularly pronounced gains for smaller models that are more prone to hallucination.

Ablation and Reward Structure

Ablation studies confirm that the ternary reward structure is critical for balancing accuracy, abstention, and hallucination. Binary rewards drive models to eliminate abstentions, conflating uncertainty with error, while knowledge-enhanced and reasoning-enhanced variants offer marginal or inconsistent improvements. Figure 4

Figure 4

Figure 4

Figure 4: Scaling curve for prompting and vanilla SFT/RL on CRAG; vanilla SFT/RL suppress abstention and increase hallucination, while prompting maintains a better balance.

Comparison with Offline and Semi-Online RL

TruthRL’s online RL paradigm (GRPO) outperforms both offline (DPO) and semi-online (iterative DPO) RL methods. Offline RL is limited by static datasets and cannot adaptively refine model behavior, while iterative DPO shows inconsistent improvements and can regress with excessive iterations. Online RL enables timely policy updates and better exploration-exploitation balance.

Reasoning Quality and Multi-Objective Reward

The paper explores incorporating reasoning quality into the reward function. While outcome-only rewards implicitly improve reasoning, explicit multi-objective reward designs (e.g., additive or multiplicative reasoning rewards) require careful balancing to avoid compromising outcome quality. This area remains open for further research.

Practical and Theoretical Implications

TruthRL demonstrates that reward structure is a fundamental lever for aligning LLMs with truthfulness objectives. By explicitly modeling abstention and penalizing hallucination, LLMs can be trained to be more reliable and trustworthy, especially in high-stakes or open-domain settings. The approach is scalable, robust to different evaluation protocols, and effective across model sizes.

Practically, TruthRL can be integrated into LLM post-training pipelines to produce models that are less likely to hallucinate and more likely to abstain when uncertain, which is critical for deployment in safety-critical applications. Theoretically, the work highlights the importance of multi-dimensional objectives in RL for LLMs and opens avenues for further research into richer reward structures, including those that account for reasoning quality and calibration.

Conclusion

TruthRL provides a principled and empirically validated framework for directly optimizing the truthfulness of LLMs via reinforcement learning with a ternary reward. The method achieves significant reductions in hallucination and improvements in truthfulness across diverse benchmarks and model architectures. The findings underscore the necessity of moving beyond accuracy-driven objectives and adopting reward structures that explicitly balance correctness, abstention, and hallucination minimization. Future work may extend this framework to incorporate more nuanced reasoning and calibration objectives, further enhancing the reliability of LLMs in real-world applications.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper is about making LLMs more truthful. Truthful means not just getting answers right, but also saying “I don’t know” when they’re unsure instead of guessing and making things up. The authors created a training method called TruthRL that teaches LLMs to reduce “hallucinations” (confident but wrong answers) and to abstain when they don’t know.

Key Objectives

The paper aims to answer simple, practical questions:

  • How can we train LLMs to be truthful, not just accurate?
  • Can we encourage models to avoid guessing when they’re unsure?
  • What kind of training rewards help models balance giving correct answers and admitting uncertainty?

Methods and Approach

Think of training an LLM like coaching a student for a quiz:

  • Old methods focused only on accuracy. This is like telling the student, “Always answer—guess if you must,” which leads to more wrong answers.
  • The new method, TruthRL, uses reinforcement learning (RL). RL is like giving feedback after each attempt: reward good behavior, discourage bad behavior, and repeat until the student improves.

Here’s how TruthRL works, using everyday ideas:

  • Group practice: The model tries several possible answers to the same question.
  • A “ternary reward” system:
    • Correct answer: reward (+1)
    • “I don’t know” (abstain): neutral (0)
    • Wrong answer (hallucination): penalty (−1)
  • Why this matters: If the model truly doesn’t know, it’s better to abstain than to invent something wrong. This reward design teaches that behavior.

The training algorithm they use is called GRPO (Group Relative Policy Optimization). Simple explanation:

  • The model makes a small group of answers.
  • The training compares answers in the group and pushes the model toward the better ones.
  • With the ternary reward, telling the truth (“I don’t know”) is treated better than guessing and being wrong.

They also tried extra reward ideas:

  • Knowledge-enhanced rewards: give a bonus for abstaining when a question is genuinely beyond the model’s knowledge.
  • Reasoning-enhanced rewards: score how good the model’s thinking process is, not just the final answer.

Surprisingly, the simple ternary reward usually worked best.

To find which questions the model truly doesn’t know, they probed its “knowledge boundary” by sampling many answers. If none were correct, they labeled the question as “out-of-knowledge” and trained the model to say “I don’t know” for those.

Main Findings and Why They Matter

In tests on four tough question-answering benchmarks (CRAG, NaturalQuestions, HotpotQA, MuSiQue), and with different LLMs (like Llama and Qwen), TruthRL:

  • Cut hallucinations by up to 28.9%. Fewer wrong, made-up answers.
  • Improved overall truthfulness by 21.1%. Better balance of being right and admitting uncertainty.
  • Worked both with and without retrieval. Retrieval means the model can look up documents. Even when those documents are noisy or misleading, TruthRL helped the model resist being tricked.
  • Helped models recognize their “knowledge boundary.” On very hard questions where almost no method gets the right answer, TruthRL’s models mostly said “I don’t know” instead of guessing.
  • Was robust to “hallucination-baiting” questions (like multiple-choice comparisons that tempt guessing). TruthRL kept hallucinations low.
  • Beat other training styles:
    • Supervised fine-tuning (SFT): boosted accuracy but often killed abstaining and increased hallucinations.
    • Vanilla RL with a binary reward (just right vs. wrong): gave high accuracy but also high hallucinations and almost no abstentions.
    • Offline/semi-online RL (like DPO): helped a bit but was inconsistent. TruthRL’s online training did better.

In short, focusing on truthfulness—not just accuracy—produced models that are safer and more reliable.

Implications and Impact

This work suggests a simple but powerful change in how we train LLMs:

  • Rewarding correct answers, tolerating honest “I don’t know,” and penalizing wrong answers leads to more trustworthy models.
  • This is especially important in high-stakes areas like medicine or law, where a wrong confident answer can be harmful.
  • The approach scales across different model sizes and datasets.
  • It shows that good training goals (the “learning objective”) matter as much as big models or more data.

If widely adopted, TruthRL could help build AI systems that people can rely on—not because they always answer, but because they know when not to.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of unresolved issues the paper leaves open. Each item is framed to be concrete and actionable for future research.

  • Truthfulness metric design: The evaluation sets w1=1, w2=0, w3=1 (favoring accuracy minus hallucination, without rewarding abstention). Explore principled weighting schemes, task-dependent weights, and adaptive/dynamic weighting that explicitly values appropriate abstention.
  • Abstention detection: “Uncertain” is operationalized via “I don’t know.” Develop robust detectors for abstention that handle paraphrases, hedged language, mixed responses (partial answer + uncertainty), and multi-lingual cases to prevent gaming and misclassification.
  • Verifier reliability: Training and evaluation depend on LLM-based judges; quantify judge error, bias, and susceptibility to reward hacking. Establish standardized, cross-judge agreement metrics, human validation, and robust aggregation to ensure reliable reward signals.
  • Rule-based vs. LLM-based verification: Rule-based verification failed, but the paper does not analyze hybrid or improved rule-based schemes (e.g., semantic matching, fuzzy normalization, passage-grounded string checks). Investigate combined verifiers for stability and cost-efficiency.
  • Reward hacking risks: Assess whether models learn to exploit the judge (e.g., phrasing that persuades the judge, superficial “uncertainty” tokens). Design adversarial tests and safeguards against reward hacking (e.g., randomized judge prompts, multi-judge ensembles).
  • Partial correctness and evidence-grounding: Ternary reward collapses nuanced outcomes into correct/uncertain/incorrect. Study graded rewards for partial correctness, citation-based support, and evidence-grounded truthfulness (e.g., rewarding supported claims, penalizing unsupported ones).
  • Knowledge boundary identification: OOK labeling via sampling 256 responses is approximate and expensive. Evaluate sensitivity to sample size, error rates, distribution shift, and propose principled, model-agnostic OOK detection (e.g., calibration-based thresholds, uncertainty estimators).
  • Calibration and confidence: Confidence grouping is used but the confidence measure is not formally defined. Establish standardized confidence metrics (e.g., token-level entropy/log-prob aggregates), proper calibration procedures, and threshold selection strategies tied to abstention.
  • Online RL design choices: GRPO group size, advantage normalization, and KL regularization are not ablated. Systematically paper hyperparameters’ effects on stability, sample efficiency, abstention behavior, and truthfulness outcomes.
  • Compute and cost analysis: Online RL with LLM judges and large sampling is computationally heavy. Provide detailed cost-per-gain comparisons (tokens, wall-clock time, GPU hours) vs. SFT/RFT/DPO to inform practical deployment.
  • Robustness to adversarial inputs: Only one type of hallucination-baiting (comparison-style multiple choice) is tested. Broaden adversarial evaluations (jailbreaks, misleading premises, contradictory contexts, long-form reasoning traps) and measure abstention vs. correction.
  • Retrieval reliability: The paper notes RAG can be noisy but does not analyze retrieval quality’s impact. Quantify model behavior under controlled retrieval corruption (wrong/contradictory documents), reliability estimation of retrieved evidence, and abstention triggers when evidence is suspect.
  • Evidence attribution: TruthRL does not enforce citing sources. Explore reward designs that require citations, verify citation correctness, and penalize claims with unsupported or mismatched evidence.
  • Generalization beyond CRAG training: Models are trained only on CRAG; evaluate generalization to unseen domains and high-stakes areas (medical, legal), multi-lingual QA, and non-encyclopedic knowledge to validate cross-domain robustness.
  • Long-form and generative tasks: TruthRL is evaluated on factoid QA. Extend to summarization, multi-paragraph answers, code generation, and instruction-following where “truthfulness” and abstention are less clear; define outcome/reasoning/evidence rewards for open-ended outputs.
  • Multi-hop reasoning: HotpotQA/MuSiQue are included, but the reward does not explicitly model multi-hop consistency or intermediate correctness. Investigate step-level rewards, proof consistency checks, and penalize reasoning drift across hops.
  • Reasoning rewards: Heuristic reasoning rewards yielded mixed outcomes. Develop principled, scalable reasoning-quality metrics (e.g., structured proofs, entailment checks), and paper how to balance outcome vs. reasoning rewards without degrading accuracy or truthfulness.
  • Abstention-user experience: The paper optimizes abstention without exploring user experience. Design policies that accompany abstention with next actions (e.g., “what info is missing,” “how to retrieve it”), escalation strategies, and interactive uncertainty communication.
  • Hallucination vs. coverage trade-off: Binary reward maximizes accuracy but suppresses abstention; ternary improves truthfulness but can reduce coverage. Formalize the trade-off and define application-specific targets (e.g., acceptable coverage given risk profile).
  • Evaluation with humans: All correctness and reasoning evaluations rely on LLM judges. Include human annotation studies, inter-annotator agreement, and error analyses to calibrate LLM judgments.
  • Statistical robustness: Provide variance across seeds, confidence intervals, and significance testing for reported improvements; quantify run-to-run stability of GRPO vs. DPO/SFT baselines.
  • Safety interactions: Truthfulness reductions do not guarantee safety. Analyze interactions with harmful content policies, refusal behavior, and alignment trade-offs (e.g., truthful yet unsafe suggestions).
  • Continuous learning and knowledge updates: Study how TruthRL adapts to evolving knowledge (temporal drift), data freshness, and continual training without catastrophic forgetting or truthfulness degradation.
  • Integration with tool use: Explore coupling TruthRL with tools (web search, calculators, databases) and tool reliability checks, rewarding the model’s choice to use tools or abstain when tools are unreliable.
  • Cross-lingual and cultural contexts: Truthfulness and abstention detection may be language- and culture-dependent. Evaluate cross-lingual generalization and design language-agnostic abstention/verification mechanisms.
  • Metric alignment with stakes: Truthfulness is treated uniformly across tasks. Develop risk-sensitive metrics and rewards that weight hallucinations more in high-stakes contexts (e.g., medicine), guiding abstention thresholds accordingly.
  • Dataset biases and coverage: CRAG/NQ/HotpotQA/MuSiQue emphasize specific knowledge types. Audit dataset biases (topics, answer styles), and evaluate on domain-specific benchmarks with differing uncertainty profiles.
  • Policy interpretability: Provide mechanistic or behavioral analyses of how TruthRL changes token-level distributions, uncertainty expressions, and internal representations to better understand learned abstention behavior.
  • Deployment guidance: Offer best-practice recipes (e.g., judge selection, reward scaling, KL coefficients, group size) and failure modes to facilitate reproducible, reliable deployment in production settings.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

Below is a concise set of actionable, sector-linked use cases that can be deployed now using the paper’s TruthRL framework (GRPO with a ternary reward) and its demonstrated robustness across LLM backbones and retrieval/non-retrieval setups.

  • Enterprise knowledge assistants (Software, IT, HR, Customer Support)
    • What to deploy: Fine-tune existing RAG chatbots with TruthRL to reduce hallucinations and enable principled abstention (“I don’t know”) when content is outside model knowledge.
    • Workflow: Add an LLM-based verifier to the RL loop; enable abstention-aware routing to a human or follow-up retrieval; track Acc/Unc/Hall in a truthfulness KPI dashboard.
    • Dependencies/assumptions: Reliable LLM judge; sufficient compute for online RL; quality retrieval corpora; product acceptance of abstentions.
  • Clinical documentation and triage assistants (Healthcare)
    • What to deploy: TruthRL-tuned assistants that consult vetted medical guidelines via RAG and abstain when uncertain, escalating cases to clinicians.
    • Workflow: Ternary reward incentivizes correct advice and discourages risky guesses; confidence thresholds trigger human review.
    • Dependencies/assumptions: Curated medical sources; regulatory and liability alignment; robust verifier; clear escalation policies.
  • Legal research and contract review (Law)
    • What to deploy: TruthRL-trained legal QA that cites authoritative sources and abstains on ambiguities or hallucination-baiting prompts.
    • Workflow: Integrate case-law/contract databases; add “safe-citation” checks; abstention triggers clarification or a human lawyer review.
    • Dependencies/assumptions: High-quality legal corpora; strict source attribution; acceptance of non-answers in workflows.
  • Financial research and compliance QA (Finance)
    • What to deploy: Assistants that query filings, regulations, or policies, returning accurate responses or abstaining for uncertain cases.
    • Workflow: Truthfulness gate (LLM judge) before user delivery; route uncertain items to compliance analysts; maintain audit logs.
    • Dependencies/assumptions: Up-to-date, curated corpora; governance of escalations; compute for online RL updates.
  • Uncertainty-aware tutoring systems (Education)
    • What to deploy: Tutors trained with TruthRL that provide correct explanations when confident and abstain otherwise, guiding learners to references or asking clarifying questions.
    • Workflow: Confidence-calibrated outputs; abstention prompts retrieval of examples or follow-up questions.
    • Dependencies/assumptions: Acceptance of “I don’t know” in learning UX; domain-specific content for retrieval; verifier reliability.
  • Moderation triage and review routing (Software/Safety)
    • What to deploy: Moderation assistants that abstain on borderline classifications and escalate to human moderators.
    • Workflow: Ternary reward discourages confident misclassifications; abstention-aware routing reduces false positives/negatives.
    • Dependencies/assumptions: Defined thresholds; human review availability; measured latency overhead.
  • Customer support decision support (Software/Customer Support)
    • What to deploy: TruthRL-trained agents that provide answers when confident and escalate tickets when uncertain.
    • Workflow: Confidence gating; abstention triggers more retrieval or handoff; monitor truthfulness over time.
    • Dependencies/assumptions: Ticket semantics are captured well; acceptable increase in abstentions; reliable verifier.
  • Truthfulness monitoring and model evaluation (Industry/Academia)
    • What to deploy: A Truthfulness KPI dashboard tracking Accuracy, Uncertainty, and Hallucination; periodic audits with multiple LLM judges.
    • Workflow: Continuous evaluation under real prompts; majority@k analysis; judge-robustness checks.
    • Dependencies/assumptions: Stable evaluator(s); evaluation budget; consensus on weighting (w1, w2, w3) for truthfulness score.
  • Developer toolkit for GRPO with ternary reward (Software/ML Ops)
    • What to deploy: A fine-tuning library that wraps GRPO + LLM-based verifier + abstention-aware reward schema for open-source LLMs (e.g., Llama, Qwen).
    • Workflow: Templates for non-retrieval and RAG; hooks for confidence calibration; CI/CD integration.
    • Dependencies/assumptions: GPUs; verified datasets; reproducible verifier prompts.
  • Confidence-calibrated APIs and UX components (Software/Product)
    • What to deploy: APIs that return an answer, confidence, and an abstention flag; UX patterns that normalize “I don’t know.”
    • Workflow: Confidence thresholds; user-facing rationales; optional follow-up retrieval.
    • Dependencies/assumptions: Product acceptance of abstentions; calibration validated in-domain.
  • Search and assistance for ambiguous queries (Software/Consumer)
    • What to deploy: Assistants that abstain on ambiguous or insufficiently grounded queries and proactively ask for clarification.
    • Workflow: Detect ambiguity; trigger clarifying questions or targeted retrieval; measure reduction in hallucination-bait failures.
    • Dependencies/assumptions: Conversation design; retrieval responsiveness; user tolerance for extra steps.

Long-Term Applications

Below are forward-looking uses that require further research, scale-up, or ecosystem development, informed by the paper’s findings on reward design, verifiers, online RL, and reasoning signals.

  • Sector-wide standards for truthfulness metrics (Policy/Standards)
    • What to build: Procurement and compliance guidelines mandating truthfulness KPIs (Acc/Unc/Hall) and judge-robust evaluation protocols.
    • Dependencies/assumptions: Cross-stakeholder consensus; standardized benchmarks; multi-judge evaluation services.
  • Verifier-as-a-Service (Software/Platform)
    • What to build: Robust, ensemble LLM-based verifiers with semantic equivalence, partial credit, and adversarial robustness.
    • Dependencies/assumptions: Ongoing verifier R&D; cost-effective inference; mitigation against “judge hacking.”
  • Continuous online RL with privacy and safety (Software/ML Ops)
    • What to build: Production pipelines for online GRPO fine-tuning using safe user feedback streams, with strong privacy guarantees.
    • Dependencies/assumptions: Feedback governance; drift detection; secure data handling; cost controls.
  • Truthfulness-aware robotics and autonomous systems (Robotics/Safety)
    • What to build: Language-driven planners that abstain or seek human input when uncertain about instructions or environment, reducing risky actions.
    • Dependencies/assumptions: Grounded perception; integration with control stacks; adaptation beyond QA to planning and execution.
  • Multi-objective RL for truthfulness and reasoning (Software/Research)
    • What to build: Stable reward designs that jointly optimize outcome correctness, abstention behavior, and reasoning quality.
    • Dependencies/assumptions: Better reasoning evaluators; principled reward shaping; training stability at scale.
  • Joint retriever–generator optimization for truthfulness (Software/IR)
    • What to build: RL systems that co-train retrieval and generation to maximize truthfulness under noisy corpora, including abstention-aware retrieval fallback.
    • Dependencies/assumptions: Access to large, evolving corpora; retriever metrics aligned with truthfulness; pipeline complexity.
  • Hallucination-bait detection and defense (Software/Safety)
    • What to build: Meta-classifiers and policies that identify prompts prone to inducing overconfident errors and raise abstention thresholds or demand citations.
    • Dependencies/assumptions: Robust bait detectors; minimal UX friction; domain tuning.
  • Cross-model truthfulness alignment (Software/Platform)
    • What to build: Methods that transfer truthfulness behavior across LLM families and sizes, ensuring consistent performance under different judges.
    • Dependencies/assumptions: Interoperability; cross-model evaluation suites; licensing constraints.
  • Sector-specific certification and audit trails (Policy/Compliance)
    • What to build: Auditable logs of abstentions, escalations, and citations; certification programs for “truthful assistants” in regulated industries.
    • Dependencies/assumptions: Regulatory buy-in; standardized audit schemas; privacy compliance.
  • Consumer-grade assistants with truthfulness gating (Daily Life/Consumer Tech)
    • What to build: Personal assistants that clearly signal uncertainty, abstain without bluffing, and propose safe next steps (retrieve, ask human, defer).
    • Dependencies/assumptions: UX patterns that maintain user trust; localized content; device resource constraints.
  • Emergency and disaster information assistants (Public Safety)
    • What to build: Systems that avoid hallucinated guidance, abstain when not grounded in official sources, and escalate to reliable channels.
    • Dependencies/assumptions: Real-time access to trusted feeds; stringent verification; public-sector coordination.

Cross-cutting assumptions and dependencies

  • High-quality LLM-based verifiers materially improve training signals; rule-based verifiers can misjudge semantic correctness and cause over-abstention.
  • Compute and engineering resources are needed for online RL (GRPO), multi-judge evaluations, and continuous monitoring.
  • Retrieval quality and corpus curation strongly influence truthfulness; abstention should trigger better retrieval or escalation.
  • Product and policy acceptance of abstentions is critical; workflows must normalize “I don’t know” and define clear escalation paths.
  • Ternary reward design is generally superior to binary for truthfulness; adding reasoning rewards requires careful balancing to avoid outcome degradation.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Ablation study: A controlled analysis where components of a method are systematically removed or varied to assess their impact. "In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty."
  • Abstention: The model choosing not to answer when uncertain (e.g., responding “I don’t know”). "and treats abstentions neutrally, thereby leading to greater truthfulness."
  • Advantage (estimated advantage): In policy optimization, a measure of how much better a specific action (response) is compared to the average in a group or baseline. "A^i\hat{A}_{i} is the estimated advantage for response yiy_{i}, computed using a group of rewards {r1,r2,,rG}\{r_1, r_2, \ldots, r_G\} corresponding to the outputs within each group:"
  • Binary reward: A reward scheme assigning positive reward for correct outputs and negative for incorrect ones, without distinguishing abstention. "Binary is a variant of our method that uses a binary reward, achieving the highest accuracy but also exhibiting a high hallucination rate"
  • Calibrated decoding: A decoding strategy aimed at aligning model confidence with accuracy. "calibrated decoding~\citep{kadavath2022self}"
  • Contrastive decoding: A decoding method that contrasts model outputs to reduce errors such as hallucinations. "contrastive decoding~\citep{chuangdola}"
  • CRAG: A knowledge-intensive benchmark used for evaluating retrieval and truthfulness in LLMs. "Scaling curve of prompting and vanilla SFT/RL methods on the CRAG benchmark~\citep{yang2024crag}"
  • Direct Preference Optimization (DPO): An offline RL-style preference optimization method for aligning models using preference data. "Comparison between Offline RL (DPO), Semi-Online RL (Iterative DPO), and Online RL (TruthRL)"
  • Group Relative Policy Optimization (GRPO): An online RL method that optimizes a policy using group-wise relative advantages and KL regularization. "We implement TruthRL using GRPO~\citep{shao2024deepseekmath}, an online RL method that optimizes the following objective:"
  • Hallucination: When a model generates fluent but factually incorrect content. "compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9\%"
  • Hallucination-baiting questions: Inputs designed or prone to induce model hallucinations, such as tricky multiple-choice comparisons. "TruthRL is robust to hallucination-baiting questions"
  • Importance ratio: The ratio used in off-policy/on-policy corrections, weighting actions by current vs. reference policy likelihood. "wi,t(θ)w_{i,t} (\theta) denotes the importance ratio,"
  • Iterative DPO: A semi-online extension of DPO where preference data and fine-tuning are iteratively refreshed. "Semi-Online RL (Iterative DPO)"
  • KL divergence: A measure of divergence between two probability distributions used as a regularizer in RL fine-tuning. "$-\beta \mathbb{D}_{KL}\left(\pi_{\theta} || \pi_{\text{ref}\right)$"
  • Knowledge boundary: The implicit limit of what a model knows; recognizing it helps avoid hallucinations. "enhancing the capability of LLMs to recognize their knowledge boundary"
  • Knowledge boundary probing: A method to detect questions outside a model’s knowledge by sampling many outputs to see if any are correct. "using a knowledge boundary probing mechanism"
  • Knowledge-enhanced reward: A reward scheme that conditions rewards on whether a question is out-of-knowledge, positively reinforcing appropriate abstention. "a knowledge-enhanced variant treats abstention as positive when the model genuinely lacks knowledge."
  • LLM-based verifier: Using a LLM as a judge to evaluate correctness and provide reward signals. "LLM-based verifier provides more reliable training signals than rule-based verifier."
  • Majority@kk: An evaluation/aggregation approach where the most frequent answer among k sampled responses is taken as the final prediction. "strong potential in majority@kk scaling"
  • MuSiQue: A multi-hop reasoning dataset designed to assess compositional question answering. "MuSiQue~\citep{trivedi2022musique}"
  • NaturalQuestions (NQ): A large-scale question answering dataset sourced from real user queries. "NaturalQuestions (NQ)~\citep{kwiatkowski2019natural}"
  • Offline RL: RL that learns from fixed datasets without online interaction during training. "Offline RL (DPO)"
  • Online RL: RL that updates the policy through on-policy interactions during training. "an online RL method"
  • Out-of-knowledge (OOK) questions: Queries that lie beyond the model’s accessible knowledge at training/inference time. "out-of-knowledge (OOK) questions"
  • Parametric knowledge: Information stored within the model parameters rather than retrieved externally. "outside their parametric knowledge."
  • Reasoning-enhanced variant: A reward design that includes signals assessing the quality of the reasoning process, not just the final answer. "a reasoning-enhanced variant builds on the above outcome-based reward by incorporating additional reward signals that evaluate the quality of the model’s reasoning process."
  • Reasoning reward: A reward component that explicitly evaluates the reasoning steps or chain-of-thought quality. "incorporating the reasoning reward $r_{\text{reason}$"
  • Rejection sampling fine-tuning (RFT): A fine-tuning strategy selecting model-generated traces (including abstentions for OOK) based on quality or correctness. "rejection sampling fine-tuning (RFT)"
  • Retrieval-augmented generation (RAG): A technique where external documents are retrieved and used to condition the model’s generation. "retrieval-augmented generation (RAG)"
  • Rule-based verifier: A deterministic evaluator (e.g., string matching) used to judge correctness, often brittle compared to LLM judges. "rule-based verifier"
  • Scaling curve: A plot showing performance changes as a function of scale factors (e.g., number of samples k). "Scaling curve of prompting and vanilla SFT/RL methods"
  • Semi-Online RL: A hybrid training setup that alternates between data collection and offline preference optimization. "Semi-Online RL (Iterative DPO)"
  • Self-consistency sampling: A decoding strategy that samples multiple reasoning paths and aggregates to improve reliability. "self-consistency sampling~\citep{wang2022selfconsistency}"
  • Supervised fine-tuning (SFT): Training a model to maximize likelihood of ground-truth outputs given inputs. "Supervised fine-tuning (SFT)."
  • Ternary reward: A reward scheme distinguishing correct (+1), uncertain/abstain (0), and incorrect (−1) outputs. "a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions."
  • Truthfulness score: A composite metric combining accuracy, abstention/uncertainty, and hallucination rates with weights. "we define the truthfulness score as a weighted combination:"
  • Uncertainty rate: The fraction of cases where the model abstains (e.g., responds “I don’t know”). "uncertainty rate (Unc)({\rm Unc})"
  • Vanilla RL: Standard accuracy- or reward-only optimized RL without explicit uncertainty/abstention handling. "compared to vanilla RL, TruthRL consistently reduces hallucination"
  • Vanilla SFT: Conventional SFT focused on maximizing accuracy, often suppressing abstention and increasing hallucinations. "In vanilla SFT/RL, the model is optimized solely for accuracy,"
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 7 posts and received 266 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube