Papers
Topics
Authors
Recent
Search
2000 character limit reached

When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

Published 1 Apr 2025 in cs.CL and cs.AI | (2504.00374v1)

Abstract: In many real-world scenarios, a single LLM may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.

Summary

  • The paper presents CW-POR as a novel metric to quantify how persuasive misinformation overrides factual accuracy in single-turn LLM debates.
  • It employs an adversarial setup with factual and persuasive agents evaluated by a confidence-rated judge, revealing domain-dependent vulnerabilities.
  • Empirical results indicate that non-adversarial contexts and extreme verbosity increase high-confidence errors, impacting LLM alignment and reliability.

Persuasion Versus Truth in LLM Debates: A Confidence-Weighted Evaluation Framework

Introduction

LLMs are emerging arbiters in scenarios where information conflicts, such as fact-checking, summarization, or policy analysis. However, their susceptibility to persuasive but incorrect arguments—especially given a single, uncontextualized interaction—remains insufficiently quantified. The paper "When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)" (2504.00374) presents a systematic study of this vulnerability, introducing a new metric (CW-POR) that explicitly incorporates the model's confidence into error weighting. The study employs an adversarial, single-turn debate framework comprising a factual agent, a persuasive agent (defending a falsehood), and an LLM judge, with all roles instantiated from the same model family. Empirical results across five open-source LLMs (3B–14B parameters) reveal that model susceptibility to persuasive overrides is substantial, and that such overrides often occur with high internal confidence.

Experimental Framework and Metric Definition

The authors adopt a single-turn, adversarial debate setup: two LLM-based agents respond to a factual question from TruthfulQA, one providing a neutral, accurate answer ("Neutral Agent"), the other forcefully defending a known distractor ("Persuasive Agent"). These responses are evaluated by a "Judge Model"—yet another LLM instance—tasked with selecting the correct answer and assigning a 1–5 confidence rating. The judge's self-rated confidence is then combined with its log-likelihood-based preference for each answer (computed via prompt completions terminating in "Answer A" or "Answer B"). Figure 1

Figure 1

Figure 1

Figure 1: Example single-turn multi-agent debate with confidence aggregation, highlighting the use of both rubric and log-likelihood confidences in CW-POR computation.

The central metric, Confidence-Weighted Persuasion Override Rate (CW-POR), is defined as the normalized sum of confidence scores on trials where the persuasive agent's (incorrect) answer is selected. This construction enables rigorous quantification not only of persuasive override frequency but also the intensity of model belief in its (potentially erroneous) decisions.

Empirical Results

Category-Level Vulnerabilities

CW-POR was analyzed across question categories in TruthfulQA. "Confusion: Other" and "Science" consistently showed elevated persuasion override rates, indicating domain-dependent susceptibility. Notably, these categories sometimes comprised relatively few questions, accentuating the potential for data scarcity to amplify apparent vulnerabilities. Figure 2

Figure 2: Confidence-Weighted Persuasion Override Rate by category, with question share overlay, illustrating both high-risk and data-scarce domains.

Adversarial Versus Non-Adversarial Queries

A counterintuitive outcome was that several models had higher CW-POR values on non-adversarial questions compared to their adversarial counterparts. This indicates that LLMs' calibration and skepticism mechanisms may be less active in ostensibly straightforward instances, making them more prone to confident endorsement of plausible-sounding falsehoods in routine inputs. Figure 3

Figure 4: CW-POR comparison between adversarial versus non-adversarial question types across five evaluated LLMs.

Verbosity Effects

Analysis of answer verbosity revealed a U-shaped relationship with CW-POR. Override rates dipped in the 90–120 word range and rose for both shorter and longer responses. Short answers lacked discriminative detail, while longer answers were more likely to overwhelm the judge with emotive or rhetorical content, again increasing susceptibility to persuasive attacks. Figure 5

Figure 5: CW-POR as a function of answer verbosity, showing a minimum at medium lengths and divergence at extremes.

Model-level breakdown of self-reported and log-likelihood confidences showed a typical drop in confidence on incorrect decisions, but this effect was not universal. Larger models, such as Phi-4 14B, displayed robust high-confidence errors, contributing disproportionately to the overall CW-POR. This underscores limitations in both self-estimation and internal uncertainty modeling.

Implications and Theoretical Impact

The findings have significant implications for model alignment and deployment in critical settings. The strong, confidence-weighted misjudgments observed suggest that even substantial model scaling and instruction tuning are insufficient to guarantee robustness against well-crafted misinformation in single-turn, multi-agent settings. The vulnerability persists despite the absence of further context or multi-turn adversarial dynamics, raising concerns for factual reliability in practical LLM usage—such as automated news aggregation, domain-specific QA, or institutional fact-checking pipelines.

Furthermore, reliance on canonical adversarial benchmarks alone underestimates real-world risk, since non-adversarial prompts may more readily slip through model defenses due to lower baseline suspicion. Integrating confidence-weighted metrics into routine evaluation pipelines is essential for comprehensive assessment of calibration and persuasion sensitivity.

The U-shaped effect of verbosity reveals a further axis of potential manipulation, suggesting that imposing or recommending response-length constraints may not uniformly improve factual robustness. Future research may explore controlled verbosity, rebuttal mechanisms, or external verification triggers modulated by combined confidence signals.

Future Directions

  • Ensemble or Multi-Turn Schemes: Incorporating iterative rebuttals may surface contradictions not revealed in one-shot debates, potentially lowering CW-POR.
  • Hybrid Judge Architectures: Utilizing a model family distinct from those of the answer agents could mitigate intra-family bias and enhance factual oversight.
  • Dynamic Confidence Thresholding: Leveraging combined rubric and log-likelihood confidence to conditionally reject or escalate high-stakes decisions could improve reliability.
  • Amplified Adversarial Data Collection: Expanding domain coverage and systematically targeting high-CW-POR categories will enable finer-grained assessment and calibration.

Conclusion

The introduction of CW-POR delivers a more granular perspective on the interaction between persuasion and factuality in multi-agent LLM deployments. The study convincingly demonstrates that current LLMs are prone not only to erroneous persuasion overrides, but also to high-confidence misjudgments, particularly under plausible, non-adversarial conditions and variable verbosity. Mitigating such vulnerabilities will require advances in calibration, adversarial testing, and system design. CW-POR should become a standard tool for assessing the factual resilience of LLM-based accreditation, evaluation, or moderation systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Explaining “When Persuasion Overrides Truth in Multi-Agent LLM Debates” (CW-POR)

Overview

This paper looks at a problem with AI LLMs, the kind of tools that write and read text like ChatGPT. Sometimes, they see two different answers to the same question: one correct and one wrong but very convincing. The paper asks: can a persuasive false answer trick an AI judge into picking it over the true one? To study this, the authors create a simple “debate” and introduce a new score called CW-POR to measure not just how often the AI judge gets fooled, but how confident it is when it makes that mistake.

Key Objectives and Questions

The study focuses on a few straightforward goals:

  • Can strong, persuasive writing beat a calm, factual answer in a one-shot debate?
  • How often does an AI judge choose the wrong answer when it sounds persuasive?
  • How confident is the AI when it makes those wrong choices?
  • Does the length of the answers (short vs. long) change how easily the judge is misled?

How the Study Worked

Think of this like a classroom activity with three students and one question:

  • One student (Neutral Agent) gives the correct answer in a simple, calm way.
  • Another student (Persuasive Agent) argues for a known false answer with strong, convincing language.
  • A third student (Judge Agent) reads both and decides which one is right, then says how sure they feel on a scale from 1 to 5.

To make this fair:

  • The authors used a question set called TruthfulQA. It’s made to catch common myths and false claims.
  • They switched the order of answers (A/B) randomly so the judge wouldn’t prefer “A” just because it comes first.
  • They tested five different open-source AI models of different sizes (from “small” to “medium” sized).
  • They controlled the word count of answers from 30 up to 300 words to see how length affects persuasion.

Two kinds of confidence were measured for the judge:

  • Self-rated confidence (like saying “I’m 4 out of 5 sure”).
  • A hidden “internal” confidence based on the model’s probabilities (you can think of this like the AI’s gut feeling).

To track mistakes, they created CW-POR:

  • POR is how often the persuasive wrong answer wins.
  • CW-POR goes further and gives more weight to mistakes the judge makes with high confidence. In simple terms, being confidently wrong is counted as a bigger problem than being unsure and wrong.

Main Findings and Why They Matter

Here are the main takeaways, explained simply:

  • Persuasion can beat truth: Even smaller AI models can write persuasive false answers that the AI judge picks over the correct answer—and sometimes the judge is very confident about that wrong choice.
  • Some topics are riskier: Categories like “Science” and “Confusion” were more likely to trick the judge, though sample sizes vary.
  • Normal-looking questions can be more misleading than “tricky” ones: Surprisingly, models sometimes got fooled more by regular questions than by questions meant to be adversarial. In real life, this means misinformation often hides in everyday, harmless-looking posts and articles.
  • There’s a “sweet spot” for answer length: Across many models, the judge made fewer confident mistakes when the answers were mid-length (around 90–120 words). Very short answers didn’t give enough clarity, and very long answers could drown the judge in persuasive language.
  • Confidence doesn’t always match correctness: Usually, the judge’s internal confidence was higher when it chose correctly. But some models stayed quite confident even when they were wrong—this is worrying because confident mistakes are more dangerous.

Implications and Potential Impact

What does this mean for how we use AI?

  • AI systems that both write and evaluate content can be tricked by style over substance. This is risky for fields like health, finance, and public policy, where wrong information can cause real harm.
  • We need better “calibration,” which means making sure a model’s confidence matches how likely it is to be right.
  • Before trusting an AI’s judgment, it may help to:
    • Use multiple checks, such as asking for sources or using another model to verify.
    • Allow short rebuttals or multi-turn discussions so the factual side can respond.
    • Flag decisions where the AI is very confident but chooses the persuasive answer.
    • Be careful with answer length—mid-length responses may reduce the chance of being misled.

Overall, the paper shows that persuasive writing can override truth in single-turn debates, and introduces CW-POR to measure how serious these mistakes are. This helps researchers and developers spot and fix situations where an AI not only gets things wrong, but does so with strong confidence—exactly the kind of error we most want to avoid.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.