Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAD-Fact: Multi-Agent Factuality Check

Updated 26 February 2026
  • MAD-Fact is a multi-agent, debate-based framework that evaluates long-form LLM outputs by decomposing responses into atomic claims.
  • It integrates structured claim extraction, diverse evaluator debates, and a weighted fact importance hierarchy to enhance factual reliability.
  • Benchmarking on LongHalluQA demonstrates that MAD-Fact improves precision and recall over traditional fact-checking methods in complex texts.

MAD-Fact refers to a debate-based, multi-agent factuality evaluation framework designed for long-form outputs from LLMs. It integrates structured claim decomposition, heterogeneous agent deliberation, and a fact importance hierarchy, supported by a new benchmark, LongHalluQA. MAD-Fact advances the assessment of factual reliability in complex, multi-claim generative texts, targeting safe LLM deployment in high-risk or regulated domains (Ning et al., 27 Oct 2025).

1. Motivation and Problem Setting

LLMs routinely generate long-form answers whose factual accuracy is crucial in domains such as medicine, law, and education. Existing evaluation tools for short texts (e.g., FactScore, FacTool, Q2) struggle with reasoning chains, cumulative errors, and the non-uniform importance of individual claims found in long responses. Moreover, subjective, single-model annotation may miss subtle inconsistencies and be vulnerable to systematic bias.

MAD-Fact addresses these limitations by:

  • Decomposing long-form answers into atomic claims.
  • Engaging multiple agents with diverse expertise and retrieval strategies in a debate over each claim.
  • Aggregating consensus verdicts using a weighted scheme that emphasizes higher-importance facts (Ning et al., 27 Oct 2025).

2. System Architecture and Protocol

MAD-Fact implements a three-stage architecture:

(i) Clerk (Claim Extraction)

Receives (qi,ai)(q_i, a_i), where qiq_i is a question and aia_i is a long-form LLM-generated answer. It outputs a set of TT atomic factual claims {ci,1,,ci,T}\{c_{i,1}, \dots, c_{i,T}\}. Formally,

{ci,j}j=1T=Clerk(confClerk,qi,ai)\{c_{i,j}\}_{j=1}^T = \mathrm{Clerk}(conf_\mathrm{Clerk}, q_i, a_i)

(ii) Jury of Evaluators (Multi-Agent Debate)

A group of NN role-specialized evaluator agents Evaluatorn\mathrm{Evaluator}^n, each simulating a different perspective (General Public, Critic, Scientist, Data Analyst, Psychologist, News Author), debate each extracted claim ci,jc_{i,j}. Debate proceeds for up to two rounds per claim, with each agent studying previous statements and optionally retrieving fresh evidence. For each round and claim, the acting agent returns:

  • A direct judgement pi,j,t{TRUE,FALSE}p_{i,j,t}\in\{\mathrm{TRUE}, \mathrm{FALSE}\}
  • An explanation ei,j,te_{i,j,t}
  • Optionally, a retrieval query with externally sourced ki,j,tk_{i,j,t}

At the end, the full message pool Mi,j,tM_{i,j,t} (round-by-round debate log) and knowledge pool Ki,j,tK_{i,j,t} (retrieved evidences) form the factuality context for each claim.

(iii) Judge (Aggregation and Verdict)

A judge agent aggregates the NN final votes for each claim, pi,jnp^n_{i,j}, via majority rule (Judge()\mathrm{Judge}(\cdot)), breaking ties by giving priority to the last speaker. This produces binary verdicts {pi,j}j=1T\{p_{i,j}\}_{j=1}^T for all claims. An overall factuality score is computed over all claims, weighted by their hierarchical importance (see below).

3. Fact Importance Hierarchy and Weighted Metrics

Recognizing that some claims in a long answer are more critical than others, MAD-Fact introduces a fact importance hierarchy:

  • Reference answers for a given question are generated by GG expert models (e.g., GPT-4, Claude-3, DeepSeek).
  • Each reference is decomposed into atomic claims.
  • The cross-expert merger aligns semantically equivalent claims, forming a "golden set" {gi,k}k=1Kgold\{g_{i,k}\}_{k=1}^{K_{\mathrm{gold}}}.
  • Claim frequency across experts f(gi,k)f(g_{i,k}) defines its importance; the fewer experts mentioning a fact, the less critical its content.
  • Each claim receives a level (gi,k)=Gf(gi,k)+1\ell(g_{i,k})=G-f(g_{i,k})+1; pre-set monotonic weights ω1>ω2>>ωG\omega_1 > \omega_2 > \dots > \omega_G are assigned, e.g., ω=5\omega_\ell=5-\ell.

The answer-level precision, recall, and F1 are then reweighted accordingly: Precw(ai)=j:pi,j=TRUEωi,jj=1Tωi,jRw@γ(ai)=min(1,1γj:pi,j=TRUEωi,jk=1Kgoldωi,k)\mathrm{Prec}_w(a_i) =\frac{\sum_{j:p_{i,j}=\mathrm{TRUE}}\omega_{i,j}}{\sum_{j=1}^{T}\omega_{i,j}} \qquad R_w@\gamma(a_i)=\min\Bigl(1, \frac1\gamma \frac{\sum_{j:p_{i,j}=\mathrm{TRUE}}\omega_{i,j}}{\sum_{k=1}^{K_{\mathrm{gold}}}\omega_{i,k}}\Bigr) with F1@γF_1@\gamma given as the harmonic mean, and a recall "slack" parameter γ1\gamma \le 1 to tolerate length/coverage mismatches (Ning et al., 27 Oct 2025).

4. The LongHalluQA Dataset

LongHalluQA is a large-scale Chinese long-form factuality benchmark, constructed to evaluate multi-agent debate frameworks:

  • Contains 2,746 samples from seven domains (Chinese culture, natural sciences, social sciences, ...)
  • Average answer length is over 9 times that of original HalluQA
  • Each sample includes reference knowledge, sub-question expansion, and manual verification of factual consistency and clarity.

This dataset enables robust quantitative and qualitative evaluation of factual verification protocols under realistic, high-information-load conditions.

5. Experimental Results

MAD-Fact was evaluated on both standard fact-checking tasks and long-form outputs. The core pipeline employs GPT-4o-mini for the Clerk and Evaluators, with Llama3.3-70B-instr as Judge.

Fact-Checking Benchmark Performance

Across tasks such as Factcheck-Bench, FacToolQA, BingCheck, FELM-WK, and FactEval-CN:

Method TRUE F1 (range) FALSE F1 (range)
SAFE 0.68–0.87 0.51–0.74
FIRE 0.71–0.88 0.53–0.70
MAD-Fact 0.74–0.88 0.64–0.73

MAD-Fact outperforms strong baselines in 8/10 F1 comparisons and yields improvements in both factual error sensitivity and correct claim recall.

Long-Form Scoring

Weighted F1@γF_1@\gamma (with γ=1.0,0.8\gamma=1.0, 0.8) on held-out LongFact and LongHalluQA splits:

  • On LongFact, GPT-4-Turbo achieves F1@0.8=0.681[email protected]=0.681, Doubao-1.5-Pro 0.669, Qwen2.5-72B 0.648.
  • On LongHalluQA, Chinese models (QwQ-32B, Doubao-1.5-Pro) approach or exceed GPT-4 by up to 10 points, likely due to in-domain pretraining advantages.

6. Design Insights and Case Analysis

  • Agent diversity: Multiple conversational roles, retrieval strategies (autonomous, mandatory, adaptive), and external evidence sources contribute to more stable and error-corrective debate outcomes.
  • Importance reweighting: Emphasizing pivotal claims leads to higher-quality, actionable factuality scores, addressing the unequal significance of content.
  • Consensus stabilization: Peer agents collectively correct judgment errors and prevent overconfident misclassification by promoting deliberative discussion.
  • Robustness: The combination of single-model, dual-model, and jury voting enables performance that generalizes well even on unseen, long-form user queries.

7. Limitations and Future Directions

  • Dataset scope: LongHalluQA is built from trusted knowledge bases, not live user logs. Further research should extend to dynamic, domain-specific, or real-world user-generated queries.
  • Scalability: MAD-Fact increases inference cost due to agent multiplicity and external retrieval. Dynamic early termination, adaptive agent allocation, and parameter-efficient agent architectures are key optimization areas.
  • Residual bias and groupthink: Multi-agent debates can still become overconfident in converging on spurious claims. Future work includes confidence-weighted voting, formal cross-agent logical checks, and adversarial stress-testing with planted falsehoods.

MAD-Fact provides a transparent, interpretable, and scalable framework for nuanced long-form factuality assessment, addressing gaps in both claim prioritization and adversarial model collaboration (Ning et al., 27 Oct 2025). Its dataset and architecture serve as baselines for advancing multi-perspective, retrieval-augmented LLM evaluation protocols in critical safety contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAD-Fact.