MAD-Fact: Multi-Agent Factuality Check

Updated 26 February 2026

MAD-Fact is a multi-agent, debate-based framework that evaluates long-form LLM outputs by decomposing responses into atomic claims.
It integrates structured claim extraction, diverse evaluator debates, and a weighted fact importance hierarchy to enhance factual reliability.
Benchmarking on LongHalluQA demonstrates that MAD-Fact improves precision and recall over traditional fact-checking methods in complex texts.

MAD-Fact refers to a debate-based, multi-agent factuality evaluation framework designed for long-form outputs from LLMs. It integrates structured claim decomposition, heterogeneous agent deliberation, and a fact importance hierarchy, supported by a new benchmark, LongHalluQA. MAD-Fact advances the assessment of factual reliability in complex, multi-claim generative texts, targeting safe LLM deployment in high-risk or regulated domains (Ning et al., 27 Oct 2025).

1. Motivation and Problem Setting

LLMs routinely generate long-form answers whose factual accuracy is crucial in domains such as medicine, law, and education. Existing evaluation tools for short texts (e.g., FactScore, FacTool, Q²⁾ struggle with reasoning chains, cumulative errors, and the non-uniform importance of individual claims found in long responses. Moreover, subjective, single-model annotation may miss subtle inconsistencies and be vulnerable to systematic bias.

MAD-Fact addresses these limitations by:

Decomposing long-form answers into atomic claims.
Engaging multiple agents with diverse expertise and retrieval strategies in a debate over each claim.
Aggregating consensus verdicts using a weighted scheme that emphasizes higher-importance facts (Ning et al., 27 Oct 2025).

2. System Architecture and Protocol

MAD-Fact implements a three-stage architecture:

(i) Clerk (Claim Extraction)

Receives $(q_i, a_i)$ , where $q_i$ is a question and $a_i$ is a long-form LLM-generated answer. It outputs a set of $T$ atomic factual claims $\{c_{i,1}, \dots, c_{i,T}\}$ . Formally,

$\{c_{i,j}\}_{j=1}^T = \mathrm{Clerk}(conf_\mathrm{Clerk}, q_i, a_i)$

(ii) Jury of Evaluators (Multi-Agent Debate)

A group of $N$ role-specialized evaluator agents $\mathrm{Evaluator}^n$ , each simulating a different perspective (General Public, Critic, Scientist, Data Analyst, Psychologist, News Author), debate each extracted claim $c_{i,j}$ . Debate proceeds for up to two rounds per claim, with each agent studying previous statements and optionally retrieving fresh evidence. For each round and claim, the acting agent returns:

A direct judgement $p_{i,j,t}\in\{\mathrm{TRUE}, \mathrm{FALSE}\}$
An explanation $e_{i,j,t}$
Optionally, a retrieval query with externally sourced $k_{i,j,t}$

At the end, the full message pool $M_{i,j,t}$ (round-by-round debate log) and knowledge pool $K_{i,j,t}$ (retrieved evidences) form the factuality context for each claim.

(iii) Judge (Aggregation and Verdict)

A judge agent aggregates the $N$ final votes for each claim, $p^n_{i,j}$ , via majority rule ( $\mathrm{Judge}(\cdot)$ ), breaking ties by giving priority to the last speaker. This produces binary verdicts $\{p_{i,j}\}_{j=1}^T$ for all claims. An overall factuality score is computed over all claims, weighted by their hierarchical importance (see below).

3. Fact Importance Hierarchy and Weighted Metrics

Recognizing that some claims in a long answer are more critical than others, MAD-Fact introduces a fact importance hierarchy:

Reference answers for a given question are generated by $G$ expert models (e.g., GPT-4, Claude-3, DeepSeek).
Each reference is decomposed into atomic claims.
The cross-expert merger aligns semantically equivalent claims, forming a "golden set" $\{g_{i,k}\}_{k=1}^{K_{\mathrm{gold}}}$ .
Claim frequency across experts $f(g_{i,k})$ defines its importance; the fewer experts mentioning a fact, the less critical its content.
Each claim receives a level $\ell(g_{i,k})=G-f(g_{i,k})+1$ ; pre-set monotonic weights $\omega_1 > \omega_2 > \dots > \omega_G$ are assigned, e.g., $\omega_\ell=5-\ell$ .

The answer-level precision, recall, and F1 are then reweighted accordingly: $\mathrm{Prec}_w(a_i) =\frac{\sum_{j:p_{i,j}=\mathrm{TRUE}}\omega_{i,j}}{\sum_{j=1}^{T}\omega_{i,j}} \qquad R_w@\gamma(a_i)=\min\Bigl(1, \frac1\gamma \frac{\sum_{j:p_{i,j}=\mathrm{TRUE}}\omega_{i,j}}{\sum_{k=1}^{K_{\mathrm{gold}}}\omega_{i,k}}\Bigr)$ with $F_1@\gamma$ given as the harmonic mean, and a recall "slack" parameter $\gamma \le 1$ to tolerate length/coverage mismatches (Ning et al., 27 Oct 2025).

4. The LongHalluQA Dataset

LongHalluQA is a large-scale Chinese long-form factuality benchmark, constructed to evaluate multi-agent debate frameworks:

Contains 2,746 samples from seven domains (Chinese culture, natural sciences, social sciences, ...)
Average answer length is over 9 times that of original HalluQA
Each sample includes reference knowledge, sub-question expansion, and manual verification of factual consistency and clarity.

This dataset enables robust quantitative and qualitative evaluation of factual verification protocols under realistic, high-information-load conditions.

5. Experimental Results

MAD-Fact was evaluated on both standard fact-checking tasks and long-form outputs. The core pipeline employs GPT-4o-mini for the Clerk and Evaluators, with Llama3.3-70B-instr as Judge.

Fact-Checking Benchmark Performance

Across tasks such as Factcheck-Bench, FacToolQA, BingCheck, FELM-WK, and FactEval-CN:

Method	TRUE F1 (range)	FALSE F1 (range)
SAFE	0.68–0.87	0.51–0.74
FIRE	0.71–0.88	0.53–0.70
MAD-Fact	0.74–0.88	0.64–0.73

MAD-Fact outperforms strong baselines in 8/10 F1 comparisons and yields improvements in both factual error sensitivity and correct claim recall.

Long-Form Scoring

Weighted $F_1@\gamma$ (with $\gamma=1.0, 0.8$ ) on held-out LongFact and LongHalluQA splits:

On LongFact, GPT-4-Turbo achieves $[email protected]=0.681$ , Doubao-1.5-Pro 0.669, Qwen2.5-72B 0.648.
On LongHalluQA, Chinese models (QwQ-32B, Doubao-1.5-Pro) approach or exceed GPT-4 by up to 10 points, likely due to in-domain pretraining advantages.

6. Design Insights and Case Analysis

Agent diversity: Multiple conversational roles, retrieval strategies (autonomous, mandatory, adaptive), and external evidence sources contribute to more stable and error-corrective debate outcomes.
Importance reweighting: Emphasizing pivotal claims leads to higher-quality, actionable factuality scores, addressing the unequal significance of content.
Consensus stabilization: Peer agents collectively correct judgment errors and prevent overconfident misclassification by promoting deliberative discussion.
Robustness: The combination of single-model, dual-model, and jury voting enables performance that generalizes well even on unseen, long-form user queries.

7. Limitations and Future Directions

Dataset scope: LongHalluQA is built from trusted knowledge bases, not live user logs. Further research should extend to dynamic, domain-specific, or real-world user-generated queries.
Scalability: MAD-Fact increases inference cost due to agent multiplicity and external retrieval. Dynamic early termination, adaptive agent allocation, and parameter-efficient agent architectures are key optimization areas.
Residual bias and groupthink: Multi-agent debates can still become overconfident in converging on spurious claims. Future work includes confidence-weighted voting, formal cross-agent logical checks, and adversarial stress-testing with planted falsehoods.

MAD-Fact provides a transparent, interpretable, and scalable framework for nuanced long-form factuality assessment, addressing gaps in both claim prioritization and adversarial model collaboration (Ning et al., 27 Oct 2025). Its dataset and architecture serve as baselines for advancing multi-perspective, retrieval-augmented LLM evaluation protocols in critical safety contexts.

Markdown Report Issue Upgrade to Chat

References (1)

MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAD-Fact.

MAD-Fact: Multi-Agent Factuality Check

1. Motivation and Problem Setting

2. System Architecture and Protocol

(i) Clerk (Claim Extraction)

(ii) Jury of Evaluators (Multi-Agent Debate)

(iii) Judge (Aggregation and Verdict)

3. Fact Importance Hierarchy and Weighted Metrics

4. The LongHalluQA Dataset

5. Experimental Results

Fact-Checking Benchmark Performance

Long-Form Scoring

6. Design Insights and Case Analysis

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MAD-Fact: Multi-Agent Factuality Check

1. Motivation and Problem Setting

2. System Architecture and Protocol

(i) Clerk (Claim Extraction)

(ii) Jury of Evaluators (Multi-Agent Debate)

(iii) Judge (Aggregation and Verdict)

3. Fact Importance Hierarchy and Weighted Metrics

4. The LongHalluQA Dataset

5. Experimental Results

Fact-Checking Benchmark Performance

Long-Form Scoring

6. Design Insights and Case Analysis

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research