MAD-Fact: Multi-Agent Factuality Check
- MAD-Fact is a multi-agent, debate-based framework that evaluates long-form LLM outputs by decomposing responses into atomic claims.
- It integrates structured claim extraction, diverse evaluator debates, and a weighted fact importance hierarchy to enhance factual reliability.
- Benchmarking on LongHalluQA demonstrates that MAD-Fact improves precision and recall over traditional fact-checking methods in complex texts.
MAD-Fact refers to a debate-based, multi-agent factuality evaluation framework designed for long-form outputs from LLMs. It integrates structured claim decomposition, heterogeneous agent deliberation, and a fact importance hierarchy, supported by a new benchmark, LongHalluQA. MAD-Fact advances the assessment of factual reliability in complex, multi-claim generative texts, targeting safe LLM deployment in high-risk or regulated domains (Ning et al., 27 Oct 2025).
1. Motivation and Problem Setting
LLMs routinely generate long-form answers whose factual accuracy is crucial in domains such as medicine, law, and education. Existing evaluation tools for short texts (e.g., FactScore, FacTool, Q2) struggle with reasoning chains, cumulative errors, and the non-uniform importance of individual claims found in long responses. Moreover, subjective, single-model annotation may miss subtle inconsistencies and be vulnerable to systematic bias.
MAD-Fact addresses these limitations by:
- Decomposing long-form answers into atomic claims.
- Engaging multiple agents with diverse expertise and retrieval strategies in a debate over each claim.
- Aggregating consensus verdicts using a weighted scheme that emphasizes higher-importance facts (Ning et al., 27 Oct 2025).
2. System Architecture and Protocol
MAD-Fact implements a three-stage architecture:
(i) Clerk (Claim Extraction)
Receives , where is a question and is a long-form LLM-generated answer. It outputs a set of atomic factual claims . Formally,
(ii) Jury of Evaluators (Multi-Agent Debate)
A group of role-specialized evaluator agents , each simulating a different perspective (General Public, Critic, Scientist, Data Analyst, Psychologist, News Author), debate each extracted claim . Debate proceeds for up to two rounds per claim, with each agent studying previous statements and optionally retrieving fresh evidence. For each round and claim, the acting agent returns:
- A direct judgement
- An explanation
- Optionally, a retrieval query with externally sourced
At the end, the full message pool (round-by-round debate log) and knowledge pool (retrieved evidences) form the factuality context for each claim.
(iii) Judge (Aggregation and Verdict)
A judge agent aggregates the final votes for each claim, , via majority rule (), breaking ties by giving priority to the last speaker. This produces binary verdicts for all claims. An overall factuality score is computed over all claims, weighted by their hierarchical importance (see below).
3. Fact Importance Hierarchy and Weighted Metrics
Recognizing that some claims in a long answer are more critical than others, MAD-Fact introduces a fact importance hierarchy:
- Reference answers for a given question are generated by expert models (e.g., GPT-4, Claude-3, DeepSeek).
- Each reference is decomposed into atomic claims.
- The cross-expert merger aligns semantically equivalent claims, forming a "golden set" .
- Claim frequency across experts defines its importance; the fewer experts mentioning a fact, the less critical its content.
- Each claim receives a level ; pre-set monotonic weights are assigned, e.g., .
The answer-level precision, recall, and F1 are then reweighted accordingly: with given as the harmonic mean, and a recall "slack" parameter to tolerate length/coverage mismatches (Ning et al., 27 Oct 2025).
4. The LongHalluQA Dataset
LongHalluQA is a large-scale Chinese long-form factuality benchmark, constructed to evaluate multi-agent debate frameworks:
- Contains 2,746 samples from seven domains (Chinese culture, natural sciences, social sciences, ...)
- Average answer length is over 9 times that of original HalluQA
- Each sample includes reference knowledge, sub-question expansion, and manual verification of factual consistency and clarity.
This dataset enables robust quantitative and qualitative evaluation of factual verification protocols under realistic, high-information-load conditions.
5. Experimental Results
MAD-Fact was evaluated on both standard fact-checking tasks and long-form outputs. The core pipeline employs GPT-4o-mini for the Clerk and Evaluators, with Llama3.3-70B-instr as Judge.
Fact-Checking Benchmark Performance
Across tasks such as Factcheck-Bench, FacToolQA, BingCheck, FELM-WK, and FactEval-CN:
| Method | TRUE F1 (range) | FALSE F1 (range) |
|---|---|---|
| SAFE | 0.68–0.87 | 0.51–0.74 |
| FIRE | 0.71–0.88 | 0.53–0.70 |
| MAD-Fact | 0.74–0.88 | 0.64–0.73 |
MAD-Fact outperforms strong baselines in 8/10 F1 comparisons and yields improvements in both factual error sensitivity and correct claim recall.
Long-Form Scoring
Weighted (with ) on held-out LongFact and LongHalluQA splits:
- On LongFact, GPT-4-Turbo achieves , Doubao-1.5-Pro 0.669, Qwen2.5-72B 0.648.
- On LongHalluQA, Chinese models (QwQ-32B, Doubao-1.5-Pro) approach or exceed GPT-4 by up to 10 points, likely due to in-domain pretraining advantages.
6. Design Insights and Case Analysis
- Agent diversity: Multiple conversational roles, retrieval strategies (autonomous, mandatory, adaptive), and external evidence sources contribute to more stable and error-corrective debate outcomes.
- Importance reweighting: Emphasizing pivotal claims leads to higher-quality, actionable factuality scores, addressing the unequal significance of content.
- Consensus stabilization: Peer agents collectively correct judgment errors and prevent overconfident misclassification by promoting deliberative discussion.
- Robustness: The combination of single-model, dual-model, and jury voting enables performance that generalizes well even on unseen, long-form user queries.
7. Limitations and Future Directions
- Dataset scope: LongHalluQA is built from trusted knowledge bases, not live user logs. Further research should extend to dynamic, domain-specific, or real-world user-generated queries.
- Scalability: MAD-Fact increases inference cost due to agent multiplicity and external retrieval. Dynamic early termination, adaptive agent allocation, and parameter-efficient agent architectures are key optimization areas.
- Residual bias and groupthink: Multi-agent debates can still become overconfident in converging on spurious claims. Future work includes confidence-weighted voting, formal cross-agent logical checks, and adversarial stress-testing with planted falsehoods.
MAD-Fact provides a transparent, interpretable, and scalable framework for nuanced long-form factuality assessment, addressing gaps in both claim prioritization and adversarial model collaboration (Ning et al., 27 Oct 2025). Its dataset and architecture serve as baselines for advancing multi-perspective, retrieval-augmented LLM evaluation protocols in critical safety contexts.