FactArena: Automated Fact-Checking Benchmark
- FactArena is an automated arena-style evaluation framework that benchmarks large language models using stage-wise fact-checking workflows.
- It employs a multi-stage pipeline—including claim extraction, evidence retrieval, and justification-based verdict prediction—with adaptive claim evolution to increase reasoning difficulty.
- The framework uses iterative feedback and difficulty metrics, such as accuracy drop and Elo shift, to effectively uncover weaknesses in LLM factual competence.
FactArena is a fully automated arena-style evaluation framework designed to comprehensively benchmark LLMs across the entire fact-checking workflow. It is distinguished by its stage-wise pipeline—which includes claim extraction, evidence retrieval, and justification-based verdict prediction—and by its use of arena-driven claim evolution to adaptively increase reasoning difficulty. FactArena enables unbiased, scalable comparison of model factual competence and robustness, particularly in safety-critical fact-checking scenarios (Lin et al., 6 Jan 2026).
1. Objectives and Motivation
FactArena was developed to address the limitations of existing LLM evaluation methodologies, which typically concentrate on static claim verification while neglecting the broader fact-checking pipeline and model blind spots. Its principal goals include:
- Adaptivity: Continuously generating new "hard" claims rather than evaluating models on a fixed seed set, thereby challenging the evolving frontier of model capability.
- Semantically controlled hardness: Producing claim variations that retain the factual pattern and entities of originals but embed subtle reasoning challenges, exposing gaps in model reasoning.
- Automated feedback loop: Employing arena-style LLM judges to identify claims that are universally recognized as "easy," and directing the evolution agent to inject further difficulty.
This approach overcomes the stagnation inherent in fixed benchmarking datasets—where test leakage and saturation at high accuracy obscure the genuine comparative robustness of advanced LLMs.
2. Architecture and Workflow
The FactArena claim-evolution module operates in a staged fashion:
- Seed Evaluation: A pool of seed claims is individually routed through claim extraction, retrieval, and verification across all target models.
- Reverse Step (Round 0): If all models predict correctly, the Evolver agent (an LLM, e.g., GPT-o4-mini) generates a contrastive claim , which flips the truth value but introduces no new entities.
- Iterative Evolution: If is still universally answered correctly, the Evolver ingests model outputs, judge comments, and instructions to exploit model strengths. It then generates further evolved claims for , with each round aiming to break model unanimity.
- Feedback and Stopping Criterion: After each evolution round, LLM judge agents evaluate model responses; if at least one model fails, the evolved claim is retained and the process terminates for that seed.
The core pipeline is encapsulated in the following pseudocode from the authors:
0
Inter-agent interactions are defined by the use of model verdicts, judge rationales, and explicit instruction templates guiding the Evolver to systematically surface weaknesses.
3. Semantically Controlled Claim Generation
The framework enforces semantic fidelity and challenge increase through constrained prompting and optional similarity regularizers:
- Reverse Step Constraint: "Produce a logically valid claim that flips the truth-value of , without introducing new entities."
- Evolution Step Constraint: "Analyze where models erred or gave shallow reasoning, then rephrase to target those weaknesses. Preserve core entities and relations."
Optional measures for semantic control and regularization include embedding-based cosine similarity thresholds , where denotes a sentence embedding (such as SBERT-WK) and 0; and adversarial objectives based on Levenshtein edit distance:
1
where 2 typically ranges from 5 to 10 characters to penalize trivial paraphrases.
4. Adaptive Difficulty and Challenge Scoring
FactArena operationalizes difficulty and novelty in claim evaluation as follows:
- Difficulty Metric: 3; 4 triggers evolution, 5 means all models err.
- Novelty Metric: To prevent near-duplicate claims, keep 6 only if
7
with 8.
- Round-wise Update: For each successful evolution, the round number is recorded and the progression of 9 is tracked. Empirically, average model accuracy drops as evolution progresses, for example, from 100% in seed claims to 68% (reverse), 52% (first evolution), and 45% (second evolution).
5. Evaluation, Integration, and Effectiveness
Accepted evolved claims are incorporated into the FactArena claim set, substantially increasing the breadth and complexity of evaluation. Each model subsequently participates in original and evolved claim battles, with all fact-checking pipeline stages applied.
Effectiveness is assessed by metrics including:
- Accuracy Drop: Mean model accuracy on the claim set pre- and post-evolution.
- Elo Shift: The change in Elo-style ranking scores for each model due to increased claim difficulty.
- Ablation Studies: Removing evolution or reverse steps yields higher average accuracy and less separation in model rankings, confirming the function and necessity of each component.
On average, each model is tested in approximately 400 evolved-claim battles, accounting for 20–30% of its total arena judgments. The result is a more stratified model ranking, with top-tier LLMs distinguished by their resilience to difficult claims and weaker models more distinctly penalized.
6. Examples and Operational Flow
Illustrative examples of arena-driven claim evolution elucidate the system's semantic rigor and adaptive difficulty:
- Reverse Step:
Original: "The Eiffel Tower was built before the Statue of Liberty." (True) Reverse: "The Statue of Liberty was constructed before the Eiffel Tower." (False)
- First Evolution (Multi-hop Reasoning):
Prompt directs to require reasoning over chronology, designers, and inauguration dates: Generated Claim: "Although the Eiffel Tower opened in 1889 and the Statue of Liberty was inaugurated in 1886, the latter’s pedestal design began in 1879 under Bartholdi, predating Eiffel’s actual construction contract in 1886."
- Second Evolution (Complex Entity Relations):
A targeted prompt integrates the French-American finance dispute: Output: "The French-funded pedestal of the Statue of Liberty was completed in 1884, four years before Eiffel finalized his contract for the Tower’s structural steel in 1888, reversing the common construction chronology."
The operational flow is summarized by the following sequence:
| Step | Evaluation Outcome | Next Action |
|---|---|---|
| Evaluate seed claim | All models correct | Reverse step |
| Evaluate reverse | All models correct | Iterative evolution |
| Each evolution | Some model fails | Record and terminate |
| Otherwise | Continue evolution rounds |
This sequence ensures claims only evolve when unanimity persists, maximizing the focus on model robustness and subtle reasoning failures (Lin et al., 6 Jan 2026).
A plausible implication is that FactArena’s adaptive, LLM-driven workflow offers a reproducible paradigm for future multi-agent, adversarial model testing scenarios, particularly where current evaluation datasets prove insufficiently discriminative.