LLM-Driven Fact-Checking Process
- The paper introduces an automated LLM-driven fact-checking pipeline using standardized zero-shot prompts and a rigorously curated claim corpus.
- LLM-driven fact-checking is a process that applies generative AI and retrieval pipelines to evaluate the veracity of political, scientific, and social claims.
- Empirical results reveal strengths in FALSE detection with moderate TRUE label performance, underscoring the need for continuous fine-tuning and domain-specific adjustments.
LLM–Driven Fact-Checking Process is an automated workflow leveraging generative AI and retrieval pipelines to assess the veracity of natural-language statements, particularly political, scientific, and social claims. Recent frameworks combine zero-shot standardized prompting with topic-sensitive benchmarking, regression analysis, and out-of-the-box performance evaluation across multiple LLM families. The process aims to measure and augment the factual reliability of large neural models in realistic settings, uncovering both strengths (topic-sensitive guardrails, moderate performance on FALSE detection) and persistent deficits (low recall on TRUE, topic and data bias, speed of model evolution). The following sections provide a comprehensive overview of pipeline design, prompt formulation, auditing methodology, empirical findings, topic-level analysis, and key limitations as reported in "Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information" (Kuznetsova et al., 11 Mar 2025).
1. Dataset Curation and Statement Filtering
A rigorously annotated claim corpus is foundational for LLM-based fact-checking. ClaimsKG v2.0, a professionally curated dataset of 29,803 English-language statements (2000–2023) rated by 13 fact-checking organizations, serves as the gold standard. Statements are systematically filtered using criteria to eliminate confounding noise:
- Exclude items containing “video”, “photo”, “image”, “clip” to avoid multimodal ambiguity.
- Discard non-English statements for consistent linguistic modeling.
- Remove direct speech forms (“I”, “we”, “say”) and interrogatives (ending with "?").
- Exclude those rated “OTHER” by fact-checkers.
The resulting corpus contains 16,513 statements, stratified as follows:
- TRUE: 1,724
- FALSE: 7,989
- MIXTURE: 6,800
This enables precise ground-truth evaluation and permits topic-level stratification.
2. Prompt Templates and Task Formulation
LLMs under audit (ChatGPT-4, Llama 3 70B, Llama 3.1 405B, Claude 3.5 Sonnet, Google Gemini) are queried in a strict zero-shot context. The standardized prompt template enforces brevity and explicitness in label prediction:
1 2 3 4 5 6 7 8 |
Please label the statement [TEXT] (published [YEAR]) as TRUE, FALSE, MIXTURE or N/A using these definitions: • TRUE = factually accurate without significant omissions • FALSE = factually inaccurate • MIXTURE = holds both truth and falsehood • N/A = insufficient information Present only the label, a comma, and a justification (≤20 words). |
3. Auditing Workflow and Evaluation Metrics
The AI auditing workflow randomizes statements per run to obviate position learning and batch effects, enforcing a batch size of 100–200 to control rate limits with fixed temperature (0.1). For each model, raw outputs are parsed to extract the initial all-caps label and concise justification.
Veracity evaluation is operationalized via standard confusion-matrix–derived metrics, computed for each label category:
Here, TP/FP/FN/TN are defined for each class (e.g., FALSE) against ClaimsKG ground-truth.
4. Topic Modeling and Regression Analysis
Statement topics are partitioned using BERTopic, producing high-density clusters followed by hierarchical merging:
- Embedding: Sentence-transformers (MiniLM-L6-v2) yield 384-dimensional text vectors.
- Dimensionality reduction: UMAP projects embeddings into lower space.
- Clustering: HDBSCAN generates 123 clusters; K-means splits outliers into 20 further clusters.
- Final aggregation: Hierarchical agglomerative clustering (Ward’s method, distance=2.0) merges to 10 macro-topics.
Top-level category labels (ChatGPT-4 generated, human-reviewed) include:
- COVID-19 Controversies
- U.S. Economic Analysis
- U.S. Fiscal Impact
- U.S. Immigration Debate
- U.S. Policy Issues (reference)
- U.S. Social Issues
- Societal Debates
- Global Politics
- North American Politicians’ Controversies
- American Elections & Economy
Regression analysis quantifies factors influencing model–ground-truth agreement. The logistic specification:
This enables odds-ratio analysis for topic, model, and veracity interactions.
5. Key Empirical Results and Guardrails
Empirical findings reveal consistent strengths and weaknesses:
- Macro F₁ for FALSE detection:
- ChatGPT-4: ≈ 0.77
- Google Gemini: ≈ 0.69
- Claude 3.5: ≈ 0.71
- Llama 3.1: ≈ 0.62
- Llama 3.0: ≈ 0.58
- TRUE detection:
- F₁ ranges 0.25–0.36 (across models)
- MIXTURE detection:
- F₁ 0.43–0.67
- Recall of FALSE:
- ChatGPT-4 ≈ 0.80
- FALSE precision:
- Llama 3.0 ≈ 0.88
- Topic sensitivity:
- Odds ratios for detecting FALSE are >2.0 for: COVID-19, American Elections/Economy, Social Issues, Politicians’ Controversies vs. baseline U.S. Policy Issues.
This suggests models have operational “guardrails” that heighten accuracy on topics with richer training coverage or safety-focused fine-tuning.
6. Limitations and Practitioner Recommendations
Major limitations include:
- US-centric, English-only corpus may inflate accuracy estimates; generalization to multilingual domains remains unvalidated.
- Ground-truth label heterogeneity across fact-checkers undermines inter-rater reliability.
- Lack of source context in statements reduces model grounding.
- Rapid LLM version turnover impedes reproducible benchmarking.
Recommendations for practitioners:
- Enhance prompt design with few-shot balanced exemplars or chain-of-thought instructions for improved consistency.
- Fine-tune models on domain-specific corpora to address coverage gaps.
- Expand datasets to encompass non-English and low-resource topics to mitigate bias.
- Adopt continuous audit cycles as models evolve.
- Monitor topic-specific weaknesses and deploy targeted domain augmentation.
7. Reproducibility Pipeline Sketch
A minimal reproducible pipeline (see pseudocode) is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
statements = load_claimskg() filtered = filter(statements, lambda s: not contains_media(s) and is_english(s) and not is_direct_speech(s) and s.label in {TRUE,FALSE,MIXTURE} and not s.text.endswith('?')) topics = fit_bertopic(filtered.texts) labelled_topics = label_and_merge(topics, method="hierarchical") for model in [GPT4, Llama3, Llama3_1, Claude35, Gemini]: responses = [] for batch in chunk(shuffle(filtered), size=100): prompts = [format_prompt(s) for s in batch] outputs = call_api(model, prompts, temp=0.1) responses += parse_labels(outputs) metrics[model] = compute_precision_recall_f1(filtered.labels, responses) agreement_df = build_agreement_df(filtered, responses, labelled_topics) reg_results[model] = logistic_regression( agreement_df.y, agreement_df.topics, agreement_df.veracity, interactions=True) |
This pipeline architecture enables transparent assessment and extension of LLM-based fact-checking systems in research and applied settings (Kuznetsova et al., 11 Mar 2025).