Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM-Driven Fact-Checking Process

Updated 13 January 2026
  • The paper introduces an automated LLM-driven fact-checking pipeline using standardized zero-shot prompts and a rigorously curated claim corpus.
  • LLM-driven fact-checking is a process that applies generative AI and retrieval pipelines to evaluate the veracity of political, scientific, and social claims.
  • Empirical results reveal strengths in FALSE detection with moderate TRUE label performance, underscoring the need for continuous fine-tuning and domain-specific adjustments.

LLM–Driven Fact-Checking Process is an automated workflow leveraging generative AI and retrieval pipelines to assess the veracity of natural-language statements, particularly political, scientific, and social claims. Recent frameworks combine zero-shot standardized prompting with topic-sensitive benchmarking, regression analysis, and out-of-the-box performance evaluation across multiple LLM families. The process aims to measure and augment the factual reliability of large neural models in realistic settings, uncovering both strengths (topic-sensitive guardrails, moderate performance on FALSE detection) and persistent deficits (low recall on TRUE, topic and data bias, speed of model evolution). The following sections provide a comprehensive overview of pipeline design, prompt formulation, auditing methodology, empirical findings, topic-level analysis, and key limitations as reported in "Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information" (Kuznetsova et al., 11 Mar 2025).

1. Dataset Curation and Statement Filtering

A rigorously annotated claim corpus is foundational for LLM-based fact-checking. ClaimsKG v2.0, a professionally curated dataset of 29,803 English-language statements (2000–2023) rated by 13 fact-checking organizations, serves as the gold standard. Statements are systematically filtered using criteria to eliminate confounding noise:

  • Exclude items containing “video”, “photo”, “image”, “clip” to avoid multimodal ambiguity.
  • Discard non-English statements for consistent linguistic modeling.
  • Remove direct speech forms (“I”, “we”, “say”) and interrogatives (ending with "?").
  • Exclude those rated “OTHER” by fact-checkers.

The resulting corpus contains 16,513 statements, stratified as follows:

  • TRUE: 1,724
  • FALSE: 7,989
  • MIXTURE: 6,800

This enables precise ground-truth evaluation and permits topic-level stratification.

2. Prompt Templates and Task Formulation

LLMs under audit (ChatGPT-4, Llama 3 70B, Llama 3.1 405B, Claude 3.5 Sonnet, Google Gemini) are queried in a strict zero-shot context. The standardized prompt template enforces brevity and explicitness in label prediction:

1
2
3
4
5
6
7
8
Please label the statement
[TEXT] (published [YEAR])
as TRUE, FALSE, MIXTURE or N/A using these definitions:
• TRUE = factually accurate without significant omissions
• FALSE = factually inaccurate
• MIXTURE = holds both truth and falsehood
• N/A = insufficient information
Present only the label, a comma, and a justification (≤20 words).
No few-shot exemplars or chain-of-thought instructions are used at inference, isolating base model veracity without context augmentation or rationale scaffolding.

3. Auditing Workflow and Evaluation Metrics

The AI auditing workflow randomizes statements per run to obviate position learning and batch effects, enforcing a batch size of 100–200 to control rate limits with fixed temperature (0.1). For each model, raw outputs are parsed to extract the initial all-caps label and concise justification.

Veracity evaluation is operationalized via standard confusion-matrix–derived metrics, computed for each label category:

Accuracy=TP+TNTP+TN+FP+FN Precision=TPTP+FP Recall=TPTP+FN F1=2PrecisionRecallPrecision+Recall\begin{aligned} \text{Accuracy} &= \frac{TP+TN}{TP+TN+FP+FN} \ \text{Precision} &= \frac{TP}{TP+FP} \ \text{Recall} &= \frac{TP}{TP+FN} \ F_1 &= \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \end{aligned}

Here, TP/FP/FN/TN are defined for each class (e.g., FALSE) against ClaimsKG ground-truth.

4. Topic Modeling and Regression Analysis

Statement topics are partitioned using BERTopic, producing high-density clusters followed by hierarchical merging:

  • Embedding: Sentence-transformers (MiniLM-L6-v2) yield 384-dimensional text vectors.
  • Dimensionality reduction: UMAP projects embeddings into lower space.
  • Clustering: HDBSCAN generates 123 clusters; K-means splits outliers into 20 further clusters.
  • Final aggregation: Hierarchical agglomerative clustering (Ward’s method, distance=2.0) merges to 10 macro-topics.

Top-level category labels (ChatGPT-4 generated, human-reviewed) include:

  1. COVID-19 Controversies
  2. U.S. Economic Analysis
  3. U.S. Fiscal Impact
  4. U.S. Immigration Debate
  5. U.S. Policy Issues (reference)
  6. U.S. Social Issues
  7. Societal Debates
  8. Global Politics
  9. North American Politicians’ Controversies
  10. American Elections & Economy

Regression analysis quantifies factors influencing model–ground-truth agreement. The logistic specification:

logit(P(yi=1))=β0+tβ1tTopici,t+vβ2vVeracityi,v+tvβ3tv(Topici,t×Veracityi,v)+εi\text{logit}(P(y_i=1)) = \beta_0 + \sum_t \beta_{1t} \cdot \text{Topic}_{i,t} + \sum_v \beta_{2v} \cdot \text{Veracity}_{i,v} + \sum_{tv} \beta_{3tv} \cdot (\text{Topic}_{i,t} \times \text{Veracity}_{i,v}) + \varepsilon_i

This enables odds-ratio analysis for topic, model, and veracity interactions.

5. Key Empirical Results and Guardrails

Empirical findings reveal consistent strengths and weaknesses:

  • Macro F₁ for FALSE detection:
    • ChatGPT-4: ≈ 0.77
    • Google Gemini: ≈ 0.69
    • Claude 3.5: ≈ 0.71
    • Llama 3.1: ≈ 0.62
    • Llama 3.0: ≈ 0.58
  • TRUE detection:
    • F₁ ranges 0.25–0.36 (across models)
  • MIXTURE detection:
    • F₁ 0.43–0.67
  • Recall of FALSE:
    • ChatGPT-4 ≈ 0.80
  • FALSE precision:
    • Llama 3.0 ≈ 0.88
  • Topic sensitivity:
    • Odds ratios for detecting FALSE are >2.0 for: COVID-19, American Elections/Economy, Social Issues, Politicians’ Controversies vs. baseline U.S. Policy Issues.

This suggests models have operational “guardrails” that heighten accuracy on topics with richer training coverage or safety-focused fine-tuning.

6. Limitations and Practitioner Recommendations

Major limitations include:

  • US-centric, English-only corpus may inflate accuracy estimates; generalization to multilingual domains remains unvalidated.
  • Ground-truth label heterogeneity across fact-checkers undermines inter-rater reliability.
  • Lack of source context in statements reduces model grounding.
  • Rapid LLM version turnover impedes reproducible benchmarking.

Recommendations for practitioners:

  • Enhance prompt design with few-shot balanced exemplars or chain-of-thought instructions for improved consistency.
  • Fine-tune models on domain-specific corpora to address coverage gaps.
  • Expand datasets to encompass non-English and low-resource topics to mitigate bias.
  • Adopt continuous audit cycles as models evolve.
  • Monitor topic-specific weaknesses and deploy targeted domain augmentation.

7. Reproducibility Pipeline Sketch

A minimal reproducible pipeline (see pseudocode) is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
statements = load_claimskg()
filtered = filter(statements, 
    lambda s: not contains_media(s)
              and is_english(s)
              and not is_direct_speech(s)
              and s.label in {TRUE,FALSE,MIXTURE}
              and not s.text.endswith('?'))
topics = fit_bertopic(filtered.texts)
labelled_topics = label_and_merge(topics, method="hierarchical")
for model in [GPT4, Llama3, Llama3_1, Claude35, Gemini]:
    responses = []
    for batch in chunk(shuffle(filtered), size=100):
        prompts = [format_prompt(s) for s in batch]
        outputs = call_api(model, prompts, temp=0.1)
        responses += parse_labels(outputs)
    metrics[model] = compute_precision_recall_f1(filtered.labels, responses)
    agreement_df = build_agreement_df(filtered, responses, labelled_topics)
    reg_results[model] = logistic_regression(
        agreement_df.y,
        agreement_df.topics,
        agreement_df.veracity,
        interactions=True)

This pipeline architecture enables transparent assessment and extension of LLM-based fact-checking systems in research and applied settings (Kuznetsova et al., 11 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-Driven Fact-Checking Process.