Low-Resource Fact-Checking: Methods & Challenges
- Low-resource fact-checking is the process of verifying claims in settings with limited data and computational resources using multilingual datasets and efficient model tuning.
- The methodologies employ synthetic data generation, cross-lingual transfer, modular architectures, and crowdsourced annotation to improve evidence retrieval and label accuracy.
- Practical challenges like domain shifts, evidence scarcity, and model overconfidence are addressed through innovative calibration techniques and human-in-the-loop oversight strategies.
Low-resource fact-checking refers to the development and deployment of automated or semi-automated fact verification systems in contexts where annotated data, computational resources, or language technology infrastructure are severely limited. This includes the vast majority of languages and regions outside English and a few other high-resource environments. The central objective is to ensure scalable, accurate, and equitable verification of factual claims in settings with minimal labeled corpora, model pretraining, and infrastructure.
1. Core Challenges in Low-Resource Fact-Checking
Low-resource fact-checking is hampered simultaneously by the scarcity of annotated data and limited computational or language modeling capacity. Key difficulties include:
- Annotated Data Scarcity: Manual annotation for fact-checking (e.g., assigning SUPPORTS, REFUTES, NOT ENOUGH INFORMATION) is prohibitively expensive and logistically taxing, especially where native expert annotators are rare (Chung et al., 21 Feb 2025).
- Linguistic Coverage and Domain Shift: Most existing resources, models, and benchmarks (FEVER, VitaminC, etc.) are English-centric; naive translation to low-resource languages fails to account for linguistic and cultural context, introducing domain mismatch, translation noise, and degraded model performance (Chung et al., 21 Feb 2025, Cekinel et al., 2024).
- Evidence Retrieval Complexity: Claims in low-resource languages often reference local knowledge or use linguistic structures with little lexical overlap with available evidence corpora. Retrieval and verification of such claims require systems robust to paraphrase, cross-domain variation, and high “novelty” in n-grams or dependencies (Le et al., 2024, Hoa et al., 2024).
- Model Confidence and Calibration: Small, computationally cheap models typical for low-resource contexts display overconfidence and low accuracy, which risks amplifying misinformation—a phenomenon documented as the “Dunning–Kruger” confidence paradox (Qazi et al., 10 Sep 2025).
- Cross-lingual and Multimodal Complexity: Evidence to support or refute claims may be available only in other languages (cross-lingual retrieval) or other modalities (images, tables), demanding sophisticated retrieval and fusion strategies (Huang et al., 2022, Singhal et al., 2021).
2. Dataset Design and Construction
Sophisticated dataset design is fundamental to advance fact-checking in low-resource settings. Notable strategies include:
- Domain Diversification and Structured Annotation: ViFactCheck for Vietnamese crawled nine government-licensed news websites across 12 domains, ensuring claims are sampled across a broad topical range (Hoa et al., 2024). Annotation procedures require pilot phases, rigorous guidelines, and explicit multi-evidence labeling. The resultant dataset (7,232 claim–evidence pairs) exhibited high reliability (Fleiss’ κ = 0.83).
- Synthetic Multilingual Data Generation: MultiSynFact introduced a scalable LLM-driven pipeline that extracts knowledge sentences from Wikipedia, prompts LLMs to generate three claims per sentence (SUPPORTS, REFUTES, NOT-INFO), and combines LLM self-evaluation with MNLI filtering for quality control, producing 2.2M multilingual claim-source pairs (Chung et al., 21 Feb 2025).
- Evidence Diversity and Novelty: ViWikiFC constructed a 20K+ claim–evidence corpus from Vietnamese Wikipedia, explicitly measuring new-word, new-dependency, and new n-gram rates between claims and evidence, revealing retrieval difficulties for NOT ENOUGH INFORMATION claims (NEI new word rate 50.44%) (Le et al., 2024).
- Crowd-Driven and Distant Supervision: Systems such as CrowdChecked automatically mined hundreds of thousands of tweet–fact-check pairs by matching links shared in social media, with noisy labels refined by self-adaptive training and weak supervision protocols (Hardalov et al., 2022).
| Dataset | Language(s) | Claim-Evidence Pairs | Domains | IAA (κ or equivalent) |
|---|---|---|---|---|
| ViFactCheck | Vietnamese | 7,232 | 12 news topics | 0.83 (Fleiss’ κ) |
| ViWikiFC | Vietnamese | 20,916 | Wikipedia | 0.9587 (Fleiss’ κ) |
| MultiSynFact | en, es, de (+ext) | 2.2M | Wikipedia | LLM+NLI/spot-checks |
| CrowdChecked | English (+tweets) | 332,660 | Social media | - |
| FactDRIL | 13 Indian langs | 22,435 | Multi-domain | 0.76–1.00 |
| FCTR | Turkish | 3,238 | Multi-domain | - |
3. Model Architectures and Transfer Learning
Model selection and adaptation in low-resource settings leverage a mixture of pre-trained LLMs (PLMs), LLMs, and parameter-efficient fine-tuning:
- Multilingual Pretrained Models: Systems such as ViFactCheck and ViWikiFC employ PhoBERT, XLM-R, and InfoXLM, leveraging pretrained multilingual representations and fine-tuning on language-specific supervisions (Hoa et al., 2024, Le et al., 2024).
- LLMs and LoRA Fine-Tuning: Large open-source models (Llama2/3, Mistral-7B, Gemma-7B) are fine-tuned with LoRA adapters (r=16, α=16) over 5 epochs, enabling competitive macro-F1 (e.g., Gemma-7B: 89.90% macro F1 for Vietnamese) on modest hardware (Hoa et al., 2024). QLoRA parameter-efficient tuning is similarly applied to Llama-2 for Turkish (Cekinel et al., 2024).
- Prompt Engineering vs. Fine-Tuning: Zero-shot and few-shot prompting of LLMs underperforms task-specific finetuning—e.g., Gemma-7B in zero-shot scored 39.47% F1 vs. 89.90% with finetuning (Hoa et al., 2024). In Turkish, Llama-2-13B fine-tuned on 500 local examples yielded F1-macro=0.890 (Cekinel et al., 2024).
- Cross-Lingual Transfer and Self-Supervised Objectives: CONCRETE introduces a cross-lingual bi-encoder trained on the Inverse Cloze Task (X-ICT), optimizing dot-product similarity of claim and passage embeddings across languages and showing +2.23 pp macro-F1 improvement over previous systems in zero-shot settings (Huang et al., 2022).
- Modular and Plug-and-Play Frameworks: Self-Checker assembles LLM-driven modules (claim decomposition, query generation, evidence selection, verdict prediction) as a fully prompt-driven pipeline, demanding no in-domain training but with substantial trade-offs in accuracy and latency (2305.14623).
4. Retrieval, Claim Matching, and Evidence Aggregation
Claim verification is often bottle-necked by the retrieval of relevant evidence, especially when lexical overlap is low or evidence is available cross-lingually:
- Sparse vs. Dense Retrieval: BM25 achieves high accuracy for SUPPORTS and REFUTES (88.30%/86.93%) in Vietnamese Wikipedia, but only 56.67% for NEI due to low overlap. Hybrid pipelines (BM25 + SBERT, dense dual-encoders) are recommended for improved performance (Le et al., 2024).
- Claim Matching in Messaging Platforms: Cross-lingual claim-matching models (student XLM-R distilled from English SBERT) outperform LASER and LaBSE on WhatsApp data in Bengali, Malayalam, Tamil (MRR=0.528 Bengali for I-XLM-R) (Kazemi et al., 2021).
- Noisy Distant Supervision: In CrowdChecked, bi-encoder SBERT models are trained with modified Multiple Negatives Ranking (MNR) loss and self-adaptive label weighting to address large-scale noisy tweet–fact-check matches, with MAP@5 gains >11 points over NLytics (Hardalov et al., 2022).
- NER-Based Query Expansion: WikiCheck demonstrates that entity extraction from claims and issuing separate Wikipedia queries per entity increases average recall from 0.628 (raw query) to 0.879 (NER-flair-fast, N=3), critical for supporting retrieval in CPU or low-memory deployments (Trokhymovych et al., 2021).
5. Evaluation Protocols and Error Analysis
Systematic assessment of low-resource fact-checking systems employs macro-F1, precision, recall, retrieval accuracy (Accuracy@k), and stricter pipeline metrics (e.g., strict accuracy requiring correct evidence and label):
| Model/System | Language | Macro F1 / Strict Acc | Notable Baseline/Method |
|---|---|---|---|
| Gemma-7B (LoRA) | Vietnamese | 89.90% (Gold Evidence) | Fine-tuned, Gold Evidence, ViFactCheck (Hoa et al., 2024) |
| InfoXLM (Large) | Vietnamese | 86.51% (VP task) | ViWikiFC, monolingual (Le et al., 2024) |
| mDeBERTa-v3-base | MultiLang | Macro-F1 up to 0.203 ↑ | MultiSynFact augmentation (Chung et al., 21 Feb 2025) |
| CONCRETE (mBERT + X-ICT) | X-Fact | Macro-F1 +2.2pp in zero-shot | Cross-lingual claim-style retrieval (Huang et al., 2022) |
Dominant error modes include evidence retrieval failure, semantic ambiguity, multi-step inferential chains, and hallucinated inference despite correct evidence (Hoa et al., 2024). For NEI claims, high novelty at the lexical and syntactic level severely degrades both retrieval and label prediction (Le et al., 2024).
6. Human-in-the-Loop, Crowdsourcing, and Policy Considerations
Human input, crowdsourcing, and governance are crucial in low-resource and high-risk settings:
- Crowdsourcing Fact-Verification: Twitter Birdwatch demonstrates that a volunteer-driven note and rating mechanism achieves 83.2% agreement with experts, with verification latency an order of magnitude faster and nearly zero direct cost (Saeed et al., 2022).
- Hybrid Human/Model Oversight: Overconfident small LLMs (e.g., Llama-7B, Mistral-7B) risk error amplification; policies must ensure human review of all verdicts below critical confidence thresholds, and transparent disclosure of known model limitations and biases (Qazi et al., 10 Sep 2025).
- Scalability via Plug-and-Play and Modular Design: Frameworks such as Self-Checker enable rapid adaptation in new languages or domains without large-scale annotation but can only supplement, not yet replace, more deeply fine-tuned models (2305.14623).
7. Future Directions and Best Practices
Leading efforts identify several priorities for further advances in low-resource fact-checking:
- Extending Data Coverage and Diversity: Expand annotation to social media, speech, and multimodal evidence; scale NEI (“no evidence”) cases; inject contrastive/ambiguous examples (Hoa et al., 2024, Chung et al., 21 Feb 2025).
- Richer Retrieval and Reasoning: Integrate neural rerankers, knowledge graphs, program-guided reasoning, and cross-modal fusion (e.g., tables, images) (Chung et al., 21 Feb 2025, Le et al., 2024).
- Instruction and Domain Adaptive Tuning: Instruction-tune LLMs on synthetic high-quality corpora (MultiSynFact) and explore domain-adaptive pretraining to counteract English/global bias (Chung et al., 21 Feb 2025, Cekinel et al., 2024).
- Confidence Calibration and Equitable Access: Post-hoc calibration, tunable selective classification, and mandated human oversight per emerging EU AI Act guidelines are essential to manage risk (Qazi et al., 10 Sep 2025).
- Open-Source Tooling and Infrastructure: The release of entire pipelines (code, data, checkpoints; e.g., github.com/Genaios/MultiSynFact, github.com/trokhymovych/WikiCheck) is central to enabling local adaptation and benchmarking (Hoa et al., 2024, Chung et al., 21 Feb 2025, Trokhymovych et al., 2021).
In summary, low-resource fact-checking research establishes a multi-pronged strategy: high-quality, domain-diverse data annotation; scalable synthetic generation; cross-lingual and parameter-efficient model adaptation; robust retrieval pipelines; and hybrid human–AI oversight—all as conditions for accurate and equitable information verification in global low-resource settings (Hoa et al., 2024, Chung et al., 21 Feb 2025, Le et al., 2024, Huang et al., 2022, Qazi et al., 10 Sep 2025, Saeed et al., 2022).