FactNet-Bench: Multilingual KG & Fact Checking
- FactNet-Bench is a unified evaluation suite that standardizes multilingual KG Completion, QA, and fact checking using a deterministic, reproducible pipeline.
- It integrates over 1.7 billion Wikidata statements and 3 billion evidence pointers from 316 Wikipedia editions, ensuring rigorous auditability and strict leakage controls.
- Baseline results demonstrate that techniques like predicate masking, grammar-guided decoding, and advanced retrieval strategies significantly enhance performance across tasks.
FactNet-Bench is a unified evaluation suite derived from the FactNet resource, designed to facilitate rigorous and reproducible multilingual research in Knowledge Graph Completion, Multilingual Knowledge-Based Question Answering, and Closed-Context Fact Checking at scale. Built atop a billion-scale deterministic knowledge graph encompassing 1.7 billion atomic Wikidata statements and over 3 billion span-grounded evidence pointers across 316 Wikipedia editions, FactNet-Bench operationalizes three task-specific benchmarks with strict leakage controls and baseline implementations, establishing new standards in dataset auditability, provenance, and linguistic breadth (Shen et al., 3 Feb 2026).
1. Definition and Scope
FactNet-Bench serves as a comprehensive evaluation environment integrating three key tasks derived from the FactNet knowledge graph: Knowledge Graph Completion (FactNet-KGC), Multilingual Knowledge-Based Question Answering (FactNet-MKQA), and Multilingual Closed-Context Fact Checking (FactNet-MFC). Each task is instantiated with fixed dataset splits, rigorous leakage controls, and canonical baseline models. FactNet-Bench leverages FactNet’s uniquely deterministic, byte-precise evidence pointer system and broad multilingual coverage, enabling robust benchmarking across entity-centric, text-grounded, and evidence-intensive scenarios.
2. Dataset Construction and Key Statistics
FactNet-Bench is constructed using a fully deterministic pipeline, eschewing stochastic components to guarantee byte-level auditability and split reproducibility. The pipeline ingests two primary data sources: Wikidata JSON dumps (12.1K properties, 1.7B statements) and 316 Wikipedia XML editions with SQL sitelink tables (as of 2025-11-01). Canonical page views are extracted per Wikipedia page—Sentence (markup-stripped, segmented via Stanza or rule-based splitter), Template (AST-extracted infobox fields), and Table (cellular content). Statements are grouped via versioned normalization into FactSynsets, and fact-evidence alignments are performed using prioritized, datatype-aware matchers (infobox key-value, wikilink entity, lexical).
FactNet-Bench statistics after benchmark filtering are summarized below:
| Benchmark | #Train | #Dev | #Test | #Languages | Notes |
|---|---|---|---|---|---|
| KGC triples | 4,180,000 | 520,000 | 520,000 | — | 248K entities, 320 relations, avg degree 33.7 |
| MKQA questions | 54,000 | 6,800 | 6,800 | 18 | 62% 1-hop/38% 2-hop, avg answer size 2.6 |
| MFC claims | 72,000 | 9,000 | 9,000 | 18 | S/R/NEI=.34/.33/.33, avg evidence 1.4, text length 210 chars |
For MKQA and MFC, the supported languages are the 18 largest Wikipedias. Across FactNet, evidence units breakdown: Sentence 57.5%, Infobox 28.4%, Table 14.1%. Audited grounding precision is 0.921 (95% CI [0.913, 0.929]), with evidence-match type precisions: wikilink entity 0.973, infobox field 0.944, lexical value 0.889, lead weak 0.808.
3. Task Formalizations
3.1 Knowledge Graph Completion (FactNet-KGC)
The KG Completion task evaluates models on predicting missing entities for incomplete triples. Given , input consists of either or and output is a ranked list of entity candidates . Metrics include filtered Mean Reciprocal Rank (MRR) and Hits@K:
3.2 Multilingual Knowledge-Based Question Answering (FactNet-MKQA)
FactNet-MKQA maps a natural-language question in language to an executable logical form over FactNet identifiers, conforming to the “hop1”, “hop2”, or “hop2c” grammar:
- hop1:
- hop2:
- hop2c:
Output is , the answer set executing on the frozen KG. Metrics are macro F1 (after normalization) and Valid% (fraction of outputs parsing and executing under the grammar).
3.3 Multilingual Closed-Context Fact Checking (FactNet-MFC)
Given a claim , models classify it as Supported, Refuted, or NEI, and retrieve supporting FactSense evidence units, with optional token spans. Retrieval is performed over top-K evidence units. Metrics: label accuracy, macro F1, Evidence Recall@K (≥1 correct evidence in top K), and Span Evidence F1 (token-level).
4. Evaluation Protocol and Leakage Controls
Global dataset splits are generated by stable hashing of synset_id: , mapping Train 80, Dev 80–90, Test 90–100. Critical leakage controls are implemented:
- Training text drawn strictly from Train-aligned FactSenses.
- Predicate masking in text-aware KGC, obscuring property values in entity descriptions to block trivial extraction.
- RelationEdge restriction: only edges between Train synsets in KGC Graph Neural Networks.
- Distinct “Train-only”/“Full” retrieval indexes for MFC.
- Uniform negative sampling: 256 negatives for KGE, 128 for GNN baselines.
- Random seeds fixed at {13,21,42}; mean±std reported for learned models.
5. Baseline Models and Benchmark Results
5.1 Knowledge Graph Completion
Structural baselines: TransE, RotatE, CompGCN. Text-aware baselines: SimKGC (contrastive PLM), KG-S2S (seq2seq PLM). Predicate masking is verified crucial; without it, KG-S2S MRR increases from 0.298 to 0.351, validating that masking suppresses leakage.
| Model | MRR | Hits@10 |
|---|---|---|
| TransE | 0.243 | 0.410 |
| RotatE | 0.261 | 0.453 |
| CompGCN | 0.284 | 0.478 |
| SimKGC | 0.293 | 0.486 |
| KG-S2S | 0.298 | 0.492 |
5.2 Multilingual Question Answering
mT5 baseline: standard and grammar-guided decoding. LLMs: Qwen-2.5-72B, LLaMA-3.3-70B (5-shot, grammar-constrained).
| Model | Macro F1 | Valid % |
|---|---|---|
| mT5 (no grammar) | 30.9 | 88.5 |
| mT5 (grammar guided) | 34.1 | 95.2 |
| Qwen-2.5-72B (5-shot) | 41.4 | 93.8 |
| LLaMA-3.3-70B (5-shot) | 36.7 | 92.5 |
5.3 Fact Checking
Baselines: hypothesis-only classifier; retrieval + XLM-R NLI verifier (BM25, E5-large dense, translation-assisted retrieval). Top-5 retrieval aggregation improves accuracy and evidence F1.
| System | Acc | Macro F1 | R@5 | Span F1 |
|---|---|---|---|---|
| Hyp-only | 0.381 | 0.375 | — | — |
| BM25 + XLM-R | 0.654 | 0.641 | 0.76 | 0.41 |
| E5-large + XLM-R | 0.701 | 0.692 | 0.83 | 0.49 |
| E5-large + XLM-R (Top-5) | 0.731 | 0.724 | 0.91 | 0.54 |
6. Analysis and Insights
Structural KGC baselines yield the expected performance hierarchy (TransE < RotatE < CompGCN), with text-aware models providing 0.01–0.02 improvement in MRR. GNNs gain further from train-only RelationEdges. Predicate masking is essential for suppressing trivial solutions in text-rich settings. In MKQA, grammar-guided decoding yields substantial validity (+6.7 points) and macro F1 (+3.2), with highest semantic accuracy from LLM few-shot prompting (Qwen-2.5-72B, 41.4 Macro F1), but best Valid% achieved by grammar-constrained mT5. Performance drops in low-resource languages; analysis identifies template mapping and alias gaps as bottlenecks in non-high-resource tiers.
In MFC, evidence retrieval dominates verifier performance: dense retrieval (E5) exceeds BM25 by 0.05–0.07 R@5, and top-5 aggregation further boosts F1 and accuracy. The hypothesis-only baseline attests to minimal annotation artifacts. While FactNet covers 316 languages, the top 5 comprise ~63% of senses; long-tail languages retain high grounding precision (Tier 3: 0.885). The primary bottlenecks are template extraction and alias resolution in non-majority languages.
7. Access, Environment, and Reproducibility
FactNet-Bench is available under permissive (CC0, Evidence-Text Pack CC BY-SA) licenses at HuggingFace (https://hf.co/collections/openbmb/factnet). The deterministic construction pipeline is hosted at https://github.com/yl-shen/factnet, supporting full dataset and split reconstruction. Data formats include JSONL and Parquet shards with explicit schema versioning; loading is supported by index scripts.
Dependencies: Python ≥3.8, Apache Parquet (pyarrow), mwparserfromhell, Stanza (with pinned checksums), and PyTorch/Transformers for baselines. Container manifests or Conda environments with exhaustive version control ensure execution reproducibility.
Best practices: employ released build manifests and language pack hashes, freeze random seeds (13, 21, 42), enforce split-aware filtering and predicate masking, and validate evidence pointers using the provided utility. Following the protocol enables full, byte-exact reproduction of dataset, splits, and baseline metrics for all FactNet-Bench tasks (Shen et al., 3 Feb 2026).