FactNet-Bench: Multilingual KG & Fact Checking

Updated 4 February 2026

FactNet-Bench is a unified evaluation suite that standardizes multilingual KG Completion, QA, and fact checking using a deterministic, reproducible pipeline.
It integrates over 1.7 billion Wikidata statements and 3 billion evidence pointers from 316 Wikipedia editions, ensuring rigorous auditability and strict leakage controls.
Baseline results demonstrate that techniques like predicate masking, grammar-guided decoding, and advanced retrieval strategies significantly enhance performance across tasks.

FactNet-Bench is a unified evaluation suite derived from the FactNet resource, designed to facilitate rigorous and reproducible multilingual research in Knowledge Graph Completion, Multilingual Knowledge-Based Question Answering, and Closed-Context Fact Checking at scale. Built atop a billion-scale deterministic knowledge graph encompassing 1.7 billion atomic Wikidata statements and over 3 billion span-grounded evidence pointers across 316 Wikipedia editions, FactNet-Bench operationalizes three task-specific benchmarks with strict leakage controls and baseline implementations, establishing new standards in dataset auditability, provenance, and linguistic breadth (Shen et al., 3 Feb 2026).

1. Definition and Scope

FactNet-Bench serves as a comprehensive evaluation environment integrating three key tasks derived from the FactNet knowledge graph: Knowledge Graph Completion (FactNet-KGC), Multilingual Knowledge-Based Question Answering (FactNet-MKQA), and Multilingual Closed-Context Fact Checking (FactNet-MFC). Each task is instantiated with fixed dataset splits, rigorous leakage controls, and canonical baseline models. FactNet-Bench leverages FactNet’s uniquely deterministic, byte-precise evidence pointer system and broad multilingual coverage, enabling robust benchmarking across entity-centric, text-grounded, and evidence-intensive scenarios.

2. Dataset Construction and Key Statistics

FactNet-Bench is constructed using a fully deterministic pipeline, eschewing stochastic components to guarantee byte-level auditability and split reproducibility. The pipeline ingests two primary data sources: Wikidata JSON dumps (12.1K properties, 1.7B statements) and 316 Wikipedia XML editions with SQL sitelink tables (as of 2025-11-01). Canonical page views are extracted per Wikipedia page—Sentence (markup-stripped, segmented via Stanza or rule-based splitter), Template (AST-extracted infobox fields), and Table (cellular content). Statements are grouped via versioned normalization into FactSynsets, and fact-evidence alignments are performed using prioritized, datatype-aware matchers (infobox key-value, wikilink entity, lexical).

FactNet-Bench statistics after benchmark filtering are summarized below:

Benchmark	#Train	#Dev	#Test	#Languages	Notes
KGC triples	4,180,000	520,000	520,000	—	248K entities, 320 relations, avg degree 33.7
MKQA questions	54,000	6,800	6,800	18	62% 1-hop/38% 2-hop, avg answer size 2.6
MFC claims	72,000	9,000	9,000	18	S/R/NEI=.34/.33/.33, avg evidence 1.4, text length 210 chars

For MKQA and MFC, the supported languages are the 18 largest Wikipedias. Across FactNet, evidence units breakdown: Sentence 57.5%, Infobox 28.4%, Table 14.1%. Audited grounding precision is 0.921 (95% CI [0.913, 0.929]), with evidence-match type precisions: wikilink entity 0.973, infobox field 0.944, lexical value 0.889, lead weak 0.808.

3. Task Formalizations

3.1 Knowledge Graph Completion (FactNet-KGC)

The KG Completion task evaluates models on predicting missing entities for incomplete triples. Given $G_{train} = \{(s,p,o)\}$ , input consists of either $(s,p,?)$ or $(?,p,o)$ and output is a ranked list of entity candidates $o \in E$ . Metrics include filtered Mean Reciprocal Rank (MRR) and Hits@K:

$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}, \qquad Hits@K = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \mathbb{1}(rank_i \leq K)$

3.2 Multilingual Knowledge-Based Question Answering (FactNet-MKQA)

FactNet-MKQA maps a natural-language question $q_\ell$ in language $\ell$ to an executable logical form $z$ over FactNet identifiers, conforming to the “hop1”, “hop2”, or “hop2c” grammar:

hop1: $(hop1\ SUBJ\ PID)$
hop2: $(hop2\ SUBJ\ PID\ PID)$
hop2c: $(hop2c\ SUBJ\ PID\ PID\ CONSTRAINT)$

Output is $A = \varphi(z)$ , the answer set executing $z$ on the frozen KG. Metrics are macro F1 (after normalization) and Valid% (fraction of outputs parsing and executing under the grammar).

3.3 Multilingual Closed-Context Fact Checking (FactNet-MFC)

Given a claim $c_\ell$ , models classify it as Supported, Refuted, or NEI, and retrieve supporting FactSense evidence units, with optional token spans. Retrieval is performed over top-K evidence units. Metrics: label accuracy, macro F1, Evidence Recall@K (≥1 correct evidence in top K), and Span Evidence F1 (token-level).

4. Evaluation Protocol and Leakage Controls

Global dataset splits are generated by stable hashing of synset_id: $h = u32(SHA1(build\_id ∥ synset\_id)[0:4]) \bmod 100$ , mapping Train $<$ 80, Dev 80–90, Test 90–100. Critical leakage controls are implemented:

Training text drawn strictly from Train-aligned FactSenses.
Predicate masking in text-aware KGC, obscuring property values in entity descriptions to block trivial extraction.
RelationEdge restriction: only edges between Train synsets in KGC Graph Neural Networks.
Distinct “Train-only”/“Full” retrieval indexes for MFC.
Uniform negative sampling: 256 negatives for KGE, 128 for GNN baselines.
Random seeds fixed at {13,21,42}; mean±std reported for learned models.

5. Baseline Models and Benchmark Results

5.1 Knowledge Graph Completion

Structural baselines: TransE, RotatE, CompGCN. Text-aware baselines: SimKGC (contrastive PLM), KG-S2S (seq2seq PLM). Predicate masking is verified crucial; without it, KG-S2S MRR increases from 0.298 to 0.351, validating that masking suppresses leakage.

Model	MRR	Hits@10
TransE	0.243	0.410
RotatE	0.261	0.453
CompGCN	0.284	0.478
SimKGC	0.293	0.486
KG-S2S	0.298	0.492

5.2 Multilingual Question Answering

mT5 baseline: standard and grammar-guided decoding. LLMs: Qwen-2.5-72B, LLaMA-3.3-70B (5-shot, grammar-constrained).

Model	Macro F1	Valid %
mT5 (no grammar)	30.9	88.5
mT5 (grammar guided)	34.1	95.2
Qwen-2.5-72B (5-shot)	41.4	93.8
LLaMA-3.3-70B (5-shot)	36.7	92.5

5.3 Fact Checking

Baselines: hypothesis-only classifier; retrieval + XLM-R NLI verifier (BM25, E5-large dense, translation-assisted retrieval). Top-5 retrieval aggregation improves accuracy and evidence F1.

System	Acc	Macro F1	R@5	Span F1
Hyp-only	0.381	0.375	—	—
BM25 + XLM-R	0.654	0.641	0.76	0.41
E5-large + XLM-R	0.701	0.692	0.83	0.49
E5-large + XLM-R (Top-5)	0.731	0.724	0.91	0.54

6. Analysis and Insights

Structural KGC baselines yield the expected performance hierarchy (TransE < RotatE < CompGCN), with text-aware models providing 0.01–0.02 improvement in MRR. GNNs gain further from train-only RelationEdges. Predicate masking is essential for suppressing trivial solutions in text-rich settings. In MKQA, grammar-guided decoding yields substantial validity (+6.7 points) and macro F1 (+3.2), with highest semantic accuracy from LLM few-shot prompting (Qwen-2.5-72B, 41.4 Macro F1), but best Valid% achieved by grammar-constrained mT5. Performance drops in low-resource languages; analysis identifies template mapping and alias gaps as bottlenecks in non-high-resource tiers.

In MFC, evidence retrieval dominates verifier performance: dense retrieval (E5) exceeds BM25 by 0.05–0.07 R@5, and top-5 aggregation further boosts F1 and accuracy. The hypothesis-only baseline attests to minimal annotation artifacts. While FactNet covers 316 languages, the top 5 comprise ~63% of senses; long-tail languages retain high grounding precision (Tier 3: 0.885). The primary bottlenecks are template extraction and alias resolution in non-majority languages.

7. Access, Environment, and Reproducibility

FactNet-Bench is available under permissive (CC0, Evidence-Text Pack CC BY-SA) licenses at HuggingFace (https://hf.co/collections/openbmb/factnet). The deterministic construction pipeline is hosted at https://github.com/yl-shen/factnet, supporting full dataset and split reconstruction. Data formats include JSONL and Parquet shards with explicit schema versioning; loading is supported by index scripts.

Dependencies: Python ≥3.8, Apache Parquet (pyarrow), mwparserfromhell, Stanza (with pinned checksums), and PyTorch/Transformers for baselines. Container manifests or Conda environments with exhaustive version control ensure execution reproducibility.

Best practices: employ released build manifests and language pack hashes, freeze random seeds (13, 21, 42), enforce split-aware filtering and predicate masking, and validate evidence pointers using the provided utility. Following the protocol enables full, byte-exact reproduction of dataset, splits, and baseline metrics for all FactNet-Bench tasks (Shen et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FactNet-Bench.