Papers
Topics
Authors
Recent
Search
2000 character limit reached

FactNet-Bench: Multilingual KG & Fact Checking

Updated 4 February 2026
  • FactNet-Bench is a unified evaluation suite that standardizes multilingual KG Completion, QA, and fact checking using a deterministic, reproducible pipeline.
  • It integrates over 1.7 billion Wikidata statements and 3 billion evidence pointers from 316 Wikipedia editions, ensuring rigorous auditability and strict leakage controls.
  • Baseline results demonstrate that techniques like predicate masking, grammar-guided decoding, and advanced retrieval strategies significantly enhance performance across tasks.

FactNet-Bench is a unified evaluation suite derived from the FactNet resource, designed to facilitate rigorous and reproducible multilingual research in Knowledge Graph Completion, Multilingual Knowledge-Based Question Answering, and Closed-Context Fact Checking at scale. Built atop a billion-scale deterministic knowledge graph encompassing 1.7 billion atomic Wikidata statements and over 3 billion span-grounded evidence pointers across 316 Wikipedia editions, FactNet-Bench operationalizes three task-specific benchmarks with strict leakage controls and baseline implementations, establishing new standards in dataset auditability, provenance, and linguistic breadth (Shen et al., 3 Feb 2026).

1. Definition and Scope

FactNet-Bench serves as a comprehensive evaluation environment integrating three key tasks derived from the FactNet knowledge graph: Knowledge Graph Completion (FactNet-KGC), Multilingual Knowledge-Based Question Answering (FactNet-MKQA), and Multilingual Closed-Context Fact Checking (FactNet-MFC). Each task is instantiated with fixed dataset splits, rigorous leakage controls, and canonical baseline models. FactNet-Bench leverages FactNet’s uniquely deterministic, byte-precise evidence pointer system and broad multilingual coverage, enabling robust benchmarking across entity-centric, text-grounded, and evidence-intensive scenarios.

2. Dataset Construction and Key Statistics

FactNet-Bench is constructed using a fully deterministic pipeline, eschewing stochastic components to guarantee byte-level auditability and split reproducibility. The pipeline ingests two primary data sources: Wikidata JSON dumps (12.1K properties, 1.7B statements) and 316 Wikipedia XML editions with SQL sitelink tables (as of 2025-11-01). Canonical page views are extracted per Wikipedia page—Sentence (markup-stripped, segmented via Stanza or rule-based splitter), Template (AST-extracted infobox fields), and Table (cellular content). Statements are grouped via versioned normalization into FactSynsets, and fact-evidence alignments are performed using prioritized, datatype-aware matchers (infobox key-value, wikilink entity, lexical).

FactNet-Bench statistics after benchmark filtering are summarized below:

Benchmark #Train #Dev #Test #Languages Notes
KGC triples 4,180,000 520,000 520,000 248K entities, 320 relations, avg degree 33.7
MKQA questions 54,000 6,800 6,800 18 62% 1-hop/38% 2-hop, avg answer size 2.6
MFC claims 72,000 9,000 9,000 18 S/R/NEI=.34/.33/.33, avg evidence 1.4, text length 210 chars

For MKQA and MFC, the supported languages are the 18 largest Wikipedias. Across FactNet, evidence units breakdown: Sentence 57.5%, Infobox 28.4%, Table 14.1%. Audited grounding precision is 0.921 (95% CI [0.913, 0.929]), with evidence-match type precisions: wikilink entity 0.973, infobox field 0.944, lexical value 0.889, lead weak 0.808.

3. Task Formalizations

3.1 Knowledge Graph Completion (FactNet-KGC)

The KG Completion task evaluates models on predicting missing entities for incomplete triples. Given Gtrain={(s,p,o)}G_{train} = \{(s,p,o)\}, input consists of either (s,p,?)(s,p,?) or (?,p,o)(?,p,o) and output is a ranked list of entity candidates oEo \in E. Metrics include filtered Mean Reciprocal Rank (MRR) and Hits@K:

MRR=1Qi=1Q1ranki,Hits@K=1Qi=1Q1(rankiK)MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}, \qquad Hits@K = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \mathbb{1}(rank_i \leq K)

3.2 Multilingual Knowledge-Based Question Answering (FactNet-MKQA)

FactNet-MKQA maps a natural-language question qq_\ell in language \ell to an executable logical form zz over FactNet identifiers, conforming to the “hop1”, “hop2”, or “hop2c” grammar:

  • hop1: (hop1 SUBJ PID)(hop1\ SUBJ\ PID)
  • hop2: (hop2 SUBJ PID PID)(hop2\ SUBJ\ PID\ PID)
  • hop2c: (hop2c SUBJ PID PID CONSTRAINT)(hop2c\ SUBJ\ PID\ PID\ CONSTRAINT)

Output is A=φ(z)A = \varphi(z), the answer set executing zz on the frozen KG. Metrics are macro F1 (after normalization) and Valid% (fraction of outputs parsing and executing under the grammar).

3.3 Multilingual Closed-Context Fact Checking (FactNet-MFC)

Given a claim cc_\ell, models classify it as Supported, Refuted, or NEI, and retrieve supporting FactSense evidence units, with optional token spans. Retrieval is performed over top-K evidence units. Metrics: label accuracy, macro F1, Evidence Recall@K (≥1 correct evidence in top K), and Span Evidence F1 (token-level).

4. Evaluation Protocol and Leakage Controls

Global dataset splits are generated by stable hashing of synset_id: h=u32(SHA1(build_idsynset_id)[0:4])mod100h = u32(SHA1(build\_id ∥ synset\_id)[0:4]) \bmod 100, mapping Train <<80, Dev 80–90, Test 90–100. Critical leakage controls are implemented:

  • Training text drawn strictly from Train-aligned FactSenses.
  • Predicate masking in text-aware KGC, obscuring property values in entity descriptions to block trivial extraction.
  • RelationEdge restriction: only edges between Train synsets in KGC Graph Neural Networks.
  • Distinct “Train-only”/“Full” retrieval indexes for MFC.
  • Uniform negative sampling: 256 negatives for KGE, 128 for GNN baselines.
  • Random seeds fixed at {13,21,42}; mean±std reported for learned models.

5. Baseline Models and Benchmark Results

5.1 Knowledge Graph Completion

Structural baselines: TransE, RotatE, CompGCN. Text-aware baselines: SimKGC (contrastive PLM), KG-S2S (seq2seq PLM). Predicate masking is verified crucial; without it, KG-S2S MRR increases from 0.298 to 0.351, validating that masking suppresses leakage.

Model MRR Hits@10
TransE 0.243 0.410
RotatE 0.261 0.453
CompGCN 0.284 0.478
SimKGC 0.293 0.486
KG-S2S 0.298 0.492

5.2 Multilingual Question Answering

mT5 baseline: standard and grammar-guided decoding. LLMs: Qwen-2.5-72B, LLaMA-3.3-70B (5-shot, grammar-constrained).

Model Macro F1 Valid %
mT5 (no grammar) 30.9 88.5
mT5 (grammar guided) 34.1 95.2
Qwen-2.5-72B (5-shot) 41.4 93.8
LLaMA-3.3-70B (5-shot) 36.7 92.5

5.3 Fact Checking

Baselines: hypothesis-only classifier; retrieval + XLM-R NLI verifier (BM25, E5-large dense, translation-assisted retrieval). Top-5 retrieval aggregation improves accuracy and evidence F1.

System Acc Macro F1 R@5 Span F1
Hyp-only 0.381 0.375
BM25 + XLM-R 0.654 0.641 0.76 0.41
E5-large + XLM-R 0.701 0.692 0.83 0.49
E5-large + XLM-R (Top-5) 0.731 0.724 0.91 0.54

6. Analysis and Insights

Structural KGC baselines yield the expected performance hierarchy (TransE < RotatE < CompGCN), with text-aware models providing 0.01–0.02 improvement in MRR. GNNs gain further from train-only RelationEdges. Predicate masking is essential for suppressing trivial solutions in text-rich settings. In MKQA, grammar-guided decoding yields substantial validity (+6.7 points) and macro F1 (+3.2), with highest semantic accuracy from LLM few-shot prompting (Qwen-2.5-72B, 41.4 Macro F1), but best Valid% achieved by grammar-constrained mT5. Performance drops in low-resource languages; analysis identifies template mapping and alias gaps as bottlenecks in non-high-resource tiers.

In MFC, evidence retrieval dominates verifier performance: dense retrieval (E5) exceeds BM25 by 0.05–0.07 R@5, and top-5 aggregation further boosts F1 and accuracy. The hypothesis-only baseline attests to minimal annotation artifacts. While FactNet covers 316 languages, the top 5 comprise ~63% of senses; long-tail languages retain high grounding precision (Tier 3: 0.885). The primary bottlenecks are template extraction and alias resolution in non-majority languages.

7. Access, Environment, and Reproducibility

FactNet-Bench is available under permissive (CC0, Evidence-Text Pack CC BY-SA) licenses at HuggingFace (https://hf.co/collections/openbmb/factnet). The deterministic construction pipeline is hosted at https://github.com/yl-shen/factnet, supporting full dataset and split reconstruction. Data formats include JSONL and Parquet shards with explicit schema versioning; loading is supported by index scripts.

Dependencies: Python ≥3.8, Apache Parquet (pyarrow), mwparserfromhell, Stanza (with pinned checksums), and PyTorch/Transformers for baselines. Container manifests or Conda environments with exhaustive version control ensure execution reproducibility.

Best practices: employ released build manifests and language pack hashes, freeze random seeds (13, 21, 42), enforce split-aware filtering and predicate masking, and validate evidence pointers using the provided utility. Following the protocol enables full, byte-exact reproduction of dataset, splits, and baseline metrics for all FactNet-Bench tasks (Shen et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FactNet-Bench.