KG20C & KG20C-QA: Scholarly Knowledge Graph Benchmarks for Link Prediction and Question Answering

Published 25 Dec 2025 in cs.IR | (2512.21799v1)

Abstract: In this paper, we present KG20C and KG20C-QA, two curated datasets for advancing question answering (QA) research on scholarly data. KG20C is a high-quality scholarly knowledge graph constructed from the Microsoft Academic Graph through targeted selection of venues, quality-based filtering, and schema definition. Although KG20C has been available online in non-peer-reviewed sources such as GitHub repository, this paper provides the first formal, peer-reviewed description of the dataset, including clear documentation of its construction and specifications. KG20C-QA is built upon KG20C to support QA tasks on scholarly data. We define a set of QA templates that convert graph triples into natural language question--answer pairs, producing a benchmark that can be used both with graph-based models such as knowledge graph embeddings and with text-based models such as LLMs. We benchmark standard knowledge graph embedding methods on KG20C-QA, analyze performance across relation types, and provide reproducible evaluation protocols. By officially releasing these datasets with thorough documentation, we aim to contribute a reusable, extensible resource for the research community, enabling future work in QA, reasoning, and knowledge-driven applications in the scholarly domain. The full datasets will be released at https://github.com/tranhungnghiep/KG20C/ upon paper publication.

Abstract PDF Upgrade to Chat

Summary

The paper introduces KG20C and KG20C-QA, benchmarks for evaluating link prediction and question answering on scholarly knowledge graphs.
It details rigorous data cleaning, defined train/validation/test splits, and dual formats (graph and natural language) to ensure high-quality evaluations.
Baseline experiments using multi-relational embedding models reveal varying performance across relation types, highlighting challenges in scholarly reasoning.

Scholarly Knowledge Graph Benchmarks: KG20C & KG20C-QA

Introduction

The paper introduces KG20C and KG20C-QA, two standardized benchmarks for evaluating link prediction and question answering (QA) on scholarly knowledge graphs. Constructed from the Microsoft Academic Graph (MAG), KG20C emphasizes high-quality coverage of scholarly entities and relations, rigorous data cleaning, and meticulously defined training, validation, and test splits. KG20C-QA builds upon KG20C, transforming graph triples into both entity-relation and natural language QA formats, covering all relation types via carefully designed templates. The benchmarks address shortcomings of previous datasets, such as noise and uncurated splits, establishing a reproducible and extensible foundation for both knowledge graph and LLM research in the scholarly domain.

Dataset Construction and Properties

KG20C Scholarly Knowledge Graph

KG20C originates from MAG, focusing on 20 CORE-ranked conferences in computer science between 1990 and 2010. The construction protocol entails venue selection, filtering papers with substantial citation counts, and thorough pruning to remove incomplete or redundant information. The resulting KG comprises five entity types (Paper, Author, Affiliation, Venue, Domain) and five relation types—author_in_affiliation, author_write_paper, paper_in_domain, paper_cite_paper, paper_in_venue—each intrinsic and non-redundant.

KG20C enforces strict standards for benchmark datasets, including:

Uniform triple splitting into train/validation/test sets, avoiding entity leakage.
No duplicate or inverse-relation leakage, matching best practices in benchmarks like WN18RR and FB15k-237.
Simple TSV file organization compatible with major KG embedding libraries.

KG20C contains 16,362 entities and 48,213 training triples, with validation and test sets each approaching 3,700 triples, aligning its scale and rigor with established benchmarks.

KG20C-QA Question Answering Benchmark

KG20C-QA reinterprets every triple in KG20C as two question-answer pairs (forward/reverse queries), generating both entity-relation incomplete triples and natural language variants. Question templates are provided for each relation type and direction, e.g., "What papers did this author write?" or "Who wrote this paper?" Entity names are injected into templates to yield readable questions.

KG20C-QA is distributed in graph (TSV) and text (natural language QA) formats, facilitating research across graph-based and language-based QA communities. The benchmark contains 96,426 QA pairs for training and balanced validation and test sets (7,340 and 7,448, respectively).

Baseline Benchmarking and Evaluation

To demonstrate utility and establish baseline performance, the authors benchmark KG20C and KG20C-QA using several models, including:

Random baseline.
Word2vec skipgram for single-relation embeddings.
CP $_h$ for canonical polyadic tensor decomposition.
MEI, a multi-relational embedding model with partitioned interactions.

Training protocols are consistent with canonical practices: Adam optimizer, softmax cross-entropy loss, negative sampling, and hyperparameter tuning by random search. Evaluations utilize Mean Reciprocal Rank (MRR) and Hits@ $k$ metrics, both standard and type-filtered.

Link Prediction Results

The results on KG20C demonstrate the superiority of multi-relational embeddings:

Random: Negligible MRR ( $<0.001$ ).
Word2vec: MRR $0.068$.
CP $_h$ : MRR $0.215$.
MEI: Highest MRR $0.230$ and Hit@10 $0.368$.

These metrics underscore the benchmark’s difficulty and the expressiveness required of multi-relational approaches.

Question Answering Results

On KG20C-QA (type-filtered QA evaluation with MEI), performance varies substantially by relation type:

Author-affiliation and paper-conference queries reach high accuracy (e.g., "Which conferences may this paper publish in?": MRR $0.693$, Hit@10 $0.976$).
Domain and citation-related queries are challenging ("Which papers may belong to this domain?": MRR $0.052$, Hit@10 $0.100$; "Which papers may cite this paper?": MRR $0.116$, Hit@10 $0.290$).

These detailed results highlight structural differences in query complexity and signal directions for further modeling improvements.

Implications and Future Directions

The formal release of KG20C and KG20C-QA fills a critical void in scholarly graph benchmarks, enabling reproducible, standardized evaluation for link prediction, QA, and reasoning tasks. Their meticulous construction and extensible format position them as superior alternatives to prior scholarly QA datasets, which often struggle with noise, lack rigor, or provide insufficient diversity in queries and entity types.

Practically, these resources catalyze cross-disciplinary research between knowledge graph and NLP communities by providing dual-format QA settings and leveraging curated scholarly metadata. Theoretically, the observed low saturation in baseline results, especially on multi-relational and domain/citation queries, identifies unsolved challenges in embedding methods, multi-hop reasoning, and integrating structured graph information with LLMs.

Prospective developments include:

Augmentation to support multi-hop QA and complex reasoning.
Expansion to broader scientific domains and venues.
Systematic evaluation of LLMs on the natural language QA subset.
Application for benchmarking hybrid systems that combine graph-based reasoning and text-based generation.

Conclusion

KG20C and KG20C-QA establish curated, reproducible benchmarks for reasoning over scholarly knowledge graphs, supplying both link prediction and one-hop QA with rigorous standards, standardized splits, and compatibility with graph and text models. Baseline experiments confirm their non-trivial difficulty and value in revealing weaknesses of current methods, especially in complex relation types. These benchmarks provide an extensible foundation for future work on multi-hop reasoning, integrative modeling, and robust evaluation of LLMs and embedding techniques in scientific and scholarly contexts (2512.21799).

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (2)

Collections

GitHub

GitHub - tranhungnghiep/KG20C: A Scholarly Knowledge Graph Benchmark Dataset (20 stars)

KG20C & KG20C-QA: Scholarly Knowledge Graph Benchmarks for Link Prediction and Question Answering

Summary

Scholarly Knowledge Graph Benchmarks: KG20C & KG20C-QA

Introduction

Dataset Construction and Properties

KG20C Scholarly Knowledge Graph

KG20C-QA Question Answering Benchmark

Baseline Benchmarking and Evaluation

Link Prediction Results

Question Answering Results

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

GitHub