- The paper introduces KG20C and KG20C-QA, benchmarks for evaluating link prediction and question answering on scholarly knowledge graphs.
- It details rigorous data cleaning, defined train/validation/test splits, and dual formats (graph and natural language) to ensure high-quality evaluations.
- Baseline experiments using multi-relational embedding models reveal varying performance across relation types, highlighting challenges in scholarly reasoning.
Scholarly Knowledge Graph Benchmarks: KG20C & KG20C-QA
Introduction
The paper introduces KG20C and KG20C-QA, two standardized benchmarks for evaluating link prediction and question answering (QA) on scholarly knowledge graphs. Constructed from the Microsoft Academic Graph (MAG), KG20C emphasizes high-quality coverage of scholarly entities and relations, rigorous data cleaning, and meticulously defined training, validation, and test splits. KG20C-QA builds upon KG20C, transforming graph triples into both entity-relation and natural language QA formats, covering all relation types via carefully designed templates. The benchmarks address shortcomings of previous datasets, such as noise and uncurated splits, establishing a reproducible and extensible foundation for both knowledge graph and LLM research in the scholarly domain.
Dataset Construction and Properties
KG20C Scholarly Knowledge Graph
KG20C originates from MAG, focusing on 20 CORE-ranked conferences in computer science between 1990 and 2010. The construction protocol entails venue selection, filtering papers with substantial citation counts, and thorough pruning to remove incomplete or redundant information. The resulting KG comprises five entity types (Paper, Author, Affiliation, Venue, Domain) and five relation types—author_in_affiliation, author_write_paper, paper_in_domain, paper_cite_paper, paper_in_venue—each intrinsic and non-redundant.
KG20C enforces strict standards for benchmark datasets, including:
- Uniform triple splitting into train/validation/test sets, avoiding entity leakage.
- No duplicate or inverse-relation leakage, matching best practices in benchmarks like WN18RR and FB15k-237.
- Simple TSV file organization compatible with major KG embedding libraries.
KG20C contains 16,362 entities and 48,213 training triples, with validation and test sets each approaching 3,700 triples, aligning its scale and rigor with established benchmarks.
KG20C-QA Question Answering Benchmark
KG20C-QA reinterprets every triple in KG20C as two question-answer pairs (forward/reverse queries), generating both entity-relation incomplete triples and natural language variants. Question templates are provided for each relation type and direction, e.g., "What papers did this author write?" or "Who wrote this paper?" Entity names are injected into templates to yield readable questions.
KG20C-QA is distributed in graph (TSV) and text (natural language QA) formats, facilitating research across graph-based and language-based QA communities. The benchmark contains 96,426 QA pairs for training and balanced validation and test sets (7,340 and 7,448, respectively).
Baseline Benchmarking and Evaluation
To demonstrate utility and establish baseline performance, the authors benchmark KG20C and KG20C-QA using several models, including:
Training protocols are consistent with canonical practices: Adam optimizer, softmax cross-entropy loss, negative sampling, and hyperparameter tuning by random search. Evaluations utilize Mean Reciprocal Rank (MRR) and Hits@k metrics, both standard and type-filtered.
Link Prediction Results
The results on KG20C demonstrate the superiority of multi-relational embeddings:
- Random: Negligible MRR (<0.001).
- Word2vec: MRR $0.068$.
- CPh​: MRR $0.215$.
- MEI: Highest MRR $0.230$ and Hit@10 $0.368$.
These metrics underscore the benchmark’s difficulty and the expressiveness required of multi-relational approaches.
Question Answering Results
On KG20C-QA (type-filtered QA evaluation with MEI), performance varies substantially by relation type:
- Author-affiliation and paper-conference queries reach high accuracy (e.g., "Which conferences may this paper publish in?": MRR $0.693$, Hit@10 $0.976$).
- Domain and citation-related queries are challenging ("Which papers may belong to this domain?": MRR $0.052$, Hit@10 $0.100$; "Which papers may cite this paper?": MRR $0.116$, Hit@10 $0.290$).
These detailed results highlight structural differences in query complexity and signal directions for further modeling improvements.
Implications and Future Directions
The formal release of KG20C and KG20C-QA fills a critical void in scholarly graph benchmarks, enabling reproducible, standardized evaluation for link prediction, QA, and reasoning tasks. Their meticulous construction and extensible format position them as superior alternatives to prior scholarly QA datasets, which often struggle with noise, lack rigor, or provide insufficient diversity in queries and entity types.
Practically, these resources catalyze cross-disciplinary research between knowledge graph and NLP communities by providing dual-format QA settings and leveraging curated scholarly metadata. Theoretically, the observed low saturation in baseline results, especially on multi-relational and domain/citation queries, identifies unsolved challenges in embedding methods, multi-hop reasoning, and integrating structured graph information with LLMs.
Prospective developments include:
- Augmentation to support multi-hop QA and complex reasoning.
- Expansion to broader scientific domains and venues.
- Systematic evaluation of LLMs on the natural language QA subset.
- Application for benchmarking hybrid systems that combine graph-based reasoning and text-based generation.
Conclusion
KG20C and KG20C-QA establish curated, reproducible benchmarks for reasoning over scholarly knowledge graphs, supplying both link prediction and one-hop QA with rigorous standards, standardized splits, and compatibility with graph and text models. Baseline experiments confirm their non-trivial difficulty and value in revealing weaknesses of current methods, especially in complex relation types. These benchmarks provide an extensible foundation for future work on multi-hop reasoning, integrative modeling, and robust evaluation of LLMs and embedding techniques in scientific and scholarly contexts (2512.21799).