AceKG: A Large-scale Knowledge Graph for Academic Data Mining (1807.08484v2)

Published 23 Jul 2018 in cs.IR, cs.AI, and cs.CL

Abstract: Most existing knowledge graphs (KGs) in academic domains suffer from problems of insufficient multi-relational information, name ambiguity and improper data format for large-scale machine processing. In this paper, we present AceKG, a new large-scale KG in academic domain. AceKG not only provides clean academic information, but also offers a large-scale benchmark dataset for researchers to conduct challenging data mining projects including link prediction, community detection and scholar classification. Specifically, AceKG describes 3.13 billion triples of academic facts based on a consistent ontology, including necessary properties of papers, authors, fields of study, venues and institutes, as well as the relations among them. To enrich the proposed knowledge graph, we also perform entity alignment with existing databases and rule-based inference. Based on AceKG, we conduct experiments of three typical academic data mining tasks and evaluate several state-of- the-art knowledge embedding and network representation learning approaches on the benchmark datasets built from AceKG. Finally, we discuss several promising research directions that benefit from AceKG.

Citations (100)

View on Semantic Scholar

Summary

The paper presents AceKG as a comprehensive academic knowledge graph consolidating diverse entities and relationships for reliable data mining.
It employs rule-based inference and cross-database entity alignment to improve data quality and connectivity while mitigating name ambiguity.
Experimental results show that both translational and compositional embedding models can effectively leverage AceKG for tasks like scholar classification and clustering.

This paper introduces AceKG, a large-scale knowledge graph (KG) specifically designed for the academic domain. It addresses limitations in existing academic datasets, such as insufficient multi-relational information, name ambiguity, and formats unsuitable for large-scale machine processing. AceKG aims to provide a clean, heterogeneous academic KG and serve as a benchmark dataset for various data mining tasks.

Core Contribution: AceKG is presented as a solution to the lack of comprehensive, structured, and machine-readable academic knowledge graphs. It consolidates information about academic entities and their relationships into a single, large-scale KG, making it readily usable for downstream AI and data mining tasks.

AceKG Structure and Content:

AceKG models academic information using a defined ontology with 5 main entity classes: Papers, Authors, Fields of paper, Venues (Journals and Conferences), and Institutes. Facts are represented as subject-predicate-object triples. Each entity is assigned a unique URI to mitigate synonymy and ambiguity issues prevalent in academic data. The graph contains 3.13 billion triples with over 114 million entities, occupying nearly 100GB of disk space. The data is released in Turtle format, which is standard for KGs and easily queryable using tools like Apache Jena and SPARQL. The ontology defines key relationships like paper_is_written_by, paper_is_in_venue, author_is_affiliated_to, field_is_part_of, etc. (Figure 1 shows the detailed schema).

Data Acquisition and Enrichment:

The data for AceKG is collected from Acemap. To enhance its completeness and connectivity, AceKG undergoes two key enrichment processes:

Entity Alignment: Papers within computer science in AceKG are mapped to corresponding entities in external databases like IEEE, ACM, and DBLP. This allows researchers to integrate data from multiple sources and link AceKG with the broader Linked Open Data cloud. Table 2 provides statistics on the number of mapped entities.
Rule-Based Inference: Simple inference rules are applied to deduce new relationships from existing ones, further enriching the graph and providing more comprehensive ground truth for tasks. An example rule shown in Figure 2 infers that if a paper is written by an author and the author is affiliated with an institute, then the paper is related to that institute.

Practical Applications and Demonstrations:

The paper demonstrates the practical utility of AceKG by evaluating state-of-the-art methods on two common academic data mining tasks:

Knowledge Embedding (Link Prediction):
- Task: Predict a missing entity in a triple $(h, r, t)$ .
- Dataset: A subset called AK18K is extracted from AceKG, containing 18,464 entities, 7 relation types, and over 140,000 triples (Table 1). AK18K is characterized by simple relation structure but a large number of potential entities per relation, making it different from standard benchmarks like FB15K and WN18.
- Methods Evaluated: Translational models (TransE, TransH) and compositional models (DistMult, ComplEx, HolE) using the OpenKE framework.
- Implementation Notes: The experiments highlight how different KG embedding models perform on a dataset with a relatively small number of relation types but potentially high branching factors for certain predicates (e.g., a field having many papers). Results (Table 4) show that while compositional models like HolE and ComplEx perform well on metrics like MRR and Hits@1, indicating their strength in modeling antisymmetric relations, even simpler translational models like TransE achieve competitive Hits@10 scores. The performance differences compared to standard benchmarks suggest that the specific structure and properties of academic data in AceKG pose unique challenges.
Network Representation Learning (Scholar Classification and Clustering):
- Task: Learn low-dimensional vector representations for entities (specifically authors/scholars) in the academic network to support downstream tasks like classifying authors into research fields or clustering them based on their collaborations and publications.
- Dataset: Seven heterogeneous academic networks are constructed based on fields of paper (Biology, CS, Economics, Medicine, Physics, 5-Fields combined) and papers from top Google Scholar venues (Table 3). These networks include Papers, Authors, and Venues as nodes, with various types of edges (e.g., paper-author, paper-venue). Scholar labels (fields of paper) are derived from the fields of their publications.
- Methods Evaluated: DeepWalk, LINE, PTE, and metapath2vec.
- Implementation Notes: This section demonstrates how standard network embedding techniques can be applied to the heterogeneous graph structure of AceKG. The results (Table 5 for classification, Table 6 for clustering) show that metapath2vec, designed for heterogeneous networks, generally outperforms methods designed primarily for homogeneous graphs. However, even homogeneous methods like DeepWalk and LINE achieve reasonable performance, suggesting they can still capture useful information in a heterogeneous network with limited node/edge types. The performance difference between FOS-labeled datasets and the Google-labeled dataset highlights the impact of data distribution and the degree of interdisciplinarity on task performance.

Future Directions:

The paper suggests several promising research directions that can leverage AceKG:

Cooperation Prediction: Using KG embeddings or NRL results to capture richer features for predicting future collaborations between researchers.
Author Disambiguation: Utilizing the network structure and node attributes in AceKG to improve the accuracy of distinguishing authors with similar names. This could also feed back into improving the quality of AceKG itself.
Finding Rising Stars: Applying clustering or classification algorithms on learned embeddings from AceKG to identify promising young researchers based on comprehensive academic features.

Practical Considerations:

AceKG is a large dataset (3.13 billion triples, 100GB). Working with it requires infrastructure capable of handling this scale, such as graph databases or distributed processing frameworks. The release in Turtle format facilitates parsing and integration with existing KG tools. The entity alignment provides valuable links to other academic databases, enabling data fusion. The paper also notes the ongoing maintenance and updating of AceKG.

In summary, AceKG provides a valuable, large-scale, and structured resource for academic data mining. The paper details its construction, ontology, and demonstrates its practical utility through experiments on link prediction, scholar classification, and clustering, providing insights into the performance of various machine learning models on this specific type of heterogeneous academic data. It also opens up several new research avenues benefiting from the rich connections within the graph.

PDF Markdown

AceKG: A Large-scale Knowledge Graph for Academic Data Mining (1807.08484v2)

Summary

Related Papers