AceKG: Academic Knowledge Graph
- AceKG is a large-scale academic knowledge graph featuring 3.13B RDF triples covering papers, authors, fields, venues, and institutes.
- It employs a consistent ontology with over 100 relation types and unique URIs, ensuring precise entity disambiguation and comprehensive multi-relational coverage.
- AceKG serves as a benchmark environment for tasks such as link prediction, community detection, and scholar profiling using advanced embedding and inference methods.
AceKG is a large-scale academic knowledge graph (KG) comprising 3.13 billion RDF triples derived from a consistent ontology designed for high-fidelity academic data mining. Its schema represents essential entities and their relations—spanning papers, authors, fields of study, venues, and institutes—with fine-grained resolution achieved via unique URIs, addressing long-standing issues of name ambiguity and insufficient multi-relational coverage. AceKG serves as both a foundational academic resource and a rigorous benchmark environment for algorithmic evaluation in knowledge representation, link prediction, scholarly community detection, and author profiling (Wang et al., 2018).
1. Ontology: Entity Classes, Relations, and Hierarchies
AceKG instantiates a directed, labeled multi-relational graph with as entities, as relation types, and as factual triples. The five disjoint top-level entity classes are:
Each entity is assigned a unique URI for unambiguous identification, enabling, for example, distinct representations for similarly-named authors (such as ace:7E7A3A69 vs. ace:7E0D6766 for two different "Jiawei Han" entities).
The relation vocabulary is extensive (100 types) and includes:
- : encodes authorship,
- : denotes citation,
- ,
- ,
- : defines a DAG over fields of study,
- and further relations such as and .
Hierarchical, compositional, and numeric data is attached via relations such as (for taxonomy) and unary predicates (for literals including citation counts and publication dates).
2. Scale, Statistics, and Dataset Partitions
AceKG encodes triples and entities, with explicit class counts:
| Entity Class | Count |
|---|---|
| Papers | |
| Authors | |
| Fields of Study | |
| Venues | (journals) + (conferences) |
| Institutes |
AceKG-derived benchmark datasets include:
- AK18K for link prediction (, ; train/validation/test counts: 130,265/7,429/7,336),
- Six field-specific heterogeneous collaboration graphs (e.g., Biology, Computer Science, Economics, Medicine, Physics, and a union of five fields),
- A "Google-Scholar-venue" graph of 600K papers, 635K authors, 151 venues, and 2.37M edges.
All primary data originates from the in-house Acemap repository, normalized and deduplicated; disambiguation is ensured via systematic URI assignment. Preprocessing also normalizes literals and dates to facilitate large-scale, machine-compliant operations.
3. Entity Alignment and Rule-based Inference
Entity alignment integrates AceKG with prominent external bibliographic databases (IEEE, ACM, DBLP). Mappings are computed by matching paper title and author lists, principally using cosine similarity of TF-IDF representations:
Titles may also be matched using edit distance. At threshold , the mapping achieves coverage of 2.33M (IEEE), 1.91M (ACM), and 2.27M (DBLP) papers.
Rule-based inference employs Horn-style rules to further augment the graph. Representative rules include:
- ,
- .
This suggests that AceKG provides both explicit and inferred relational knowledge, improving its utility for downstream analytics.
4. Benchmark Tasks: Link Prediction, Community Detection, and Scholar Profiling
AceKG supports systematic evaluation via canonical academic data mining tasks:
4.1 Link Prediction (Knowledge Embedding)
Given a triple set , entities are embedded as and relations as , using scoring functions to predict missing links. Evaluated methods and representative scoring functions include:
- TransE: ,
- TransH: projection onto relation-specific hyperplanes,
- DistMult: ,
- HolE: (circular correlation),
- ComplEx: with complex-valued embeddings.
All models use embedding dimension , margin , learning rate , and SGD with early stopping on validation MRR; loss is a margin ranking loss with one-entity corrupted negatives per triple.
4.2 Community Detection and Scholar Classification
On extracted collaboration graphs, network representation learning is performed via:
- DeepWalk: homogeneous random walks with Skip-Gram,
- LINE: first- and second-order proximity preservation,
- PTE: joint text-network embeddings,
- metapath2vec: heterogeneous walks along user-defined metapaths.
Resulting node embeddings feed into logistic regression (5-fold CV) for classification (metrics: Micro-F1, Macro-F1) and k-means clustering (NMI metric) for community detection.
5. Experimental Results and Analytical Insights
5.1 Link Prediction Performance on AK18K
Filtered MRR and Hits@ (for ) for several knowledge embedding models:
| Model | MRR | Hits@1 | Hits@10 |
|---|---|---|---|
| TransE | 0.719 | 62.7% | 89.2% |
| TransH | 0.701 | 61.0% | 84.6% |
| DistMult | 0.749 | 68.7% | 86.1% |
| HolE | 0.864 | 83.8% | 88.2% |
| ComplEx | 0.817 | 75.4% | 89.0% |
HolE and ComplEx exhibit superior capability for modeling antisymmetric relationships, which characterize all seven AK18K relation types. TransE outperforms TransH in Hits@10, plausibly due to a high proportion of many-to-many relations (94%) and a limited relation schema, where the benefit from hyperplane projection (TransH) is minimal. AK18K results fall between FB15K and WN18 benchmarks; this is attributed to moderate graph sparsity and the relatively simple relation schema.
5.2 Scholar Classification and Clustering
Micro-F1 and Macro-F1 scores (classification) and normalized mutual information (clustering):
| Dataset | DeepWalk | LINE(1+2) | PTE | metapath2vec |
|---|---|---|---|---|
| biology | .792/.547 | .722/.445 | .759/.495 | .828/.637 |
| computer sci. | .545/.454 | .633/.542 | .574/.454 | .678/.570 |
| economics | .692/.277 | .717/.385 | .654/.276 | .753/.485 |
| medicine | .663/.496 | .701/.577 | .694/.555 | .770/.659 |
| physics | .774/.592 | .779/.640 | .723/.571 | .794/.635 |
| 5-Fields union | .731/.589 | .755/.655 | .664/.528 | .831/.682 |
| Google-venues | .948/.942 | .955/.949 | .966/.961 | .971/.968 |
Clustering (NMI):
| Method | FOS | |
|---|---|---|
| DeepWalk | 0.277 | 0.394 |
| LINE(1+2) | 0.305 | 0.459 |
| PTE | 0.153 | 0.602 |
| metapath2vec | 0.427 | 0.836 |
Metapath2vec, by exploiting heterogeneous context, consistently achieves superior classification and clustering performance. The Google-venue graph's higher data quality (top venues, reduced interdisciplinarity) yields increased performance across methods. A pronounced Micro-F1 to Macro-F1 drop indicates major label imbalance and cross-field scholarly activity.
6. Applications and Prospective Research Avenues
AceKG underpins multiple academic data mining tasks:
- Cooperation Prediction: Embeddings encode institutional, field, and citation proximity, supporting co-authorship link prediction.
- Author Disambiguation: Hetero-typed relational context enables distinction of same-name authors; iterative URI refinement is supported via embedding-space clustering.
- Rising-Star Detection: Temporal AceKG snapshots, combined with feature trajectories and embedding novelty, support early-career impact forecasting.
Future research directions include dynamic KG updates (streaming ingestion), fine-grained event extraction, graph neural networks over heterogeneous schemas, and cross-modal knowledge integration (e.g., PDF content, code repositories). The scale, ontological clarity, and connection to external bibliographic resources position AceKG as a foundational asset for scalable, multi-facet academic knowledge discovery and benchmarking (Wang et al., 2018).