AceKG: Academic Knowledge Graph

Updated 22 February 2026

AceKG is a large-scale academic knowledge graph featuring 3.13B RDF triples covering papers, authors, fields, venues, and institutes.
It employs a consistent ontology with over 100 relation types and unique URIs, ensuring precise entity disambiguation and comprehensive multi-relational coverage.
AceKG serves as a benchmark environment for tasks such as link prediction, community detection, and scholar profiling using advanced embedding and inference methods.

AceKG is a large-scale academic knowledge graph (KG) comprising 3.13 billion RDF triples derived from a consistent ontology designed for high-fidelity academic data mining. Its schema represents essential entities and their relations—spanning papers, authors, fields of study, venues, and institutes—with fine-grained resolution achieved via unique URIs, addressing long-standing issues of name ambiguity and insufficient multi-relational coverage. AceKG serves as both a foundational academic resource and a rigorous benchmark environment for algorithmic evaluation in knowledge representation, link prediction, scholarly community detection, and author profiling (Wang et al., 2018).

1. Ontology: Entity Classes, Relations, and Hierarchies

AceKG instantiates a directed, labeled multi-relational graph $G = (E, R, T)$ with $E$ as entities, $R$ as relation types, and $T \subseteq E \times R \times E$ as factual triples. The five disjoint top-level entity classes are:

$\mathcal{C} = \{\, \mathsf{Paper},\; \mathsf{Author},\; \mathsf{Field},\; \mathsf{Venue},\; \mathsf{Institute}\, \}$

Each entity $e \in E$ is assigned a unique URI for unambiguous identification, enabling, for example, distinct representations for similarly-named authors (such as ace:7E7A3A69 vs. ace:7E0D6766 for two different "Jiawei Han" entities).

The relation vocabulary $R$ is extensive ( $\sim$ 100 types) and includes:

$\mathsf{paper\_is\_written\_by}$ : encodes authorship,
$\mathsf{paper\_cites\_paper}$ : denotes citation,
$\mathsf{paper\_published\_in\_venue}$ ,
$\mathsf{author\_affiliated\_with\_institute}$ ,
$\mathsf{field\_is\_part\_of}$ : defines a DAG over fields of study,
and further relations such as $\mathsf{venue\_located\_in\_institute}$ and $\mathsf{institute\_has\_research\_field}$ .

Hierarchical, compositional, and numeric data is attached via relations such as $\mathsf{field\_is\_part\_of}$ (for taxonomy) and unary predicates (for literals including citation counts and publication dates).

2. Scale, Statistics, and Dataset Partitions

AceKG encodes $|T| = 3.13 \times 10^9$ triples and $|E| = 1.14 \times 10^8$ entities, with explicit class counts:

Entity Class	Count
Papers	$6.17 \times 10^7$
Authors	$5.25 \times 10^7$
Fields of Study	$5.02 \times 10^4$
Venues	$2.17 \times 10^4$ (journals) + $1.28 \times 10^3$ (conferences)
Institutes	$1.98 \times 10^4$

AceKG-derived benchmark datasets include:

AK18K for link prediction ( $n_{rel}=7$ , $n_{ent}=18464$ ; train/validation/test counts: $\sim$ 130,265/7,429/7,336),
Six field-specific heterogeneous collaboration graphs (e.g., Biology, Computer Science, Economics, Medicine, Physics, and a union of five fields),
A "Google-Scholar-venue" graph of 600K papers, 635K authors, 151 venues, and 2.37M edges.

All primary data originates from the in-house Acemap repository, normalized and deduplicated; disambiguation is ensured via systematic URI assignment. Preprocessing also normalizes literals and dates to facilitate large-scale, machine-compliant operations.

3. Entity Alignment and Rule-based Inference

Entity alignment integrates AceKG with prominent external bibliographic databases (IEEE, ACM, DBLP). Mappings are computed by matching paper title and author lists, principally using cosine similarity of TF-IDF representations:

$\operatorname{sim}(\mathrm{pm},\mathrm{qm}) = \cos(\mathrm{tfidf}(\mathrm{pm}),\mathrm{tfidf}(\mathrm{qm}))$

Titles may also be matched using edit distance. At threshold $\tau \approx 0.9$ , the mapping achieves coverage of 2.33M (IEEE), 1.91M (ACM), and 2.27M (DBLP) papers.

Rule-based inference employs Horn-style rules to further augment the graph. Representative rules include:

$(a\;\mathsf{paper\_is\_written\_by}\;b) \wedge (b\;\mathsf{affiliated\_with}\;I) \longrightarrow (a\;\mathsf{paper\_has\_institute}\;I)$ ,
$(p_1\;\mathsf{paper\_cites\_paper}\;p_2) \wedge (p_2\;\mathsf{paper\_cites\_paper}\;p_3) \longrightarrow (p_1\;\mathsf{paper\_co\_cite}\;p_3)$ .

This suggests that AceKG provides both explicit and inferred relational knowledge, improving its utility for downstream analytics.

4. Benchmark Tasks: Link Prediction, Community Detection, and Scholar Profiling

AceKG supports systematic evaluation via canonical academic data mining tasks:

4.1 Link Prediction (Knowledge Embedding)

Given a triple set $T_{train} \subseteq T$ , entities $e \in E$ are embedded as $\mathbf{e} \in \mathbb{R}^d$ and relations $r \in R$ as $\mathbf{r}$ , using scoring functions $f_r(h, t)$ to predict missing links. Evaluated methods and representative scoring functions include:

TransE: $f_r(h, t) = -\|\mathbf{h} + \mathbf{r} - \mathbf{t}\|_2$ ,
TransH: projection onto relation-specific hyperplanes,
DistMult: $f_r(h, t) = \sum_{i=1}^d h_i\,r_i\,t_i$ ,
HolE: $f_r(h, t) = \mathbf{r}^\top (\mathbf{h} \star \mathbf{t})$ (circular correlation),
ComplEx: $f_r(h, t) = \Re\langle\mathbf{h},\,\mathbf{r},\,\overline{\mathbf{t}}\rangle$ with complex-valued embeddings.

All models use embedding dimension $d=100$ , margin $\gamma=1.0$ , learning rate $\alpha=0.01$ , and SGD with early stopping on validation MRR; loss $\mathcal{L}$ is a margin ranking loss with one-entity corrupted negatives per triple.

4.2 Community Detection and Scholar Classification

On extracted collaboration graphs, network representation learning is performed via:

DeepWalk: homogeneous random walks with Skip-Gram,
LINE: first- and second-order proximity preservation,
PTE: joint text-network embeddings,
metapath2vec: heterogeneous walks along user-defined metapaths.

Resulting node embeddings feed into logistic regression (5-fold CV) for classification (metrics: Micro-F1, Macro-F1) and k-means clustering (NMI metric) for community detection.

5. Experimental Results and Analytical Insights

5.1 Link Prediction Performance on AK18K

Filtered MRR and Hits@ $k$ (for $k=1,3,10$ ) for several knowledge embedding models:

Model	MRR	Hits@1	Hits@10
TransE	0.719	62.7%	89.2%
TransH	0.701	61.0%	84.6%
DistMult	0.749	68.7%	86.1%
HolE	0.864	83.8%	88.2%
ComplEx	0.817	75.4%	89.0%

HolE and ComplEx exhibit superior capability for modeling antisymmetric relationships, which characterize all seven AK18K relation types. TransE outperforms TransH in Hits@10, plausibly due to a high proportion of many-to-many relations (94%) and a limited relation schema, where the benefit from hyperplane projection (TransH) is minimal. AK18K results fall between FB15K and WN18 benchmarks; this is attributed to moderate graph sparsity and the relatively simple relation schema.

5.2 Scholar Classification and Clustering

Micro-F1 and Macro-F1 scores (classification) and normalized mutual information (clustering):

Dataset	DeepWalk	LINE(1+2)	PTE	metapath2vec
biology	.792/.547	.722/.445	.759/.495	.828/.637
computer sci.	.545/.454	.633/.542	.574/.454	.678/.570
economics	.692/.277	.717/.385	.654/.276	.753/.485
medicine	.663/.496	.701/.577	.694/.555	.770/.659
physics	.774/.592	.779/.640	.723/.571	.794/.635
5-Fields union	.731/.589	.755/.655	.664/.528	.831/.682
Google-venues	.948/.942	.955/.949	.966/.961	.971/.968

Clustering (NMI):

Method	FOS	Google
DeepWalk	0.277	0.394
LINE(1+2)	0.305	0.459
PTE	0.153	0.602
metapath2vec	0.427	0.836

Metapath2vec, by exploiting heterogeneous context, consistently achieves superior classification and clustering performance. The Google-venue graph's higher data quality (top venues, reduced interdisciplinarity) yields increased performance across methods. A pronounced Micro-F1 to Macro-F1 drop indicates major label imbalance and cross-field scholarly activity.

6. Applications and Prospective Research Avenues

AceKG underpins multiple academic data mining tasks:

Cooperation Prediction: Embeddings encode institutional, field, and citation proximity, supporting co-authorship link prediction.
Author Disambiguation: Hetero-typed relational context enables distinction of same-name authors; iterative URI refinement is supported via embedding-space clustering.
Rising-Star Detection: Temporal AceKG snapshots, combined with feature trajectories and embedding novelty, support early-career impact forecasting.

Future research directions include dynamic KG updates (streaming ingestion), fine-grained event extraction, graph neural networks over heterogeneous schemas, and cross-modal knowledge integration (e.g., PDF content, code repositories). The scale, ontological clarity, and connection to external bibliographic resources position AceKG as a foundational asset for scalable, multi-facet academic knowledge discovery and benchmarking (Wang et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

AceKG: A Large-scale Knowledge Graph for Academic Data Mining (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AceKG.