Papers
Topics
Authors
Recent
Search
2000 character limit reached

AceKG: Academic Knowledge Graph

Updated 22 February 2026
  • AceKG is a large-scale academic knowledge graph featuring 3.13B RDF triples covering papers, authors, fields, venues, and institutes.
  • It employs a consistent ontology with over 100 relation types and unique URIs, ensuring precise entity disambiguation and comprehensive multi-relational coverage.
  • AceKG serves as a benchmark environment for tasks such as link prediction, community detection, and scholar profiling using advanced embedding and inference methods.

AceKG is a large-scale academic knowledge graph (KG) comprising 3.13 billion RDF triples derived from a consistent ontology designed for high-fidelity academic data mining. Its schema represents essential entities and their relations—spanning papers, authors, fields of study, venues, and institutes—with fine-grained resolution achieved via unique URIs, addressing long-standing issues of name ambiguity and insufficient multi-relational coverage. AceKG serves as both a foundational academic resource and a rigorous benchmark environment for algorithmic evaluation in knowledge representation, link prediction, scholarly community detection, and author profiling (Wang et al., 2018).

1. Ontology: Entity Classes, Relations, and Hierarchies

AceKG instantiates a directed, labeled multi-relational graph G=(E,R,T)G = (E, R, T) with EE as entities, RR as relation types, and TE×R×ET \subseteq E \times R \times E as factual triples. The five disjoint top-level entity classes are:

C={Paper,  Author,  Field,  Venue,  Institute}\mathcal{C} = \{\, \mathsf{Paper},\; \mathsf{Author},\; \mathsf{Field},\; \mathsf{Venue},\; \mathsf{Institute}\, \}

Each entity eEe \in E is assigned a unique URI for unambiguous identification, enabling, for example, distinct representations for similarly-named authors (such as ace:7E7A3A69 vs. ace:7E0D6766 for two different "Jiawei Han" entities).

The relation vocabulary RR is extensive (\sim100 types) and includes:

  • paper_is_written_by\mathsf{paper\_is\_written\_by}: encodes authorship,
  • paper_cites_paper\mathsf{paper\_cites\_paper}: denotes citation,
  • paper_published_in_venue\mathsf{paper\_published\_in\_venue},
  • author_affiliated_with_institute\mathsf{author\_affiliated\_with\_institute},
  • field_is_part_of\mathsf{field\_is\_part\_of}: defines a DAG over fields of study,
  • and further relations such as venue_located_in_institute\mathsf{venue\_located\_in\_institute} and institute_has_research_field\mathsf{institute\_has\_research\_field}.

Hierarchical, compositional, and numeric data is attached via relations such as field_is_part_of\mathsf{field\_is\_part\_of} (for taxonomy) and unary predicates (for literals including citation counts and publication dates).

2. Scale, Statistics, and Dataset Partitions

AceKG encodes T=3.13×109|T| = 3.13 \times 10^9 triples and E=1.14×108|E| = 1.14 \times 10^8 entities, with explicit class counts:

Entity Class Count
Papers 6.17×1076.17 \times 10^7
Authors 5.25×1075.25 \times 10^7
Fields of Study 5.02×1045.02 \times 10^4
Venues 2.17×1042.17 \times 10^4 (journals) + 1.28×1031.28 \times 10^3 (conferences)
Institutes 1.98×1041.98 \times 10^4

AceKG-derived benchmark datasets include:

  • AK18K for link prediction (nrel=7n_{rel}=7, nent=18464n_{ent}=18464; train/validation/test counts: \sim130,265/7,429/7,336),
  • Six field-specific heterogeneous collaboration graphs (e.g., Biology, Computer Science, Economics, Medicine, Physics, and a union of five fields),
  • A "Google-Scholar-venue" graph of 600K papers, 635K authors, 151 venues, and 2.37M edges.

All primary data originates from the in-house Acemap repository, normalized and deduplicated; disambiguation is ensured via systematic URI assignment. Preprocessing also normalizes literals and dates to facilitate large-scale, machine-compliant operations.

3. Entity Alignment and Rule-based Inference

Entity alignment integrates AceKG with prominent external bibliographic databases (IEEE, ACM, DBLP). Mappings are computed by matching paper title and author lists, principally using cosine similarity of TF-IDF representations:

sim(pm,qm)=cos(tfidf(pm),tfidf(qm))\operatorname{sim}(\mathrm{pm},\mathrm{qm}) = \cos(\mathrm{tfidf}(\mathrm{pm}),\mathrm{tfidf}(\mathrm{qm}))

Titles may also be matched using edit distance. At threshold τ0.9\tau \approx 0.9, the mapping achieves coverage of 2.33M (IEEE), 1.91M (ACM), and 2.27M (DBLP) papers.

Rule-based inference employs Horn-style rules to further augment the graph. Representative rules include:

  • (a  paper_is_written_by  b)(b  affiliated_with  I)(a  paper_has_institute  I)(a\;\mathsf{paper\_is\_written\_by}\;b) \wedge (b\;\mathsf{affiliated\_with}\;I) \longrightarrow (a\;\mathsf{paper\_has\_institute}\;I),
  • (p1  paper_cites_paper  p2)(p2  paper_cites_paper  p3)(p1  paper_co_cite  p3)(p_1\;\mathsf{paper\_cites\_paper}\;p_2) \wedge (p_2\;\mathsf{paper\_cites\_paper}\;p_3) \longrightarrow (p_1\;\mathsf{paper\_co\_cite}\;p_3).

This suggests that AceKG provides both explicit and inferred relational knowledge, improving its utility for downstream analytics.

AceKG supports systematic evaluation via canonical academic data mining tasks:

Given a triple set TtrainTT_{train} \subseteq T, entities eEe \in E are embedded as eRd\mathbf{e} \in \mathbb{R}^d and relations rRr \in R as r\mathbf{r}, using scoring functions fr(h,t)f_r(h, t) to predict missing links. Evaluated methods and representative scoring functions include:

  • TransE: fr(h,t)=h+rt2f_r(h, t) = -\|\mathbf{h} + \mathbf{r} - \mathbf{t}\|_2,
  • TransH: projection onto relation-specific hyperplanes,
  • DistMult: fr(h,t)=i=1dhiritif_r(h, t) = \sum_{i=1}^d h_i\,r_i\,t_i,
  • HolE: fr(h,t)=r(ht)f_r(h, t) = \mathbf{r}^\top (\mathbf{h} \star \mathbf{t}) (circular correlation),
  • ComplEx: fr(h,t)=h,r,tf_r(h, t) = \Re\langle\mathbf{h},\,\mathbf{r},\,\overline{\mathbf{t}}\rangle with complex-valued embeddings.

All models use embedding dimension d=100d=100, margin γ=1.0\gamma=1.0, learning rate α=0.01\alpha=0.01, and SGD with early stopping on validation MRR; loss L\mathcal{L} is a margin ranking loss with one-entity corrupted negatives per triple.

4.2 Community Detection and Scholar Classification

On extracted collaboration graphs, network representation learning is performed via:

  • DeepWalk: homogeneous random walks with Skip-Gram,
  • LINE: first- and second-order proximity preservation,
  • PTE: joint text-network embeddings,
  • metapath2vec: heterogeneous walks along user-defined metapaths.

Resulting node embeddings feed into logistic regression (5-fold CV) for classification (metrics: Micro-F1, Macro-F1) and k-means clustering (NMI metric) for community detection.

5. Experimental Results and Analytical Insights

Filtered MRR and Hits@kk (for k=1,3,10k=1,3,10) for several knowledge embedding models:

Model MRR Hits@1 Hits@10
TransE 0.719 62.7% 89.2%
TransH 0.701 61.0% 84.6%
DistMult 0.749 68.7% 86.1%
HolE 0.864 83.8% 88.2%
ComplEx 0.817 75.4% 89.0%

HolE and ComplEx exhibit superior capability for modeling antisymmetric relationships, which characterize all seven AK18K relation types. TransE outperforms TransH in Hits@10, plausibly due to a high proportion of many-to-many relations (94%) and a limited relation schema, where the benefit from hyperplane projection (TransH) is minimal. AK18K results fall between FB15K and WN18 benchmarks; this is attributed to moderate graph sparsity and the relatively simple relation schema.

5.2 Scholar Classification and Clustering

Micro-F1 and Macro-F1 scores (classification) and normalized mutual information (clustering):

Dataset DeepWalk LINE(1+2) PTE metapath2vec
biology .792/.547 .722/.445 .759/.495 .828/.637
computer sci. .545/.454 .633/.542 .574/.454 .678/.570
economics .692/.277 .717/.385 .654/.276 .753/.485
medicine .663/.496 .701/.577 .694/.555 .770/.659
physics .774/.592 .779/.640 .723/.571 .794/.635
5-Fields union .731/.589 .755/.655 .664/.528 .831/.682
Google-venues .948/.942 .955/.949 .966/.961 .971/.968

Clustering (NMI):

Method FOS Google
DeepWalk 0.277 0.394
LINE(1+2) 0.305 0.459
PTE 0.153 0.602
metapath2vec 0.427 0.836

Metapath2vec, by exploiting heterogeneous context, consistently achieves superior classification and clustering performance. The Google-venue graph's higher data quality (top venues, reduced interdisciplinarity) yields increased performance across methods. A pronounced Micro-F1 to Macro-F1 drop indicates major label imbalance and cross-field scholarly activity.

6. Applications and Prospective Research Avenues

AceKG underpins multiple academic data mining tasks:

  • Cooperation Prediction: Embeddings encode institutional, field, and citation proximity, supporting co-authorship link prediction.
  • Author Disambiguation: Hetero-typed relational context enables distinction of same-name authors; iterative URI refinement is supported via embedding-space clustering.
  • Rising-Star Detection: Temporal AceKG snapshots, combined with feature trajectories and embedding novelty, support early-career impact forecasting.

Future research directions include dynamic KG updates (streaming ingestion), fine-grained event extraction, graph neural networks over heterogeneous schemas, and cross-modal knowledge integration (e.g., PDF content, code repositories). The scale, ontological clarity, and connection to external bibliographic resources position AceKG as a foundational asset for scalable, multi-facet academic knowledge discovery and benchmarking (Wang et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AceKG.