Entity alignment (EA) is a crucial task in integrating disparate knowledge graphs (KGs), aiming to identify entities that refer to the same real-world object across different sources. Traditionally, EA relied on symbolic methods using features like names, annotations, and relational structures. However, the heterogeneity across KGs poses significant challenges for these methods. The emergence of KG embedding techniques, which represent entities and relations as low-dimensional vectors, has led to the development of embedding-based entity alignment approaches. These methods encode entities in continuous vector spaces, allowing similarity to be measured via vector distance, potentially mitigating heterogeneity issues. Despite recent advancements, a comprehensive understanding of embedding-based EA's status quo, realistic benchmarks for evaluation, and readily available implementations have been lacking. This paper presents a detailed benchmarking paper to address these gaps.
The key contributions of this research are:
- A comprehensive survey and categorization of 23 embedding-based EA approaches, analyzing their core techniques in embedding and alignment modules and their interaction modes.
- The development of new benchmark datasets designed to reflect the characteristics of real-world KGs, particularly regarding entity degree distributions, through an Iterative Degree-based Sampling (IDS) algorithm. Datasets are provided at 15K and 100K entity scales across different KG pairs (DBpedia-DBpedia cross-lingual, DBpedia-Wikidata, DBpedia-YAGO).
- An open-source library called OpenEA, which integrates 12 representative embedding-based EA approaches and allows for flexible combination of different KG embedding models and alignment strategies.
- An extensive experimental evaluation and analysis of the implemented approaches on the new datasets, providing insights into their strengths and limitations.
- Exploratory experiments investigating the geometric properties of entity embeddings, evaluating unexplored KG embedding models for EA, and comparing embedding-based approaches with conventional methods.
- Identification of promising future research directions based on the experimental findings.
The paper describes the typical framework of embedding-based entity alignment, which involves an embedding module to encode KGs into vector spaces and an alignment module that uses seed alignments to learn the correspondence between spaces or entities. These modules interact either by mapping embeddings between independent spaces or by learning embeddings directly in a unified space (Figure 1).
Techniques and Practical Implementation
Embedding modules often use relation embedding (triple, path, or neighborhood-based) and attribute embedding.
- Triple-based: Models like TransE capture local triple semantics. For a triple (h,r,t), TransE minimizes a scoring function like ϕ(h,r,t)=∥h+r−t∥. Practical implementations involve training with positive and negative samples and ranking losses.
- Path-based: Exploit relation paths, e.g., IPTransE composes relation embeddings along a path. RSN4EA uses RNNs. Implementing these requires handling sequence or graph structures beyond simple triples.
- Neighborhood-based: Use Graph Convolutional Networks (GCNs) to capture local graph structure (Figure 1 framework mentions GCNs as neighborhood-based). GCN layers propagate and aggregate information from neighbors. Implementation involves defining graph adjacency, handling different relation types (e.g., R-GCN), and applying activation functions.
- Attribute embedding: Enhances entity representation using attribute triples (entity, attribute, value). Attribute correlation embedding (JAPE) models attribute co-occurrence. Literal embedding (AttrE) encodes literal values, often using character-level models or pre-trained word embeddings for textual values. Implementing literal embedding requires encoders capable of handling diverse value types and potentially cross-lingual text.
Alignment modules select distance metrics (Cosine, Euclidean, Manhattan) and alignment inference strategies. Greedy search (finding nearest neighbor) is common, but the paper explores collective search methods like Stable Matching to find optimal 1-to-1 alignments.
Interaction modes include space transformation (learning a matrix to map one space to another), space calibration (minimizing distance for aligned entities), parameter sharing (using same parameters for common entities/relations), and parameter swapping (augmenting triples using seed alignment). Learning strategies can be supervised (using only seed alignment) or semi-supervised (iteratively augmenting seed alignment, like BootEA).
Dataset Generation (Practical Consideration)
Existing datasets like DBP15K and WK3L are shown to have entity degree distributions significantly different from real-world KGs, potentially biasing evaluation towards high-degree entities (Figure 2). To address this, the paper proposes the Iterative Degree-based Sampling (IDS) algorithm (Algorithm 1). IDS iteratively samples entities while attempting to preserve the original KG's degree distribution, measured by Jensen-Shannon divergence. It uses PageRank to prioritize deleting less influential entities. This process ensures the generated datasets (Table 2) are more representative of real-world KG structures, providing a more realistic benchmark (Figure 3). Compared to simple random or PageRank-based sampling, IDS yields datasets with better-preserved structural properties like average degree, percentage of isolated entities, and clustering coefficient (Table 3).
Open-source Library (OpenEA)
The OpenEA library (Figure 4) is designed for flexible implementation and evaluation. Its architecture promotes loose coupling between embedding and alignment modules. It provides core components like loss functions, negative sampling, distance metrics, and learning strategies. It implements 12 representative EA approaches and integrates 8 other KG embedding models. This library significantly lowers the barrier to entry for researchers and practitioners wanting to experiment with or apply these methods. It uses Python and TensorFlow, common deep learning frameworks.
Experimental Evaluation and Analysis
The evaluation uses 5-fold cross-validation on the generated datasets (Table 4). Results (Table 5) show that RDGCN, BootEA, and MultiKE are top performers, suggesting the benefit of combining relational and attribute information (RDGCN, MultiKE) and effective semi-supervised learning (BootEA).
Key findings include:
- Data Density: Most relation-based approaches perform better on denser KG versions (V2), as more relational triples provide richer context for embedding.
- Long-tail Entities: Relation-based methods struggle with long-tail entities (those with few connections), which are prevalent in real KGs (Figure 5). Incorporating attribute information helps alleviate this.
- Scale: Performance generally decreases on larger (100K) datasets due to increased complexity and search space.
- Features (Relations vs. Attributes): Literal embedding improves performance significantly, while attribute correlation embeddings are less effective, especially across heterogeneous KGs (Figure 6). Attribute heterogeneity (e.g., different schemas or value formats) challenges attribute-based methods. Negative sampling is shown to be crucial for some methods (MTransE analysis).
- Semi-supervised Learning: The effectiveness of self-training depends heavily on the quality of the iteratively added alignment. BootEA's heuristic editing method helps maintain high precision in augmented data, leading to better performance (Figure 7).
- Efficiency: Approaches incorporating auxiliary information or complex techniques (like truncated negative sampling or path-based embedding) generally have higher training times (Figure 8). MultiKE offers a good balance of effectiveness and efficiency.
Exploratory Experiments
- Geometric Analysis: Analysis of entity embedding spaces reveals issues like hubness (entities frequently being nearest neighbors) and isolation (entities having no nearest neighbors) (Figure 10). These phenomena negatively impact nearest-neighbor-based alignment inference. Using metrics like Cross-domain Similarity Local Scaling (CSLS) and collective inference methods like Stable Matching improves performance by mitigating hubness and considering isolated entities (Table 6, Figure 9), suggesting that improving the alignment inference strategy is important.
- Unexplored KG Embedding Models: Testing various KG embedding models (TransH, TransD, ProjE, ConvE, HolE, SimplE, RotatE) shows that not all are equally suitable for EA (Figure 11). Models robust to multi-mapping relations (TransH, TransD) and non-Euclidean models like RotatE perform better than simpler models like TransE or those sensitive to structural sparsity (ConvE, ProjE on sparse data). RotatE's strong performance suggests non-Euclidean spaces are promising.
- Comparison to Conventional Approaches: Comparing with LogMap (semantic web) and PARIS (database) reveals that conventional methods, often relying heavily on attribute/literal similarity or logical reasoning, can outperform current embedding-based methods on some datasets (Table 7). Notably, LogMap and PARIS primarily rely on attribute information, while embedding methods heavily use relational information or both (Table 8). Analysis shows that embedding-based and conventional methods find complementary sets of correct alignments (Figure 12), suggesting potential for hybrid approaches.
Future Research Directions
Based on the paper, several critical directions for future work are highlighted:
- Unsupervised EA: Developing methods that do not require seed alignment, potentially leveraging auxiliary features, pre-trained models, or techniques from unsupervised cross-lingual word alignment or active learning.
- Long-tail Entity Alignment: Improving alignment performance for low-degree entities by using advanced graph neural networks, incorporating multi-modal or taxonomic features, or joint training with link prediction.
- Large-scale Entity Alignment: Addressing scalability challenges in training and inference for very large KGs, possibly using blocking techniques like LSH.
- Entity Alignment in Non-Euclidean Spaces: Exploring KG embedding models in hyperbolic or other non-Euclidean spaces, motivated by the promising results with RotatE.
In conclusion, this benchmarking paper provides a valuable overview and empirical analysis of embedding-based entity alignment. It highlights the strengths of methods combining relation and attribute information with effective semi-supervised strategies, identifies limitations like handling long-tail entities and attribute heterogeneity, and underscores the importance of robust alignment inference. The publicly available datasets and OpenEA library provide crucial resources for advancing research in this field. The findings also suggest that embedding-based and conventional methods have complementary strengths, pointing towards potential hybrid solutions for real-world applications.