- The paper proposes LargeGNN, a scalable GNN method that merges two knowledge graphs and partitions the merged graph using METIS for efficient entity alignment.
- The approach minimizes structural loss by integrating centrality-based landmark selection, cross-subgraph negative sampling, and entity reconstruction during training.
- Experimental results on DBpedia1M and OpenEA benchmarks show that LargeGNN achieves higher precision and efficiency compared to existing scalable alignment methods.
Entity alignment, the task of finding equivalent entities across different knowledge graphs (KGs), is crucial for integrating heterogeneous data sources. However, traditional embedding-based entity alignment methods, particularly those leveraging Graph Neural Networks (GNNs), face significant scalability challenges on large-scale KGs due to high memory requirements for graph convolution over the entire structure. Existing scalable approaches attempt to address this by partitioning KGs into smaller blocks, but this often leads to considerable structure and alignment information loss, and may rely on auxiliary information like entity names, which can be unreliable.
This paper proposes LargeGNN, a scalable GNN-based approach for large-scale entity alignment that minimizes structure and alignment loss from three perspectives: merging and partitioning, enhanced GNN training, and graph-level inference.
The core idea is to train a GNN on mini-batches composed of subgraphs rather than the entire KG, while incorporating mechanisms to retain global information and connectivity. The pipeline begins by merging the two KGs to be aligned into a single graph. This is done by collapsing pre-aligned entities from the seed alignment set into single nodes, ensuring that training pairs are contained within the same subgraph after partitioning. The merged graph is then partitioned using the METIS algorithm to create subgraphs small enough for GPU memory.
To mitigate the structure loss inherent in partitioning, a centrality-based subgraph generation algorithm is introduced. After initial partitioning, this algorithm identifies "landmark" entities outside each subgraph that are important based on their proximity to seed entities (importance ϕ(e)) and their connections to other important entities (influence Φ(e)). A benefit score Ω(e,S) is calculated for candidate landmarks based on their influence and distance to the subgraph S. A connectivity-aware procedure is used to select a budget k of these beneficial entities to be recalled and added to potentially multiple subgraphs, acting as bridges.
The GNN encoder, based on Dual-AMN (2008.09792), is trained on these subgraphs. While it performs message passing within each subgraph, the training process is designed to capture cross-subgraph information. The primary alignment loss Lalign (2008.09792) is computed within each subgraph using in-batch negative sampling (considering all other entities in the subgraph as negative candidates). To incorporate global context, two additional strategies are employed:
- Cross-subgraph Negative Sampling (CNS): During training for a subgraph, a random sample of entities from other subgraphs is included as negative alignment candidates. To save memory and computation, the similarity for these cross-subgraph negatives is calculated using the entities' input representations (Lcross), not their output embeddings from the GNN, as their subgraphs are not currently processed.
- Entity Reconstruction (ER): A self-supervised task Lreconstruct is added to help the GNN learn robust entity representations even from incomplete neighborhoods within a subgraph. This loss encourages an entity's embedding to be close to the embeddings of its neighbors within the current subgraph.
The overall training loss is a combination of Lalign, Lcross, and Lreconstruct.
After training, the entity embeddings obtained from processing each subgraph are integrated into a single, unified embedding space. For landmark entities that appeared in multiple subgraphs, their final representation is the average of their embeddings from all subgraphs they belonged to. Alignment inference is then performed on this unified space. Recognizing that real-world KGs contain many non-matchable entities, the paper proposes a bidirectional kNN search using efficient similarity search libraries like Faiss (1702.08734). An entity e1 from KG1 and an entity e2 from KG2 are considered an alignment pair only if e2 is among the top-k nearest neighbors of e1 in KG2, and e1 is among the top-k nearest neighbors of e2 in KG1. This helps improve precision by filtering out low-confidence or unidirectional matches.
The paper introduces a new million-scale dataset, DBpedia1M, derived from multilingual DBpedia, which includes a significant number of non-matchable entities to simulate real-world scenarios more closely. Experiments on DBpedia1M and benchmark OpenEA 15K/100K datasets demonstrate LargeGNN's effectiveness and efficiency. Compared to existing scalable methods like LargeEA (2108.05211) and LIME (2204.05592), LargeGNN achieves superior entity alignment performance, especially on the large-scale dataset with non-matchable entities (higher F1, Precision, Recall, and Hits@1). It also shows competitive or better efficiency (time and memory usage) depending on the partition strategy. The ablation studies confirm the benefits of the centrality-based subgraph generation, cross-subgraph negative sampling, and entity reconstruction strategies, highlighting their role in compensating for the limitations of subgraph-based processing.
Implementation details mentioned include setting hyperparameters based on Dual-AMN (2008.09792), specific values for η and λ in the importance/benefit calculations, and the budget k for landmark entities based on dataset size. The number of subgraphs was set to 5, 10, and 40 for 15K, 100K, and 1M datasets respectively. Bidirectional kNN search uses k=5. The source code is publicly available.
The practical implications are significant: LargeGNN provides a method to apply expressive GNN models to entity alignment on KGs that are orders of magnitude larger than previously feasible, making GNNs a viable option for real-world KG integration tasks. The proposed techniques effectively balance the need for partitioning to manage memory with the requirement to capture global graph structures and alignment signals.