Entity Alignment (EA) aims to identify equivalent entities across different knowledge graphs (KGs). While embedding-based approaches using Graph Neural Networks (GNNs) have shown strong performance, they face significant challenges when applied to large-scale KGs. These challenges stem from the quadratic complexity of similarity normalization methods (like Sinkhorn iteration or CSLS) commonly used to mitigate geometric problems (e.g., hubness and isolation) in high-dimensional embedding spaces, and the difficulty of scaling GNN training itself. Existing scalable methods often sacrifice accuracy for speed by losing structural information.
The paper proposes ClusterEA, a general framework to address these limitations, enabling scalable EA while improving accuracy by applying normalization techniques efficiently on mini-batches with a high entity equivalent rate. The framework consists of three main components:
- Stochastic Training of GNNs for EA: ClusterEA first trains a Siamese GNN model on large-scale KGs in a stochastic fashion using neighborhood sampling (similar to GraphSAGE). This approach allows training GNNs on massive graphs by processing mini-batches containing sampled neighbors, minimizing the loss of structural information compared to partitioning methods. The goal is to produce entity embeddings that capture structural and relational information. The training uses a variant of triplet loss like Normalized Hard Sample Mining (NHSM) for efficiency. Any GNN-based EA model can be integrated into this component.
- Learning-based Mini-batch Samplers (ClusterSampler): After obtaining the entity embeddings, ClusterEA employs a novel strategy called ClusterSampler to generate mini-batches with a high concentration of potentially alignable entity pairs. This is critical because normalization methods like Sinkhorn iteration work best when applied to matrices representing potential one-to-one mappings. ClusterSampler is a learning-based approach: it first clusters the known training alignment pairs using either intra-KG structure or inter-KG mapping information to assign 'batch labels'. Then, it trains classifiers (e.g., GCN or XGBoost) on all entities using these labels to predict which batch each entity belongs to. Two specific samplers are introduced:
- Intra-KG Structure-based ClusterSampler (ISCS): Leverages intra-KG structural information (like neighborhood) by clustering one KG using METIS and training a GCN on the other KG to classify entities into corresponding batches based on training labels.
- Cross-KG Mapping-based ClusterSampler (CMCS): Utilizes the learned inter-KG mapping information present in the embeddings. It clusters the concatenated embeddings of aligned training pairs using K-Means and then trains XGBoost classifiers on the embeddings of all entities to assign batch IDs.
These methods produce sets of mini-batches where the entities within each batch have a higher probability of being alignable, making the subsequent normalization more effective. The process supports parallelization on GPUs for efficiency.
- Fusing Local and Global Similarities (SparseFusion): This component normalizes similarity matrices and combines information from different sources.
- Local Normalization: For each mini-batch generated by ClusterSampler, a local similarity matrix is computed using the entity embeddings (e.g., cosine similarity). Sinkhorn iteration is then applied to each local matrix independently to normalize it, effectively making it closer to a permutation matrix, which models the 1-to-1 mapping assumption within the batch. The normalized local matrices from all batches generated by a sampler are summed up to form a consolidated local similarity matrix.
- Multi-aspect Fusion: To reduce bias from a single sampler, ClusterEA can fuse local similarity matrices obtained from different samplers (e.g., CMCS and ISCS). This is typically done by simply summing the normalized local similarity matrices.
- Global Normalization and Fusion: A global similarity matrix is computed using a sparse approach like K-Nearest Neighbors (K-NN) search via FAISS across all entities in both KGs. A sparse version of CSLS, called Sp-CSLS, is proposed to partially normalize this global matrix efficiently. Sp-CSLS subtracts the average neighborhood similarity from only the non-zero entries and applies min-max normalization. Finally, the fused local similarity matrix and the partially normalized global similarity matrix are combined (e.g., by summing) and subjected to another Sp-CSLS normalization step to produce the final sparse similarity matrix used for predicting alignments (typically using greedy top-1 neighbor matching).
Practical Implementation:
- Scalability: Stochastic GNN training and the learning-based samplers (CMCS, ISCS) can be implemented with GPU acceleration. Applying Sinkhorn iteration on small mini-batches drastically reduces the memory and computational cost compared to a full global matrix. Sparse operations in SparseFusion and FAISS for K-NN search further enhance scalability.
- Flexibility: ClusterEA is a general framework; different GNN models (like GCNAlign, RREA, Dual-AMN) can be plugged into the stochastic training component, and different clustering/classification models could be explored for ClusterSampler, provided they meet scalability and distinguishability requirements.
- Hyperparameters: Key hyperparameters include the GNN architecture parameters, stochastic training parameters (e.g., neighborhood fanout F, batch sizes Np,Nn, learning rate, epochs), ClusterSampler parameters (number of batches K, clustering/classification model parameters), and SparseFusion parameters (Sinkhorn iterations Ks, K-NN size Kr for global similarity, neighborhood size Kn for Sp-CSLS). The choice of K in ClusterSampler is a trade-off between memory usage during normalization and the complexity of sampling.
- Data: The framework relies on KG structure and a set of seed alignments for training. It avoids using potentially biased side information like entity names or attributes, focusing on structural alignment.
- Performance: Experiments on large datasets like DBP1M show that ClusterEA significantly outperforms prior scalable methods like LargeEA in accuracy (e.g., H@1), achieving results comparable to or better than non-scalable state-of-the-art methods when applied to smaller datasets, while maintaining competitive scalability in terms of time and memory usage. The ablation paper confirms the importance of each component, particularly the Sinkhorn normalization and the multi-aspect batch sampling.
ClusterEA offers a practical approach to tackle the challenge of scaling embedding-based entity alignment to real-world, large-scale knowledge graphs by combining stochastic training, intelligent mini-batch sampling, and efficient similarity normalization/fusion.