- The paper introduces an active learning framework that strategically selects informative instances for efficient entity alignment.
- It formalizes a unique active learning setup for 1:n matching tasks and demonstrates the advantages of centrality and embedding-based heuristics.
- Empirical results on WK3l-15k show that simple heuristics like node centrality markedly improve Hits@1 performance with fewer labels.
This paper (2001.08943) introduces active learning for entity alignment (EA) in knowledge graphs (KGs). Entity alignment, the task of identifying matching entities across different KGs, is crucial for integrating information but acquiring sufficient labeled training data (seed alignments) is often expensive and requires human annotators. The paper proposes using active learning to strategically select the most informative instances for human labeling, aiming to achieve higher alignment performance with fewer labels.
The paper formalizes the active learning problem for EA, highlighting key differences from classical classification active learning. A significant distinction is the nature of the potential labels: in EA, a query about an entity from one KG can yield multiple matching entities in the other KG (1:n mapping), or reveal that the entity has no match in the other KG (it's "exclusive"). This differs from the typical single class label in classification.
Two potential labeling scenarios are discussed:
- Presenting a pair of entities and asking if they match (True/False). This is simple per query but requires ∣EL∣×∣ER∣ potential queries for full coverage and has a low probability of finding positive matches in a random query.
- Presenting an entity from one KG and asking the annotator to find all matching entities in the other KG. This is more complex for the annotator but requires fewer queries overall (∣EL∣+∣ER∣), has a higher chance of finding positive matches initially, and inherently identifies exclusive nodes.
The authors focus on the second scenario due to its practical advantages, especially for cold-start performance and its ability to identify exclusive nodes.
A practical implementation contribution is the "Dataset Adjustment" strategy. Since the labeling process identifies exclusive nodes (entities with no match in the other KG), these nodes can be removed from the KG representations used for the matching model during training. This can make the KGs more similar from the perspective of the matching entities and improve performance, particularly at later stages of training when more exclusive nodes have been identified. This is a practical technique enabled by controlling the data acquisition process.
The paper proposes and evaluates several active learning heuristics for selecting entities to query:
- Node Centrality (Degree, Betweenness): Queries entities with high centrality, hypothesizing that central nodes are more likely to have matches and their labels will provide more information to the graph-based model due to their connections. These can be precomputed, simplifying deployment and parallelization.
- Graph Coverage (AVC): Selects nodes to distribute labeled instances across the graph structure, using an approximate vertex cover approach.
- Embedding Space Coverage (Coreset, ESCCN): Aims to select nodes that represent the embedding space of the KGs well. Coreset selects nodes maximally distant from already selected ones. ESCCN improves upon this by first clustering embeddings and then selecting central nodes (e.g., high degree) within each cluster, balancing representation and structural importance.
- Uncertainty Matching (BALD): Adapts Bayesian uncertainty estimation (using Monte-Carlo Dropout) from classification AL. Queries nodes where the model is most uncertain about the matching prediction. This requires running the matching model multiple times per query iteration.
- Certainty Matching (PREXP): A heuristic specifically designed for EA, which prefers querying nodes likely to have matches. It uses historical maximum similarity scores between entities and their potential matches in the other KG to build probability distributions for matching vs. exclusive nodes and queries entities whose similarity profile is more indicative of having a match.
The evaluation is conducted on the WK3l-15k dataset subsets (en-de, en-fr), using a GCN-Align model. The evaluation framework is batch-wise and pool-based, simulating an oracle (human annotator). It incrementally adds discovered alignments and exclusive nodes to the training data and retrains the model. Performance is measured by Hits@1 on a held-out test set of alignments.
The experimental results highlight several practical findings:
- Removing exclusive nodes identified during labeling significantly improves performance, especially as more nodes are labeled.
- Simple node centrality-based heuristics (Degree, Betweenness) and the ESCCN heuristic perform very well, often matching or exceeding the performance of more complex, model-dependent heuristics.
- Adaptations of state-of-the-art classification AL heuristics like Coreset and Uncertainty Matching perform poorly compared to simpler methods for this specific EA task.
- Finding positive matching pairs is particularly crucial in the early stages of active learning.
The paper concludes that simple, precomputable heuristics like node centrality are highly effective for active learning in EA. Their performance is comparable to or better than model-based approaches, and they offer advantages in terms of computational cost (no need to run the model for selection) and deployment (labeling order can be fixed and parallelized). This suggests that for practical EA labeling pipelines, straightforward graph-based heuristics are a strong starting point.
For implementing these concepts:
- Setup KG representations: Load your KGs and represent them (e.g., adjacency lists, entity/relation dictionaries).
- Select Query Scenario: Choose the single-node query scenario as recommended.
- Implement Exclusive Node Tracking: Maintain sets of identified exclusive nodes for each KG.
- Implement Dataset Adjustment: Modify your training data loading pipeline to exclude triples involving identified exclusive nodes when training the matching model.
- Implement Heuristics:
- Centrality: Calculate degree and betweenness centrality for all nodes in the KGs. Store these scores. For querying, select nodes based on these scores (e.g., highest degree/betweenness).
- Graph Coverage (AVC): Implement the greedy vertex cover approximation algorithm.
- Embedding Space Coverage (ESCCN): Train an initial EA model or use pre-trained embeddings. Implement clustering (e.g., k-means) on the combined entity embeddings. Calculate centrality within clusters. Implement the proposed sampling strategy based on labeled node counts per cluster and centrality.
- Certainty Matching (PREXP): Requires training an EA model. Implement logic to calculate max similarity scores for all entities, fit distributions for labeled matching/exclusive nodes, and score unlabeled nodes based on these distributions.
- Integrate Oracle Simulation: Write a function that takes a list of queried entity IDs and returns the ground-truth matches found in the predefined Atrain​ and any entities that are in XL∪XR.
- Build Active Learning Loop:
- Initialize pool of query candidates (P0​).
- Loop for a fixed number of queries (budget) or steps:
- Select a batch of entities Qi​ from the current pool Pi​ using the chosen heuristic.
- Query the oracle O(Qi​) to get new alignments Ai​ and exclusive nodes XiL​,XiR​.
- Add new alignments to the training set A≤i​.
- Add new exclusive nodes to the sets X≤iL​,X≤iR​.
- Update the pool Pi+1​.
- Retrain the EA model using A≤i​ and excluding X≤iL​,X≤iR​. Warm-start training from the previous iteration's model parameters.
- Evaluate the model on the test set Atest​.
- Monitor and Compare: Track performance (Hits@1) vs. the number of queries for different heuristics. Calculate metrics like AUC to compare overall efficiency.
The choice of heuristic involves trade-offs. Centrality-based methods are computationally inexpensive to select queries but rely solely on graph structure. Embedding-based methods (ESCCN, PREXP) require a trained model to select queries, adding computational overhead per active learning step but potentially leveraging more nuanced information. Uncertainty methods require multiple model runs, increasing per-step cost significantly and not performing well according to this study. For practical deployment where labeling can be done in parallel and query generation cost is less critical than annotator cost, precomputable heuristics like Degree or Betweenness offer a good balance of performance and simplicity. If model predictions are readily available after training each step, PREXP or ESCCN could also be considered.