- The paper introduces AutoAlign, a fully automatic KG alignment framework that leverages LLMs to bypass manual seed alignments.
- It employs three joint embedding modules—predicate, structure, and attribute—to generate unified entity representations for effective cross-graph similarity computation.
- Evaluations on DWY-NB benchmarks show that AutoAlign significantly outperforms seed-dependent methods, especially with attention-based pseudo-type embeddings.
The paper "AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by LLMs" (AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by Large Language Models, 2023) introduces a novel, fully automatic method for aligning entities and predicates between two knowledge graphs (KGs) without requiring any manually crafted seed alignments. Existing methods for KG alignment typically rely on a set of known entity pairs (seed alignments) to train models that learn mappings or align embedding spaces. Obtaining these seeds is expensive, time-consuming, and often requires domain expertise, which limits the scalability and portability of current approaches.
AutoAlign addresses this limitation by proposing a method that leverages both the structural and attribute information within KGs, enhanced by LLMs, to achieve alignment automatically. The core idea is to learn unified embedding spaces for entities and predicates from two source KGs directly, allowing similarity computation across graphs without explicit seed mappings.
The method consists of three core embedding modules: Predicate Embedding, Structure Embedding, and Attribute Embedding, which are learned jointly.
- Predicate Embedding Module: To align predicates automatically (e.g.,
lgd:is_in
and dbp:located_in
), AutoAlign constructs a predicate-proximity-graph. This graph replaces the head and tail entities of relation triples with their respective entity types (obtained via rdfs:type
predicates). Since entity types might have different surface forms across KGs (e.g., 'people' vs. 'person'), the paper proposes using LLMs (specifically, Claude is mentioned) to automatically identify and align synonymous types. This is done by prompting the LLM with lists of types from both KGs and asking it to identify synonymous pairs. After alignment, similar types are represented consistently. To handle entities with multiple types and emphasize more distinctive ones, the module computes "pseudo-type embeddings" using either a weighted sum or an attention-based mechanism over the embeddings of an entity's associated types. These pseudo-type embeddings are then used in a TransE-like objective function (JPE) to learn predicate embeddings that capture similarity across KGs based on the types of entities they connect.
- Attribute Embedding Module: This module leverages attribute triples (entities linked to literal values) to find entity similarities. Attribute values (like names, dates, coordinates) often use similar surface forms or structures for corresponding entities across KGs. AutoAlign proposes attribute character embeddings, representing attribute values as sequences of characters. It explores three compositional functions (SUM, LSTM, N-gram) to encode character sequences into single vectors, finding that the N-gram approach is most effective at capturing string similarity. An objective function (JCE) is minimized to learn entity, attribute predicate, and attribute value embeddings such that h+p≈fa(v), where fa(v) is the compositional embedding of the attribute value. This module learns attribute embeddings in a unified space across KGs because the character-level similarity in attribute values naturally facilitates alignment.
- Structure Embedding Module: This module focuses on relation triples (entities linked to other entities). It builds upon the TransE model but modifies the objective function (JSE) by introducing a weighting factor α for each triple based on the frequency of its predicate. This gives higher importance to triples involving more frequent (and thus likely aligned or alignable) predicates, helping filter out noise from non-aligned predicates.
Joint Learning and Entity Alignment:
The three modules are trained jointly. The attribute embedding module naturally aligns entities based on similar attribute values. To integrate this with the structure embeddings (which might be in different spaces initially), a similarity objective (JSIM) is added to minimize the difference between an entity's structure embedding and its attribute embedding (specifically, using cosine similarity). The overall objective function J=JPE+JSE+JCE+JSIM is minimized to learn the entity, predicate, and attribute embeddings in a shared vector space. After training, entity alignment is performed by finding, for each entity h1 in the first KG, the entity h2 in the second KG with the highest cosine similarity to h1, above a certain threshold β.
Triple Enrichment: The method also employs a triple enrichment step based on the transitivity rule (e.g., inferring ⟨h1,p1.p2,t2⟩ from ⟨h1,p1,t1⟩ and ⟨t1,p2,t2⟩) to add one-hop transitive relations. This explicitly includes multi-hop relationships, potentially providing more attributes and related entities for each entity, which can further strengthen the attribute embedding and entity similarity identification.
Implementation and Evaluation: AutoAlign is evaluated on the DWY-NB benchmark datasets (DW-NB and DY-NB), comparing its performance (measured by Hits@k) against various state-of-the-art translation-based and GNN-based methods. Crucially, AutoAlign is evaluated in a 0% seed alignment setting, which is impossible for methods requiring seeds. The results demonstrate that AutoAlign, particularly the version using the attention-based pseudo-type embedding (AutoAlign-A), significantly outperforms all baseline methods across different amounts of seed alignments (even when baselines use up to 50% seeds), and achieves high accuracy even with no seeds. Ablation studies confirm the effectiveness of both the attribute embedding module and the proposed automatic predicate embedding module, highlighting the N-gram compositional function's superiority and the attention mechanism's benefit for predicate embedding. The learned embeddings are also shown to be effective for the downstream task of knowledge graph completion. The paper also discusses the scalability, noting that AutoAlign has a time complexity of O(M) (where M is the total number of triples), comparable to other embedding-based methods, and is more memory efficient than GNN-based approaches.
In summary, AutoAlign is the first method to achieve fully automatic KG alignment by eliminating the need for manual seed alignments. It accomplishes this through a novel combination of attribute character embeddings, a predicate-proximity-graph enabled by LLM-based type alignment, and a joint learning framework, demonstrating superior effectiveness and practicality compared to existing seed-dependent methods.