AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by Large Language Models (2307.11772v3)

Published 18 Jul 2023 in cs.IR, cs.CL, and cs.LG

Abstract: The task of entity alignment between knowledge graphs (KGs) aims to identify every pair of entities from two different KGs that represent the same entity. Many machine learning-based methods have been proposed for this task. However, to our best knowledge, existing methods all require manually crafted seed alignments, which are expensive to obtain. In this paper, we propose the first fully automatic alignment method named AutoAlign, which does not require any manually crafted seed alignments. Specifically, for predicate embeddings, AutoAlign constructs a predicate-proximity-graph with the help of LLMs to automatically capture the similarity between predicates across two KGs. For entity embeddings, AutoAlign first computes the entity embeddings of each KG independently using TransE, and then shifts the two KGs' entity embeddings into the same vector space by computing the similarity between entities based on their attributes. Thus, both predicate alignment and entity alignment can be done without manually crafted seed alignments. AutoAlign is not only fully automatic, but also highly effective. Experiments using real-world KGs show that AutoAlign improves the performance of entity alignment significantly compared to state-of-the-art methods.

Citations (13)

View on Semantic Scholar

Summary

The paper introduces AutoAlign, a fully automatic KG alignment framework that leverages LLMs to bypass manual seed alignments.
It employs three joint embedding modules—predicate, structure, and attribute—to generate unified entity representations for effective cross-graph similarity computation.
Evaluations on DWY-NB benchmarks show that AutoAlign significantly outperforms seed-dependent methods, especially with attention-based pseudo-type embeddings.

The paper "AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by LLMs" (AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by Large Language Models, 2023) introduces a novel, fully automatic method for aligning entities and predicates between two knowledge graphs (KGs) without requiring any manually crafted seed alignments. Existing methods for KG alignment typically rely on a set of known entity pairs (seed alignments) to train models that learn mappings or align embedding spaces. Obtaining these seeds is expensive, time-consuming, and often requires domain expertise, which limits the scalability and portability of current approaches.

AutoAlign addresses this limitation by proposing a method that leverages both the structural and attribute information within KGs, enhanced by LLMs, to achieve alignment automatically. The core idea is to learn unified embedding spaces for entities and predicates from two source KGs directly, allowing similarity computation across graphs without explicit seed mappings.

The method consists of three core embedding modules: Predicate Embedding, Structure Embedding, and Attribute Embedding, which are learned jointly.

Predicate Embedding Module: To align predicates automatically (e.g., lgd:is_in and dbp:located_in), AutoAlign constructs a predicate-proximity-graph. This graph replaces the head and tail entities of relation triples with their respective entity types (obtained via rdfs:type predicates). Since entity types might have different surface forms across KGs (e.g., 'people' vs. 'person'), the paper proposes using LLMs (specifically, Claude is mentioned) to automatically identify and align synonymous types. This is done by prompting the LLM with lists of types from both KGs and asking it to identify synonymous pairs. After alignment, similar types are represented consistently. To handle entities with multiple types and emphasize more distinctive ones, the module computes "pseudo-type embeddings" using either a weighted sum or an attention-based mechanism over the embeddings of an entity's associated types. These pseudo-type embeddings are then used in a TransE-like objective function ( $\mathcal{J}_{PE}$ ) to learn predicate embeddings that capture similarity across KGs based on the types of entities they connect.
Attribute Embedding Module: This module leverages attribute triples (entities linked to literal values) to find entity similarities. Attribute values (like names, dates, coordinates) often use similar surface forms or structures for corresponding entities across KGs. AutoAlign proposes attribute character embeddings, representing attribute values as sequences of characters. It explores three compositional functions (SUM, LSTM, N-gram) to encode character sequences into single vectors, finding that the N-gram approach is most effective at capturing string similarity. An objective function ( $\mathcal{J}_{CE}$ ) is minimized to learn entity, attribute predicate, and attribute value embeddings such that $\mathbf{h} + \mathbf{p} \approx f_a(v)$ , where $f_a(v)$ is the compositional embedding of the attribute value. This module learns attribute embeddings in a unified space across KGs because the character-level similarity in attribute values naturally facilitates alignment.
Structure Embedding Module: This module focuses on relation triples (entities linked to other entities). It builds upon the TransE model but modifies the objective function ( $\mathcal{J}_{SE}$ ) by introducing a weighting factor $\alpha$ for each triple based on the frequency of its predicate. This gives higher importance to triples involving more frequent (and thus likely aligned or alignable) predicates, helping filter out noise from non-aligned predicates.

Joint Learning and Entity Alignment:

The three modules are trained jointly. The attribute embedding module naturally aligns entities based on similar attribute values. To integrate this with the structure embeddings (which might be in different spaces initially), a similarity objective ( $\mathcal{J}_{SIM}$ ) is added to minimize the difference between an entity's structure embedding and its attribute embedding (specifically, using cosine similarity). The overall objective function $\mathcal{J} = \mathcal{J}_{PE} + \mathcal{J}_{SE} + \mathcal{J}_{CE} + \mathcal{J}_{SIM}$ is minimized to learn the entity, predicate, and attribute embeddings in a shared vector space. After training, entity alignment is performed by finding, for each entity $h_1$ in the first KG, the entity $h_2$ in the second KG with the highest cosine similarity to $\mathbf{h}_1$ , above a certain threshold $\beta$ .

Triple Enrichment: The method also employs a triple enrichment step based on the transitivity rule (e.g., inferring $\langle h_1, p_1.p_2, t_2 \rangle$ from $\langle h_1, p_1, t_1 \rangle$ and $\langle t_1, p_2, t_2 \rangle$ ) to add one-hop transitive relations. This explicitly includes multi-hop relationships, potentially providing more attributes and related entities for each entity, which can further strengthen the attribute embedding and entity similarity identification.

Implementation and Evaluation: AutoAlign is evaluated on the DWY-NB benchmark datasets (DW-NB and DY-NB), comparing its performance (measured by Hits@k) against various state-of-the-art translation-based and GNN-based methods. Crucially, AutoAlign is evaluated in a 0% seed alignment setting, which is impossible for methods requiring seeds. The results demonstrate that AutoAlign, particularly the version using the attention-based pseudo-type embedding (AutoAlign-A), significantly outperforms all baseline methods across different amounts of seed alignments (even when baselines use up to 50% seeds), and achieves high accuracy even with no seeds. Ablation studies confirm the effectiveness of both the attribute embedding module and the proposed automatic predicate embedding module, highlighting the N-gram compositional function's superiority and the attention mechanism's benefit for predicate embedding. The learned embeddings are also shown to be effective for the downstream task of knowledge graph completion. The paper also discusses the scalability, noting that AutoAlign has a time complexity of $\mathcal{O}(M)$ (where $M$ is the total number of triples), comparable to other embedding-based methods, and is more memory efficient than GNN-based approaches.

In summary, AutoAlign is the first method to achieve fully automatic KG alignment by eliminating the need for manual seed alignments. It accomplishes this through a novel combination of attribute character embeddings, a predicate-proximity-graph enabled by LLM-based type alignment, and a joint learning framework, demonstrating superior effectiveness and practicality compared to existing seed-dependent methods.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ManojBhat711/status/1914611734431645874