KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

Published 21 Apr 2025 in cs.IR, cs.AI, and cs.CL | (2504.15135v1)

Abstract: Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-LLMs based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs LLMs to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper proposes a three-stage KGMEL framework (generate, retrieve, rerank) that integrates knowledge graph triples with multimodal data for enhanced entity linking.
The method employs a pretrained CLIP model and dual cross-attention to fuse textual, visual, and triple information into unified embeddings.
Experiments on benchmark datasets demonstrate up to 19.13% HITS@1 improvement, validating the framework’s robust performance and practical impact.

This paper introduces KGMEL, a novel framework for Multimodal Entity Linking (MEL) that enhances performance by incorporating structural information from Knowledge Graph (KG) triples. Traditional MEL methods link textual mentions accompanied by images to entities in a knowledge base using text and visual features but often ignore the rich context provided by KG triples associated with entities. The authors observe that KG triples are abundant and can serve as semantic bridges, linking mentions and entities that might appear dissimilar based solely on text descriptions.

KGMEL addresses the challenges of using KG triples (mentions lack triples, entities have too many irrelevant triples) through a three-stage generate-retrieve-rerank process:

Generation: Since mentions don't inherently have triples, KGMEL uses Vision-LLMs (VLMs) to generate a set of relevant triples ( $\mathcal{T}_m$ ) for each mention ( $m$ ) based on its textual ( $t_m$ ) and visual ( $v_m$ ) context. A structured prompt guides the VLM to identify the mention's type, describe it, and generate triples, optionally using provided relation examples.
Retrieval: This stage learns joint representations for mentions and entities by integrating text, image, and triple information.
- Encoding: Text ( $t_m, t_e$ ) and image ( $v_m, v_e$ ) features are encoded using a pretrained CLIP model. For triples ( $\mathcal{T}_m, \mathcal{T}_e$ ), relation and tail embeddings (also via CLIP) are combined using an MLP with a residual connection.
- Triple Aggregation: Triple embeddings for a mention/entity are aggregated into a single vector ( $\mathbf{Z}_m, \mathbf{Z}_e$ ) using a dual cross-attention mechanism weighted by text and image embeddings, followed by top-p filtering to retain only the most relevant triples.
- Fusion: Text ( $\mathbf{T}$ ), image ( $\mathbf{V}$ ), and aggregated triple ( $\mathbf{Z}$ ) embeddings are fused using a gated mechanism to produce final embeddings ( $\mathbf{X}_m, \mathbf{X}_e$ ).
- Learning & Retrieval: The model is trained using contrastive losses (mention-entity, mention-mention, entity-entity) to align mention embeddings with their corresponding entity embeddings. The learned embeddings are then used to retrieve the top-K candidate entities ( $\mathcal{C}(m)$ ) for a given mention based on dot product similarity.
Reranking: This stage refines the candidate list to identify the best match.
- Triple Filtering: For each candidate entity $e \in \mathcal{C}(m)$ , its potentially large set of KG triples ( $\mathcal{T}_e$ ) is filtered to keep only those whose relations and tails are most similar (top-n) to the relations and tails in the mention's generated triples ( $\mathcal{T}_m$ ). This results in a filtered set $\mathcal{T}_e^{(filt)}$ .
- Zero-Shot Reranking: A LLM is prompted with the mention's text ( $t_m$ ) and generated triples ( $\mathcal{T}_m$ ), along with the text ( $t_e$ ) and filtered triples ( $\mathcal{T}_e^{(filt)}$ ) for each candidate entity in $\mathcal{C}(m)$ . The LLM identifies supporting triples and selects the final best-matching entity ( $e_m^*$ ).

Experiments conducted on three benchmark datasets (WikiDiverse, RichpediaMEL, WikiMEL) show that KGMEL significantly outperforms existing state-of-the-art retrieval-based and generative MEL methods. The retrieval stage alone shows strong performance, and the reranking stage provides further substantial improvements, achieving up to 19.13% higher HITS@1 than the best competitor. Ablation studies confirm the positive contributions of incorporating visual information, triple information, and the gated fusion mechanism. The framework demonstrates robustness with different VLMs used for triple generation. Case studies illustrate how generated triples capture key information from both text and images, aiding disambiguation.

Markdown Report Issue