Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion (2205.02357v5)

Published 4 May 2022 in cs.CL, cs.AI, cs.CV, cs.LG, and cs.MM

Abstract: Multimodal Knowledge Graphs (MKGs), which organize visual-text factual knowledge, have recently been successfully applied to tasks such as information retrieval, question answering, and recommendation system. Since most MKGs are far from complete, extensive knowledge graph completion studies have been proposed focusing on the multimodal entity, relation extraction and link prediction. However, different tasks and modalities require changes to the model architecture, and not all images/objects are relevant to text input, which hinders the applicability to diverse real-world scenarios. In this paper, we propose a hybrid transformer with multi-level fusion to address those issues. Specifically, we leverage a hybrid transformer architecture with unified input-output for diverse multimodal knowledge graph completion tasks. Moreover, we propose multi-level fusion, which integrates visual and text representation via coarse-grained prefix-guided interaction and fine-grained correlation-aware fusion modules. We conduct extensive experiments to validate that our MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER. Code is available in https://github.com/zjunlp/MKGformer.

Citations (103)

View on Semantic Scholar

Summary

The paper introduces MKGformer, a hybrid transformer that fuses text and visual data to enhance multimodal knowledge graph completion.
It employs Prefix-guided Interaction and Correlation-aware Fusion modules to reduce modality heterogeneity and mitigate noise.
Experimental results demonstrate significant improvements in link prediction, relation extraction, and named entity recognition across benchmark datasets.

Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion

The paper by Xiang Chen and colleagues, presented at SIGIR 2022, addresses the challenge of completing Multimodal Knowledge Graphs (MKGs) which effectively organize both visual and textual knowledge but often suffer from incompleteness. This incompleteness limits their utility in applications such as multimodal information retrieval, question answering, and recommendation systems. The paper proposes a novel approach using a hybrid transformer architecture with multi-level fusion to enhance the task of multimodal knowledge graph completion (MKGC), particularly in multimodal link prediction, relation extraction (RE), and named entity recognition (NER).

Methodology Overview

The primary contribution of the paper is the introduction of the MKGformer, which utilizes a hybrid transformer to unify the processing of text and visual inputs across multiple tasks within the MKG domain. The paper employs the Vision Transformer (ViT) for encoding visual data and the BERT model for textual data, integrating them through a series of layers called M-Encoder which performs multi-level fusion, crucial for addressing modality heterogeneity and noise from irrelevant visual data.

Multi-level Fusion

Prefix-Guided Interaction (PGI) Module: Activated within the transformer at the attention layer, this module facilitates coarse-grained interaction between text and visual modalities, serving to reduce modality heterogeneity by allowing textual representations to influence visual attention weights, making the fused model more uniformly integrated.
Correlation-Aware Fusion (CAF) Module: Integrated in the feed-forward layer, this module enables fine-grained alignment between tokens and image patches, mitigating the effect of irrelevant visual content by enhancing the direct correspondence between crucial textual entities and related visual features.

Experimental Results

The MKGformer was evaluated on several datasets, achieving state-of-the-art performance on multimodal link prediction tasks on FB15K-237-IMG and WN18-IMG, demonstrating substantial improvements in Hits@10 and Mean Rank scores over previous state-of-the-art approaches like RSME. In the field of multimodal RE and NER tasks, the model surpassed recent baselines such as UMGF and MEGA on MNRE and Twitter-2017 datasets, showing particularly strong improvements under low-resource conditions. This indicates the robustness and effectiveness of the proposed model in effectively leveraging and integrating multimodal information for knowledge graph tasks.

Implications and Speculations

The implications of this research extend beyond MKGC to broader applications in the intersection of vision and language. The successful implementation of a unified transformer model for different multimodal tasks suggests a pathway towards more versatile AI systems capable of learning more generalized representations across diverse forms of data. The hybrid approach also presents opportunities to refine existing pre-training strategies, optimizing them for tasks that balance visual and textual data to foster advancements in AI understanding of multimodal inputs.

Future Directions

The paper opens avenues for future research, including exploration into other multimodal domains such as sentiment analysis and event extraction, where visual context can provide significant supplementary information to textual analysis. Additionally, extending this framework to pre-training strategies for MKGC could enhance generalizability and performance efficacy across various applications.

In conclusion, this paper contributes a significant methodological advancement in the completion of multimodal knowledge graphs and offers a promising direction for future research in integrating visual and textual modalities through sophisticated transformer architectures.