Multimodal Relation Extraction

Updated 6 December 2025

Multimodal relation extraction is a field that integrates textual and visual data to extract semantic relationships and build rich knowledge graphs.
Key methodologies include joint extraction, retrieval-augmented models, and graph-based fusion to effectively merge and align data across modalities.
Recent advances leverage cross-modal attention and variational techniques, yielding notable improvements on F₁ scores and benchmark evaluations.

Multimodal relation extraction (MRE) is the task of identifying semantic relationships between entities by leveraging both textual and visual information, often in support of knowledge graph construction. MRE systems aim to resolve relations not only between text-derived entity pairs but also across modalities, such as linking textual mentions to detected objects in paired images. Developments in this field have produced a variety of approaches encompassing joint extraction, retrieval-augmented models, cross-modal alignment, graph-based fusion, and robust evaluation protocols.

1. Task Formalization and Scope

MRE extends traditional, text-only relation extraction by grounding the task in a multimodal context, typically involving paired text (e.g., captions, sentences, headlines) and images (or more generally, visual content). The relation arguments may both occur in text, both in image, or one in each modality (e.g., a text entity and an image object) (He et al., 2023, Hei et al., 16 Aug 2024, Lin et al., 5 Sep 2025). The challenge requires systems to (a) extract possible entities/objects from each modality and (b) classify or retrieve the relation between any admissible pair.

Formally, given a text $T$ and image $I$ , the goal is to predict sets of relational triples or quintuples, e.g., for object-entity extraction: $\{(e_o, r, e_t)\} \subseteq \mathcal{O} \times \mathcal{R} \times \mathcal{E}$ where $\mathcal{O}$ is the set of detected objects, $\mathcal{E}$ the set of text entities, and $r \in \mathcal{R}$ the set of relation types (He et al., 2023). More complex joint settings formulate the output as quintuples $(e_1, t_1, e_2, t_2, r)$ , with $e_1, e_2$ entity spans and $t_i$ their types, requiring simultaneous entity and relation extraction (Yuan et al., 2022).

2. Model Architectures and Fusion Mechanisms

A spectrum of architectures underpins modern MRE approaches, each optimizing for the multimodal nature of the task:

Graph-based Fusion: Graphs encode structural dependencies within and across modalities. Edge-enhanced graph alignment networks (EEGA) construct textual dependency graphs and visual object relationship graphs, then use optimal-transport based matching (Wasserstein and Gromov-Wasserstein distances) to align both nodes and edges, leveraging cross-modal semantics for joint prediction (Yuan et al., 2022).
Hierarchical Visual Prefix Networks: HVPNeT employs a BERT backbone with plug-in visual prefixes extracted at multiple scales via ResNet, fused hierarchically via dynamic gating and used as attention biases at each layer (Chen et al., 2022).
Implicit Cross-modal Transformers: IFAformer realizes fine-grained, layer-wise text-image alignment by allowing both text and visual streams to attend to each other at each Transformer layer, capturing object-level and global-contextual cues (Li et al., 2022).
Query-based Joint Extraction: QEOT reframes the task via a query-based architecture: independent text/image encoders are cross-fused via selective attention, after which a transformer decoder (with $Q$ learnable queries) jointly predicts entity spans, relations, and object bounding boxes without requiring pre-annotated entities or objects (Hei et al., 16 Aug 2024).
Mixture-of-Experts and Optimal Transport: REMOTE's multilevel optimal transport mechanism fuses token features from multiple encoder layers to preserve both low- and high-level attributes, routed dynamically via a mixture-of-experts gating scheme to handle intra- and inter-modal triplet extraction (Lin et al., 5 Sep 2025).
Variational Hypergraph Attention: VM-HAN builds multimodal hypergraphs per entity-pair, leveraging Gaussian node embeddings and hyperedges to model complex intra/inter-modal correlations, updating representations via variational graph attention for each pair (Li et al., 18 Apr 2024).
Retrieval-augmented Fusion: MoRe retrieves external textual/visual knowledge for both input text and images (e.g., Wikipedia paragraphs, images), then ensembles predictions from separate text and image experts using a learned gating strategy (Wang et al., 2022).
Retrieval-over-classification Paradigm: ROC replaces one-hot relation classification with relation retrieval by expanding each label to a natural language description, encoding both entity pairs and relation semantics and applying contrastive alignment in the joint embedding space (Hei et al., 25 Sep 2025).

These architectures employ various fusion mechanisms (cross-modal attention, prefix-tuning, mixture-of-experts, topic modeling, graph alignment, and knowledge retrieval) to model complex interactions and mitigate cross-modal differences.

3. Representation Engineering and Alignment

A central challenge in MRE is effective representation and alignment of information across modalities:

Scene Graphs and Entity-Object Alignment: Many systems (e.g., EEGA, MEGA, VM-HAN, MRE-ISE) rely on scene graphs to represent visual objects and their predicates, with dependency graphs for text. Node and edge features are embedded and cross-modal edges constructed via similarity metrics, enabling alignment at both the object/entity and relation/predicate levels (Yuan et al., 2022, Wu et al., 2023, Li et al., 18 Apr 2024).
Information Bottleneck and Regularization: MMIB adopts a variational information bottleneck, regularizing the modality-specific representations for both denoising (removing task-irrelevant visual noise) and enforcing tighter text-image alignment using a mutual information–based contrastive loss (Cui et al., 2023).
Pre-training via Soft Alignment Objectives: Prompt Me Up leverages self-supervised pretraining on vast pools of image-caption pairs, using pseudo-labels generated by object/entity and image/relation prompt matching. These signals, applied as contrastive and matching objectives, enhance downstream multimodal extraction (Hu et al., 2023).

Rigorous alignment and denoising are key to alleviating the intrinsic modality gap and noise commonly encountered in multimodal datasets.

4. Datasets and Evaluation Protocols

Standardized and challenging datasets are fundamental to reproducible MRE research:

MNRE: The widely adopted Multimodal Neural Relation Extraction dataset consists of social-media posts, each accompanied by images and entity–relation annotations (15.5K posts, 23 relations) (Li et al., 2022, Wu et al., 2023).
MORE/UMRE: The MORE benchmark provides >20K object-entity triples from paired news headlines and images, focusing on inter-modal relations (text entity ↔ image object). The newer UMRE dataset spans intra-text, intra-image, and inter-modal triplets, with fine-grained annotations and a more diverse relation set (He et al., 2023, Lin et al., 5 Sep 2025).
JMERE: Targets joint extraction of entities and relations by providing quintuples per example, requiring simultaneous entity span/type and relation predictions (Yuan et al., 2022, Yuan et al., 18 Oct 2024).
Few-Shot Benchmarks: Dedicated splits for few-shot learning (e.g., FS-JMERE, FewRelsmall) allow evaluation under low-resource constraints (Gong et al., 1 Mar 2024, Yuan et al., 18 Oct 2024).

Evaluation is typically via micro-averaged F₁, precision, recall, and, for multiclass or head–tail analysis, macro-F₁.

5. Key Results, Ablation Analyses, and Limitations

Table: Representative SOTA Results on MNRE and MORE Test Sets

Model	MNRE F₁	MORE F₁	Macro-F₁	Principal Architecture
MOREformer	—	62.8	50.0	object-attribute-depth fusion
REMOTE	87.77	63.91	—	multi-expert, OT fusion
ROC	91.22	71.97	—	retrieval-based, contrastive
VM-HAN	85.22	65.71	—	variational hypergraph attn
MMIB+PromptMeUp	84.86	—	—	VIB, pretrain alignment
DGF-PT	84.47	—	—	prefix-tuning, dual-gated
QEOT	—	49.65*	—	DETR-style query triple dec.

* QEOT triple F₁; note baselines in same setting are lower.

Ablations consistently reveal:

Removal of cross-modal alignment or gating layers degrades F₁ by 1–5 points (Yuan et al., 2022, Lin et al., 5 Sep 2025, Chen et al., 2022, Li et al., 18 Apr 2024).
Absence of self-supervised pretraining or prompt-based alignment reduces transfer and few-shot performance by substantial margins (Hu et al., 2023).
End-to-end joint models (e.g., via word-pair tagging or query-based decoding) mitigate error propagation but may not handle overlapping relations or complex entity attribute types (Yuan et al., 2022, Hei et al., 16 Aug 2024).
Approaches relying on scene-graph–based explicit alignment are vulnerable to upstream object/graph detection errors (Li et al., 2022, Li et al., 18 Apr 2024).

Limitations include sensitivity to annotation noise, computational overhead (Sinkhorn OT), and persistent difficulty with multi-object or rare-relation disambiguation, particularly in natural images with ambiguous context (He et al., 2023, Lin et al., 5 Sep 2025).

6. Specialized and Emerging Directions

Several emerging sub-areas demonstrate the expanding scope of MRE research:

Few-Shot and Low-Resource MRE: Dedicated models integrate LLM-based knowledge prompting or meta-learning fusion to compensate for limited annotation, reporting improved macro-F₁ and robustness to label sparsity (Yuan et al., 18 Oct 2024, Gong et al., 1 Mar 2024).
Synthetic Data Generation: MI2RAGE presents an approach where multimodal classifiers are trained solely on synthetic modalities generated by chained cross-modal generators, filtered by a mutual-information–aware teacher. This yields significant gains over both standard synthetic and real-data only models, especially under missing-modality scenarios (Du et al., 2023).
Retrieval-Augmented and Knowledge-Based MRE: Systems like MoRe integrate large-scale external textual and visual knowledge corpora via retrieval and mixture-of-experts gating, empirically benefiting entity disambiguation in challenging situations (Wang et al., 2022, Hu et al., 2023).
Medical and Document MRE: REMAP fuses graph and text representations for disease relation extraction, leveraging HAN for graph encoding and SciBERT for text, with output probability-level fusion (Lin et al., 2022). Multimodal approaches for visually-rich document understanding highlight that layout–text integration is typically more informative than vision alone (Cooney et al., 2022).

7. Open Problems and Future Perspectives

The field continues to be shaped by several open directions:

Scalability: Extending models to handle higher-resolution images, a larger entity/object vocabulary, and efficient inference remains a challenge, especially for application in real-world pipelines (Hei et al., 16 Aug 2024, Lin et al., 5 Sep 2025).
Multi-label and Overlapping Relations: Most joint models to date cannot fully accommodate overlapping predictions or instances where multiple relations hold between argument pairs. Structured prediction strategies and sequence-to-sequence paradigms offer potential remedies (Yuan et al., 2022, Yuan et al., 18 Oct 2024).
Semantic and Geometry Fusion: Going beyond simple visual grounding to employ depth, spatial context, and geometry-aware cues for disambiguation, especially in crowded or abstract scenes (He et al., 2023, Lin et al., 5 Sep 2025).
Robustness and Generalization: The use of pre-training, meta-learning, and synthetic data generation is gaining traction to bolster robustness in few-shot or out-of-domain setups (Du et al., 2023, Yuan et al., 18 Oct 2024).
Evaluation Standards: More informative macro-F₁ and error analysis across relation types, modalities, and dataset construction strategies are needed to reliably gauge progress.

In synthesis, multimodal relation extraction has evolved into a mature research direction with diverse, sophisticated methods and evaluation benchmarks that drive toward robust, fine-grained, and data-efficient semantic extraction across text and vision. Continued advances will hinge on improved cross-modal fusion, self-supervised learning, benchmark development, and principled integration of external knowledge.