Multimodal Graph Extraction

Updated 18 April 2026

Multimodal graph extraction is the automated process of recovering graph-structured data from text and visual inputs, capturing nodes, edges, and their attributes.
It integrates textual and visual cues using methods like graph neural networks, optimal transport, and mixture-of-experts to align diverse features.
This approach enhances document understanding and scene analysis by resolving fine-grained ambiguities through joint reasoning across multiple modalities.

Multimodal graph extraction refers to the automated process of recovering graph-structured representations—nodes, edges, and their attributes—from data that spans multiple modalities, typically text and vision. The extracted graphs encode entities and their relations where contextual meaning arises through composition of orthogonal features such as linguistic content, geometric/layout cues, semantic categories, and visual context. This paradigm is foundational to visually-rich document understanding, information fusion in knowledge graphs, scene graph generation, relation extraction across vision and language, and structured reasoning in multimodal LLMs. Numerous approaches have been advanced to address challenges in both domain-specific (e.g., business forms, cyber threat intelligence) and general-purpose settings, leveraging innovations in graph neural networks, contrastive and optimal transport-based fusion, information bottleneck denoising, and expert mixture modeling.

1. Core Principles and Challenges of Multimodal Graph Extraction

Multimodal graph extraction departs fundamentally from unimodal approaches by integrating complementary evidence streams—textual sequences and visual input—into a unifying graph abstraction. The two core challenges are: (1) capturing relations and entities whose semantics are not encoded exclusively in either input modality; (2) resolving fine-grained correspondences or disambiguation cues that require joint reasoning, such as associating visually distinct but textually similar entities, or aligning layout with semantics.

In visually rich documents, textual serialization loses structural cues such as spatial proximity, font style, or hierarchical arrangement; similarly, vision-only extractions lack linguistic precision. Multimodal graph extraction frameworks therefore encode both node-level (textual and visual region embedding) and edge-level (spatial/geometric, semantic, or relational) features into the constructed graph, which is then refined and decoded to yield the desired output: entity segments, entity chains/coreference, relation triples, event graphs, or more specialized structures such as attack graphs or scene graphs (Liu et al., 2019, Lin et al., 5 Sep 2025, Yuan et al., 2022, Lee et al., 2023, Cao et al., 2024, Liu et al., 21 Mar 2026).

A plausible implication is that when multi-modal cues are inconsistent (e.g., visual evidence contradicts text), robust graph extraction frameworks must detect or resolve such contradictions, either through gating networks, mixture-of-experts, or information-theoretic regularization.

2. Graph Construction and Multimodal Feature Encoding

Document and Scene Graphs

The foundational step in multimodal graph extraction is the construction of a node/edge graph from multi-source input. For documents, typical practice relies on OCR segmentation for nodes (text blocks or word tokens), with edge features encoding relative position, bounding box geometry, or visual linkages (Liu et al., 2019, Cao et al., 2024, Lee et al., 2023). Scene graph extraction in vision leverages object detectors and relationship classifiers (e.g., Faster R-CNN, Mask-R-CNN, MOTIFS) to identify objects, attributes, and their pairwise/spatial relations, providing a structural visual backbone (Wu et al., 2023, Liu et al., 21 Mar 2026).

Table: Representative Node and Edge Features

Node Features	Edge Features (Document)	Edge Features (Scene Graph)
Token/segment embedding (BERT, BiLSTM)	Offset vectors, window ratios	Predicate types (action, spatial)
Visual region/crop (ResNet, ViT)	Center distances, k-NN	CLIP-based similarity
Character or object type	Geometric transformation vectors	Co-occurrence or scene context

Graph construction in text leverages syntactic or semantic parsing (dependency graphs, coreference chains), whose adjacency structures are extended with cross-modal edges informed by similarity, positional proximity, or external knowledge. Some frameworks consolidate textual and visual graphs into a single Cross-Modal Graph (CMG) via fusion mechanisms such as similarity-based edge addition or optimal transport (Wu et al., 2023, Yuan et al., 2022). Domain-specific settings (e.g., attack graph extraction) may augment this process with specialized routines for visual flowchart analysis or event-entity typing (Zhang et al., 20 Jun 2025).

3. Multimodal Graph Representation and Fusion Mechanisms

Multimodal representation learning for graph extraction hinges on (A) constructing feature-rich multimodal node/edge embeddings, (B) aligning or aggregating these features, and (C) propagating context via message-passing or global attention.

Frameworks can be categorized into several design archetypes:

Graph Convolution with Self- and Attention-Based Aggregation: Fully-connected or kNN-based document graphs use learnable attention to aggregate messages from every node, combining textual node states, visual edge features, and neighboring context into a multimodal embedding per segment (Liu et al., 2019, Cao et al., 2024, Lee et al., 2023).
Optimal Transport-Based Alignment: Alignment of graph features, either at node or edge level (Wasserstein/Gromov-Wasserstein distances), is solved as an optimal transport problem, yielding a transport plan used for cross-modal fusion or to enforce alignment consistency (Yuan et al., 2022, Lin et al., 5 Sep 2025).
Mixture-of-Experts (MoE) and Gating: Mixture-of-experts architectures dynamically select and combine interaction features from multiple expert streams (textual, visual, cross-modal), with softmax gating selecting the optimal composition for each candidate relation (Zhou et al., 21 Feb 2025, Lin et al., 5 Sep 2025).
Information Bottleneck Denoising: To counter over-utilization of internal information, learnable stochastic pruning via the Graph Information Bottleneck retains only the most informative nodes/edges for each relation extraction instance (Wu et al., 2023).
Contrastive and Adversarial Objectives: Centralized graph contrastive loss, multimodal contrastive learning, and virtual adversarial training (VAT) enforce consistency of representations across graph views and smooth the semantic clustering of multimodal embeddings (Lee et al., 2023, Zhou et al., 21 Feb 2025).

Fusion can occur at token/patch level (concatenation/fine-grained matching), or globally via gating, attention, or decoders tuned for specific output templates (e.g., code-style, natural language) (Liu et al., 21 Mar 2026).

4. Decoding, Training Objectives, and Unified Extraction Paradigms

Output decoding strategies reflect task structure. BiLSTM-CRF and variants are standard for sequence and BIO tagging, especially for entity extraction and segmentation in visually rich documents (Liu et al., 2019, Cao et al., 2024). For joint entity and relation extraction, word-pair grid tagging is used, with decoding rules that extract both single-token and multi-token entities and all valid relations in one pass, avoiding error propagation from pipelined NER→RE models (Yuan et al., 2022).

Modern unified paradigms conceptualize the extraction task as code generation (via code-style templates), where multimodal features and scene graphs are encoded as function arguments and entities/relations are predicted as structured Python dicts. This approach leverages decoder-only LLMs with parameter-efficient tuning to flexibly handle all subtasks (entities, coreference, relations, grounding) within a single generation framework (Liu et al., 21 Mar 2026).

Training objectives are similarly unified: cross-entropy losses for main tasks (entity/relation classification), auxiliary losses for segment or node classification, optimal transport penalties, information bottleneck regularizers, and single unified contrastive objectives for multimodal agreement (Lin et al., 5 Sep 2025, Lee et al., 2023, Wu et al., 2023).

5. Empirical Performance and Benchmarks

Key datasets span visually rich documents (SROIE, FUNSD, CORD, M³D), social media (Twitter-15/17, MNRE), joint entity-relation sets (JMERE, UMRE), and domain-specific corpora (CTI attack graphs, disease KGs). Extraction performance is measured by strict entity, relation, and grounding F1, graph-level exact match, and, in knowledge-graph settings, micro-accuracy, precision, and recall per relation type (Lin et al., 5 Sep 2025, Yuan et al., 2022, Lee et al., 2023, Liu et al., 21 Mar 2026, Lin et al., 2022).

Recent results demonstrate that multimodal graph conditioning, code-style output, and graph-revised GCNs set new state of the art on diverse tasks. For instance, REMOTE reports 78.47% Accuracy / 69.17% F1 on UMRE (↑5.3 F1 over previous best) and up to +12 F1 improvement over text-only baselines on key entity/relation extraction in graph-rich settings (Lin et al., 5 Sep 2025, Cao et al., 2024, Liu et al., 21 Mar 2026). Ablations consistently show that removal of graph-based alignment or scene-graph modules leads to 1–6 point degradations in F1.

Notably, domain-specific extensions such as multimodal attack graph construction with iterative QA pipelines (MM-AttacKG) show up to +14 pp F1 gains over text-only systems by incorporating threat image parsing and fusion (Zhang et al., 20 Jun 2025). Automated extraction from scientific plots with vision-based chain-of-thought LLMs now achieves >90% precision and recall and <5% normalized error in 1–2D point extraction settings (Polak et al., 16 Mar 2025).

6. Advances in Multimodal Graph Reasoning and Comprehension

Recent studies characterize the graph-structure comprehension of multimodal LLMs, showing that for generic graph QA tasks (node degree, cycle detection, triangle counting), text-based adjacency or incident list encodings yield high accuracy, vision-mediated prompts add most value on global or structural queries, and multimodal (text+vision) approaches can slightly exceed textual-only accuracy on dense, complex graphs (Zhong et al., 2024). However, vision alone without textual encoding remains substantially weaker (e.g., ≤75% accuracy on edge existence) for local graph property extraction.

A plausible implication is that optimal use of both modalities—text for precision, vision for holistic structure—yields the most robust graph extraction and reasoning, particularly in complex or visual-noise-prone scenarios.

7. Open Problems and Future Directions

Despite substantial progress, multimodal graph extraction still faces limitations: (1) reliance on external scene graph or OCR parsers whose failures degrade performance (Wu et al., 2023, Liu et al., 21 Mar 2026); (2) difficulty handling partially aligned or ambiguous cross-modal cues; (3) lack of explainable error detection or interactive graph-editing loops, especially in safety-critical domains (e.g., threat intelligence, clinical KGs) (Lin et al., 2022, Zhang et al., 20 Jun 2025).

There is active research into end-to-end, self-supervised graph parsing; fully code-style extraction with dynamic template induction; scalable hierarchical multimodal graph learning; and explainability methods for cross-modal evidence tracing in the extracted graph. Extending current frameworks to richer modalities (audio, video), higher-order graphs (hypergraphs), or continuously evolving knowledge graphs also remains an open frontier.

Key References

(Liu et al., 2019) Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
(Lee et al., 2023) FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction
(Yuan et al., 2022) Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging
(Wu et al., 2023) Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling
(Lin et al., 5 Sep 2025) REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts
(Cao et al., 2024) GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network
(Zhou et al., 21 Feb 2025) Multimodal Graph-Based Variational Mixture of Experts Network for Zero-Shot Multimodal Information Extraction
(Liu et al., 21 Mar 2026) Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement
(Zhang et al., 20 Jun 2025) MM-AttacKG: A Multimodal Approach to Attack Graph Construction with LLMs
(Polak et al., 16 Mar 2025) Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots
(Lin et al., 2022) Multimodal Learning on Graphs for Disease Relation Extraction
(Zhong et al., 2024) Exploring Graph Structure Comprehension Ability of Multimodal LLMs: Case Studies