Entity-Centric Multimodal KB

Updated 29 November 2025

Entity-centric multimodal knowledge base is a structured repository that organizes data around entities using text, visual content, and attribute graphs.
It integrates modality-specific neural encoders and fusion techniques to enable efficient retrieval, entity linking, and cross-modal reasoning.
The approach leverages rigorous schema design, diverse data aggregation, and advanced retrieval pipelines to enhance AI-driven applications.

An entity-centric multimodal knowledge base (EMKB) is a structured repository that organizes knowledge around entities and integrates multiple modalities—most commonly structured triples, textual descriptions, and visual data. EMKBs have emerged as essential resources for machine intelligence tasks that require grounded, cross-modal understanding and reasoning, including entity linking, information retrieval, question answering, content generation, and knowledge base completion. EMKBs are characterized by explicit per-entity aggregation of heterogeneous signals, modular retrieval-supporting architectures, and rigorous schema definitions supporting extensibility.

1. Formal Definitions and Schema Design

Entity-centric multimodal knowledge bases generalize traditional knowledge bases by associating each entity with multiple forms of modality-specific information. A canonical EMKB schema, as adopted in recent systems, comprises:

A finite set of entities $\mathcal{E} = \{e_1, ..., e_N\}$ .
Modal content per entity, e.g.,
- Textual descriptions $t_{e_j}$
- Visual data $v_{e_j}$ (images or video)
- Fine-grained attribute–value pairs $\{(a^s_{e_j}, v^s_{e_j})\}$ (when available)
- Structured subgraphs $\bm{G}_{\text{sub}}^j$ (entity-specific KG slices) (You et al., 26 Nov 2025, 2305.14725).

A typical entry is thus:

$e_j = \bigl(t_{e_j},\, v_{e_j},\{(a^1_{e_j}, v^1_{e_j}),\dots,(a^{S_j}_{e_j}, v^{S_j}_{e_j})\},\, \bm{G}_{\text{sub}}^j \bigr)$

The ontology design pattern in (Apriceno et al., 2024) further specifies three metaclasses:

MultiModalEntity: The semantic entity itself.
ModalDescriptor: The digital artefact realizing an entity in a specific modality.
Modality: The type of information (e.g., image, text, audio, video).

Key formal axioms enforce a clean separation between semantics (entity) and realization (mode-specific artifacts), e.g.,

$\mathrm{MultiModalEntity} \equiv \mathrm{InformationObject} \sqcap \exists\;\mathrm{hasModalDescriptor}.\top$

Disjointness constraints and modality hierarchies enable complex and extensible modeling.

2. Data Population and Multimodal Annotation

The construction of an EMKB proceeds by entity inventory selection, per-entity data aggregation, and annotation:

Entity harvesting: Methods include Wikipedia/WordNet category crawling (Qiu et al., 2024, Huang et al., 2022), named entity recognition from domain corpora (e.g., news) (You et al., 26 Nov 2025), and domain-specific crawls (e.g., from e-commerce catalogs (2305.14725)).
Textual content: Aggregation of Wikipedia leads, infoboxes, synonyms, glosses in multiple languages (Huang et al., 2022). Average per-entity textual variants are typically ≥3.
Visual content: Multiple representative images per entity, sourced from image repositories, automated web search, or datasets like ImageNet, VGGFace2, DBpedia (Qiu et al., 2024, You et al., 26 Nov 2025, Peng et al., 2021).
Attribute extraction: Structured attribute–value pairs are included when fine-grained entity properties are key (e.g., product color, memory) (2305.14725).
External subgraphs: Entity-local KGs or automatically generated subgraphs capturing relational context (You et al., 26 Nov 2025).

Quality control relies on manual curation, automated filtering, and cross-view consistency checks (e.g., via crowdworker verification or zero-shot detectors).

3. Representation Learning and Multimodal Fusion

EMKBs employ modality-specific neural encoders to obtain embeddings suitable for unified modeling:

Text: BERT-based encoders (bi-encoder or cross-encoder for retrieval/entailment (Peng et al., 2021, 2305.14725)), word CNNs, or LASER for multilingual settings (Huang et al., 2022).
Image: CLIP ViT or ResNet152-based encoders; InsightFace for face cues (Qiu et al., 2024, You et al., 26 Nov 2025, Huang et al., 2022).
Attributes: Represented as sequences and fed through text entailment models (DeBERTa NLI, (2305.14725)) or directly as structured vectors.
Graph: GAT/GraphSAGE to aggregate structured KG relations (Huang et al., 2022, You et al., 26 Nov 2025).
Modality-specific gating: Fusion weights learned dynamically per entity and modality (Huang et al., 2022).

Embeddings are combined—via projection or gating—into a shared latent space. These representations feed into downstream scoring architectures (e.g., DistMult, ConvE, cross-attention fusion for generative models).

4. Retrieval, Indexing, and Query Processing

Efficient retrieval from EMKBs is essential for scaling entity matching and downstream tasks:

Dense indexing: Visual embeddings (CLIP/VGG) and textual representations (SBERT/LASER) indexed via approximate nearest neighbor methods such as Faiss (Qiu et al., 2024, Huang et al., 2022).
Retrieval paradigm: Multistage pipeline—initial top-K retrieval using text and image separately, followed by multimodal matching and reranking (Peng et al., 2021, 2305.14725).
Score fusion: Linear models or MLPs over modal similarity scores, with weights optimized for each modality (text encoders typically dominate performance) (Peng et al., 2021).
Path fusion: For knowledge completion, 1-step and 2-step multimodal paths are scored, and answers ranked via path-weighted combinations or learned logistic regression models (Peng et al., 2022).
Query-driven optimization: Early pruning (confidence thresholds), top-K limiting, and fast parallelization maintain end-to-end latency ( $\approx 6$ s on large graphs) (Peng et al., 2022).

5. Downstream Applications and Model Architectures

EMKBs enable a broad suite of knowledge-intensive, cross-modal tasks:

Entity Linking/Tagging: Mapping text-image (or text-only, image-only) input pairs to the correct entity. Attribute-aware models leveraging NLI and fine-grained properties show $+10.7$ F1 over baseline (2305.14725).
Visual Question Answering: Retrieval-augmented visual LLMs combine EMKB-sourced textual snippets and image evidence to answer entity-specific questions, yielding $+22\%$ BLEURT and up to $+85.3\%$ accuracy on tail entities (Qiu et al., 2024).
Image Captioning/Generation: Retrieval-augmented generative models (e.g., MERGE) draw on textual, visual, and structured subgraphs from EMKBs, achieving CIDEr improvements of up to $+20.17$ on out-of-domain datasets (You et al., 26 Nov 2025).
Knowledge Base Completion: EMKBs augmented with Web-extracted textual facts boost mean average precision in KB completion by up to $48\%$ (Peng et al., 2022).
Sense Disambiguation: Multilingual, vision-enhanced models (VisualSem) achieve up to $+2.5\%$ accuracy on visual verb sense tasks (Huang et al., 2022).

Table: Typical EMKB Modal Coverage (examples):

Modality	Example Encoding	Key Datasets/Uses
Textual glosses	BERT, LASER, DeBERTa	VisualSem, AMELI, SnapNTell
Visual images	CLIP ViT, ResNet152, VGG	VisualSem, SnapNTell, MERGE
Structured KGs	GAT, subgraphs, edge types	MERGE, VisualSem, generic EMKBs
Attributes	Key-value NLI, SBERT similarity	AMELI

6. Evaluation and Empirical Findings

Performance of EMKB systems is benchmarked via standard metrics:

Retrieval accuracy: Hits@K, MRR, BLEU-n, METEOR, ROUGE-L, BLEURT/BELURT.
Attribute-aware entity linking: AMELI achieves $33.5\%$ F1 (top-10) over products, with fine-grained attribute ablation reducing F1 by $-10.7$ (2305.14725).
Entity tagging: Full-model Hits@1 is $61.2\%$ versus $41.4\%$ (text only) and $7.8\%$ (image only) (Peng et al., 2021).
VQA factuality: Retrieval-augmented SnapNTell outperforms baselines by $+22\%$ BLEURT overall and up to $+85.3\%$ accuracy on long-tail entities (Qiu et al., 2024).
News image captioning: MERGE exceeds prior SOTA by $+6.84$ CIDEr and $+4.14$ NER-F1 (GoodNews), generalizing to unseen datasets (You et al., 26 Nov 2025).
Efficiency: Query-driven KB completion achieves $6$ s latency for YAGO-scale graphs (Peng et al., 2022).

Ablation studies consistently show that text cross-encoders and knowledge-grounded attributes provide the largest marginal gains, while pure vision-only models are least robust (due to ambiguity and visual diversity).

7. Ontological Foundations, Integration, and Open Challenges

The recent ontology pattern (Apriceno et al., 2024) provides a principled framework for EMKB design:

Core tenet: Decouple entity semantics (MultiModalEntity) from specific digital realizations (ModalDescriptor) and their modality labels (Modality), enforcing extensibility and domain-agnosticism.
Integration: Alignment strategies declare media objects and data formats as subclass relations, supporting federated queries and light-touch harmonization across medical, culinary, cultural, and technical domains.
Use cases: Successful alignment in projects such as FuS-KG, MUSCO, and DataSpaces demonstrates the abstraction’s utility.
Limitations: Community adoption, complex modality modeling (e.g., 3D, AR), and explicit relation mapping between modalities remain active research challenges.

Future directions include scalable multimodal retrieval, entity/context expansion to address sparsity, combining LLMs with EMKB schemas, and extending coverage to open-domain and fine-grained multi-aspect entity modeling.

Key references: (Peng et al., 2021, Huang et al., 2022, Peng et al., 2022, 2305.14725, Qiu et al., 2024, Apriceno et al., 2024, You et al., 26 Nov 2025, Pezeshkpour et al., 2018).