ToFu for Multi-modal KG Reasoning

Updated 16 April 2026

ToFu is a multi-modal knowledge graph reasoning model that tokenizes structure, image, and text data, enabling efficient cross-graph generalization.
It employs a hierarchical fusion framework combining GNNs and transformers to integrate discrete modalities with minimal parameters.
Empirical results demonstrate significant gains in MRR and Hit@10 over baselines, underscoring its robustness and efficiency in various inductive settings.

ToFu (“TOken-based Foundation model for multi-modal Knowledge Graph Reasoning”) is a foundation model architecture for multi-modal knowledge graph reasoning (MMKGR), designed to deliver strong cross-graph generalization by representing all structural and content modalities as discrete, transferable tokens. Unlike previous approaches that rely on entity- or relation-specific embeddings, ToFu leverages a fully tokenized, hierarchical fusion, message-mixing framework that is lightweight, parameter-efficient (~0.25M parameters), and robust to unseen heterogeneous graphs and modalities. The design, mathematical formulation, empirical findings, and limitations of ToFu are detailed below (Zhang et al., 11 Feb 2026).

1. Architecture and Tokenization Strategy

ToFu processes each MMKG query (triple) by sampling a local k-hop subgraph around the head and tail entities, constructing a comprehensive, multimodal representation at both the entity and neighborhood level. All available information is discretized as follows:

Structural tokens: For every node $e$ in the sampled subgraph, the tuple $[d(e, h), d(e, t)]$ (shortest-path distances to head $h$ and tail $t$ ) indexes a small, learnable structural codebook.
Textual tokens: Entity descriptions are tokenized with a fixed BERT WordPiece tokenizer, producing up to $N_{\mathrm{txt}}$ subword tokens, each mapped via BERT embeddings with an additional projection layer.
Visual tokens: Each entity image is encoded via a pre-trained VQ-VAE (e.g., BEiT v2), generating $N_{\mathrm{vis}}$ discrete patch tokens further projected into the shared model space.

This tokenization strategy eliminates explicit per-entity or per-relation embeddings, enabling the model to apply the same parameter set across diverse knowledge graph domains and inductive regimes.

2. Hierarchical Fusion and Message-Passing

ToFu employs a two-stage fusion architecture:

Local fusion:
- Structural Encoder (SE): An $L_1$ -layer GNN propagates structural codebook embeddings, guided by a shared relation query vector. Updates are given by:
$H_{\mathrm{ent}}^{(i+1)} = \mathrm{AGG}_{n \in N(e)}\left[\mathrm{MSG}(H_{\mathrm{ent}}^{(i)}, H_{\mathrm{rel}}^{(i)}, n, q)\right]$

with $\mathrm{MSG}(\cdot)$ a learned MLP weighted by attention $\alpha_{r, q}$ , and $[d(e, h), d(e, t)]$ 0 as attention- or max-pooling. - Multi-modal Encoder (ME): A transformer ingests the concatenation $[d(e, h), d(e, t)]$ 1 and outputs a fused image+text representation $[d(e, h), d(e, t)]$ 2 via a special [ENT] read-out token. - Gated Fusion: Learned gates $[d(e, h), d(e, t)]$ 3 combine the two features as $[d(e, h), d(e, t)]$ 4.
Global propagation (Mixture-of-Messages, MiM):
- Fused features propagate through the subgraph with $[d(e, h), d(e, t)]$ 5 additional GNN layers. Each edge passes a weighted sum of $[d(e, h), d(e, t)]$ 6 base message functions (TransE, DistMult, RotatE, etc.):
$[d(e, h), d(e, t)]$ 7

where the $[d(e, h), d(e, t)]$ 8, $[d(e, h), d(e, t)]$ 9, and $h$ 0 are produced by small MLPs.

This modular mixture-of-message operator imparts universal inductive bias, crucial for out-of-domain generalization.

3. Mathematical Formulation

The key modeling stacks are:

Structural encoding: $h$ 1.
Multi-modal transformer: $h$ 2, $h$ 3.
Gated fusion: $h$ 4.
Message mixing: Each message to $h$ 5 is a convex combination of baseline operators, facilitating adaptation to various KG semantics.
Scoring / loss: The final entity representation $h$ 6 is scored as $h$ 7, with multi-class cross-entropy loss over negative samples.

The resulting feature pipeline is highly modular and avoids embedding lookup tables entirely.

4. Training Regimes and Generalization

ToFu is pretrained on four diverse multi-modal KGs (DB15K, MKG-W, MKG-Y, FB15K-237(v1)) using the outlined sampling and loss, with no entity/relation-specific parameters. Transfer learning is tested in three settings across 17 MMKGs:

Transductive: all entities and relations seen during pretraining.
Inductive: new test entities, known relations.
Fully-inductive: both new entities and new relations at inference.

Zero-shot and fine-tuned variants are evaluated; parameter updates are applied only to shared GNN and Transformer weights during fine-tuning, leaving all codebooks and gates intact (Zhang et al., 11 Feb 2026).

5. Empirical Performance and Ablation

ToFu achieves strong transfer and overall performance on MMKG link prediction:

Setting	MRR (Zero-shot)	Hit@10 (Zero-shot)	MRR (FT)	Hit@10 (FT)
All 17 MMKGs	45.93%	61.67%	47.41%	63.02%
SOTA Baseline	35–44%	~59%	41–45%	61–62%

Key ablation outcomes:

Removal of structural encoder (SE) or global propagation (GP) drops MRR by 5–10 points, emphasizing the necessity of hierarchical fusion.
Omitting textual tokens particularly degrades accuracy; visual tokens are also contributory but less critical.
Token budget experiments show a monotonic accuracy increase as the number of visual/text tokens grows, saturating at 8–16 tokens for practical efficiency.
Single-function message passing underperforms the MiM strategy, reducing overall stability and effectiveness.

6. Practical Implications, Limitations, and Future Work

ToFu provides key advances for MMKGR:

Generalization: Entity-agnostic tokenization enables true zero-shot and few-shot cross-KG reasoning.
Efficiency: The parameter count is two orders of magnitude smaller than typical multimodal KGFMs.
Robustness: Fine-grained codebook reuse for images and text strengthens robustness on content-rich and unseen graphs.

However, certain limitations are noted:

Lacking explicit entity embeddings modestly impacts performance on long-tail or coarse-grained ranking metrics, notably Hits@10.
Accurate k-hop subgraph extraction is required; very large graphs may require advanced sampling heuristics.
Only image and text modalities are currently supported; extension to audio, video, or dynamic KGs is considered future work.
Application to fully end-to-end tasks (e.g., entity alignment, QA) remains open (Zhang et al., 11 Feb 2026).

7. Broader Impact and Research Directions

ToFu’s design demonstrates that a modular, fully tokenized, multi-modal modeling pipeline can supersede embedding-centric architectures in both transferability and efficiency. Future research may extend to additional modalities, address subgraph sampling challenges in web-scale KGs, or integrate ToFu into larger multi-modal foundation models and downstream tasks.

Summary Table: ToFu’s Core Pipeline

Stage	Input Modalities	Operation Type
Tokenization	Structure / Image / Text	Discretize/codebook
Local fusion	Structural + MM tokens	GNN + Transformer
Gated feature mixing	Fused local features	Learned gating
Global propagation (MiM)	All fused entities	Mixture of GNN messages
Scoring & loss	Target entity representations	MLP, Cross-entropy

By systematically discretizing and hierarchically fusing structural, image, and text information, ToFu achieves broad, transferable MMKG reasoning with a minimal and fully shared model footprint (Zhang et al., 11 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ToFu.

ToFu for Multi-modal KG Reasoning

1. Architecture and Tokenization Strategy

2. Hierarchical Fusion and Message-Passing

3. Mathematical Formulation

4. Training Regimes and Generalization

5. Empirical Performance and Ablation

6. Practical Implications, Limitations, and Future Work

7. Broader Impact and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ToFu for Multi-modal KG Reasoning

1. Architecture and Tokenization Strategy

2. Hierarchical Fusion and Message-Passing

3. Mathematical Formulation

4. Training Regimes and Generalization

5. Empirical Performance and Ablation

6. Practical Implications, Limitations, and Future Work

7. Broader Impact and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research