ToFu for Multi-modal KG Reasoning
- ToFu is a multi-modal knowledge graph reasoning model that tokenizes structure, image, and text data, enabling efficient cross-graph generalization.
- It employs a hierarchical fusion framework combining GNNs and transformers to integrate discrete modalities with minimal parameters.
- Empirical results demonstrate significant gains in MRR and Hit@10 over baselines, underscoring its robustness and efficiency in various inductive settings.
ToFu (“TOken-based Foundation model for multi-modal Knowledge Graph Reasoning”) is a foundation model architecture for multi-modal knowledge graph reasoning (MMKGR), designed to deliver strong cross-graph generalization by representing all structural and content modalities as discrete, transferable tokens. Unlike previous approaches that rely on entity- or relation-specific embeddings, ToFu leverages a fully tokenized, hierarchical fusion, message-mixing framework that is lightweight, parameter-efficient (~0.25M parameters), and robust to unseen heterogeneous graphs and modalities. The design, mathematical formulation, empirical findings, and limitations of ToFu are detailed below (Zhang et al., 11 Feb 2026).
1. Architecture and Tokenization Strategy
ToFu processes each MMKG query (triple) by sampling a local k-hop subgraph around the head and tail entities, constructing a comprehensive, multimodal representation at both the entity and neighborhood level. All available information is discretized as follows:
- Structural tokens: For every node in the sampled subgraph, the tuple (shortest-path distances to head and tail ) indexes a small, learnable structural codebook.
- Textual tokens: Entity descriptions are tokenized with a fixed BERT WordPiece tokenizer, producing up to subword tokens, each mapped via BERT embeddings with an additional projection layer.
- Visual tokens: Each entity image is encoded via a pre-trained VQ-VAE (e.g., BEiT v2), generating discrete patch tokens further projected into the shared model space.
This tokenization strategy eliminates explicit per-entity or per-relation embeddings, enabling the model to apply the same parameter set across diverse knowledge graph domains and inductive regimes.
2. Hierarchical Fusion and Message-Passing
ToFu employs a two-stage fusion architecture:
- Local fusion:
- Structural Encoder (SE): An -layer GNN propagates structural codebook embeddings, guided by a shared relation query vector. Updates are given by:
with a learned MLP weighted by attention , and 0 as attention- or max-pooling. - Multi-modal Encoder (ME): A transformer ingests the concatenation 1 and outputs a fused image+text representation 2 via a special [ENT] read-out token. - Gated Fusion: Learned gates 3 combine the two features as 4.
- Global propagation (Mixture-of-Messages, MiM):
- Fused features propagate through the subgraph with 5 additional GNN layers. Each edge passes a weighted sum of 6 base message functions (TransE, DistMult, RotatE, etc.):
7
where the 8, 9, and 0 are produced by small MLPs.
This modular mixture-of-message operator imparts universal inductive bias, crucial for out-of-domain generalization.
3. Mathematical Formulation
The key modeling stacks are:
- Structural encoding: 1.
- Multi-modal transformer: 2, 3.
- Gated fusion: 4.
- Message mixing: Each message to 5 is a convex combination of baseline operators, facilitating adaptation to various KG semantics.
- Scoring / loss: The final entity representation 6 is scored as 7, with multi-class cross-entropy loss over negative samples.
The resulting feature pipeline is highly modular and avoids embedding lookup tables entirely.
4. Training Regimes and Generalization
ToFu is pretrained on four diverse multi-modal KGs (DB15K, MKG-W, MKG-Y, FB15K-237(v1)) using the outlined sampling and loss, with no entity/relation-specific parameters. Transfer learning is tested in three settings across 17 MMKGs:
- Transductive: all entities and relations seen during pretraining.
- Inductive: new test entities, known relations.
- Fully-inductive: both new entities and new relations at inference.
Zero-shot and fine-tuned variants are evaluated; parameter updates are applied only to shared GNN and Transformer weights during fine-tuning, leaving all codebooks and gates intact (Zhang et al., 11 Feb 2026).
5. Empirical Performance and Ablation
ToFu achieves strong transfer and overall performance on MMKG link prediction:
| Setting | MRR (Zero-shot) | Hit@10 (Zero-shot) | MRR (FT) | Hit@10 (FT) |
|---|---|---|---|---|
| All 17 MMKGs | 45.93% | 61.67% | 47.41% | 63.02% |
| SOTA Baseline | 35–44% | ~59% | 41–45% | 61–62% |
Key ablation outcomes:
- Removal of structural encoder (SE) or global propagation (GP) drops MRR by 5–10 points, emphasizing the necessity of hierarchical fusion.
- Omitting textual tokens particularly degrades accuracy; visual tokens are also contributory but less critical.
- Token budget experiments show a monotonic accuracy increase as the number of visual/text tokens grows, saturating at 8–16 tokens for practical efficiency.
- Single-function message passing underperforms the MiM strategy, reducing overall stability and effectiveness.
6. Practical Implications, Limitations, and Future Work
ToFu provides key advances for MMKGR:
- Generalization: Entity-agnostic tokenization enables true zero-shot and few-shot cross-KG reasoning.
- Efficiency: The parameter count is two orders of magnitude smaller than typical multimodal KGFMs.
- Robustness: Fine-grained codebook reuse for images and text strengthens robustness on content-rich and unseen graphs.
However, certain limitations are noted:
- Lacking explicit entity embeddings modestly impacts performance on long-tail or coarse-grained ranking metrics, notably Hits@10.
- Accurate k-hop subgraph extraction is required; very large graphs may require advanced sampling heuristics.
- Only image and text modalities are currently supported; extension to audio, video, or dynamic KGs is considered future work.
- Application to fully end-to-end tasks (e.g., entity alignment, QA) remains open (Zhang et al., 11 Feb 2026).
7. Broader Impact and Research Directions
ToFu’s design demonstrates that a modular, fully tokenized, multi-modal modeling pipeline can supersede embedding-centric architectures in both transferability and efficiency. Future research may extend to additional modalities, address subgraph sampling challenges in web-scale KGs, or integrate ToFu into larger multi-modal foundation models and downstream tasks.
Summary Table: ToFu’s Core Pipeline
| Stage | Input Modalities | Operation Type |
|---|---|---|
| Tokenization | Structure / Image / Text | Discretize/codebook |
| Local fusion | Structural + MM tokens | GNN + Transformer |
| Gated feature mixing | Fused local features | Learned gating |
| Global propagation (MiM) | All fused entities | Mixture of GNN messages |
| Scoring & loss | Target entity representations | MLP, Cross-entropy |
By systematically discretizing and hierarchically fusing structural, image, and text information, ToFu achieves broad, transferable MMKG reasoning with a minimal and fully shared model footprint (Zhang et al., 11 Feb 2026).