Papers
Topics
Authors
Recent
Search
2000 character limit reached

ToFu for Multi-modal KG Reasoning

Updated 16 April 2026
  • ToFu is a multi-modal knowledge graph reasoning model that tokenizes structure, image, and text data, enabling efficient cross-graph generalization.
  • It employs a hierarchical fusion framework combining GNNs and transformers to integrate discrete modalities with minimal parameters.
  • Empirical results demonstrate significant gains in MRR and Hit@10 over baselines, underscoring its robustness and efficiency in various inductive settings.

ToFu (“TOken-based Foundation model for multi-modal Knowledge Graph Reasoning”) is a foundation model architecture for multi-modal knowledge graph reasoning (MMKGR), designed to deliver strong cross-graph generalization by representing all structural and content modalities as discrete, transferable tokens. Unlike previous approaches that rely on entity- or relation-specific embeddings, ToFu leverages a fully tokenized, hierarchical fusion, message-mixing framework that is lightweight, parameter-efficient (~0.25M parameters), and robust to unseen heterogeneous graphs and modalities. The design, mathematical formulation, empirical findings, and limitations of ToFu are detailed below (Zhang et al., 11 Feb 2026).

1. Architecture and Tokenization Strategy

ToFu processes each MMKG query (triple) by sampling a local k-hop subgraph around the head and tail entities, constructing a comprehensive, multimodal representation at both the entity and neighborhood level. All available information is discretized as follows:

  • Structural tokens: For every node ee in the sampled subgraph, the tuple [d(e,h),d(e,t)][d(e, h), d(e, t)] (shortest-path distances to head hh and tail tt) indexes a small, learnable structural codebook.
  • Textual tokens: Entity descriptions are tokenized with a fixed BERT WordPiece tokenizer, producing up to NtxtN_{\mathrm{txt}} subword tokens, each mapped via BERT embeddings with an additional projection layer.
  • Visual tokens: Each entity image is encoded via a pre-trained VQ-VAE (e.g., BEiT v2), generating NvisN_{\mathrm{vis}} discrete patch tokens further projected into the shared model space.

This tokenization strategy eliminates explicit per-entity or per-relation embeddings, enabling the model to apply the same parameter set across diverse knowledge graph domains and inductive regimes.

2. Hierarchical Fusion and Message-Passing

ToFu employs a two-stage fusion architecture:

  • Local fusion:

    • Structural Encoder (SE): An L1L_1-layer GNN propagates structural codebook embeddings, guided by a shared relation query vector. Updates are given by:

    Hent(i+1)=AGGnN(e)[MSG(Hent(i),Hrel(i),n,q)]H_{\mathrm{ent}}^{(i+1)} = \mathrm{AGG}_{n \in N(e)}\left[\mathrm{MSG}(H_{\mathrm{ent}}^{(i)}, H_{\mathrm{rel}}^{(i)}, n, q)\right]

    with MSG()\mathrm{MSG}(\cdot) a learned MLP weighted by attention αr,q\alpha_{r, q}, and [d(e,h),d(e,t)][d(e, h), d(e, t)]0 as attention- or max-pooling. - Multi-modal Encoder (ME): A transformer ingests the concatenation [d(e,h),d(e,t)][d(e, h), d(e, t)]1 and outputs a fused image+text representation [d(e,h),d(e,t)][d(e, h), d(e, t)]2 via a special [ENT] read-out token. - Gated Fusion: Learned gates [d(e,h),d(e,t)][d(e, h), d(e, t)]3 combine the two features as [d(e,h),d(e,t)][d(e, h), d(e, t)]4.

  • Global propagation (Mixture-of-Messages, MiM):

    • Fused features propagate through the subgraph with [d(e,h),d(e,t)][d(e, h), d(e, t)]5 additional GNN layers. Each edge passes a weighted sum of [d(e,h),d(e,t)][d(e, h), d(e, t)]6 base message functions (TransE, DistMult, RotatE, etc.):

    [d(e,h),d(e,t)][d(e, h), d(e, t)]7

    where the [d(e,h),d(e,t)][d(e, h), d(e, t)]8, [d(e,h),d(e,t)][d(e, h), d(e, t)]9, and hh0 are produced by small MLPs.

This modular mixture-of-message operator imparts universal inductive bias, crucial for out-of-domain generalization.

3. Mathematical Formulation

The key modeling stacks are:

  • Structural encoding: hh1.
  • Multi-modal transformer: hh2, hh3.
  • Gated fusion: hh4.
  • Message mixing: Each message to hh5 is a convex combination of baseline operators, facilitating adaptation to various KG semantics.
  • Scoring / loss: The final entity representation hh6 is scored as hh7, with multi-class cross-entropy loss over negative samples.

The resulting feature pipeline is highly modular and avoids embedding lookup tables entirely.

4. Training Regimes and Generalization

ToFu is pretrained on four diverse multi-modal KGs (DB15K, MKG-W, MKG-Y, FB15K-237(v1)) using the outlined sampling and loss, with no entity/relation-specific parameters. Transfer learning is tested in three settings across 17 MMKGs:

  • Transductive: all entities and relations seen during pretraining.
  • Inductive: new test entities, known relations.
  • Fully-inductive: both new entities and new relations at inference.

Zero-shot and fine-tuned variants are evaluated; parameter updates are applied only to shared GNN and Transformer weights during fine-tuning, leaving all codebooks and gates intact (Zhang et al., 11 Feb 2026).

5. Empirical Performance and Ablation

ToFu achieves strong transfer and overall performance on MMKG link prediction:

Setting MRR (Zero-shot) Hit@10 (Zero-shot) MRR (FT) Hit@10 (FT)
All 17 MMKGs 45.93% 61.67% 47.41% 63.02%
SOTA Baseline 35–44% ~59% 41–45% 61–62%

Key ablation outcomes:

  • Removal of structural encoder (SE) or global propagation (GP) drops MRR by 5–10 points, emphasizing the necessity of hierarchical fusion.
  • Omitting textual tokens particularly degrades accuracy; visual tokens are also contributory but less critical.
  • Token budget experiments show a monotonic accuracy increase as the number of visual/text tokens grows, saturating at 8–16 tokens for practical efficiency.
  • Single-function message passing underperforms the MiM strategy, reducing overall stability and effectiveness.

6. Practical Implications, Limitations, and Future Work

ToFu provides key advances for MMKGR:

  • Generalization: Entity-agnostic tokenization enables true zero-shot and few-shot cross-KG reasoning.
  • Efficiency: The parameter count is two orders of magnitude smaller than typical multimodal KGFMs.
  • Robustness: Fine-grained codebook reuse for images and text strengthens robustness on content-rich and unseen graphs.

However, certain limitations are noted:

  • Lacking explicit entity embeddings modestly impacts performance on long-tail or coarse-grained ranking metrics, notably Hits@10.
  • Accurate k-hop subgraph extraction is required; very large graphs may require advanced sampling heuristics.
  • Only image and text modalities are currently supported; extension to audio, video, or dynamic KGs is considered future work.
  • Application to fully end-to-end tasks (e.g., entity alignment, QA) remains open (Zhang et al., 11 Feb 2026).

7. Broader Impact and Research Directions

ToFu’s design demonstrates that a modular, fully tokenized, multi-modal modeling pipeline can supersede embedding-centric architectures in both transferability and efficiency. Future research may extend to additional modalities, address subgraph sampling challenges in web-scale KGs, or integrate ToFu into larger multi-modal foundation models and downstream tasks.

Summary Table: ToFu’s Core Pipeline

Stage Input Modalities Operation Type
Tokenization Structure / Image / Text Discretize/codebook
Local fusion Structural + MM tokens GNN + Transformer
Gated feature mixing Fused local features Learned gating
Global propagation (MiM) All fused entities Mixture of GNN messages
Scoring & loss Target entity representations MLP, Cross-entropy

By systematically discretizing and hierarchically fusing structural, image, and text information, ToFu achieves broad, transferable MMKG reasoning with a minimal and fully shared model footprint (Zhang et al., 11 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ToFu.