Papers
Topics
Authors
Recent
2000 character limit reached

Multimodal Fusion: TI-JEPA & JEPA-T

Updated 2 December 2025
  • Multimodal fusion is defined as integrating heterogeneous modalities like text and images using joint embedding frameworks with cross-attention and energy-based losses.
  • TI-JEPA and JEPA-T employ specialized encoder-decoder architectures, leveraging late fusion and predictive objectives to improve data efficiency and open-vocabulary generation.
  • Empirical studies show these models achieve state-of-the-art results in tasks such as image generation and sentiment classification compared to traditional methods.

Multimodal fusion refers to the architectural and algorithmic mechanisms by which information from heterogeneous modalities—typically text and images—is integrated within joint embedding frameworks. Modern approaches such as TI-JEPA (“Text-Image Joint Embedding Predictive Architecture”) and JEPA-T (“Joint-Embedding Predictive Architecture with Text Fusion for Image Generation”) leverage cross-attention mechanisms, prediction-based objectives, and energy-based losses to establish robust and semantically meaningful bridges across modalities. These methods significantly advance both generative modeling (text-to-image, JEPA-T) and predictive alignment (text-image understanding, TI-JEPA), demonstrating strong improvements in data efficiency, open-vocabulary generalization, and downstream multimodal recognition and generation tasks (Wan et al., 1 Oct 2025, Vo et al., 9 Mar 2025).

1. Architectural Design of TI-JEPA and JEPA-T

TI-JEPA

TI-JEPA consists of three main modules: a Vision Transformer image encoder (fIf_I), a transformer-based text encoder (fTf_T), and a shallow predictor network (gϕg_\phi). Information from text and image streams is fused through pairs of cross-attention transformer stacks (modules XX and X~\tilde{X}):

  • Image encoding: An image II or a masked image IcontextI_{\text{context}} is mapped to patch embeddings sIRN×dIs_I \in \mathbb{R}^{N \times d_I} via fIf_I.
  • Text encoding: Caption TT is mapped to embeddings sTRL×dTs_T \in \mathbb{R}^{L \times d_T} via fTf_T.
  • Fusion via cross-attention: X(sT,sIcontext)X(s_T, s_{I_{\text{context}}}) (context) and X~(sT,sIfull)\tilde{X}(s_T, s_{I_{\text{full}}}) (target) integrate textual and visual information through per-block cross-attention (queries: image tokens, keys/values: text tokens).
  • Prediction: gϕg_\phi receives fused context representations plus learnable mask tokens corresponding to masked visual patches, predicting targets for these patch regions.

JEPA-T

JEPA-T employs a two-stage vision-language transformer architecture:

  • Tokenization: Images are encoded through VAE and projected to match the model dimension, with \sim70% of tokens randomly masked. Text is tokenized using CLIP’s tokenizer and projected to the shared embedding space.
  • Encoder: Processes unmasked visual tokens alongside a learnable buffer, yielding “memory” MM through self-attention layers.
  • Predictor/Decoder:
    • Runs self-attention and cross-attention (to MM), followed by initial masked token prediction.
    • Applies late-stage cross-attention from predicted visual tokens to text condition embeddings (cc), performing “residual fuse” and concatenation-projection for “final token”.
  • Textual fusion: Occurs primarily at two points (late fusion): cross-attention after prediction (using queries from the predicted visual tokens and keys/values from cc) and through direct injection into the flow-matching loss.

2. Multimodal Fusion Mechanisms

Cross-Attention Strategies

Central to both architectures is the use of cross-attention for modality fusion:

  • TI-JEPA: Each block in modules XX and X~\tilde{X} incorporates cross-attention, where image patch tokens attend to text token embeddings. This grounds patch-level representations in language, yielding fused features sxs_x and sys_y\ (Vo et al., 9 Mar 2025).
  • JEPA-T: Early cross-attention to encoder memory is supplemented by explicit, post-predictor cross-attention, binding textual cues to predicted visual tokens at high resolution. The core predictor remains vision-only, isolating fusion to precise architectural loci (“late-fusion”) (Wan et al., 1 Oct 2025).

Objective-Level Fusion

JEPA-T further injects pooled text embeddings cc directly into the loss computation (specifically into the flow-matching and alignment losses), ensuring cross-modal alignment not just at the representational but also objective level. The alignment loss is

Lalign=Pool(zfinal)Pool(c)22,L_{\text{align}} = \|\text{Pool}(z_{\text{final}}) - \text{Pool}(c)\|_2^2,

enforcing congruity between image and text global representations.

3. Learning Objectives

Predictive (Reconstruction) Losses

Both TI-JEPA and JEPA-T are fundamentally predictive, focusing on reconstructing masked or denoised targets:

  • TI-JEPA: The loss is the mean squared error between predicted and ground-truth patch embeddings:

LP=1Mi=1MjBis^yjsyj22.L_P = \frac{1}{M} \sum_{i=1}^M \sum_{j\in B_i} \|\hat s_{y_j} - s_{y_j}\|_2^2.

LFM=EtU[0,1],v0,ϵ[v^0(vt,c)v022mask].L_{\text{FM}} = \mathbb{E}_{t\sim U[0,1], v_0, \epsilon} \left[ \|\hat v_0(v_t, c) - v_0\|_2^2 \odot \text{mask} \right].

Text embedding injection ensures strong cross-modal gradients, directly coupling updates to both modalities.

Energy-Based Perspective

TI-JEPA can be interpreted as an energy-based model where joint energy Eθ(x,y)E_\theta(x,y) is low when predictions and targets are consistent within a paired sample and high otherwise. Energy corresponds to cumulative prediction error. In margin-based extensions, explicit negatives can be introduced via:

L=max{0,Eθ(x,x+)Eθ(x,x)+Δ}.L = \max \{0, E_\theta(x, x^+) - E_\theta(x, x^-) + \Delta\}.

4. Training and Inference Protocols

Both architectures employ self-supervised masked prediction with careful masking strategies:

Model Masking Strategy Backpropagation Scope Downstream Fine-tuning
TI-JEPA Large image region masked for context; small blocks for targets Only cross-attention modules XX, X~\tilde X Classification head on [CLS] token for sentiment analysis
JEPA-T \sim70% visual tokens masked; text always available Full model except frozen encoders Denoising-based generation; supports open-vocabulary

In JEPA-T, at inference, generation is driven by iterative latent denoising:

1
2
3
4
5
for t = T...1:
    z_pred, z_final = Decoder(z_vis, c)
    v_hat_0 = FlowHead(z_final, v)
    v ← v + Δt * (v_hat_0 - v) / (σ(t)/α(t))
return VAE.decode(v)
Class-conditional and open-vocabulary synthesis result from specifying appropriate prompts.

TI-JEPA can be extended to an image\rightarrowtext denoising regime (“JEPA-T” in TI-JEPA notation) by masking the caption, cross-attending from image features, and applying an analogous predictive loss over masked text tokens (Vo et al., 9 Mar 2025).

5. Empirical Results and Impact

JEPA-T

  • On ImageNet-1K (2562^2), JEPA-T achieves:
    • FID = 1.42, IS = 298.3, Prec = 0.79, Rec = 0.63,
    • Exceeds non-fusion (FID 1.75) and non-text-injection (FID 1.48) baselines,
    • Maintains performance at half training data,
    • Coherent, semantically accurate open-vocabulary generation (Wan et al., 1 Oct 2025).

TI-JEPA

  • On multimodal sentiment classification (MVSA datasets):
Model MVSA-Single Acc/F1 MVSA-Multi Acc/F1
TI-JEPA-Large 76.75 / 74.62 77.55 / 75.02

This outperforms previous SoTA methods such as CLIP-CA-CG (75.25/73.62) and MVAN (70.15/68.75), establishing new benchmarks for fine-grained text-image alignment (Vo et al., 9 Mar 2025).

Ablation Studies (JEPA-T)

  • w/o Cross-Attn: FID=1.75
  • w/o Text-Inj: FID=1.48
  • w/o Flow-Match: FID=1.60

This suggests that both architectural late fusion and direct text embedding injection into the loss landscape confer non-trivial gains in cross-modal conditioning strength and task-general backbone retention.

6. Analysis of Fusion Strategies

Fusion Type Description Strengths/Limitations
Non-fusion No text–vision cross-attn; text ignored at core Fails to modulate predictions by text
Early-fusion Text prepended to backbone input Can drown out vision; impairs generality
Late-fusion (JEPA-T) Text introduced post-prediction/cross-attn only Strong, localized control; modularity

JEPA-T’s late-fusion design maintains a generic, reusable self-supervised vision pipeline while leveraging localized, high-resolution text guidance at strategic junctures. Objective-level text alignment further sharpens the cross-modal coupling.

7. Extensions, Limitations, and Future Prospects

Extensions outlined in (Vo et al., 9 Mar 2025) include multimodal scaling (audio, video), hybrid predictive-contrastive objectives, and jointly learned encoders to prevent energy landscape collapse. Downstream applications include Visual Question Answering, cross-modal retrieval, and image captioning by reusing the joint embedding as a universal encoder. A plausible implication is that flexible masking, hybrid fusion points, and energy-based predictiveness will be central in future state-of-the-art multimodal fusion models.

No significant controversies are highlighted in these works, but a common challenge is tuning the locus and strength of fusion to avoid overwhelming one modality with the other or losing the transfer benefits of a general-purpose backbone. Both JEPA-T and TI-JEPA represent a shift toward modular, minimally intrusive fusion, balancing backbone extensibility with parameter-efficient strong conditioning.


References:

JEPA-T: "JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation" (Wan et al., 1 Oct 2025) TI-JEPA: "TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems" (Vo et al., 9 Mar 2025)

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multimodal Fusion (TI-JEPA, JEPA-T).