Multimodal Fusion: TI-JEPA & JEPA-T

Updated 2 December 2025

Multimodal fusion is defined as integrating heterogeneous modalities like text and images using joint embedding frameworks with cross-attention and energy-based losses.
TI-JEPA and JEPA-T employ specialized encoder-decoder architectures, leveraging late fusion and predictive objectives to improve data efficiency and open-vocabulary generation.
Empirical studies show these models achieve state-of-the-art results in tasks such as image generation and sentiment classification compared to traditional methods.

Multimodal fusion refers to the architectural and algorithmic mechanisms by which information from heterogeneous modalities—typically text and images—is integrated within joint embedding frameworks. Modern approaches such as TI-JEPA (“Text-Image Joint Embedding Predictive Architecture”) and JEPA-T (“Joint-Embedding Predictive Architecture with Text Fusion for Image Generation”) leverage cross-attention mechanisms, prediction-based objectives, and energy-based losses to establish robust and semantically meaningful bridges across modalities. These methods significantly advance both generative modeling (text-to-image, JEPA-T) and predictive alignment (text-image understanding, TI-JEPA), demonstrating strong improvements in data efficiency, open-vocabulary generalization, and downstream multimodal recognition and generation tasks (Wan et al., 1 Oct 2025, Vo et al., 9 Mar 2025).

1. Architectural Design of TI-JEPA and JEPA-T

TI-JEPA

TI-JEPA consists of three main modules: a Vision Transformer image encoder ( $f_I$ ), a transformer-based text encoder ( $f_T$ ), and a shallow predictor network ( $g_\phi$ ). Information from text and image streams is fused through pairs of cross-attention transformer stacks (modules $X$ and $\tilde{X}$ ):

Image encoding: An image $I$ or a masked image $I_{\text{context}}$ is mapped to patch embeddings $s_I \in \mathbb{R}^{N \times d_I}$ via $f_I$ .
Text encoding: Caption $T$ is mapped to embeddings $s_T \in \mathbb{R}^{L \times d_T}$ via $f_T$ .
Fusion via cross-attention: $X(s_T, s_{I_{\text{context}}})$ (context) and $\tilde{X}(s_T, s_{I_{\text{full}}})$ (target) integrate textual and visual information through per-block cross-attention (queries: image tokens, keys/values: text tokens).
Prediction: $g_\phi$ receives fused context representations plus learnable mask tokens corresponding to masked visual patches, predicting targets for these patch regions.

JEPA-T

JEPA-T employs a two-stage vision-language transformer architecture:

Tokenization: Images are encoded through VAE and projected to match the model dimension, with $\sim$ 70% of tokens randomly masked. Text is tokenized using CLIP’s tokenizer and projected to the shared embedding space.
Encoder: Processes unmasked visual tokens alongside a learnable buffer, yielding “memory” $M$ through self-attention layers.
Predictor/Decoder:
- Runs self-attention and cross-attention (to $M$ ), followed by initial masked token prediction.
- Applies late-stage cross-attention from predicted visual tokens to text condition embeddings ( $c$ ), performing “residual fuse” and concatenation-projection for “final token”.
Textual fusion: Occurs primarily at two points (late fusion): cross-attention after prediction (using queries from the predicted visual tokens and keys/values from $c$ ) and through direct injection into the flow-matching loss.

2. Multimodal Fusion Mechanisms

Cross-Attention Strategies

Central to both architectures is the use of cross-attention for modality fusion:

TI-JEPA: Each block in modules $X$ and $\tilde{X}$ incorporates cross-attention, where image patch tokens attend to text token embeddings. This grounds patch-level representations in language, yielding fused features $s_x$ and $s_y$ \ (Vo et al., 9 Mar 2025).
JEPA-T: Early cross-attention to encoder memory is supplemented by explicit, post-predictor cross-attention, binding textual cues to predicted visual tokens at high resolution. The core predictor remains vision-only, isolating fusion to precise architectural loci (“late-fusion”) (Wan et al., 1 Oct 2025).

Objective-Level Fusion

JEPA-T further injects pooled text embeddings $c$ directly into the loss computation (specifically into the flow-matching and alignment losses), ensuring cross-modal alignment not just at the representational but also objective level. The alignment loss is

$L_{\text{align}} = \|\text{Pool}(z_{\text{final}}) - \text{Pool}(c)\|_2^2,$

enforcing congruity between image and text global representations.

3. Learning Objectives

Predictive (Reconstruction) Losses

Both TI-JEPA and JEPA-T are fundamentally predictive, focusing on reconstructing masked or denoised targets:

TI-JEPA: The loss is the mean squared error between predicted and ground-truth patch embeddings:

$L_P = \frac{1}{M} \sum_{i=1}^M \sum_{j\in B_i} \|\hat s_{y_j} - s_{y_j}\|_2^2.$

JEPA-T: Conditional flow-matching loss over masked tokens:

$L_{\text{FM}} = \mathbb{E}_{t\sim U[0,1], v_0, \epsilon} \left[ \|\hat v_0(v_t, c) - v_0\|_2^2 \odot \text{mask} \right].$

Text embedding injection ensures strong cross-modal gradients, directly coupling updates to both modalities.

Energy-Based Perspective

TI-JEPA can be interpreted as an energy-based model where joint energy $E_\theta(x,y)$ is low when predictions and targets are consistent within a paired sample and high otherwise. Energy corresponds to cumulative prediction error. In margin-based extensions, explicit negatives can be introduced via:

$L = \max \{0, E_\theta(x, x^+) - E_\theta(x, x^-) + \Delta\}.$

4. Training and Inference Protocols

Both architectures employ self-supervised masked prediction with careful masking strategies:

Model	Masking Strategy	Backpropagation Scope	Downstream Fine-tuning
TI-JEPA	Large image region masked for context; small blocks for targets	Only cross-attention modules $X$ , $\tilde X$	Classification head on [CLS] token for sentiment analysis
JEPA-T	$\sim$ 70% visual tokens masked; text always available	Full model except frozen encoders	Denoising-based generation; supports open-vocabulary

In JEPA-T, at inference, generation is driven by iterative latent denoising:

for t = T...1:
    z_pred, z_final = Decoder(z_vis, c)
    v_hat_0 = FlowHead(z_final, v)
    v ← v + Δt * (v_hat_0 - v) / (σ(t)/α(t))
return VAE.decode(v)

Class-conditional and open-vocabulary synthesis result from specifying appropriate prompts.

TI-JEPA can be extended to an image $\rightarrow$ text denoising regime (“JEPA-T” in TI-JEPA notation) by masking the caption, cross-attending from image features, and applying an analogous predictive loss over masked text tokens (Vo et al., 9 Mar 2025).

5. Empirical Results and Impact

JEPA-T

On ImageNet-1K (256 $^2$ $^{2}$ ), JEPA-T achieves:
- FID = 1.42, IS = 298.3, Prec = 0.79, Rec = 0.63,
- Exceeds non-fusion (FID 1.75) and non-text-injection (FID 1.48) baselines,
- Maintains performance at half training data,
- Coherent, semantically accurate open-vocabulary generation (Wan et al., 1 Oct 2025).

TI-JEPA

On multimodal sentiment classification (MVSA datasets):

Model	MVSA-Single Acc/F1	MVSA-Multi Acc/F1
TI-JEPA-Large	76.75 / 74.62	77.55 / 75.02

This outperforms previous SoTA methods such as CLIP-CA-CG (75.25/73.62) and MVAN (70.15/68.75), establishing new benchmarks for fine-grained text-image alignment (Vo et al., 9 Mar 2025).

Ablation Studies (JEPA-T)

w/o Cross-Attn: FID=1.75
w/o Text-Inj: FID=1.48
w/o Flow-Match: FID=1.60

This suggests that both architectural late fusion and direct text embedding injection into the loss landscape confer non-trivial gains in cross-modal conditioning strength and task-general backbone retention.

6. Analysis of Fusion Strategies

Fusion Type	Description	Strengths/Limitations
Non-fusion	No text–vision cross-attn; text ignored at core	Fails to modulate predictions by text
Early-fusion	Text prepended to backbone input	Can drown out vision; impairs generality
Late-fusion (JEPA-T)	Text introduced post-prediction/cross-attn only	Strong, localized control; modularity

JEPA-T’s late-fusion design maintains a generic, reusable self-supervised vision pipeline while leveraging localized, high-resolution text guidance at strategic junctures. Objective-level text alignment further sharpens the cross-modal coupling.

7. Extensions, Limitations, and Future Prospects

Extensions outlined in (Vo et al., 9 Mar 2025) include multimodal scaling (audio, video), hybrid predictive-contrastive objectives, and jointly learned encoders to prevent energy landscape collapse. Downstream applications include Visual Question Answering, cross-modal retrieval, and image captioning by reusing the joint embedding as a universal encoder. A plausible implication is that flexible masking, hybrid fusion points, and energy-based predictiveness will be central in future state-of-the-art multimodal fusion models.

No significant controversies are highlighted in these works, but a common challenge is tuning the locus and strength of fusion to avoid overwhelming one modality with the other or losing the transfer benefits of a general-purpose backbone. Both JEPA-T and TI-JEPA represent a shift toward modular, minimally intrusive fusion, balancing backbone extensibility with parameter-efficient strong conditioning.

References:

JEPA-T: "JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation" (Wan et al., 1 Oct 2025) TI-JEPA: "TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems" (Vo et al., 9 Mar 2025)

PDF Markdown Chat (Pro)

References (2)

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation (2025)

TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multimodal Fusion (TI-JEPA, JEPA-T).

Multimodal Fusion: TI-JEPA & JEPA-T

1. Architectural Design of TI-JEPA and JEPA-T

TI-JEPA

JEPA-T

2. Multimodal Fusion Mechanisms

Cross-Attention Strategies

Objective-Level Fusion

3. Learning Objectives

Predictive (Reconstruction) Losses

Energy-Based Perspective

4. Training and Inference Protocols

5. Empirical Results and Impact

JEPA-T

TI-JEPA

Ablation Studies (JEPA-T)

6. Analysis of Fusion Strategies

7. Extensions, Limitations, and Future Prospects

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics