Multimodal Fusion: TI-JEPA & JEPA-T
- Multimodal fusion is defined as integrating heterogeneous modalities like text and images using joint embedding frameworks with cross-attention and energy-based losses.
- TI-JEPA and JEPA-T employ specialized encoder-decoder architectures, leveraging late fusion and predictive objectives to improve data efficiency and open-vocabulary generation.
- Empirical studies show these models achieve state-of-the-art results in tasks such as image generation and sentiment classification compared to traditional methods.
Multimodal fusion refers to the architectural and algorithmic mechanisms by which information from heterogeneous modalities—typically text and images—is integrated within joint embedding frameworks. Modern approaches such as TI-JEPA (“Text-Image Joint Embedding Predictive Architecture”) and JEPA-T (“Joint-Embedding Predictive Architecture with Text Fusion for Image Generation”) leverage cross-attention mechanisms, prediction-based objectives, and energy-based losses to establish robust and semantically meaningful bridges across modalities. These methods significantly advance both generative modeling (text-to-image, JEPA-T) and predictive alignment (text-image understanding, TI-JEPA), demonstrating strong improvements in data efficiency, open-vocabulary generalization, and downstream multimodal recognition and generation tasks (Wan et al., 1 Oct 2025, Vo et al., 9 Mar 2025).
1. Architectural Design of TI-JEPA and JEPA-T
TI-JEPA
TI-JEPA consists of three main modules: a Vision Transformer image encoder (), a transformer-based text encoder (), and a shallow predictor network (). Information from text and image streams is fused through pairs of cross-attention transformer stacks (modules and ):
- Image encoding: An image or a masked image is mapped to patch embeddings via .
- Text encoding: Caption is mapped to embeddings via .
- Fusion via cross-attention: (context) and (target) integrate textual and visual information through per-block cross-attention (queries: image tokens, keys/values: text tokens).
- Prediction: receives fused context representations plus learnable mask tokens corresponding to masked visual patches, predicting targets for these patch regions.
JEPA-T
JEPA-T employs a two-stage vision-language transformer architecture:
- Tokenization: Images are encoded through VAE and projected to match the model dimension, with 70% of tokens randomly masked. Text is tokenized using CLIP’s tokenizer and projected to the shared embedding space.
- Encoder: Processes unmasked visual tokens alongside a learnable buffer, yielding “memory” through self-attention layers.
- Predictor/Decoder:
- Runs self-attention and cross-attention (to ), followed by initial masked token prediction.
- Applies late-stage cross-attention from predicted visual tokens to text condition embeddings (), performing “residual fuse” and concatenation-projection for “final token”.
- Textual fusion: Occurs primarily at two points (late fusion): cross-attention after prediction (using queries from the predicted visual tokens and keys/values from ) and through direct injection into the flow-matching loss.
2. Multimodal Fusion Mechanisms
Cross-Attention Strategies
Central to both architectures is the use of cross-attention for modality fusion:
- TI-JEPA: Each block in modules and incorporates cross-attention, where image patch tokens attend to text token embeddings. This grounds patch-level representations in language, yielding fused features and \ (Vo et al., 9 Mar 2025).
- JEPA-T: Early cross-attention to encoder memory is supplemented by explicit, post-predictor cross-attention, binding textual cues to predicted visual tokens at high resolution. The core predictor remains vision-only, isolating fusion to precise architectural loci (“late-fusion”) (Wan et al., 1 Oct 2025).
Objective-Level Fusion
JEPA-T further injects pooled text embeddings directly into the loss computation (specifically into the flow-matching and alignment losses), ensuring cross-modal alignment not just at the representational but also objective level. The alignment loss is
enforcing congruity between image and text global representations.
3. Learning Objectives
Predictive (Reconstruction) Losses
Both TI-JEPA and JEPA-T are fundamentally predictive, focusing on reconstructing masked or denoised targets:
- TI-JEPA: The loss is the mean squared error between predicted and ground-truth patch embeddings:
- JEPA-T: Conditional flow-matching loss over masked tokens:
Text embedding injection ensures strong cross-modal gradients, directly coupling updates to both modalities.
Energy-Based Perspective
TI-JEPA can be interpreted as an energy-based model where joint energy is low when predictions and targets are consistent within a paired sample and high otherwise. Energy corresponds to cumulative prediction error. In margin-based extensions, explicit negatives can be introduced via:
4. Training and Inference Protocols
Both architectures employ self-supervised masked prediction with careful masking strategies:
| Model | Masking Strategy | Backpropagation Scope | Downstream Fine-tuning |
|---|---|---|---|
| TI-JEPA | Large image region masked for context; small blocks for targets | Only cross-attention modules , | Classification head on [CLS] token for sentiment analysis |
| JEPA-T | 70% visual tokens masked; text always available | Full model except frozen encoders | Denoising-based generation; supports open-vocabulary |
In JEPA-T, at inference, generation is driven by iterative latent denoising:
1 2 3 4 5 |
for t = T...1:
z_pred, z_final = Decoder(z_vis, c)
v_hat_0 = FlowHead(z_final, v)
v ← v + Δt * (v_hat_0 - v) / (σ(t)/α(t))
return VAE.decode(v) |
TI-JEPA can be extended to an imagetext denoising regime (“JEPA-T” in TI-JEPA notation) by masking the caption, cross-attending from image features, and applying an analogous predictive loss over masked text tokens (Vo et al., 9 Mar 2025).
5. Empirical Results and Impact
JEPA-T
- On ImageNet-1K (256), JEPA-T achieves:
- FID = 1.42, IS = 298.3, Prec = 0.79, Rec = 0.63,
- Exceeds non-fusion (FID 1.75) and non-text-injection (FID 1.48) baselines,
- Maintains performance at half training data,
- Coherent, semantically accurate open-vocabulary generation (Wan et al., 1 Oct 2025).
TI-JEPA
- On multimodal sentiment classification (MVSA datasets):
| Model | MVSA-Single Acc/F1 | MVSA-Multi Acc/F1 |
|---|---|---|
| TI-JEPA-Large | 76.75 / 74.62 | 77.55 / 75.02 |
This outperforms previous SoTA methods such as CLIP-CA-CG (75.25/73.62) and MVAN (70.15/68.75), establishing new benchmarks for fine-grained text-image alignment (Vo et al., 9 Mar 2025).
Ablation Studies (JEPA-T)
- w/o Cross-Attn: FID=1.75
- w/o Text-Inj: FID=1.48
- w/o Flow-Match: FID=1.60
This suggests that both architectural late fusion and direct text embedding injection into the loss landscape confer non-trivial gains in cross-modal conditioning strength and task-general backbone retention.
6. Analysis of Fusion Strategies
| Fusion Type | Description | Strengths/Limitations |
|---|---|---|
| Non-fusion | No text–vision cross-attn; text ignored at core | Fails to modulate predictions by text |
| Early-fusion | Text prepended to backbone input | Can drown out vision; impairs generality |
| Late-fusion (JEPA-T) | Text introduced post-prediction/cross-attn only | Strong, localized control; modularity |
JEPA-T’s late-fusion design maintains a generic, reusable self-supervised vision pipeline while leveraging localized, high-resolution text guidance at strategic junctures. Objective-level text alignment further sharpens the cross-modal coupling.
7. Extensions, Limitations, and Future Prospects
Extensions outlined in (Vo et al., 9 Mar 2025) include multimodal scaling (audio, video), hybrid predictive-contrastive objectives, and jointly learned encoders to prevent energy landscape collapse. Downstream applications include Visual Question Answering, cross-modal retrieval, and image captioning by reusing the joint embedding as a universal encoder. A plausible implication is that flexible masking, hybrid fusion points, and energy-based predictiveness will be central in future state-of-the-art multimodal fusion models.
No significant controversies are highlighted in these works, but a common challenge is tuning the locus and strength of fusion to avoid overwhelming one modality with the other or losing the transfer benefits of a general-purpose backbone. Both JEPA-T and TI-JEPA represent a shift toward modular, minimally intrusive fusion, balancing backbone extensibility with parameter-efficient strong conditioning.
References:
JEPA-T: "JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation" (Wan et al., 1 Oct 2025) TI-JEPA: "TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems" (Vo et al., 9 Mar 2025)