Papers
Topics
Authors
Recent
Search
2000 character limit reached

TI-JEPA: Joint Embedding for Text-Image Alignment

Updated 2 July 2026
  • TI-JEPA is a self-supervised framework that bridges the semantic gap between text and images by learning a shared embedding space using energy-based modeling.
  • The architecture employs frozen encoders, trainable cross-attention aligners, and a predictor to reconstruct masked patches for fine-grained multimodal alignment.
  • Empirical results demonstrate that TI-JEPA outperforms contrastive methods on sentiment analysis and extends to domains like time-series and tabular data.

The Text-Image Joint Embedding Predictive Architecture (TI-JEPA) is a self-supervised framework for multimodal representation learning, emphasizing effective fusion and alignment of text and visual data by leveraging the Joint-Embedding Predictive Architecture paradigm. Designed to address the semantic gap between discrete textual and continuous visual modalities, TI-JEPA utilizes energy-based modeling (EBM) to induce a shared embedding space, in which compatibility of multimodal pairs is realized through masked-patch prediction, cross-attention alignment, and energy minimization. Derivative JEPA-based approaches have also achieved leading results for time-series and tabular domains, typically modifying the masking and self-supervised predictive losses to suit respective data structures.

1. Motivation: Bridging the Multimodal Semantic Gap

TI-JEPA directly addresses the challenge of aligning heterogeneous modalities, primarily text and images, by modeling their complex, nonlinear correspondences. In contrast to naive concatenation or early fusion—which often fail to capture cross-domain structure due to fundamental representational discrepancies—TI-JEPA learns an embedding within which text–image compatibility is encoded as a low-energy state. The semantic gap refers to the fact that equivalent concepts manifest as highly disparate patterns in raw modalities (e.g., “dog playing” as pixels or as a sequence of tokens), rendering direct association or global contrastive learning insufficiently expressive for dense reasoning and fine-grained compositionality (Vo et al., 9 Mar 2025).

2. Architecture and Core Components

TI-JEPA is architected from four principal modules:

  • Frozen Encoders: A pretrained ViT-H vision transformer produces patch-level image embeddings {sIk}k=1N\{\mathbf{s}_{I_k}\}_{k=1}^N, while a Transformer-based text encoder (gte-base-en-v1.5) generates token embeddings {sT}=1L\{\mathbf{s}_{T_\ell}\}_{\ell=1}^L. Both are fixed during downstream training for stability and efficiency.
  • Trainable Cross-Attention Aligners: Two cross-attention blocks (context and target) with variable capacity perform fine-grained alignment of text and image features. Each block employs multi-head self-attention, cross-modal attention, and residual MLPs.
  • Predictor: A shallow Vision Transformer gϕg_\phi receives cross-attended context outputs, synthesizes masked-patch embeddings augmented with learned mask tokens, and predicts the target patch embeddings.
  • Joint Embedding Space: All representations are aligned via cross-attention; compatibility is measured by an energy function over the resulting paired vectors.

A typical data flow is as follows: the image is encoded, patches are masked, and the text is embedded; the embedding pairs are then aligned using cross-attention, and the predictor reconstructs masked regions using these integrated features (Vo et al., 9 Mar 2025).

3. Energy-Based Modeling and Learning Objective

Central to TI-JEPA is its use of an energy-based model. The architecture learns a function Eθ(t,i)E_\theta(\mathbf{t},\mathbf{i}) such that well-matched text–image pairs have low energy. The joint distribution is defined as

pθ(t,i)=exp(Eθ(t,i))Z(θ)p_\theta(\mathbf{t},\mathbf{i}) = \frac{\exp(-E_\theta(\mathbf{t},\mathbf{i}))}{Z(\theta)}

where Z(θ)Z(\theta) is the partition function integrating over both modalities. In practice, explicit negative sampling is bypassed; instead, masking in the image domain generates “challenging” examples by blocking out random image regions.

The principal loss is a patch-level predictive (reconstruction) objective: Lpred=1Mi=1MjBis^yjsyj22\mathcal{L}_{\rm pred} = \frac{1}{M}\sum_{i=1}^M \sum_{j\in B_i} \|\hat{\mathbf{s}}_{y_j} - \mathbf{s}_{y_j}\|_2^2 where syj\mathbf{s}_{y_j} is the actual embedding for a masked patch and s^yj\hat{\mathbf{s}}_{y_j} is the prediction from the context and text encoding. The final training objective adds regularization, including weight decay and moving average parameter stabilization (Vo et al., 9 Mar 2025).

4. Training Details, Hyperparameters, and Practical Considerations

TI-JEPA is pretrained on MS COCO 2017 (∼118K image–caption pairs). The two encoders are frozen throughout, with only cross-attention blocks and the predictor trainable. Typical optimizer selection is AdamW, with a base learning rate 1×1031 \times 10^{-3} and an EMA decay schedule.

Masking procedures randomly select context and target patches at fixed scales for each training epoch (context: [0.85, 1.0], target: [0.15, 0.2]). Capacity of cross-attention modules is tuned (small–large variants), and larger modules yield improved results. Batch sizes of 1024 and ∼300 total epochs are standard for robustness and convergence. All experimental configurations maintain frozen encoders to avoid mode collapse in the embedding space (Vo et al., 9 Mar 2025).

5. Empirical Results and Comparative Analysis

TI-JEPA achieves state-of-the-art results on multimodal sentiment analysis, outperforming previous models including CLIP-CA-CG, SentiBank, MVAN, and others. On MVSA-Single, TI-JEPA-Large improves accuracy to 76.75% and F1 to 74.62%; on MVSA-Multi, corresponding improvements are 77.55% (accuracy) and 75.02% (F1). Increasing cross-attention module capacity further enhances performance (Small to Large variant).

Performance against other vision-language baselines shows that the predictive (JEPA-style) loss in combination with an EBM yields superior fine-grained multimodal alignment compared with contrastive-only (CLIP), global/local contrastive (SPARC), or naive cross-attention methods. This is attributed to the EBM’s global compatibility landscape and patch masking’s ability to enforce local compositional reasoning (Vo et al., 9 Mar 2025).

Model MVSA-Single Acc (%) F1 (%) MVSA-Multi Acc (%) F1 (%)
CLIP-CA-CG 75.25 73.62 76.05 74.02
TI-JEPA-Small 73.03 71.69 73.59 72.10
TI-JEPA-Medium 75.26 72.15 75.13 73.57
TI-JEPA-Large 76.75 74.62 77.55 75.02

6. Generalizations and Variants of JEPA Across Modalities

The underlying JEPA principle extends beyond image–text pairs. Key examples:

  • Astronomical Time Series: In "Domain-Informed Multi-View Self-Distillation for Astronomical Light-Curve Representation Learning with JEPA," TI-JEPA is adapted to irregular time series by leveraging three domain-informed "views" (raw, periodogram, phase-folded), a self-distillation (LeJEPA) loss, and specialized tokenization. On the StarEmbed benchmark, the model outperforms hand-crafted representations in 15/16 metrics, demonstrating the utility of JEPA-aligned self-supervision across domains (Rui, 26 Jun 2026).
  • Tabular Data: In "T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data," JEPA is instantiated for mask prediction on latent embeddings of random feature subsets. The method is augmentation-free—each view is simply a random subset. Regularizer tokens are employed to prevent degeneration. On multiple tabular benchmarks, T-JEPA representations enable downstream models to outperform or match XGBoost, with demonstrable alignment between unsupervised and supervised feature importances (Thimonier et al., 2024).

7. Limitations and Prospects

Current limitations of TI-JEPA include focus on text and image modalities—audio, video, and structured data are not yet addressed within the same framework. Explicit negative-sampling is not performed (the partition function is approximated via masking), and generalization to higher-order cross-modal tasks (VQA, multi-hop reasoning) is yet to be fully validated.

Prospective research includes extension to visual question answering (by adapting predictors to generate answer embeddings), energy normalization improvements (multi-hop sampling, learned negative proposals), and integration with additional modalities (e.g., audio, time series) to investigate scalability and universality of the joint-embedding EBM principle (Vo et al., 9 Mar 2025).


TI-JEPA and its derivatives constitute a unified modeling paradigm that leverages masked feature prediction and energy-based joint embedding for robust multimodal alignment. By eschewing global contrastive losses in favor of predictive objectives and explicit masking strategies, these models bridge semantic gaps in diverse domains—from vision-language pairing and astronomical time series to tabular data—while providing strong empirical performance and extensible architectural blueprints for future multimodal research.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TI-JEPA.