TC-JEPA: Text-Conditional Joint Embedding

Updated 9 May 2026

TC-JEPA is a self-supervised, text-conditioned framework that refines predictive feature learning for multimodal tasks.
It employs fine-grained cross-attention mechanisms for both visual patches and text spans, reducing prediction uncertainty and enhancing semantic grounding.
The architecture scales robustly across diverse datasets, driving performance improvements in classification, segmentation, and text-to-image generation.

Text-Conditional Joint Embedding Predictive Architecture (TC-JEPA) is a self-supervised, text-conditioned feature-prediction framework that integrates language information into predictive embedding objectives, producing semantically enriched representations for both vision and language modalities. TC-JEPA generalizes the Joint Embedding Predictive Architecture (JEPA) by conditioning the prediction of masked features, tokens, or spans not only on context from the same modality but also on associated text (for images) or on context spans (for text), thereby significantly reducing prediction uncertainty and improving downstream performance across classification, dense prediction, generation, and multimodal reasoning tasks (Huang et al., 5 May 2026, Wan et al., 1 Oct 2025, Huang et al., 11 Sep 2025).

1. Architectural Foundations and Variants

TC-JEPA extends the I-JEPA and JEPA-T paradigms by introducing explicit text conditioning into the predictive architecture across multiple modalities:

Vision (Image–Text): TC-JEPA builds on masked-patch predictive learning. An input image is split into non-overlapping patches, partitioned into context ( $X$ ) and target ( $Y$ ) subsets with random multi-block masking. The image encoder $f_\theta$ (typically a ViT-B/16, L/16, or H/14) maps context patches $\{x_i\}$ to latent feature vectors. The predictor $g_\varphi$ , a lightweight transformer, receives these context features plus learnable mask tokens and is explicitly conditioned at every layer on the caption embedding using patch-wise residual cross-attention mechanisms (Huang et al., 5 May 2026).

Text (Language Modeling): In LLMs, TC-JEPA is realized by splitting token sequences into context ( $x_{ctx}$ ) and masked target spans ( $x_{tgt}$ ), processed through a pretrained Transformer. The context branch optionally appends learnable predictor tokens and outputs a prediction $z_{ctx}$ of the target span’s embedding. A JEPA loss, computed between $z_{ctx}$ and the (stop-gradient) target, regularizes the autoregressive cross-entropy objective (Huang et al., 11 Sep 2025).

Text-to-Image (T2I) Generation: JEPA-T, an instantiation of TC-JEPA, tokenizes both images (via a VAE) and captions (via CLIP/T5), projecting both into a joint space processed by a Transformer backbone. Text fusion is performed via cross-attention after the prediction head and by direct injection of text features at the loss computation stage. Iterative denoising generates images conditioned on text (Wan et al., 1 Oct 2025).

2. Text-Conditioning Mechanisms

The core innovation in TC-JEPA is fine-grained fusion of text into the predictive process:

Residual Patch-wise Cross-Attention (Vision):

At each layer $l$ of the predictor,

Text tokens $Y$ 0 (from a frozen T5/CLIP encoder) become keys ( $Y$ 1) and values ( $Y$ 2) in an attention calculation for each patch token $Y$ 3.
The output is a sparse, patch-specific combination of text features, yielding a residual update:

$Y$ 4

where $Y$ 5 are attention weights computed using dot-product similarity (Huang et al., 5 May 2026).

Late-Fusion Cross-Attention (Text-to-Image):

Text features are fused into the visual token predictions after a coarse prediction stage via a dedicated cross-attention block, followed by residual addition and concatenation with raw text embeddings (Wan et al., 1 Oct 2025).

Span Conditioning (Text):

For LLMs, the context is defined as a prefix span, and the target as a randomly selected, masked contiguous span.
Conditioning tokens ([JEPA_CTX], [JEPA_TGT]) can be prefixed to explicitly differentiate context and target inputs (Huang et al., 11 Sep 2025).

3. Training Objectives and Loss Functions

TC-JEPA employs a feature-prediction objective complemented by regularizers that reinforce semantic grounding and information bottleneck effects:

Core Feature-Prediction Loss:

Mean squared error between predicted and teacher (EMA) target features:

$Y$ 6

Text-Grounding Regularizers (Vision):

Sparsity:

$Y$ 7

Consistency (cross-layer):

$Y$ 8

where $Y$ 9 are the rectified cosine similarities between patch tokens and caption features (Huang et al., 5 May 2026).

Text-to-Image Flow-Matching and JEPA Loss:

The total loss integrates conditional flow-matching and masked JEPA reconstruction:

$f_\theta$ 0

where $f_\theta$ 1 is conditioned on the text embedding, $f_\theta$ 2 is a masked-prediction loss, and text features are injected at the objective level for alignment (Wan et al., 1 Oct 2025).

Combined Language Loss (LLMs):

The training objective is the weighted sum:

$f_\theta$ 3

where $f_\theta$ 4 is the cross-entropy loss for next-token prediction, and $f_\theta$ 5 is a distance (e.g., $f_\theta$ 6, cosine, or NT-Xent contrastive) between predicted and ground-truth embeddings (Huang et al., 11 Sep 2025).

4. Training Protocols and Scalability

Vision: Models are trained on large-scale datasets (ImageNet-1k/21k, YFCC15M, CC12M) with multiple synthetic captions per image. Optimization employs AdamW, high batch sizes ( $f_\theta$ 72048), and long schedules (600–1200 epochs) (Huang et al., 5 May 2026).

Text: LLMs are pretrained or fine-tuned for four epochs on datasets such as NL-RX-SYNTH, GSM8K, and Spider. Key hyperparameters include batch size 32–64, learning rates $f_\theta$ 8 to $f_\theta$ 9, and $\{x_i\}$ 0 (JEPA weight) in 0.5,2.0.

Text-to-Image: JEPA-T employs a 12-layer ViT-Base, CLIP text backbone, and conditional flow matching, using 64 steps of latent denoising and explicit late-fusion cross-attention (Wan et al., 1 Oct 2025).

Scalability: TC-JEPA demonstrates robust, monotonic scaling with model and data size, in contrast to the instability and saturation observed in I-JEPA under heavy masking (Huang et al., 5 May 2026).

5. Empirical Results and Evaluation

Image Classification and Dense Prediction:

On ImageNet-1k, TC-JEPA L/16: 79.6% top-1 (vs. 77.5% for I-JEPA L/16). Gains of 1–4% on CIFAR100, Places205, iNat18.
COCO object detection: +1.5–2.5 APᵇ; ADE20k/Pascal VOC segmentation: +2–4 mIoU over I-JEPA, DINO, iBOT.
On large-scale datasets (IN-21k, CC27M), TC-JEPA outperforms I-JEPA and other patch-masked, contrastive, and multimodal approaches (Huang et al., 5 May 2026).

Language Modeling:

LLM-JEPA (and by extension TC-JEPA) increases fine-tuning accuracy significantly across architectures, with improvements such as 57.3%→71.5% (NL-RX-SYNTH), 33.7%→43.1% (gemma-2-2B) (Huang et al., 11 Sep 2025).

Text-to-Image Generation:

JEPA-T achieves state-of-the-art FID 1.42 and IS 298.3 on ImageNet-1K, with superior open-vocabulary generalization versus non-fusion and early-fusion baselines (Wan et al., 1 Oct 2025).

Ablation Studies:

Multi-layer patch-wise text conditioning, as opposed to holistic/global pooling, is crucial for maximal gains.
Removing sparsity or consistency regularizers degrades semantic alignment and task performance (Huang et al., 5 May 2026).
In T2I, late-fusion cross-attention is essential—its removal causes marked degradation in FID and IS (Wan et al., 1 Oct 2025).
In LLMs, the effect of predictor tokens, JEPA/LLM weighting, and contrastive vs. MSE/cosine loss functions is systematically studied (Huang et al., 11 Sep 2025).

6. Paradigmatic Shifts and Broader Impact

TC-JEPA introduces a new paradigm for vision–language pretraining by replacing global contrastive alignment (e.g., CLIP) with fine-grained, text-conditioned predictive feature learning. The architecture yields rich, semantically meaningful patch and token representations, facilitating transfer to dense, multimodal, and generative tasks. During inference, only the image or text encoder is used, preserving computational efficiency. The reliance on text annotations (human- or LLM-generated) presents exposure to annotation biases; high caption diversity ( $\{x_i\}$ 1 per image) is essential for comprehensive visual semantic coverage (Huang et al., 5 May 2026).

A plausible implication is that TC-JEPA may become the preferred self-supervised backbone for settings where grounding to external language, increased interpretability, or generalization beyond contrastive paradigms is required. Potential extensions include scaling to higher spatial resolutions, enhanced multi-caption or dialog conditioning, and joint pretraining for vision, language, and multimodal reasoning (Wan et al., 1 Oct 2025, Huang et al., 11 Sep 2025).

7. Key Differences and Relationships with Adjacent Approaches

Approach	Text Conditioning	Objective	Masking/Prediction Target
I-JEPA	None	Patch feature-prediction ( $\{x_i\}$ 2)	Visual patches
CLIP	Global contrastive	Image-text embedding alignment	Global image/text embeddings
TC-JEPA	Fine-grained cross-attention	Patch/Span feature-prediction + grounding	Visual patches, text spans
JEPA-T	Late-fusion cross-attention	Masked prediction + flow matching	Visual tokens
LLM-JEPA / TC-JEPA (LM)	Context-target span split	Embedding prediction + cross-entropy	Text spans

TC-JEPA differentiates itself by conditioning each predicted local feature (image patch, text span) on textual description, enabling stronger semantic supervision than contrastive or generative methods alone. This text-conditioned predictive architecture unifies disparate self-supervised advances and enables new capabilities in representation learning for multimodal tasks (Huang et al., 5 May 2026, Wan et al., 1 Oct 2025, Huang et al., 11 Sep 2025).