RadJEPA: Self-Supervised Radiology Encoder

Updated 29 January 2026

The paper introduces RadJEPA, a self-supervised radiology encoder that uses a JEPA framework to predict latent representations of masked chest X-ray regions.
It delivers improved performance over contrastive and self-distillation methods, achieving gains such as +2.4 AUPRC and higher Dice scores in segmentation.
The approach demonstrates sample efficiency and potential for advancing computer-aided diagnosis through optimized anatomical feature extraction.

RadJEPA is a self-supervised radiology encoder for chest X-ray images, developed using a Joint Embedding Predictive Architecture (JEPA) paradigm and trained without language supervision. Unlike medical vision–LLMs that require paired image–report data, RadJEPA learns solely from unlabeled images by explicitly predicting the latent representation of masked regions based on visible context. This predictive modeling approach distinguishes RadJEPA from contrastive methods and self-distillation frameworks, enabling the acquisition of high-level anatomical and pathological features without bias toward clinically salient text annotations.

1. Motivation and Paradigm Shift

Medical vision–LLMs such as CLIP-style, VirTex, and MRM typically leverage paired chest X-ray images and radiology reports for representation learning. However, radiology reports frequently omit subtle findings and selectively emphasize clinical priorities, introducing bias and diminishing fine-grained anatomical fidelity. Contrastive and self-distillation encoders (e.g., DINO, DINOv2, Rad-DINO) enforce invariance across augmented views but may prioritize pixel-level consistency over semantic nuance.

RadJEPA departs from these approaches by adopting JEPA, which models latent-space prediction rather than global representation alignment or pixel reconstruction. The encoder learns by predicting embeddings of masked regions from visible context, therefore distilling high-level semantics inherent to anatomical structures and disease patterns. This approach obviates the need for textual supervision and contrastive negatives, allowing for robust encoder development solely from image data (Khan et al., 22 Jan 2026).

2. Joint Embedding Predictive Architecture (JEPA) Formalism

RadJEPA employs JEPA to structure its self-supervised objective. For a given chest X-ray $x \in \mathbb{R}^{H \times W}$ , two non-overlapping crops are sampled: context ( $c$ ) and target ( $t$ ), each covering approximately 25–50% of the image. Both are fed through a shared Vision Transformer (ViT-B/14) encoder $f$ (online) and a momentum encoder $f'$ (target) to produce $d$ -dimensional embeddings:

$z_c = f(c; \theta)$
$z_t = f'(t; \theta')$

A three-layer MLP predictor $g_\phi$ transforms $z_c$ to output $\hat{z}_t$ , aiming to approximate $z_t$ :

$\hat{z}_t = g_\phi(z_c)$

The loss function minimizes the $\ell_2$ distance between stop-gradient'ed $z_t$ and $\hat{z}_t$ :

$\mathcal{L}_{JEPA} = \mathbb{E}_{(c, t)} \left[ \| \operatorname{stopgrad}(z_t) - \hat{z}_t \|_2^2 \right]$

The momentum encoder is updated via:

$\theta' \leftarrow \tau \cdot \theta' + (1-\tau) \cdot \theta, \quad \tau \in [0,1)$

No learnable projection head is employed; predictions occur directly in embedding space.

3. Training Protocol and Architectural Specifications

Pretraining of RadJEPA is conducted on 839,364 chest X-ray images pooled from BRAX, CheXpert, MIMIC-CXR, ChestX-ray14, and PadChest, maintaining an approximately 3:1 ratio of frontal:lateral views. Resolution is standardized to $224 \times 224$ , with ViT-B/14 patch tokens ( $16 \times 16$ grid, 86M parameters). The predictor MLP uses a hidden size matching ViT embed dim (768).

Optimization is performed with AdamW (base LR= $1\times10^{-4}$ , weight decay=0.05, batch size=1024 over 32 GPUs), cosine LR decay, and 10-epoch warmup. Fifty epochs are run per pretraining cycle. The masking strategy restricts pretraining augmentations to random cropping of regions; no pixel-level or appearance alterations are introduced.

After pretraining, the encoder $f$ is frozen, and downstream heads—classification (linear classifier), segmentation (UPerNet decoder), and report generation (MLP adapter + Vicuna-7B)—are trained atop the learned representations.

4. Downstream Task Evaluation and Comparative Analysis

4.1 Disease Classification

Linear probe evaluations are conducted on VinDr-CXR and RSNA Pneumonia. Embeddings (768-D) extracted from center-cropped, augmented images are classified using a frozen encoder and trained linear head.

Model	VinDr-Agg. (AUPRC)	RSNA-AUPRC	RSNA-AUROC
Rad-DINO	52.8	71.0	88.4
I-JEPA	50.0	70.2	87.4
MRM	51.3	71.4	89.0
RadJEPA	55.2	72.7	89.2

RadJEPA delivers +2.4 AUPRC gain over Rad-DINO on VinDr and +1.3 / +0.2 gain in AUPRC/AUROC on RSNA.

4.2 Semantic Segmentation

Segmentation heads (UPerNet, 39M parameters) are attached to the frozen encoder for targets including lung, lung-zone, and rib delineation. Metrics are reported as mean Dice.

Target	Rad-DINO	I-JEPA	RadJEPA
Lungs	95.9	97.9	98.3
Lung-zones	85.7	92.0	93.7
Ribs	73.4	85.2	89.6

RadJEPA increases Dice scores by +3–16 points over Rad-DINO, notably in complex structures.

4.3 Report Generation

Report generation applies a LLaVA-style adapter to Vicuna-7B (frozen LLM), producing radiology reports from embeddings. Metrics include ROUGE-L, BLEU-4, RG_ER, and Macro-F1.

Model	ROUGE-L	BLEU-4	RG_ER	F1-14
Rad-DINO	24.6	9.3	22.8	31.9
I-JEPA	25.6	9.5	23.4	32.1
RadJEPA	26.1	10.1	23.8	32.6
IU-Xray	27.1→28.4	9.6→9.9	27.0→27.5	26.8→27.6

Absolute gains of +0.5–1.0 are noted across all metrics, with ~1.3 F1 improvement on IU-Xray.

5. Analytical Observations and Ablation Studies

Predictive modeling in latent space, as instantiated by JEPA, prioritizes semantic and anatomical feature learning over appearance-level or pixel-scale invariance. Sample efficiency is pronounced: RadJEPA’s ViT-B/14 outperforms larger-scale DINOv2 ViT-G/14 (1.1B params) with ~10× fewer parameters and ~7× fewer images.

Balancing the lateral view subset is found to improve multi-view generalization. Controlled-data ablations, restricting pretraining to MIMIC-CXR, reveal that performance advantages stem from the JEPA objective rather than dataset size alone. Single-tailed statistical tests confirm training improvements are significant ( $p<0.05$ ) in 16/19 metric–dataset comparisons.

Noted limitations include uniform region sampling (suggesting benefit for anatomy-centric masking), single-scale backbone, and report generation lagging radiologist quality. Future work may address multi-scale backbones and improved fusion with LLMs.

6. Impact, Extensions, and Availability

RadJEPA establishes that robust and general-purpose chest X-ray encoders can be developed entirely without paired report supervision or contrastive objectives, relying on prediction in latent space. Gains over image–text and image-only baselines are quantifiable: +2–4 AUPRC, +2–5 Dice, +0.5–1 BLEU/ROUGE improvements.

Clinically, RadJEPA’s representations preserve subtle anatomical-detail and pathological nuance, holding promise as a backbone for computer-aided diagnosis (CAD), segmentation, and report generation. Proposed extensions include:

Cross-modality JEPA (prediction across CT/X-ray)
Anatomy-aware masking strategies
Integration with foundational LLMs for unified radiology assistants

Code, pretrained weights, and documentation are available at GitHub and Hugging Face (Khan et al., 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RadJEPA.