Papers
Topics
Authors
Recent
Search
2000 character limit reached

RadJEPA: Self-Supervised Radiology Encoder

Updated 29 January 2026
  • The paper introduces RadJEPA, a self-supervised radiology encoder that uses a JEPA framework to predict latent representations of masked chest X-ray regions.
  • It delivers improved performance over contrastive and self-distillation methods, achieving gains such as +2.4 AUPRC and higher Dice scores in segmentation.
  • The approach demonstrates sample efficiency and potential for advancing computer-aided diagnosis through optimized anatomical feature extraction.

RadJEPA is a self-supervised radiology encoder for chest X-ray images, developed using a Joint Embedding Predictive Architecture (JEPA) paradigm and trained without language supervision. Unlike medical vision–LLMs that require paired image–report data, RadJEPA learns solely from unlabeled images by explicitly predicting the latent representation of masked regions based on visible context. This predictive modeling approach distinguishes RadJEPA from contrastive methods and self-distillation frameworks, enabling the acquisition of high-level anatomical and pathological features without bias toward clinically salient text annotations.

1. Motivation and Paradigm Shift

Medical vision–LLMs such as CLIP-style, VirTex, and MRM typically leverage paired chest X-ray images and radiology reports for representation learning. However, radiology reports frequently omit subtle findings and selectively emphasize clinical priorities, introducing bias and diminishing fine-grained anatomical fidelity. Contrastive and self-distillation encoders (e.g., DINO, DINOv2, Rad-DINO) enforce invariance across augmented views but may prioritize pixel-level consistency over semantic nuance.

RadJEPA departs from these approaches by adopting JEPA, which models latent-space prediction rather than global representation alignment or pixel reconstruction. The encoder learns by predicting embeddings of masked regions from visible context, therefore distilling high-level semantics inherent to anatomical structures and disease patterns. This approach obviates the need for textual supervision and contrastive negatives, allowing for robust encoder development solely from image data (Khan et al., 22 Jan 2026).

2. Joint Embedding Predictive Architecture (JEPA) Formalism

RadJEPA employs JEPA to structure its self-supervised objective. For a given chest X-ray xRH×Wx \in \mathbb{R}^{H \times W}, two non-overlapping crops are sampled: context (cc) and target (tt), each covering approximately 25–50% of the image. Both are fed through a shared Vision Transformer (ViT-B/14) encoder ff (online) and a momentum encoder ff' (target) to produce dd-dimensional embeddings:

  • zc=f(c;θ)z_c = f(c; \theta)
  • zt=f(t;θ)z_t = f'(t; \theta')

A three-layer MLP predictor gϕg_\phi transforms zcz_c to output z^t\hat{z}_t, aiming to approximate ztz_t:

  • z^t=gϕ(zc)\hat{z}_t = g_\phi(z_c)

The loss function minimizes the 2\ell_2 distance between stop-gradient'ed ztz_t and z^t\hat{z}_t:

LJEPA=E(c,t)[stopgrad(zt)z^t22]\mathcal{L}_{JEPA} = \mathbb{E}_{(c, t)} \left[ \| \operatorname{stopgrad}(z_t) - \hat{z}_t \|_2^2 \right]

The momentum encoder is updated via:

θτθ+(1τ)θ,τ[0,1)\theta' \leftarrow \tau \cdot \theta' + (1-\tau) \cdot \theta, \quad \tau \in [0,1)

No learnable projection head is employed; predictions occur directly in embedding space.

3. Training Protocol and Architectural Specifications

Pretraining of RadJEPA is conducted on 839,364 chest X-ray images pooled from BRAX, CheXpert, MIMIC-CXR, ChestX-ray14, and PadChest, maintaining an approximately 3:1 ratio of frontal:lateral views. Resolution is standardized to 224×224224 \times 224, with ViT-B/14 patch tokens (16×1616 \times 16 grid, 86M parameters). The predictor MLP uses a hidden size matching ViT embed dim (768).

Optimization is performed with AdamW (base LR=1×1041\times10^{-4}, weight decay=0.05, batch size=1024 over 32 GPUs), cosine LR decay, and 10-epoch warmup. Fifty epochs are run per pretraining cycle. The masking strategy restricts pretraining augmentations to random cropping of regions; no pixel-level or appearance alterations are introduced.

After pretraining, the encoder ff is frozen, and downstream heads—classification (linear classifier), segmentation (UPerNet decoder), and report generation (MLP adapter + Vicuna-7B)—are trained atop the learned representations.

4. Downstream Task Evaluation and Comparative Analysis

4.1 Disease Classification

Linear probe evaluations are conducted on VinDr-CXR and RSNA Pneumonia. Embeddings (768-D) extracted from center-cropped, augmented images are classified using a frozen encoder and trained linear head.

Model VinDr-Agg. (AUPRC) RSNA-AUPRC RSNA-AUROC
Rad-DINO 52.8 71.0 88.4
I-JEPA 50.0 70.2 87.4
MRM 51.3 71.4 89.0
RadJEPA 55.2 72.7 89.2

RadJEPA delivers +2.4 AUPRC gain over Rad-DINO on VinDr and +1.3 / +0.2 gain in AUPRC/AUROC on RSNA.

4.2 Semantic Segmentation

Segmentation heads (UPerNet, 39M parameters) are attached to the frozen encoder for targets including lung, lung-zone, and rib delineation. Metrics are reported as mean Dice.

Target Rad-DINO I-JEPA RadJEPA
Lungs 95.9 97.9 98.3
Lung-zones 85.7 92.0 93.7
Ribs 73.4 85.2 89.6

RadJEPA increases Dice scores by +3–16 points over Rad-DINO, notably in complex structures.

4.3 Report Generation

Report generation applies a LLaVA-style adapter to Vicuna-7B (frozen LLM), producing radiology reports from embeddings. Metrics include ROUGE-L, BLEU-4, RG_ER, and Macro-F1.

Model ROUGE-L BLEU-4 RG_ER F1-14
Rad-DINO 24.6 9.3 22.8 31.9
I-JEPA 25.6 9.5 23.4 32.1
RadJEPA 26.1 10.1 23.8 32.6
IU-Xray 27.1→28.4 9.6→9.9 27.0→27.5 26.8→27.6

Absolute gains of +0.5–1.0 are noted across all metrics, with ~1.3 F1 improvement on IU-Xray.

5. Analytical Observations and Ablation Studies

Predictive modeling in latent space, as instantiated by JEPA, prioritizes semantic and anatomical feature learning over appearance-level or pixel-scale invariance. Sample efficiency is pronounced: RadJEPA’s ViT-B/14 outperforms larger-scale DINOv2 ViT-G/14 (1.1B params) with ~10× fewer parameters and ~7× fewer images.

Balancing the lateral view subset is found to improve multi-view generalization. Controlled-data ablations, restricting pretraining to MIMIC-CXR, reveal that performance advantages stem from the JEPA objective rather than dataset size alone. Single-tailed statistical tests confirm training improvements are significant (p<0.05p<0.05) in 16/19 metric–dataset comparisons.

Noted limitations include uniform region sampling (suggesting benefit for anatomy-centric masking), single-scale backbone, and report generation lagging radiologist quality. Future work may address multi-scale backbones and improved fusion with LLMs.

6. Impact, Extensions, and Availability

RadJEPA establishes that robust and general-purpose chest X-ray encoders can be developed entirely without paired report supervision or contrastive objectives, relying on prediction in latent space. Gains over image–text and image-only baselines are quantifiable: +2–4 AUPRC, +2–5 Dice, +0.5–1 BLEU/ROUGE improvements.

Clinically, RadJEPA’s representations preserve subtle anatomical-detail and pathological nuance, holding promise as a backbone for computer-aided diagnosis (CAD), segmentation, and report generation. Proposed extensions include:

  • Cross-modality JEPA (prediction across CT/X-ray)
  • Anatomy-aware masking strategies
  • Integration with foundational LLMs for unified radiology assistants

Code, pretrained weights, and documentation are available at GitHub and Hugging Face (Khan et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RadJEPA.