Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rad-DINO: Self-Supervised CAC Detection

Updated 29 January 2026
  • Rad-DINO is a self-supervised, vision transformer–based pipeline designed for automated coronary artery calcium detection and scoring in CT imaging.
  • It integrates a student-teacher framework with label-guided cropping to steer attention toward calcified regions, significantly improving sensitivity and specificity over conventional methods.
  • The pipeline encompasses pretraining, feature extraction, binary slice classification, and UNET segmentation to achieve accurate Agatston scoring for clinical risk assessment.

Rad-DINO is a self-supervised, vision transformer–based pipeline designed for automated coronary artery calcium (CAC) detection and scoring in computed tomography (CT) imaging. It leverages the DINO self-distillation framework with a specifically engineered label-guided variant (DINO-LG), improving the localization and segmentation of calcified regions critical for coronary artery disease (CAD) risk assessment. The method substantially enhances sensitivity and specificity compared to traditional UNET-based approaches by siphoning annotated slices through a targeted cropping and self-attention mechanism, enabling downstream segmentation and accurate Agatston-style quantification (Gokmen et al., 2024).

1. DINO Self-Supervised Distillation for Medical Imaging

Rad-DINO adapts DINO (“self-distillation with no labels”) to medical imaging tasks, specifically CT-based CAC scoring. The architecture employs two vision transformer (ViT) networks—designated as student (parameters θs\theta_s) and teacher (parameters θt\theta_t)—that process multiple augmented views of each input CT slice. The distillation loss is computed between the teacher’s softmax outputs (as soft targets) and the student’s predictions over these augmented crops:

LDINO=1NsNti=1Nsj=1Ntpt(vjt)logps(vis)L_{\mathrm{DINO}} = -\frac{1}{N_s N_t} \sum_{i=1}^{N_s} \sum_{j=1}^{N_t} p_t(v_j^t) \cdot \log p_s(v_i^s)

where NsN_s and NtN_t are the numbers of views for student and teacher, and ps()p_s(\cdot) and pt()p_t(\cdot) are temperature-scaled softmax outputs. The teacher network is updated using exponential moving average:

θtmθt+(1m)θs\theta_t \gets m\theta_t + (1-m)\theta_s

with momentum mm scheduled from 0.996 to 1.0 through training (Gokmen et al., 2024).

2. Label-Guided Cropping and Attention Focusing

A central innovation in Rad-DINO is the introduction of “label-guided” local cropping within the data augmentation pipeline (Editor’s term: LG-crops). During training, for slices annotated with CAC, a subset of local crops is explicitly centered on the annotated calcifications, while remaining crops remain random:

  • Standard DINO uses Nl=16N_l = 16 random local crops.
  • DINO-LG replaces 4 of these with guided crops: Nl,random=12N_{l,\mathrm{random}}=12, Nl,guided=4N_{l,\mathrm{guided}}=4.

This adjustment steers the ViT’s attention maps toward annotated regions, as revealed by attention visualizations—guided crops yield focus on calcium, whereas unguided crops distribute attention diffusely. No extra trainable layers are added; only the cropping pipeline is altered (Gokmen et al., 2024).

3. End-to-End Training and Inference Pipeline

The Rad-DINO workflow comprises several stages:

  1. Pretraining: ViTb8 teacher and student are jointly trained for 150 epochs using both randomly and label-guided augmented crops. The cross-entropy distillation loss drives the student to mimic the teacher, with parameters progressively synchronized by EMA.
  2. Feature Extraction: After pretraining, the teacher network is frozen. Each CT slice (gated/non-gated) is encoded via its ViT [CLS] token embedding.
  3. Slice Classification: A linear + softmax binary head is trained for \sim10 epochs to differentiate CAC-positive and CAC-negative slices using extracted embeddings.
  4. Segmentation: Slices exceeding a threshold probability (P>0.5)(P>0.5) for calcium are passed to a basic UNET architecture for pixel-wise segmentation.
  5. Agatston Scoring: Segmented masks are processed for connected component analysis, Hounsfield Unit thresholding (HU>130)(\mathrm{HU}>130), and scoring. Aggregate scores produce volume-level CAC classification (Gokmen et al., 2024).

Pseudocode Summary

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for epoch in range(150):
    for slice in training_set:
        crops_global = random_resized_crops(slice, N_global)
        if slice_has_CAC_annotation:
            crops_guided = local_crops_around_annotation(slice, N_local_guided)
            crops_random = random_local_crops(slice, N_local_rand)
            crops = crops_global + crops_guided + crops_random
        else:
            crops = crops_global + random_local_crops(slice, N_local_rand + N_local_guided)
        S_out = student(theta_s, crops)
        T_out = teacher(theta_t, crops)
        loss = cross_entropy_distillation(S_out, T_out)
        theta_s = update_student(theta_s, loss)
        m = momentum_schedule(epoch)
        theta_t = m*theta_t + (1-m)*theta_s

4. Quantitative Performance

Rad-DINO yields substantial performance improvements in CAC slice detection:

Model Sensitivity Specificity FNR FPR
Standard DINO 0.79 0.77 0.21 0.23
DINO-LG 0.89 0.90 0.11 0.10
  • Absolute sensitivity increase: +10%.
  • Specificity increase: +13%.
  • False-negative rate reduction: 49%.
  • False-positive rate reduction: 59%.

Downstream, the Rad-DINO filter yields higher per-artery precision/recall (LAD F1 0.85 vs. 0.76, LCX F1 0.80 vs. 0.75), CAC-score classification accuracy (0.90 vs. 0.76), mean sensitivity (0.86 vs. 0.69), and specificity (0.97 vs. 0.92) compared to standalone UNET segmentation (Gokmen et al., 2024).

5. System Integration and Clinical Implications

By limiting UNET segmentation to slices classified as CAC-positive, Rad-DINO reduces false alarms and radiologist workload, optimizing both compute and clinical accuracy. The improved specificity mitigates unnecessary follow-up, while high sensitivity preserves diagnostic reliability for CAD risk stratification.

Agatston scoring maps the results to established clinical risk categories: 0–10 (low), 11–100 (moderate), 101–400 (high), >400 (very high). Connected component analysis and Hounsfield Unit thresholding enforce diagnostic rigor (Gokmen et al., 2024).

6. Limitations and Future Directions

  • The current evaluation is restricted to a single-center dataset (Stanford COCA), with generalization to different populations and scanners unresolved.
  • Small calcified lesions (<5 mm) challenge segmentation accuracy, impacting clinical Agatston scores.
  • The slice-wise (2D) approach omits 3D context, which may miss subtle, small foci.

Future enhancements may include extension to other target organs (e.g., lung nodules, liver tumors), implementation of 3D ViTs or spatiotemporal SSL, joint multi-task fine-tuning, and uncertainty quantification methodologies such as Monte Carlo dropout for selective human review (Gokmen et al., 2024).

7. Context within Self-Supervised Medical Imaging

Rad-DINO exemplifies the trend of adapting general-purpose self-supervised transformers to targeted clinical tasks by minimal domain-specific modifications (here, label-guided cropping). The pipeline is aligned with broader efforts in radiological foundation models such as RayDINO (Moutakanni et al., 2024) and RAD-DINO (Pérez-García et al., 2024), which employ large-scale, unimodal self-supervision to derive holistic, robust visual features, subsequently enhanced by lightweight task adapters for classification, segmentation, and report generation.

A plausible implication is that such domain-adapted, label-guided self-supervision is broadly applicable for efficient, annotation-sparse training in other volumetric or anatomical tasks, especially where annotated data is scarce and interpretability is paramount.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rad-DINO.