Rad-DINO: Self-Supervised CAC Detection

Updated 29 January 2026

Rad-DINO is a self-supervised, vision transformer–based pipeline designed for automated coronary artery calcium detection and scoring in CT imaging.
It integrates a student-teacher framework with label-guided cropping to steer attention toward calcified regions, significantly improving sensitivity and specificity over conventional methods.
The pipeline encompasses pretraining, feature extraction, binary slice classification, and UNET segmentation to achieve accurate Agatston scoring for clinical risk assessment.

Rad-DINO is a self-supervised, vision transformer–based pipeline designed for automated coronary artery calcium (CAC) detection and scoring in computed tomography (CT) imaging. It leverages the DINO self-distillation framework with a specifically engineered label-guided variant (DINO-LG), improving the localization and segmentation of calcified regions critical for coronary artery disease (CAD) risk assessment. The method substantially enhances sensitivity and specificity compared to traditional UNET-based approaches by siphoning annotated slices through a targeted cropping and self-attention mechanism, enabling downstream segmentation and accurate Agatston-style quantification (Gokmen et al., 2024).

1. DINO Self-Supervised Distillation for Medical Imaging

Rad-DINO adapts DINO (“self-distillation with no labels”) to medical imaging tasks, specifically CT-based CAC scoring. The architecture employs two vision transformer (ViT) networks—designated as student (parameters $\theta_s$ ) and teacher (parameters $\theta_t$ )—that process multiple augmented views of each input CT slice. The distillation loss is computed between the teacher’s softmax outputs (as soft targets) and the student’s predictions over these augmented crops:

$L_{\mathrm{DINO}} = -\frac{1}{N_s N_t} \sum_{i=1}^{N_s} \sum_{j=1}^{N_t} p_t(v_j^t) \cdot \log p_s(v_i^s)$

where $N_s$ and $N_t$ are the numbers of views for student and teacher, and $p_s(\cdot)$ and $p_t(\cdot)$ are temperature-scaled softmax outputs. The teacher network is updated using exponential moving average:

$\theta_t \gets m\theta_t + (1-m)\theta_s$

with momentum $m$ scheduled from 0.996 to 1.0 through training (Gokmen et al., 2024).

2. Label-Guided Cropping and Attention Focusing

A central innovation in Rad-DINO is the introduction of “label-guided” local cropping within the data augmentation pipeline (Editor’s term: LG-crops). During training, for slices annotated with CAC, a subset of local crops is explicitly centered on the annotated calcifications, while remaining crops remain random:

Standard DINO uses $N_l = 16$ random local crops.
DINO-LG replaces 4 of these with guided crops: $N_{l,\mathrm{random}}=12$ , $N_{l,\mathrm{guided}}=4$ .

This adjustment steers the ViT’s attention maps toward annotated regions, as revealed by attention visualizations—guided crops yield focus on calcium, whereas unguided crops distribute attention diffusely. No extra trainable layers are added; only the cropping pipeline is altered (Gokmen et al., 2024).

3. End-to-End Training and Inference Pipeline

The Rad-DINO workflow comprises several stages:

Pretraining: ViTb8 teacher and student are jointly trained for 150 epochs using both randomly and label-guided augmented crops. The cross-entropy distillation loss drives the student to mimic the teacher, with parameters progressively synchronized by EMA.
Feature Extraction: After pretraining, the teacher network is frozen. Each CT slice (gated/non-gated) is encoded via its ViT [CLS] token embedding.
Slice Classification: A linear + softmax binary head is trained for $\sim$ 10 epochs to differentiate CAC-positive and CAC-negative slices using extracted embeddings.
Segmentation: Slices exceeding a threshold probability $(P>0.5)$ for calcium are passed to a basic UNET architecture for pixel-wise segmentation.
Agatston Scoring: Segmented masks are processed for connected component analysis, Hounsfield Unit thresholding $(\mathrm{HU}>130)$ , and scoring. Aggregate scores produce volume-level CAC classification (Gokmen et al., 2024).

Pseudocode Summary

for epoch in range(150):
    for slice in training_set:
        crops_global = random_resized_crops(slice, N_global)
        if slice_has_CAC_annotation:
            crops_guided = local_crops_around_annotation(slice, N_local_guided)
            crops_random = random_local_crops(slice, N_local_rand)
            crops = crops_global + crops_guided + crops_random
        else:
            crops = crops_global + random_local_crops(slice, N_local_rand + N_local_guided)
        S_out = student(theta_s, crops)
        T_out = teacher(theta_t, crops)
        loss = cross_entropy_distillation(S_out, T_out)
        theta_s = update_student(theta_s, loss)
        m = momentum_schedule(epoch)
        theta_t = m*theta_t + (1-m)*theta_s

4. Quantitative Performance

Rad-DINO yields substantial performance improvements in CAC slice detection:

Model	Sensitivity	Specificity	FNR	FPR
Standard DINO	0.79	0.77	0.21	0.23
DINO-LG	0.89	0.90	0.11	0.10

Absolute sensitivity increase: +10%.
Specificity increase: +13%.
False-negative rate reduction: 49%.
False-positive rate reduction: 59%.

Downstream, the Rad-DINO filter yields higher per-artery precision/recall (LAD F1 0.85 vs. 0.76, LCX F1 0.80 vs. 0.75), CAC-score classification accuracy (0.90 vs. 0.76), mean sensitivity (0.86 vs. 0.69), and specificity (0.97 vs. 0.92) compared to standalone UNET segmentation (Gokmen et al., 2024).

5. System Integration and Clinical Implications

By limiting UNET segmentation to slices classified as CAC-positive, Rad-DINO reduces false alarms and radiologist workload, optimizing both compute and clinical accuracy. The improved specificity mitigates unnecessary follow-up, while high sensitivity preserves diagnostic reliability for CAD risk stratification.

Agatston scoring maps the results to established clinical risk categories: 0–10 (low), 11–100 (moderate), 101–400 (high), >400 (very high). Connected component analysis and Hounsfield Unit thresholding enforce diagnostic rigor (Gokmen et al., 2024).

6. Limitations and Future Directions

The current evaluation is restricted to a single-center dataset (Stanford COCA), with generalization to different populations and scanners unresolved.
Small calcified lesions (<5 mm) challenge segmentation accuracy, impacting clinical Agatston scores.
The slice-wise (2D) approach omits 3D context, which may miss subtle, small foci.

Future enhancements may include extension to other target organs (e.g., lung nodules, liver tumors), implementation of 3D ViTs or spatiotemporal SSL, joint multi-task fine-tuning, and uncertainty quantification methodologies such as Monte Carlo dropout for selective human review (Gokmen et al., 2024).

7. Context within Self-Supervised Medical Imaging

Rad-DINO exemplifies the trend of adapting general-purpose self-supervised transformers to targeted clinical tasks by minimal domain-specific modifications (here, label-guided cropping). The pipeline is aligned with broader efforts in radiological foundation models such as RayDINO (Moutakanni et al., 2024) and RAD-DINO (Pérez-García et al., 2024), which employ large-scale, unimodal self-supervision to derive holistic, robust visual features, subsequently enhanced by lightweight task adapters for classification, segmentation, and report generation.

A plausible implication is that such domain-adapted, label-guided self-supervision is broadly applicable for efficient, annotation-sparse training in other volumetric or anatomical tasks, especially where annotated data is scarce and interpretability is paramount.

Markdown Report Issue Upgrade to Chat

References (3)

DINO-LG: A Task-Specific DINO Model for Coronary Calcium Scoring (2024)

Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning (2024)

Exploring scalable medical image encoders beyond text supervision (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rad-DINO.

Rad-DINO: Self-Supervised CAC Detection

1. DINO Self-Supervised Distillation for Medical Imaging

2. Label-Guided Cropping and Attention Focusing

3. End-to-End Training and Inference Pipeline

Pseudocode Summary

4. Quantitative Performance

5. System Integration and Clinical Implications

6. Limitations and Future Directions

7. Context within Self-Supervised Medical Imaging

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Rad-DINO: Self-Supervised CAC Detection

1. DINO Self-Supervised Distillation for Medical Imaging

2. Label-Guided Cropping and Attention Focusing

3. End-to-End Training and Inference Pipeline

Pseudocode Summary

4. Quantitative Performance

5. System Integration and Clinical Implications

6. Limitations and Future Directions

7. Context within Self-Supervised Medical Imaging

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research