Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 29 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations (2509.21249v1)

Published 25 Sep 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.

Summary

  • The paper presents Decipher-MR, a model that integrates self-supervised 3D vision learning with report-guided text supervision to generate robust MRI representations.
  • It employs a two-stage pretraining strategy combining a 3D ViT-based encoder and a BERT-based text encoder, achieving improved performance in classification, retrieval, segmentation, and localization tasks.
  • Key results include a +2.9% AUROC in disease classification and significant improvements in low-data regimes and cross-modal retrieval, demonstrating practical clinical applicability.

Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Introduction and Motivation

The Decipher-MR model addresses the longstanding challenge of developing scalable, generalizable machine learning systems for 3D MRI analysis. MRI's inherent heterogeneity—arising from diverse anatomical coverage, protocol variations, and scanner differences—has historically limited the effectiveness of both traditional and deep learning approaches, especially in low-data or cross-domain settings. While foundation models have demonstrated strong transferability in natural language and 2D vision domains, their application to 3D MRI has been constrained by data scarcity, narrow anatomical focus, and insufficient integration of clinical context.

Decipher-MR is introduced as a 3D vision-language foundation model, pretrained on over 200,000 MRI series from more than 22,000 studies, encompassing a wide range of anatomical regions, imaging protocols, and patient demographics. The model leverages both self-supervised vision learning and report-guided text supervision, aiming to produce robust, generalizable representations that can be efficiently adapted to a broad spectrum of clinical and research tasks via lightweight, modular decoders. Figure 1

Figure 1: (a) Distribution of the diverse pretraining dataset across age, sex, imaging sequences, body regions, and scanner manufacturers. (b) Overview of the two-stage pretraining framework of Decipher-MR. (c) Evaluation of Decipher-MR in a frozen encoder setup for diverse tasks.

Pretraining Strategy and Model Architecture

Decipher-MR employs a two-stage pretraining pipeline:

  1. Stage 1: Unimodal Self-Supervised Pretraining
    • Vision Encoder: A 3D ViT (ViT-Base, 86M parameters) is pretrained using a DINOv2-based student-teacher framework, combining masked image modeling and multi-crop contrastive objectives. The model is robust to variable input sizes via trilinear position embedding interpolation and is trained with extensive data augmentation to handle MRI-specific inconsistencies.
    • Text Encoder: A BERT-based model (initialized from PubMedBERT) is pretrained on radiology reports using masked LLMing, capturing domain-specific linguistic patterns.
  2. Stage 2: Vision-Language Contrastive Alignment
    • The pretrained vision and text encoders are further aligned using image-report contrastive learning, mapping both modalities into a shared 512-dimensional embedding space. Organ-based batch sampling and report section extraction (via LLMs and SNOMED mapping) are used to enhance fine-grained anatomical and pathological alignment.

This modular design enables the pretrained encoder to remain frozen during downstream adaptation, with task-specific decoders (e.g., MLPs, ResNet-based segmentation heads, transformer-based localization modules) trained atop the fixed backbone.

Evaluation Across Downstream Tasks

Classification and Regression

Decipher-MR demonstrates consistent improvements over state-of-the-art baselines (DINOv2, BiomedCLIP, MedImageInsight, SAMMed3D) on a range of classification and regression tasks, including disease diagnosis, demographic prediction, body region identification, and imaging attribute detection. Notably, the model achieves:

  • +2.9% AUROC in disease classification
  • +3.0% in demographic prediction
  • Superior performance in low-data regimes, with the performance gap widening as labeled data decreases

Bias analysis reveals that Decipher-MR maintains higher robustness across sex-based splits, outperforming MedImageInsight by an average of 5.5% in cross-sex evaluation scenarios. Figure 2

Figure 2: (a) Classification performance comparison using MLP probes. (b) Ablation of pretraining data diversity and text supervision. (c) Performance under low-data regimes.

Ablation studies confirm that both anatomical/protocol diversity and text supervision are critical for generalization. Models pretrained on restricted anatomical regions or protocol types underperform even on in-domain tasks, and the addition of report-guided contrastive learning yields significant gains, especially in cardiac and prostate lesion classification, brain age regression, and body region detection.

Cross-Modal Retrieval

Decipher-MR's vision-language alignment enables zero-shot cross-modal retrieval, supporting both text-to-image and image-to-text queries. On out-of-domain datasets:

  • Body region retrieval (Source1): Top-3 localization success rate of 91.4% (full reports) and 78.8% (short descriptions), outperforming MedImageInsight by 10–15%.
  • Tumor pathology retrieval (Source2): 39.3% accuracy in retrieving correct tumor subregion descriptions, surpassing BiomedCLIP and MedImageInsight by 12–20%.

The model achieves higher mean average precision (mAP) and greater retrieval diversity, particularly when using concise, clinically relevant queries. Figure 3

Figure 3: Cross-modal retrieval performance for body region and tumor pathology tasks.

Segmentation and Localization

For 3D anatomical segmentation, Decipher-MR (frozen encoder + ResNet decoder) matches or exceeds nnUNet (fully end-to-end) and outperforms MedSAM3D (frozen encoder) across abdominal, pelvic, and cardiac datasets. The model converges rapidly, achieving high Dice scores within a single epoch, and demonstrates strong performance even with minimal fine-tuning. Figure 4

Figure 4: (a) Segmentation performance across datasets. (b) Convergence speed comparison. (c) Qualitative segmentation examples. (d) Anomaly localization and visual grounding. (e) Qualitative anomaly localization results.

For anomaly localization and visual grounding, Decipher-MR as a frozen encoder yields 14% higher mIoU in localization and 11.5% higher mIoU in visual grounding compared to end-to-end trained DETR3D and MedRPG models, respectively. Qualitative results show accurate detection of tumors and surgically removed organs across diverse anatomical regions.

Implementation Considerations

  • Data Processing: Foreground cropping (Otsu's method), intensity normalization, and resolution standardization are essential for robust pretraining. Report processing leverages LLMs for section extraction and SNOMED mapping for anatomical consistency.
  • Modular Adaptation: The frozen encoder paradigm enables rapid adaptation to new tasks with minimal computational overhead, supporting efficient development pipelines in clinical settings.
  • Preprocessing Trade-offs: Task-specific preprocessing (e.g., cropping, resizing, skull-stripping) can further boost performance but may reduce generalizability for tasks requiring broader context.
  • Resource Requirements: Pretraining is computationally intensive (100+ epochs, large batch sizes), but downstream adaptation is lightweight due to the frozen encoder approach.

Limitations and Future Directions

  • Textual Supervision: Gains from report-guided pretraining are less pronounced for imaging attribute classification; richer DICOM metadata and more diverse report corpora may be needed.
  • Pathology-Focused Retrieval: Performance in fine-grained pathology retrieval is limited by the diversity and granularity of textual data.
  • Bias and Generalization: While Decipher-MR is more robust to demographic variation, performance drops persist in cross-sex and out-of-distribution scenarios, indicating the need for further bias mitigation and domain adaptation strategies.
  • Encoder Fine-Tuning: While the frozen encoder paradigm is efficient, certain applications may benefit from partial or full encoder fine-tuning, especially for highly specialized tasks.

Conclusion

Decipher-MR establishes a new standard for MRI-specific foundation models by integrating large-scale, diverse 3D MRI data with report-guided vision-language pretraining. The model demonstrates strong, generalizable performance across classification, retrieval, segmentation, and localization tasks, often matching or exceeding specialized end-to-end models with significantly reduced adaptation cost. Its modular, frozen-encoder design supports efficient, scalable development of AI solutions for MRI, with broad implications for clinical and research applications. Future work should focus on expanding textual supervision, enhancing pathology-specific retrieval, and developing advanced bias mitigation and fine-tuning strategies to further improve generalizability and fairness.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube