Janus-Pro-CXR: Multimodal LLM in Radiology
- The paper presents Janus-Pro-CXR, an open-source 1B parameter multimodal LLM that automates chest X-ray interpretation and report generation with state-of-the-art accuracy.
- It employs a unified vision-language Transformer with cross-modal fusion and an expert analysis module, achieving superior performance on datasets like MIMIC-CXR, CheXpert Plus, and CXR-27.
- Clinically validated through multicenter trials, Janus-Pro-CXR enhances report quality, reduces reading time, and supports efficient deployment in both high- and low-resource environments.
Janus-Pro-CXR is an open-source, lightweight, 1-billion-parameter multimodal LLM framework specifically designed for the automated interpretation of chest radiographs (CXR). Built upon the DeepSeek Janus-Pro foundation, Janus-Pro-CXR integrates advanced vision and language components, robust domain-specific optimization, and full clinical validation, achieving state-of-the-art performance in both retrospective and prospective multicenter studies for CXR report generation and critical findings detection. The system is engineered for efficient deployment in high- and low-resource settings, with architecture, weights, and implementation framework released under a permissive license (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).
1. Architectural Overview
Janus-Pro-CXR utilizes a unified vision-language Transformer architecture, with tight coupling across visual and linguistic modalities to enable end-to-end CXR interpretation and report generation. The model comprises the following principal modules:
- Vision Encoder: A Siglip-based Vision Transformer (ViT) with a patch size of 16×16, 12 Transformer blocks, and a hidden dimension of 768. Input CXRs are patchified, linearly projected, and combined with positional encodings.
- Language Decoder: A 24-layer Transformer decoder with hidden dimension 1024 and 16 attention heads, performing autoregressive report generation.
- Cross-modal Fusion ("Aligner"): Consisting of a linear projection and cross-modal self-attention layers, this module integrates visual token embeddings into the language space.
- Expert Analysis Module: Binary classifiers consuming pooled vision features for classification of six clinically significant findings.
The vision encoder outputs a visual token matrix , which is projected to for fusion with the language stream. The decoder then generates text autoregressively. The following equations characterize key operations:
- Patch Embedding:
- Self-attention (single head):
- Multihead attention:
- Cross-modal Fusion: , followed by standard MHA.
Figure Description: The architecture processes input CXRs via patchification, embedding, and vision encoding. Pooled visual features inform both critical finding classifiers and, through the aligner, are fused with the language decoder for contextualized report generation.
2. Training Protocol and Datasets
Janus-Pro-CXR employs a two-stage fine-tuning strategy leveraging diverse large-scale and multicenter datasets:
- Stage 1:
- MIMIC-CXR: 167,496 images (165,131 train, 2,365 test)
- CheXpert Plus: 222,103 images (no held-out test)
- Stage 2:
- CXR-27: 11,156 images from 27 Chinese hospitals (11,156 train, 1,240 test)
Training involves rescaling images to 224×224, min–max normalization, and augmentation (p=0.5 horizontal flip, ±5° rotation). Byte-Pair Encoding is used for tokenization (50k vocab). Auto-annotation for 14 findings is produced by CheXbert (MIMIC) and DeepSeek labeler (CXR-27), treating uncertain labels as positive during training.
The combined loss function is: with
where , . The AdamW optimizer is applied (β₁=0.9, β₂=0.999, ε=1e−8), with batch size 64 and learning rate warmup (5%), followed by linear decay.
Retrospective and prospective clinical data encompassed 296 patients (NCT07117266) each with two independent junior radiologist interpretations.
3. Benchmarking and Performance Evaluation
Janus-Pro-CXR demonstrates state-of-the-art results across established chest X-ray tasks, outperforming both its base model and the 200B-parameter ChatGPT 4o in all reported dimensions.
Automated Report Generation:
| Metric | Janus-Pro-CXR | Janus-Pro | ChatGPT 4o |
|---|---|---|---|
| MIMIC-CXR Test (n=2,365) | |||
| BLEU-1 | 0.312 | 0.221 | 0.236 |
| ROUGE-L | 0.378 | 0.291 | 0.307 |
| Micro-avg F1-5 | 63.4 | 49.1 | 51.6 |
| Macro-avg F1-5 | 55.1 | 38.7 | 40.3 |
| RadGraph F1 | 25.8 | 18.4 | 20.1 |
| CXR-27 Test (n=1,240) | |||
| BLEU-1 | 0.344 | 0.183 | 0.196 |
| ROUGE-L | 0.410 | 0.237 | 0.254 |
| Micro-avg F1-5 | 71.2 | 32.8 | 34.5 |
| Macro-avg F1-5 | 63.7 | 28.5 | 30.1 |
| RadGraph F1 | 58.6 | 21.9 | 24.3 |
Detection of Six Clinically Critical Findings (CXR-27; n=1,026):
| Finding | AUC | Sensitivity | Specificity |
|---|---|---|---|
| Support Devices | 0.931 | 0.727 | 0.880 |
| Pleural Effusion | 0.931 | 0.667 | 0.897 |
| Pneumothorax | 0.921 | 0.526 | 0.912 |
| Atelectasis | 0.902 | 0.526 | 0.905 |
| Cardiomegaly | 0.800 | 0.474 | 0.864 |
| Consolidation | 0.888 | 0.278 | 0.957 |
Metrics are computed as , , and . These results confirm that the model consistently exceeds clinically-relevant performance thresholds across both lexical report generation and structured radiological finding detection (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).
4. Prospective Clinical Trial Outcomes
A multicenter, randomized prospective trial (NCT07117266) evaluated the impact of Janus-Pro-CXR on clinical CXR workflows:
- Design: 20 junior radiologists randomized 1:1 to AI-assisted vs. standard care across 296 patient cases at three centers, with independent review by senior radiologists.
- Outcomes:
| Outcome | Standard Care | AI-Assisted | Mean Diff (95% CI) | P |
|---|---|---|---|---|
| Report Quality (μ±σ) | 4.12 ± 0.80 | 4.36 ± 0.50 | +0.25 (0.216–0.283) | <0.001 |
| RADPEER Agreement | 4.14 ± 0.84 | 4.30 ± 0.57 | +0.16 (0.119–0.200) | <0.001 |
| Reading Time (s) | 147.6 ± 51.1 | 120.6 ± 45.6 | −27.0 (−34.8 to −19.2) | <0.001 |
| Expert Preference ≥3/5 | — | 54.3% | — | <0.001 |
AI assistance reduced average interpretation time by 18.3% and was preferred by experts in 54.3% of cases. All improvements were statistically significant (). This demonstrates both objective and subjective benefits of Janus-Pro-CXR integration into clinical radiology workflows (Bai et al., 23 Dec 2025).
5. Inference Pipeline and Clinical Integration
Janus-Pro-CXR is optimized for real-world deployment with emphasis on low-latency and minimal computational footprint:
- Architecture: 1B parameters, 1–2 s inference latency on consumer GPUs (RTX 4060, 8 GB); quantization to 8-bit enables CPU inference with 5 GB memory use.
- Pipelined Inference:
- DICOM to PNG conversion and preprocessing on local workstation.
- Image and basic patient data transmitted over LAN to Janus-Pro-CXR server.
- Model outputs draft report (∼3 s total latency).
- Radiologist inserts draft into RIS/PACS, edits as needed, signs off.
- Batching: Disabled by default for latency; pre-fetching and asynchronous I/O maximize throughput.
- Software Stack: PyTorch 2.0, Transformers 4.x, OpenCV, DICOM-RTK, FastAPI.
Open-Source Implementation: The entire framework—vision encoder, language decoder, fusion modules, preprocessing tools, training and inference pipelines—is available at https://github.com/ZrH42/Janus-Pro-CXR under the Apache 2.0 license. Deployment requires standard Python environments and supports both CLI and RESTful interfaces for HIS/RIS integration (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).
6. Impact and Reproducibility
Janus-Pro-CXR bridges the gap between academic foundation models and clinical deployment of AI for chest radiograph interpretation. Rigorous multicentric validation demonstrates improved report quality, diagnostic reliability, and workflow efficiency in actual hospital settings, with substantive reduction in reporting time and expert-verified case preference. Its lightweight parameterization enables real-time inference on consumer hardware, facilitating adoption in under-resourced environments. The open-source release of code, weights, and evaluation protocols ensures reproducibility and encourages further research and clinical translation in AI-assisted radiology (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).