Janus-Pro-CXR: Multimodal LLM in Radiology

Updated 27 December 2025

The paper presents Janus-Pro-CXR, an open-source 1B parameter multimodal LLM that automates chest X-ray interpretation and report generation with state-of-the-art accuracy.
It employs a unified vision-language Transformer with cross-modal fusion and an expert analysis module, achieving superior performance on datasets like MIMIC-CXR, CheXpert Plus, and CXR-27.
Clinically validated through multicenter trials, Janus-Pro-CXR enhances report quality, reduces reading time, and supports efficient deployment in both high- and low-resource environments.

Janus-Pro-CXR is an open-source, lightweight, 1-billion-parameter multimodal LLM framework specifically designed for the automated interpretation of chest radiographs (CXR). Built upon the DeepSeek Janus-Pro foundation, Janus-Pro-CXR integrates advanced vision and language components, robust domain-specific optimization, and full clinical validation, achieving state-of-the-art performance in both retrospective and prospective multicenter studies for CXR report generation and critical findings detection. The system is engineered for efficient deployment in high- and low-resource settings, with architecture, weights, and implementation framework released under a permissive license (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).

1. Architectural Overview

Janus-Pro-CXR utilizes a unified vision-language Transformer architecture, with tight coupling across visual and linguistic modalities to enable end-to-end CXR interpretation and report generation. The model comprises the following principal modules:

Vision Encoder: A Siglip-based Vision Transformer (ViT) with a patch size of 16×16, 12 Transformer blocks, and a hidden dimension of 768. Input CXRs are patchified, linearly projected, and combined with positional encodings.
Language Decoder: A 24-layer Transformer decoder with hidden dimension 1024 and 16 attention heads, performing autoregressive report generation.
Cross-modal Fusion ("Aligner"): Consisting of a linear projection and cross-modal self-attention layers, this module integrates visual token embeddings into the language space.
Expert Analysis Module: Binary classifiers consuming pooled vision features for classification of six clinically significant findings.

The vision encoder outputs a visual token matrix $V \in \mathbb{R}^{N \times d}$ , which is projected to $H \in \mathbb{R}^{M \times d}$ for fusion with the language stream. The decoder then generates text autoregressively. The following equations characterize key operations:

Patch Embedding: $x_p = W_{\mathrm{patch}}\,\mathrm{patch}(I) + p_{\mathrm{pos}},\quad x_p \in \mathbb{R}^{N \times d}$
Self-attention (single head): $\mathrm{head} = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)\,V$
Multihead attention: $\mathrm{MHA}(Q,K,V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\,W_O$
Cross-modal Fusion: $Q = E_{\mathrm{tok}} W^Q,\; K = H W^K,\; V = H W^V$ , followed by standard MHA.

Figure Description: The architecture processes input CXRs via patchification, embedding, and vision encoding. Pooled visual features inform both critical finding classifiers and, through the aligner, are fused with the language decoder for contextualized report generation.

2. Training Protocol and Datasets

Janus-Pro-CXR employs a two-stage fine-tuning strategy leveraging diverse large-scale and multicenter datasets:

Stage 1:
- MIMIC-CXR: 167,496 images (165,131 train, 2,365 test)
- CheXpert Plus: 222,103 images (no held-out test)
Stage 2:
- CXR-27: 11,156 images from 27 Chinese hospitals (11,156 train, 1,240 test)

Training involves rescaling images to 224×224, min–max normalization, and augmentation (p=0.5 horizontal flip, ±5° rotation). Byte-Pair Encoding is used for tokenization (50k vocab). Auto-annotation for 14 findings is produced by CheXbert (MIMIC) and DeepSeek labeler (CXR-27), treating uncertain labels as positive during training.

The combined loss function is: $L = L_\mathrm{lang} + \lambda L_\mathrm{cls}$ with

$L_\mathrm{lang} = -\sum_{t=1}^T \log p_\theta(w_t \mid w_{<t}, I)$

$L_\mathrm{cls} = -\sum_{i=1}^6 \left[ w_i y_i \log p_i + (1-y_i)\log(1-p_i) \right]$

where $w_i = 1/(\mathrm{freq}_i)$ , $\lambda=0.3$ . The AdamW optimizer is applied (β₁=0.9, β₂=0.999, ε=1e−8), with batch size 64 and learning rate warmup (5%), followed by linear decay.

Retrospective and prospective clinical data encompassed 296 patients (NCT07117266) each with two independent junior radiologist interpretations.

3. Benchmarking and Performance Evaluation

Janus-Pro-CXR demonstrates state-of-the-art results across established chest X-ray tasks, outperforming both its base model and the 200B-parameter ChatGPT 4o in all reported dimensions.

Automated Report Generation:

Metric	Janus-Pro-CXR	Janus-Pro	ChatGPT 4o
MIMIC-CXR Test (n=2,365)
BLEU-1	0.312	0.221	0.236
ROUGE-L	0.378	0.291	0.307
Micro-avg F1-5	63.4	49.1	51.6
Macro-avg F1-5	55.1	38.7	40.3
RadGraph F1	25.8	18.4	20.1
CXR-27 Test (n=1,240)
BLEU-1	0.344	0.183	0.196
ROUGE-L	0.410	0.237	0.254
Micro-avg F1-5	71.2	32.8	34.5
Macro-avg F1-5	63.7	28.5	30.1
RadGraph F1	58.6	21.9	24.3

Detection of Six Clinically Critical Findings (CXR-27; n=1,026):

Finding	AUC	Sensitivity	Specificity
Support Devices	0.931	0.727	0.880
Pleural Effusion	0.931	0.667	0.897
Pneumothorax	0.921	0.526	0.912
Atelectasis	0.902	0.526	0.905
Cardiomegaly	0.800	0.474	0.864
Consolidation	0.888	0.278	0.957

Metrics are computed as $AUC = \int_0^1 TPR(FPR^{-1}(u))\,du$ , $\mathrm{Sensitivity} = \frac{TP}{TP+FN}$ , and $\mathrm{Specificity} = \frac{TN}{TN+FP}$ . These results confirm that the model consistently exceeds clinically-relevant performance thresholds across both lexical report generation and structured radiological finding detection (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).

4. Prospective Clinical Trial Outcomes

A multicenter, randomized prospective trial (NCT07117266) evaluated the impact of Janus-Pro-CXR on clinical CXR workflows:

Design: 20 junior radiologists randomized 1:1 to AI-assisted vs. standard care across 296 patient cases at three centers, with independent review by senior radiologists.
Outcomes:

Outcome	Standard Care	AI-Assisted	Mean Diff (95% CI)	P
Report Quality (μ±σ)	4.12 ± 0.80	4.36 ± 0.50	+0.25 (0.216–0.283)	<0.001
RADPEER Agreement	4.14 ± 0.84	4.30 ± 0.57	+0.16 (0.119–0.200)	<0.001
Reading Time (s)	147.6 ± 51.1	120.6 ± 45.6	−27.0 (−34.8 to −19.2)	<0.001
Expert Preference ≥3/5	—	54.3%	—	<0.001

AI assistance reduced average interpretation time by 18.3% and was preferred by experts in 54.3% of cases. All improvements were statistically significant ( $\alpha=0.05$ ). This demonstrates both objective and subjective benefits of Janus-Pro-CXR integration into clinical radiology workflows (Bai et al., 23 Dec 2025).

5. Inference Pipeline and Clinical Integration

Janus-Pro-CXR is optimized for real-world deployment with emphasis on low-latency and minimal computational footprint:

Architecture: 1B parameters, 1–2 s inference latency on consumer GPUs (RTX 4060, 8 GB); quantization to 8-bit enables CPU inference with $<$ 5 GB memory use.
Pipelined Inference:

DICOM to PNG conversion and preprocessing on local workstation.
Image and basic patient data transmitted over LAN to Janus-Pro-CXR server.
Model outputs draft report (∼3 s total latency).
Radiologist inserts draft into RIS/PACS, edits as needed, signs off.

Batching: Disabled by default for latency; pre-fetching and asynchronous I/O maximize throughput.
Software Stack: PyTorch 2.0, Transformers 4.x, OpenCV, DICOM-RTK, FastAPI.

Open-Source Implementation: The entire framework—vision encoder, language decoder, fusion modules, preprocessing tools, training and inference pipelines—is available at https://github.com/ZrH42/Janus-Pro-CXR under the Apache 2.0 license. Deployment requires standard Python environments and supports both CLI and RESTful interfaces for HIS/RIS integration (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).

6. Impact and Reproducibility

Janus-Pro-CXR bridges the gap between academic foundation models and clinical deployment of AI for chest radiograph interpretation. Rigorous multicentric validation demonstrates improved report quality, diagnostic reliability, and workflow efficiency in actual hospital settings, with substantive reduction in reporting time and expert-verified case preference. Its lightweight parameterization enables real-time inference on consumer hardware, facilitating adoption in under-resourced environments. The open-source release of code, weights, and evaluation protocols ensures reproducibility and encourages further research and clinical translation in AI-assisted radiology (Bai et al., 23 Dec 2025, Bai et al., 31 May 2025).