Papers
Topics
Authors
Recent
2000 character limit reached

NCT07117266 Trial: Janus-Pro-CXR Evaluation

Updated 27 December 2025
  • The paper demonstrates a significant enhancement in report quality, with a 0.25-point improvement on a five-point scale using AI-assisted chest X-ray interpretation.
  • The study employs a rigorous randomized design across three tertiary hospitals, comparing AI-assisted reporting with standard care through both subjective and quantitative metrics.
  • The trial highlights clinical impact by reducing reading times by 18.3% and showing superior performance to ChatGPT 4o at 200x lower parameter count.

A multicenter prospective trial (NCT07117266) represents a rigorous clinical research effort aimed at prospectively validating the Janus-Pro-CXR model, a 1-billion-parameter chest X-ray (CXR)-specific multimodal LLM, for automated chest radiograph interpretation. The trial—conducted across three tertiary hospitals in China—assesses the real-world utility, diagnostic accuracy, and workflow impact of AI assistance in radiology, addressing the global challenges posed by radiologist shortages and increasing CXR workload. The protocol incorporates randomized assignment, standardized reporting, and direct comparison against state-of-the-art models including ChatGPT 4o, supporting robust statistical inference on efficacy and efficiency endpoints (Bai et al., 23 Dec 2025).

1. Study Design and Patient Cohorts

The trial’s prospective arm included 296 adult patients requiring CXR-assisted clinical diagnosis. Enrollment occurred at three major tertiary hospitals in China: The First Affiliated Hospital of Zhengzhou University, The First Affiliated Hospital of Henan University of Science and Technology, and Union Hospital, Tongji Medical College. Inclusion criteria mandated suspected thoracic disease necessitating a CXR, informed consent, complete clinical records, and posteroanterior (PA) imaging only. Exclusion criteria comprised substandard image quality and pregnancy or lactation.

Retrospective data underpinned development and fine-tuning: MIMIC-CXR (227,835 images) and CheXpert Plus (222,103 images), alongside the Chinese CXR-27 multicenter cohort (12,396 images, post–quality control). For the latter, 11,156 images informed final fine-tuning; 1,240 images were reserved for independent retrospective evaluation.

Twenty junior radiologists (1–3 years’ experience) were randomized (1:1, by computerized sealed envelopes) to AI-assisted or standard-care arms. Both generated independent reports per patient, subsequently adjudicated by senior radiologists. Report creation time was recorded via automated timestamping.

2. Endpoints and Statistical Hypotheses

Primary endpoints in the prospective phase encompassed:

  • Report Quality Score on a five-point Likert scale, quantifying completeness, clarity, and clinical relevance.
  • Reader Preference: fraction of cases favored by at least 3 of 5 blinded expert raters in head-to-head comparisons.
  • Agreement Score (RADPEER): a five-point system (1 = highly discrepant error, 5 = no discrepancy).
  • Reading Time per case, in seconds.

Secondary endpoints (retrospective) replicated subjective metrics in 300 test cases from CXR-27, comparing model-generated and original clinical reports.

Null hypotheses posited equivalence between AI-assisted and standard-care groups across metrics, tested at two-sided α=0.05\alpha = 0.05 significance.

3. Model Architecture, Training Paradigm, and Comparators

Janus-Pro-CXR builds on DeepSeek’s Janus-Pro unified multimodal LLM, with approximately 1×1091 \times 10^9 parameters. Core architectural components:

  • Siglip Vision Encoder: Extracts visual tokens from DICOM/PNG CXR inputs.
  • Vision Aligner: Cross-modal adapter mapping image features to the LLM’s embedding space.
  • Expert Classification Branch: Injects predicted binary labels (six critical findings) into text generation via learned prompts.
  • Autoregressive Decoder: Produces structured radiology reports.

Training progressed through three SFT stages: MIMIC-CXR (162,105 images after quality control), CheXpert Plus (222,103), and CXR-27 (11,156), effecting domain adaptation to local clinical context. Standard loss functions included cross-entropy for autoregressive text and optional binary cross-entropy for classification outputs. Inference latency was 1–2 s per case on RTX 4060 hardware (8 GB VRAM).

Comparators included:

  • Janus-Pro: The base model lacking CXR-specific fine-tuning.
  • ChatGPT 4o: OpenAI’s generalist 200B-parameter multimodal LLM.

4. Evaluation Metrics and Statistical Analysis

Key classification metrics for six critical findings (support devices, pleural effusion, pneumothorax, atelectasis, consolidation, cardiomegaly):

  • Sensitivity: Sensitivity=TPTP+FN\text{Sensitivity} = \frac{\text{TP}}{\text{TP}+\text{FN}}
  • Specificity: Specificity=TNTN+FP\text{Specificity} = \frac{\text{TN}}{\text{TN}+\text{FP}}
  • Accuracy: Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}
  • AUC: AUC=∫01TPR(FPR) d(FPR)\text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR})\,d(\text{FPR})

Report generation/NLG efficacy was quantified using micro and macro F1 over 14 and 5 findings (F1-14, F1-5) with open-source labelers (CheXbert, DeepSeek), and RadGraph F1 for entity and relation correctness. Subjective metrics included Likert-scaled report quality, RADPEER agreement, and expert preference rates.

Statistical reporting followed standard conventions: all means with standard deviations, 95% CIs via xˉ±t0.975SE\bar{x} \pm t_{0.975}\mathrm{SE}, PP-values from paired t-tests or repeated-measures ANOVA where appropriate. Additional agreement measures included Cohen’s κ\kappa (model–reader) and Kendall’s WW (inter-rater).

5. Principal Outcomes and Diagnostic Performance

In prospective deployment (n=296), AI-assisted reporting significantly outperformed standard care:

Metric Standard Care AI-Assisted Mean Difference (95% CI) P-value
Report Quality Score 4.12 ± 0.80 4.36 ± 0.50 0.25 [0.216, 0.283] <0.001
RADPEER Score 4.14 ± 0.84 4.30 ± 0.57 0.16 [0.119, 0.200] <0.001
Reading Time (s) 147.6 ± 51.1 120.6 ± 45.6 –27.0 [–34.8, –19.2] <0.001
Expert Preference — 54.3% (≥3/5) — —

Retrospective assessment (CXR-27, n=300) revealed similar superiority:

  • Janus-Pro-CXR: Quality = 3.22 ± 1.14, RADPEER = 3.10 ± 1.05, Preference = 15.3%
  • Janus-Pro: Quality = 1.57 ± 0.63, RADPEER = 1.66 ± 0.60, Preference = 2.8%
  • ChatGPT 4o: Quality = 1.70 ± 0.76, RADPEER = 1.74 ± 0.75, Preference = 5.2%

Automated metrics (MIMIC-CXR):

  • Micro-F1 (top-5 findings): Janus-Pro-CXR, 63.4; Macro-F1, 55.1; RadGraph F1, 25.8 (all highest).

CXR-27 (test set):

  • Macro-F1 (14 findings): Janus-Pro-CXR, 42.3 (second); RadGraph F1, 58.6 (highest).

Detection of six critical findings yielded uniformly strong AUCs (support devices, pleural effusion, AUC = 0.931), with F1 ranging from 0.278 (consolidation) to 0.727 (support devices).

6. Workflow, Implementation, and Clinical Impact

AI assistance reduced junior radiologist report times by 18.3% (27.0 s/case), with potential aggregate daily savings of ~90 minutes per radiologist (assuming 200 cases/day). Structured AI impressions supported a higher pneumonia diagnosis rate in the AI-assisted arm (52.4% vs 36.1%, P<0.001P < 0.001), indicating tangible augmentation of diagnostic confidence.

Janus-Pro-CXR's lightweight design (1 B parameters) operates on commodity GPUs (RTX 4060 8 GB) without cloud inference requirements, facilitating clinical deployment in resource-constrained environments. Open-source code and weights further support translation to low- and middle-income regions.

Performance comparisons underscore the model’s domain specificity: Janus-Pro-CXR surpassed the generalist multimodal LLM ChatGPT 4o (200 B parameters) in both generation and diagnostic accuracy at ≈\approx200x lower parameter count, and with full operability on-site.

7. Key Findings, Limitations, and Future Directions

Janus-Pro-CXR, validated in multicenter prospective evaluation, achieved:

  • State-of-the-art CXR report generation and key finding detection (AUC >> 0.8 for all primary targets).
  • Statistically significant enhancements to report quality and workflow efficiency when deployed as an assistive tool.
  • Resource efficiency and open-sourcing, enabling broad and equitable adoption.

Limitations include suboptimal recognition of rare/subtle findings (e.g., pulmonary edema), lack of longitudinal multiseries validation, reference standard dependence on published reports, and the absence of a fully interactive radiologist–AI conversational interface. Currently, the model functions in an auxiliary capacity rather than as a stand-alone reporter; further studies may address real-time bidirectional AI–radiologist exchanges and integration of longitudinal imaging.

Trial NCT07117266 provides robust evidence for domain-specific, open-source LLMs in scalable, effective, and efficient CXR interpretation, particularly in under-resourced clinical contexts (Bai et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multicenter Prospective Trial (NCT07117266).