CXPMRG-Bench: X-ray Report Generation Benchmark
- CXPMRG-Bench is a comprehensive framework for X-ray medical report generation that combines pre-training and benchmarking using the large-scale CheXpert Plus dataset.
- It employs a novel MambaXray-VL model with a three-stage pre-training paradigm—self-supervised autoregressive generation, image–text contrastive learning, and supervised fine-tuning—to boost both language and clinical performance.
- The framework enables reproducible comparisons across 21 MRG models and 16 LLMs, setting a solid baseline for future research in radiology report generation.
CXPMRG-Bench is a comprehensive pre-training and benchmarking framework for X-ray medical report generation (MRG), established on the CheXpert Plus dataset. This resource aims to address the evaluation bottleneck in X-ray MRG by supplying both a rigorous benchmark of state-of-the-art models and a systematic pre-training approach that advances performance on natural language generation and clinical efficacy metrics. CXPMRG-Bench incorporates a newly proposed model, MambaXray-VL, and provides large-scale, reproducible comparison results across multiple existing models and large language architectures (Wang et al., 2024).
1. Dataset Composition and Benchmark Protocols
CXPMRG-Bench is grounded in the CheXpert Plus dataset, which comprises 223,228 chest X-ray images (both DICOM and PNG formats) paired with 187,711 radiology reports, parsed into 11 semantically defined sections such as "Findings" and "Impression." The dataset represents 64,725 patients, inclusive of 14 chest pathology labels and RadGraph entity/relation annotations. For the MRG task, the "Findings" section is utilized as ground truth, with splits defined as follows: 40,463 training cases, 5,780 validation cases, and 11,562 test cases.
A diverse suite of 21 open-source MRG models (spanning from R2GenRL and XProNet to recent architectures such as R2GenGPT and Token-Mixer) is systematically benchmarked. Additionally, 16 LLMs—including Vicuna-7B, Llama2-7B/13B, InternLM-7B, and Qwen-1.5/2-7B—are plugged into the R2GenGPT framework, complementing evaluations of vision-LLMs (VLMs) such as InternVL-2 and MiniCPM-V2.5.
Evaluation leverages both natural language generation (NLG) and clinical efficacy (CE) metrics. NLG metrics include BLEU-1 to BLEU-4, ROUGE-L, METEOR, and CIDEr, while CE is assessed by CheXpert label extraction with precision, recall, and F1 scoring.
2. Multi-Stage Pre-training Paradigm
The MambaXray-VL model employs a three-stage pre-training paradigm:
Stage 1: Self-Supervised Autoregressive Generation (ARG)
- Utilizes a Vision Mamba ("Vim") encoder composed of multiple Mamba state-space blocks with linear complexity.
- Each X-ray image is partitioned into patches ( pixels each), projected to 1,024-d tokens.
- The stack of Mamba blocks features normalization, state-space (SSM) and scan branches, SwiGLU gating, residual links, and MLP layers.
- The model predicts the next image patch token in an autoregressive manner, optimizing the mean-squared error:
- No augmentation or masking is used—training is on fully observed patches.
Stage 2: Image–Text Contrastive Learning (CTL)
- Transfers the Vim backbone and introduces a medical text encoder (Bio_ClinicalBERT or, for ablation, Llama2).
- Both vision and language encoders are mapped into a shared embedding space; cosine similarity between visual () and text () embeddings drives InfoNCE loss:
- Training uses ~480 K aligned pairs from MIMIC-CXR, CheXpert Plus, and IU-Xray.
Stage 3: Supervised Fine-tuning (SFT)
- The decoder is a LLM (typically Llama2-7B), prompted to "generate a comprehensive and detailed diagnosis report for this chest X-ray image."
- The decoder receives the concatenated output of Vim token embeddings and prompt tokens.
- Training freezes the LLM parameters, optimizing only the vision encoder and a light mapping layer, via negative log-likelihood over target report tokens:
3. Model Architecture and Hyperparameterization
MambaXray-VL is available in Base and Large variants, distinguishing themselves by backbone depth and hidden dimensionality of the Vim encoder. Text decoders involve Llama2 (7B or 13B), with Qwen-1.5 used for certain ablation studies.
Pre-training for Stage 1 (ARG) is executed for 100 epochs at a batch size of 256 (Base) or 128 (Large), learning rate of (scaled to batch size), with AdamW optimizer and weight decay of 0.05. Cosine learning rate scheduling and 5 epochs of warm-up are used. Stage 2 (CTL) follows for 50 epochs, batch size 192, optimizer as above, and input resized to ; both vision and text encoders are updated.
Fine-tuning procedures include 30 epochs and batch size 20 for IU-Xray (decoder: Qwen-1.5), report length max 60, and 6 epochs with batch 18 and max length 100 for MIMIC-CXR and CheXpert Plus (decoder: Llama2-7B; vision frozen; validation every 0.5 epoch).
4. Quantitative and Qualitative Evaluation
Benchmarking on the CheXpert Plus test split demonstrates that MambaXray-VL-Large achieves state-of-the-art performance. On NLG metrics, it achieves BLEU-4 = 0.112 (second best 0.101), ROUGE-L = 0.276, METEOR = 0.157, CIDEr = 0.139. Clinical efficacy F1 reaches 0.335, exceeding strong baselines such as Token-Mixer (F1 = 0.288) and PromptMRG (F1 = 0.281). The Base variant (BLEU-4 = 0.105, F1 = 0.273) also outperforms most previous methods.
Among LLM plug-in variants, Vicuna-7B yields the best BLEU-4 among LLMs (0.104), while InternLM-7B achieves the highest clinical F1 (0.284). VLMs pretrained on natural images, such as InternVL-2 and MiniCPM-V2.5, produce lower scores, indicating a domain adaptation gap for radiology images.
Ablation analyses reveal that adding ARG pre-training increases BLEU-4 by approximately 25% on MIMIC-CXR and conveys a smaller, consistent improvement on CheXpert Plus. Adding CTL contributes about 5% gain in ROUGE-L. Vim-PTD initialization (ARG pre-training on X-rays) results in superior downstream performance compared to ImageNet-1K initialization. Larger backbone configurations consistently provide higher downstream metrics across evaluation regimes. Furthermore, Bio_ClinicalBERT outperforms Llama2 as the text encoder during the contrastive stage, likely due to its biomedical pre-training.
Qualitative case studies demonstrate that MambaXray-VL matches more ground-truth sentences than R2GenGPT, particularly for images featuring medical devices (pacemaker leads, sternotomy wires) and for subtle findings such as “flattening of the diaphragms” or “no pleural effusion or pneumothorax.”
5. Comparative Model Landscape
The following table summarizes the core groups of methods benchmarked with CXPMRG-Bench on CheXpert Plus:
| Category | Examples | Key Features |
|---|---|---|
| X-ray MRG | R2GenRL, Token-Mixer, PromptMRG, etc. | Vision-only and vision-LLMs, various backbones |
| LLM Plug-ins | Vicuna-7B, Qwen-1.5, Llama-2/3, etc. | Plugged into R2GenGPT framework for report generation |
| VLMs | InternVL-2, MiniCPM-V2.5 | Pretrained on natural images, applied without X-ray-specific tuning |
A plausible implication is that the domain gap for VLMs pretrained on natural-image corpora remains a limiting factor in radiology-specific tasks.
6. Significance and Outlook
CXPMRG-Bench provides the first large-scale, reproducible comparative evaluation platform for X-ray MRG algorithms on CheXpert Plus. By including both established models and modern LLM architectures within a unified protocol, it establishes a solid baseline for future algorithmic advances in the field. MambaXray-VL, with its lightweight Mamba-based backbone and multi-stage pretraining (ARG CTL SFT), advances state of the art in both natural language and clinical outcome metrics. This framework enables more rigorous benchmarking for future research, clarifies the impact of domain-specific pretraining over generic vision or LLMs, and offers open source resources for follow-on studies (Wang et al., 2024).