CXPMRG-Bench: X-ray Report Generation Benchmark

Updated 14 February 2026

CXPMRG-Bench is a comprehensive framework for X-ray medical report generation that combines pre-training and benchmarking using the large-scale CheXpert Plus dataset.
It employs a novel MambaXray-VL model with a three-stage pre-training paradigm—self-supervised autoregressive generation, image–text contrastive learning, and supervised fine-tuning—to boost both language and clinical performance.
The framework enables reproducible comparisons across 21 MRG models and 16 LLMs, setting a solid baseline for future research in radiology report generation.

CXPMRG-Bench is a comprehensive pre-training and benchmarking framework for X-ray medical report generation (MRG), established on the CheXpert Plus dataset. This resource aims to address the evaluation bottleneck in X-ray MRG by supplying both a rigorous benchmark of state-of-the-art models and a systematic pre-training approach that advances performance on natural language generation and clinical efficacy metrics. CXPMRG-Bench incorporates a newly proposed model, MambaXray-VL, and provides large-scale, reproducible comparison results across multiple existing models and large language architectures (Wang et al., 2024).

1. Dataset Composition and Benchmark Protocols

CXPMRG-Bench is grounded in the CheXpert Plus dataset, which comprises 223,228 chest X-ray images (both DICOM and PNG formats) paired with 187,711 radiology reports, parsed into 11 semantically defined sections such as "Findings" and "Impression." The dataset represents 64,725 patients, inclusive of 14 chest pathology labels and RadGraph entity/relation annotations. For the MRG task, the "Findings" section is utilized as ground truth, with splits defined as follows: 40,463 training cases, 5,780 validation cases, and 11,562 test cases.

A diverse suite of 21 open-source MRG models (spanning from R2GenRL and XProNet to recent architectures such as R2GenGPT and Token-Mixer) is systematically benchmarked. Additionally, 16 LLMs—including Vicuna-7B, Llama2-7B/13B, InternLM-7B, and Qwen-1.5/2-7B—are plugged into the R2GenGPT framework, complementing evaluations of vision-LLMs (VLMs) such as InternVL-2 and MiniCPM-V2.5.

Evaluation leverages both natural language generation (NLG) and clinical efficacy (CE) metrics. NLG metrics include BLEU-1 to BLEU-4, ROUGE-L, METEOR, and CIDEr, while CE is assessed by CheXpert label extraction with precision, recall, and F1 scoring.

2. Multi-Stage Pre-training Paradigm

The MambaXray-VL model employs a three-stage pre-training paradigm:

Stage 1: Self-Supervised Autoregressive Generation (ARG)

Utilizes a Vision Mamba ("Vim") encoder composed of multiple Mamba state-space blocks with linear $\mathcal O(N)$ complexity.
Each $192\times192$ X-ray image is partitioned into $N=144$ patches ( $16\times16$ pixels each), projected to 1,024-d tokens.
The stack of Mamba blocks features normalization, state-space (SSM) and scan branches, SwiGLU gating, residual links, and MLP layers.
The model predicts the next image patch token in an autoregressive manner, optimizing the mean-squared error:

$\mathcal L_{\mathrm{AR}} = \sum_{i=1}^{N-1}\left\| \mathrm{Vim}(\mathcal T_{1:i}) - \mathcal T_{i+1}\right\|_2^2$

No augmentation or masking is used—training is on fully observed patches.

Stage 2: Image–Text Contrastive Learning (CTL)

Transfers the Vim backbone and introduces a medical text encoder (Bio_ClinicalBERT or, for ablation, Llama2).
Both vision and language encoders are mapped into a shared embedding space; cosine similarity between visual ( $v_i$ ) and text ( $t_j$ ) embeddings drives InfoNCE loss:

$\mathcal L_{\mathrm{CL}} = -\sum_{i=1}^B \log \frac{\exp\left(\mathrm{sim}(v_i, t_i)/\tau\right)}{\sum_{j=1}^{B}\exp\left(\mathrm{sim}(v_i,t_j)/\tau\right)}$

Training uses ~480 K aligned pairs from MIMIC-CXR, CheXpert Plus, and IU-Xray.

Stage 3: Supervised Fine-tuning (SFT)

The decoder is a LLM (typically Llama2-7B), prompted to "generate a comprehensive and detailed diagnosis report for this chest X-ray image."
The decoder receives the concatenated output of Vim token embeddings and prompt tokens.
Training freezes the LLM parameters, optimizing only the vision encoder and a light mapping layer, via negative log-likelihood over target report tokens:

$\mathcal L_{\mathrm{NLL}} = -\sum_{t=1}^T \log p_\theta(y_t \mid \text{Prompt}, y_{<t})$

3. Model Architecture and Hyperparameterization

MambaXray-VL is available in Base and Large variants, distinguishing themselves by backbone depth and hidden dimensionality of the Vim encoder. Text decoders involve Llama2 (7B or 13B), with Qwen-1.5 used for certain ablation studies.

Pre-training for Stage 1 (ARG) is executed for 100 epochs at a batch size of 256 (Base) or 128 (Large), learning rate of $1.5\times 10^{-4}$ (scaled to batch size), with AdamW optimizer and weight decay of 0.05. Cosine learning rate scheduling and 5 epochs of warm-up are used. Stage 2 (CTL) follows for 50 epochs, batch size 192, optimizer as above, and input resized to $224 \times 224$ ; both vision and text encoders are updated.

Fine-tuning procedures include 30 epochs and batch size 20 for IU-Xray (decoder: Qwen-1.5), report length max 60, and 6 epochs with batch 18 and max length 100 for MIMIC-CXR and CheXpert Plus (decoder: Llama2-7B; vision frozen; validation every 0.5 epoch).

4. Quantitative and Qualitative Evaluation

Benchmarking on the CheXpert Plus test split demonstrates that MambaXray-VL-Large achieves state-of-the-art performance. On NLG metrics, it achieves BLEU-4 = 0.112 (second best 0.101), ROUGE-L = 0.276, METEOR = 0.157, CIDEr = 0.139. Clinical efficacy F1 reaches 0.335, exceeding strong baselines such as Token-Mixer (F1 = 0.288) and PromptMRG (F1 = 0.281). The Base variant (BLEU-4 = 0.105, F1 = 0.273) also outperforms most previous methods.

Among LLM plug-in variants, Vicuna-7B yields the best BLEU-4 among LLMs (0.104), while InternLM-7B achieves the highest clinical F1 (0.284). VLMs pretrained on natural images, such as InternVL-2 and MiniCPM-V2.5, produce lower scores, indicating a domain adaptation gap for radiology images.

Ablation analyses reveal that adding ARG pre-training increases BLEU-4 by approximately 25% on MIMIC-CXR and conveys a smaller, consistent improvement on CheXpert Plus. Adding CTL contributes about 5% gain in ROUGE-L. Vim-PTD initialization (ARG pre-training on X-rays) results in superior downstream performance compared to ImageNet-1K initialization. Larger backbone configurations consistently provide higher downstream metrics across evaluation regimes. Furthermore, Bio_ClinicalBERT outperforms Llama2 as the text encoder during the contrastive stage, likely due to its biomedical pre-training.

Qualitative case studies demonstrate that MambaXray-VL matches more ground-truth sentences than R2GenGPT, particularly for images featuring medical devices (pacemaker leads, sternotomy wires) and for subtle findings such as “flattening of the diaphragms” or “no pleural effusion or pneumothorax.”

5. Comparative Model Landscape

The following table summarizes the core groups of methods benchmarked with CXPMRG-Bench on CheXpert Plus:

Category	Examples	Key Features
X-ray MRG	R2GenRL, Token-Mixer, PromptMRG, etc.	Vision-only and vision-LLMs, various backbones
LLM Plug-ins	Vicuna-7B, Qwen-1.5, Llama-2/3, etc.	Plugged into R2GenGPT framework for report generation
VLMs	InternVL-2, MiniCPM-V2.5	Pretrained on natural images, applied without X-ray-specific tuning

A plausible implication is that the domain gap for VLMs pretrained on natural-image corpora remains a limiting factor in radiology-specific tasks.

6. Significance and Outlook

CXPMRG-Bench provides the first large-scale, reproducible comparative evaluation platform for X-ray MRG algorithms on CheXpert Plus. By including both established models and modern LLM architectures within a unified protocol, it establishes a solid baseline for future algorithmic advances in the field. MambaXray-VL, with its lightweight Mamba-based backbone and multi-stage pretraining (ARG $\rightarrow$ CTL $\rightarrow$ SFT), advances state of the art in both natural language and clinical outcome metrics. This framework enables more rigorous benchmarking for future research, clarifies the impact of domain-specific pretraining over generic vision or LLMs, and offers open source resources for follow-on studies (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CXPMRG-Bench.

CXPMRG-Bench: X-ray Report Generation Benchmark

1. Dataset Composition and Benchmark Protocols

2. Multi-Stage Pre-training Paradigm

3. Model Architecture and Hyperparameterization

4. Quantitative and Qualitative Evaluation

5. Comparative Model Landscape

6. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CXPMRG-Bench: X-ray Report Generation Benchmark

1. Dataset Composition and Benchmark Protocols

2. Multi-Stage Pre-training Paradigm

3. Model Architecture and Hyperparameterization

4. Quantitative and Qualitative Evaluation

5. Comparative Model Landscape

6. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research