Papers
Topics
Authors
Recent
2000 character limit reached

MAIRA-2: Grounded Chest X-ray Reporting LLM

Updated 7 December 2025
  • MAIRA-2 is a radiology-specialized multimodal LLM that generates detailed chest X-ray reports with explicit grounding of findings using bounding box tokens.
  • It integrates a frozen radiology-adapted visual transformer, a multi-layer adapter, and a fine-tuned Vicuna language model to achieve state-of-the-art performance.
  • The model employs an autoregressive grounding mechanism and rigorous evaluation frameworks to ensure clinical relevance and robust interpretability.

MAIRA-2 is a radiology-specialized multimodal LLM (MLLM) designed for chest X-ray (CXR) report generation, with a unique capability for grounded reporting—explicit localization of described findings on the medical image. Developed within the LLaVA framework, MAIRA-2 integrates a frozen visual transformer backbone (ViT), a multi-layer adapter, and a fine-tuned causal LLM (Vicuna v1.5) to achieve state-of-the-art performance on public and private chest X-ray reporting benchmarks. MAIRA-2 has been the subject of extensive evaluation, including blinded studies against real radiologist reports and dedicated mechanistic interpretability analyses utilizing sparse autoencoders.

1. Model Architecture and Design

MAIRA-2 is architected as an encoder–adapter–decoder MLLM, with the following key components:

  • Visual Encoder: A frozen, radiology-specialized ViT-B (≈87 M parameters) backbone pretrained via DINO-style self-supervision on a consortium of chest X-ray datasets (BRAX, ChestX-ray8, CheXpert, MIMIC-CXR, PadChest, USMix; ≈1.4M images). Each image is encoded to a sequence of 1,369 visual tokens (input size: 518×518, patch size: 14×14).
  • Adapter: A trainable 4-layer MLP projects the ViT output into the embedding space required by the LLM, effecting multimodal alignment.
  • LLM: An autoregressive transformer decoder (Vicuna v1.5, 7B or 13B parameters; all results below refer to the 7B base unless otherwise specified). Position embeddings are augmented using a 1.5× linear RoPE scaling to support long, interleaved multimodal prompts.
  • Input and Prompting: Model input combines image tokens (frontal, lateral, optionally prior CXRs) and text tokens (Indication, Technique, Comparison, prior report sections) within a task-specific system instruction. Image tokens are incorporated by direct replacement within the token sequence.
  • Grounding Mechanism: For grounded report generation, each describable finding may be followed by special tokens <obj>…<box>…</box></obj> encoding bounding boxes using a tokenized 100×100 grid. No explicit detection head/regression loss is used; grounding is handled autoregressively as text decoding.

The end-to-end parameter count of the 7B MAIRA-2 configuration is approximately 6.9 billion. The model is trained exclusively on cross-entropy objectives, with no auxiliary loss terms for grounding; all tasks (ungrounded Findings, Grounded Report, Phrase Grounding) are mixed in training batches in proportion to their dataset sizes (Bannur et al., 6 Jun 2024).

2. Training Regime and Data

MAIRA-2’s training strategy exploits multimodal (image+text) pairing sourced from large, publicly available radiology corpora, with no task-specific pretraining or in-paper fine-tuning for external benchmarks. The main datasets utilized are:

  • Pretraining/Fine-tuning Sets: MIMIC-CXR, PadChest, USMix, IU-Xray (total ≈510–511K CXR-report pairs for evaluation studies; full model pretraining included further sources such as BRAX, ChestX-ray8, CheXpert—see architecture summary above).
  • Grounded Reporting Task: Internal “GR-Bench” dataset with explicit bounding box annotation for image findings (private, n=1231).
  • Phrase Grounding: MS-CXR (public phrase-to-box mapping, n≈176 held-out phrases).
  • Contextual Inputs: Inclusion of prior images, prior reports, and additional sections (INDICATION/TECHNIQUE/COMPARISON) shown empirically to reduce hallucinations and improve factuality.

All model parameters except for the image encoder are fine-tuned. Training hyperparameters (optimizer, batch size, learning rate, epoch count) are not specified in published evaluation studies (Lim et al., 29 Nov 2025); the technical report notes standard AdamW-like schedules (Bannur et al., 6 Jun 2024).

3. Evaluation Frameworks and Task Formulations

MAIRA-2 introduces and is evaluated under both traditional ungrounded and novel grounded report generation settings.

  • Findings Generation (FindGen): Tasked with producing the free-text Findings section of a radiology report given image and context input.
  • Grounded Report Generation (GroundRep): For each sentence describing a localisable finding, the model appends <obj>…<box>…</box></obj> tokens to specify a bounding box. The model autoregressively generates both findings and localization tokens.
  • Phrase Grounding: Given an image and a target phrase, the model predicts the corresponding location as a bounding box.

RadFact Evaluation Suite: An LLM-based evaluation framework measures: - Sentence-level logical entailment versus reference report (precision, recall). - Correctness of generated grounding (precision, recall), using pixel-precision overlaps (threshold: τ = 0.5). - Additional standard metrics: ROUGE-L, BLEU-1/4, METEOR, CheXbert F1, RadCliQ, RadGraph-F1, box-completion mIoU.

Radiologist Benchmarks: In Lim et al., MAIRA-2 and comparator VLMs were blindly scored by expert thoracic radiologists using criteria directly modeled on clinical QA and safety: - RADPEER: Three-point peer review with subtypes tracking clinical significance. - Clinical Acceptability: Four-point scale for real-world correction burden. - Hallucination Detection: Percentage of statements unsupported by the image. - Language Clarity: Five-point readability assessment.

Comparative performance balancing both factuality and language was assessed using generalized linear mixed models, additionally including per-finding sensitivity and specificity using same-day CT reads as reference (Lim et al., 29 Nov 2025).

4. Quantitative Performance and Benchmark Results

4.1 Findings Generation (Ungrounded)

In public MIMIC-CXR evaluation (n≈2,461), MAIRA-2 achieved state-of-the-art results compared to prior models:

Metric MAIRA-2 7B MAIRA-2 13B
ROUGE-L 38.4 39.1
BLEU-1/4 46.5 / 23.4 47.9 / 24.3
METEOR 41.9 43.0
RadFact Precision 52.5 55.6
RadFact Recall 48.6 51.5
RadGraph-F₁ 34.6 35.9
CheXbert macro F₁-14 40.9 45.9
CheXbert micro F₁-14 60.2 61.4

On PadChest and IU-Xray external sets, MAIRA-2 retained strong performance; for instance, PadChest ROUGE-L ≈29.4, BLEU-4 ≈9.3, CheXbert macro F₁-14 ≈35.9 (Bannur et al., 6 Jun 2024).

4.2 Grounded Reporting and Phrase Grounding

On GR-Bench (n=1231) for grounded reporting:

Metric MAIRA-2 7B
RadFact logical P / R 73.5 / 72.4
RadFact grounding P / R 68.2 / 92.2
RadFact spatial P / R 32.1 / 33.7
Box-completion mIoU (P/R) 60.7 (54.1)

On MS-CXR phrase grounding, box mIoU was 57.8 (single-box), demonstrating high-quality grounding performance relative to specialized vision-LLMs (Bannur et al., 6 Jun 2024).

4.3 Clinical Acceptability and Diagnostic Fidelity (Radiologist Review)

MAIRA-2’s performance in blinded radiologist evaluation (n=1,434 reports):

Metric Radiologist MAIRA-2 P-value
RADPEER 3b (“unacceptable”) % 13.9 (200) 24.5 (352) <.001
RADPEER 2b+3b (sig. miss) % 30.3 (435) 36.8 (528) <.001
Clinical acceptability ≥3 % 74.3 (1065) 65.6 (941) <.001
Clinical acceptability =4 % 43.5 (624) 37.3 (535) <.001
Hallucination % 0.1 17.4 <.001
Language clarity ≥4 % 78.1 78.4 .85

Finding-level sensitivity (vs. CT):

  • Consolidation/GGO: 51.4%
  • Pleural effusion: 43.6%
  • Cardiomegaly: 72.0%
  • Emphysema: 6.0%
  • Nodule/mass: 7.1%
  • Pneumothorax: 55.6%
  • Rare findings: 0% (Lim et al., 29 Nov 2025)

Compared to other VLMs (AIRead, Lingshu, MedGemma, MedVersa), MAIRA-2 demonstrated middling factual agreement and clarity, high hallucination rates, and moderate to poor sensitivity for key pulmonary findings.

5. Mechanistic Interpretability with Sparse Autoencoders

Large-scale mechanistic analysis applied a Matryoshka sparse autoencoder (SAE) to midlayer (4,096-dim) residuals in MAIRA-2’s decoder to surface latent concepts (Bouzid et al., 17 Jul 2025). The SAE trains a nested sequence of dictionary sizes (up to 16,384 latents, enforcing 256-sparsity per batch) on internal representations obtained from both image and text tokens.

Concept Discovery Pipeline:

  • Each latent is screened using automated GPT-4o-based annotation over 50 exemplars, with interpretability F₁ measured on 200 held-out samples.
  • Approximately 1.8% (288/16,384) of SAEs reached detection F₁≥0.75; these encoded clinically salient concepts (e.g., chest tubes, pleural effusion, cardiomegaly, explicit interval change).
  • Steering experiments, which manipulate residuals at decoding using decoder directions of latents, demonstrated causal influence on generation for a minority of features (on-target success peaks ≈11.3%, off-target changes substantial), highlighting the current fragility of mechanistic control.

Key findings:

  • Most latents are not monosemantic; feature duplication and mixed concepts are frequent.
  • SAE integration did not degrade perplexity or factual accuracy on test sets.
  • Dependence on a generalist LLM for feature annotation introduces potential interpretability limitations.

6. Ablation Studies and Design Analysis

MAIRA-2’s multitask architecture supports rigorous ablation of both modeling and data-centric components.

  • Grounding Supervision: Removal of grounding tasks from training does not impair Findings text generation (no negative transfer). Conversely, removal of the Findings generation task decreases both text and grounding performance, indicating positive transfer.
  • Contextual Inputs: Absence of temporal context or multi-view information reduces ROUGE-L by 8–29%, CheXbert F₁ by ~10%, and increases hallucinated lateral/comparison mentions significantly.
  • Grounding Quality: Qualitative error analysis reveals partial-to-good localization in most cases; some misspecifications are borderline or due to ambiguous references not explicit in the annotation (Bannur et al., 6 Jun 2024).

These findings underscore the importance of rich context and multitask learning for robust radiology report generation and grounding.

7. Limitations and Future Directions

Key limitations of MAIRA-2 and its evaluation pipeline include:

  • Generalization: Most evaluation has focused on English-language, academic medical centers, with limitations for succinct or regionally specific report styles (e.g., Korean sites in Lim et al. (Lim et al., 29 Nov 2025)).
  • Lack of Clinical Metadata: Current model input is restricted to image and report text, omitting critical patient metadata, which could remedy certain hallucination modes and improve finding interpretability.
  • Undetected Rare Findings: Detection of low-prevalence high-risk findings (e.g., aortic aneurysm, pneumoperitoneum) remains below clinical thresholds, as with all comparator VLMs.
  • Human–AI Collaboration: No studies yet address downstream clinical workflows or the impact of human-in-the-loop verification on safety and trust.
  • Interpretability and Control: SAE-based concept decomposition is promising but currently reveals a low proportion of high-fidelity, steerable neuron directions; collateral text effects and split concepts impede practical deployment.

Future research directions encouraged by the authors and subsequent work include radiology-adapted LLMs for concept annotation, active sampling for concept discovery, tighter architectural integration between vision and language components, and domain-expert review to validate feature steering and ensure patient safety (Bouzid et al., 17 Jul 2025, Bannur et al., 6 Jun 2024).


MAIRA-2 establishes a new paradigm for large-scale, grounded radiology report generation leveraging a vision–language transformer framework, multitask sequence modeling, and systematic clinical benchmarking. It sets baseline capabilities for factual and spatially grounded report generation in CXR, while also serving as a platform for deeper model interpretability and mechanistic dissection. Its design and empirical results provide a reference point for future radiology-specific and general medical foundation models.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MAIRA-2.