EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT

Published 27 Apr 2026 in cs.CV | (2604.24146v1)

Abstract: Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a novel anatomy-aware, anomaly-aware framework (EXACT) that leverages a Y-shaped Mamba backbone for voxel-level localization in 3D chest CT scans.
It demonstrates superior diagnostic performance and precise anomaly segmentation versus CLIP-based and segmentation networks, achieving AUROCs up to 0.830 and high Hit Rates@5%.
The method integrates organ segmentation with multi-instance anomaly detection, enabling transparent, explainable radiology report generation and robust cross-domain generalization.

EXACT: Anatomy-Aware Explainable Vision Foundation Model for 3D Chest CT

Motivation and Context

The increasing volume and complexity of chest computed tomography (CT) scans underscore the need for scalable and interpretable AI systems in thoracic imaging. Traditional radiology workflows and most current AI models are insufficient for integrated, voxel-level understanding, spatial evidence preservation, and robust generalization across multi-disease diagnoses. Existing vision-language foundation models, especially those based on alignment paradigms like CLIP, compress volumetric image features into global embeddings that discard critical spatial and anatomical context, an approach ill-suited for precise localization and explainable clinical interpretation.

Model Architecture and Training Paradigm

EXACT introduces a novel anatomy-constrained, anomaly-aware framework, leveraging a Y-shaped Mamba backbone (Y-Mamba) for 3D chest CT. The model jointly optimizes organ segmentation and multi-instance anomaly localization in a weakly supervised setting, anchored by automatic organ masks (Segment Anything by Text, SAT) and disease pseudo-labels extracted from paired clinical reports (RadBERT). The dual-decoder architecture robustly retains spatial granularity throughout both branches, generating 18-channel anomaly-aware maps (AAmaps) and organ segmentation masks. Contrary to CLIP-style compression, EXACT preserves voxel-level anomaly scores confined to anatomically plausible regions, supporting both holistic disease reasoning and focal pathology detection.

Multi-scale fusion of anomaly predictions and top-k pooling within each anatomical region drive discriminative learning at both coarse and fine spatial resolutions. The pretraining curriculum schedules anatomical prior consolidation before progressive pathological pattern acquisition. Layer-normalized, gated spatial convolutions and efficient state-space modeling in Y-Mamba further enhance global and local context propagation.

Evaluation and Numerical Results

EXACT is validated across five clinically relevant tasks—multi-disease diagnosis (zero-shot and fine-tuning), unsupervised anomaly localization, downstream supervised segmentation (EXACT-Seg), and visually grounded radiology report generation (EXACT-CHAT)—on internal (CT-RATE) and multinational external datasets (RAD-ChestCT, Mian Yang, ReXGroundingCT, COVID-19, MosMed). The benchmark includes SOTA 3D vision-language FMs (CT-CLIP, fVLM, MedVista3D, Merlin, T3D), established segmentation networks (BiomedParse-v2, RWKV-UNet, SegMamba), and advanced report generators (CT-CHAT, Hulu-Med).

Diagnosis: In zero-shot diagnosis, EXACT achieves AUROC of 0.830 (CT-RATE), 0.728 (RAD-ChestCT), and 0.758 (Mian Yang), consistently outperforming all baselines, including those fully fine-tuned (e.g., T3D, AUROC 0.802). These gains hold across high-prevalence and spatially sparse diseases (e.g., hiatal hernia, pericardial effusion), demonstrating robust cross-domain generalization where CLIP-based approaches degrade (external AUROCs <0.720).

Anomaly Localization: EXACT is the first FM to deliver intrinsic voxel-level anomaly localization from weak supervision, achieving DSC 0.071 (ReX-Val, internal), 0.435 (COVID-19), and 0.363 (MosMed), with Hit Rates@5% exceeding 95% externally. BiomedParse-v2, operating via text prompts, only reaches 0.065 (ReX-Val) and 0.340 (COVID-19); Grad-CAM visualizations from CT-CLIP/fVLM are structurally imprecise (DSC <0.023). In supervised segmentation, EXACT-Seg achieves 0.476 (COVID-19) and 0.454 (MosMed), a statistically significant improvement over SegMamba and RWKV-UNet given limited training samples (P<0.001).

Report Generation: EXACT-CHAT, integrating image encoder and diagnostic priors with LLaMA-3.1-8B-Instruct, achieves RadBERT-F1 of 0.501 (CT-RATE), 0.441 (RAD-ChestCT), and 0.410 (Mian Yang), substantially surpassing CT-CHAT, Hulu-Med, and graph-based models. Notably, refinement with GPT-4.1 further reduces hallucinations and enhances clinical accuracy, with NLG metrics not correlating well with factual correctness. EXACT-CHAT provides spatial grounding, mapping report content directly to voxel anomaly scores, supporting transparent clinical verification.

Robustness, Interpretability, and Clinical Trustworthiness

Anatomy-aware constraints and multi-instance learning confer superior sensitivity to organ-localized and spatially sparse pathologies, mitigating inattentional blindness and under-reading errors. Visual evidence for each prediction is provided intrinsically by AAmaps, not post-hoc approximations—addressing a long-standing barrier to trustworthy medical AI. EXACT’s end-to-end pipeline bypasses manual annotation requirements, encapsulating transferable feature priors adaptable to rare pathologies and cross-institutional data heterogeneity.

A pronounced dissociation between lexical NLG metrics and clinical fidelity is observed: foundation models prioritizing factual grounding outperformed those focusing solely on fluency. Report generators built atop CLIP-style encoders inherit spatial insensitivity and exhibit domain drift and factual collapse under distribution shift, highlighting the necessity of anatomy-aware pretraining and spatial evidence anchoring.

Limitations and Prospective Directions

EXACT relies on RadBERT for pseudo-label extraction during pretraining, and future work may replace this with more advanced open LLMs to improve pipeline scalability and taxonomy induction. Current target abnormalities and disease–organ mappings are manually defined, which limits coverage for rare/unevaluated conditions. Prospective validation is required to ascertain translational impact in real-world workflows and patient outcomes.

Implications and Future Prospects

EXACT establishes a scalable, annotation-free paradigm in volumetric medical imaging, fusing global semantic and fine-grained spatial understanding. Its architecture and training principles can be generalized to other 3D modalities (e.g., MRI, PET) and extended to pan-organ or multi-system AI frameworks. Integration of structured priors into LLM-driven medical report generation offers a blueprint for clinically grounded, trustworthy multimodal assistants. With anticipated enhancements in automated taxonomy extraction and incorporation of general-purpose LLMs for pseudo-labeling, the approach promises increased scalability and broader disease coverage. In resource-constrained clinical settings, EXACT’s immediate deployment capabilities democratize access to expert-level diagnosis and annotation, potentially transforming screening, triage, and population-scale analysis.

Conclusion

EXACT demonstrates anatomical and pathological interpretability, high diagnostic and localization efficacy, and robust clinical report generation by leveraging explainable, anomaly-aware voxel-level representations pre-trained from routine chest CT scans and radiology reports. Its anatomy-aware weak supervision paradigm significantly advances volumetric FM modeling, strongly outperforming alignment-based and segmentation-focused baselines across diverse tasks and cohorts, and delivering transparent, clinically relevant evidence for scalable, trustworthy medical AI (2604.24146).

Markdown Report Issue