- The paper introduces a novel anatomy-aware, anomaly-aware framework (EXACT) that leverages a Y-shaped Mamba backbone for voxel-level localization in 3D chest CT scans.
- It demonstrates superior diagnostic performance and precise anomaly segmentation versus CLIP-based and segmentation networks, achieving AUROCs up to 0.830 and high Hit Rates@5%.
- The method integrates organ segmentation with multi-instance anomaly detection, enabling transparent, explainable radiology report generation and robust cross-domain generalization.
EXACT: Anatomy-Aware Explainable Vision Foundation Model for 3D Chest CT
Motivation and Context
The increasing volume and complexity of chest computed tomography (CT) scans underscore the need for scalable and interpretable AI systems in thoracic imaging. Traditional radiology workflows and most current AI models are insufficient for integrated, voxel-level understanding, spatial evidence preservation, and robust generalization across multi-disease diagnoses. Existing vision-language foundation models, especially those based on alignment paradigms like CLIP, compress volumetric image features into global embeddings that discard critical spatial and anatomical context, an approach ill-suited for precise localization and explainable clinical interpretation.
Model Architecture and Training Paradigm
EXACT introduces a novel anatomy-constrained, anomaly-aware framework, leveraging a Y-shaped Mamba backbone (Y-Mamba) for 3D chest CT. The model jointly optimizes organ segmentation and multi-instance anomaly localization in a weakly supervised setting, anchored by automatic organ masks (Segment Anything by Text, SAT) and disease pseudo-labels extracted from paired clinical reports (RadBERT). The dual-decoder architecture robustly retains spatial granularity throughout both branches, generating 18-channel anomaly-aware maps (AAmaps) and organ segmentation masks. Contrary to CLIP-style compression, EXACT preserves voxel-level anomaly scores confined to anatomically plausible regions, supporting both holistic disease reasoning and focal pathology detection.
Multi-scale fusion of anomaly predictions and top-k pooling within each anatomical region drive discriminative learning at both coarse and fine spatial resolutions. The pretraining curriculum schedules anatomical prior consolidation before progressive pathological pattern acquisition. Layer-normalized, gated spatial convolutions and efficient state-space modeling in Y-Mamba further enhance global and local context propagation.
Evaluation and Numerical Results
EXACT is validated across five clinically relevant tasks—multi-disease diagnosis (zero-shot and fine-tuning), unsupervised anomaly localization, downstream supervised segmentation (EXACT-Seg), and visually grounded radiology report generation (EXACT-CHAT)—on internal (CT-RATE) and multinational external datasets (RAD-ChestCT, Mian Yang, ReXGroundingCT, COVID-19, MosMed). The benchmark includes SOTA 3D vision-language FMs (CT-CLIP, fVLM, MedVista3D, Merlin, T3D), established segmentation networks (BiomedParse-v2, RWKV-UNet, SegMamba), and advanced report generators (CT-CHAT, Hulu-Med).
Diagnosis: In zero-shot diagnosis, EXACT achieves AUROC of 0.830 (CT-RATE), 0.728 (RAD-ChestCT), and 0.758 (Mian Yang), consistently outperforming all baselines, including those fully fine-tuned (e.g., T3D, AUROC 0.802). These gains hold across high-prevalence and spatially sparse diseases (e.g., hiatal hernia, pericardial effusion), demonstrating robust cross-domain generalization where CLIP-based approaches degrade (external AUROCs <0.720).
Anomaly Localization: EXACT is the first FM to deliver intrinsic voxel-level anomaly localization from weak supervision, achieving DSC 0.071 (ReX-Val, internal), 0.435 (COVID-19), and 0.363 (MosMed), with Hit Rates@5% exceeding 95% externally. BiomedParse-v2, operating via text prompts, only reaches 0.065 (ReX-Val) and 0.340 (COVID-19); Grad-CAM visualizations from CT-CLIP/fVLM are structurally imprecise (DSC <0.023). In supervised segmentation, EXACT-Seg achieves 0.476 (COVID-19) and 0.454 (MosMed), a statistically significant improvement over SegMamba and RWKV-UNet given limited training samples (P<0.001).
Report Generation: EXACT-CHAT, integrating image encoder and diagnostic priors with LLaMA-3.1-8B-Instruct, achieves RadBERT-F1 of 0.501 (CT-RATE), 0.441 (RAD-ChestCT), and 0.410 (Mian Yang), substantially surpassing CT-CHAT, Hulu-Med, and graph-based models. Notably, refinement with GPT-4.1 further reduces hallucinations and enhances clinical accuracy, with NLG metrics not correlating well with factual correctness. EXACT-CHAT provides spatial grounding, mapping report content directly to voxel anomaly scores, supporting transparent clinical verification.
Robustness, Interpretability, and Clinical Trustworthiness
Anatomy-aware constraints and multi-instance learning confer superior sensitivity to organ-localized and spatially sparse pathologies, mitigating inattentional blindness and under-reading errors. Visual evidence for each prediction is provided intrinsically by AAmaps, not post-hoc approximations—addressing a long-standing barrier to trustworthy medical AI. EXACT’s end-to-end pipeline bypasses manual annotation requirements, encapsulating transferable feature priors adaptable to rare pathologies and cross-institutional data heterogeneity.
A pronounced dissociation between lexical NLG metrics and clinical fidelity is observed: foundation models prioritizing factual grounding outperformed those focusing solely on fluency. Report generators built atop CLIP-style encoders inherit spatial insensitivity and exhibit domain drift and factual collapse under distribution shift, highlighting the necessity of anatomy-aware pretraining and spatial evidence anchoring.
Limitations and Prospective Directions
EXACT relies on RadBERT for pseudo-label extraction during pretraining, and future work may replace this with more advanced open LLMs to improve pipeline scalability and taxonomy induction. Current target abnormalities and disease–organ mappings are manually defined, which limits coverage for rare/unevaluated conditions. Prospective validation is required to ascertain translational impact in real-world workflows and patient outcomes.
Implications and Future Prospects
EXACT establishes a scalable, annotation-free paradigm in volumetric medical imaging, fusing global semantic and fine-grained spatial understanding. Its architecture and training principles can be generalized to other 3D modalities (e.g., MRI, PET) and extended to pan-organ or multi-system AI frameworks. Integration of structured priors into LLM-driven medical report generation offers a blueprint for clinically grounded, trustworthy multimodal assistants. With anticipated enhancements in automated taxonomy extraction and incorporation of general-purpose LLMs for pseudo-labeling, the approach promises increased scalability and broader disease coverage. In resource-constrained clinical settings, EXACT’s immediate deployment capabilities democratize access to expert-level diagnosis and annotation, potentially transforming screening, triage, and population-scale analysis.
Conclusion
EXACT demonstrates anatomical and pathological interpretability, high diagnostic and localization efficacy, and robust clinical report generation by leveraging explainable, anomaly-aware voxel-level representations pre-trained from routine chest CT scans and radiology reports. Its anatomy-aware weak supervision paradigm significantly advances volumetric FM modeling, strongly outperforming alignment-based and segmentation-focused baselines across diverse tasks and cohorts, and delivering transparent, clinically relevant evidence for scalable, trustworthy medical AI (2604.24146).