Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Published 12 Jul 2025 in cs.CV and cs.LG | (2507.09222v1)

Abstract: Foundation models like CLIP and SAM have transformed computer vision and medical imaging via low-shot transfer learning. However, deployment of these models hindered by two key challenges: \textit{distribution shift} between training and test data, and \textit{confidence misalignment} that leads to overconfident incorrect predictions. These issues manifest differently in vision-language classification and medical segmentation tasks, yet existing solutions remain domain-specific. We propose \textit{StaRFM}, a unified framework addressing both challenges. It introduces a Fisher information penalty (FIP), extended to 3D medical data via patch-wise regularization, to reduce covariate shift in CLIP and SAM embeddings. Additionally, a confidence misalignment penalty (CMP), reformulated for voxel-level predictions, calibrates uncertainty in segmentation tasks. We theoretically derive PAC-Bayes bounds showing FIP controls generalization via the Fisher-Rao norm, while CMP minimizes calibration error through Brier score optimization. StaRFM shows consistent performance like \texttt{+}3.5\% accuracy and 28\% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), 84.7\% DSC and 4.8mm HD95 in medical segmentation (e.g., BraTS, ATLAS), and 40\% lower cross-domain performance gap compared to prior benchmarking methods. The framework is plug-and-play, requiring minimal architectural changes for seamless integration with foundation models. Code and models will be released at https://anonymous.4open.science/r/StaRFM-C0CD/README.md