Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations (2403.07241v2)
Abstract: Fine-tuning pre-trained vision-LLMs, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features -- patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
- Masktune: Mitigating spurious correlations by forcing to explore. In NeurIPS, 2022.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. In ICLR, 2019.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Environment inference for invariant learning. In ICML, 2021.
- Class-balanced loss based on effective number of samples. In CVPR, 2019.
- Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, 2019.
- Shortcut learning in deep neural networks. Nat. Mach. Intell., 2020.
- Model patching: Closing the subgroup performance gap with data augmentation. In ICLR, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Annotation artifacts in natural language inference data. In NAACL, 2018.
- Audioclip: Extending clip to image, text and audio. In ICASSP. IEEE, 2022.
- Interpretable minority synthesis for imbalanced classification. In International Joint Conferences on Artificial Intelligence, 2021.
- Simple data balancing achieves competitive worst-group-accuracy. In Conference on Causal Learning and Reasoning, 2022.
- Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In AAAI, 2019.
- On feature learning in the presence of spurious correlations. In NeurIPS, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- How much reading does reading comprehension require? a critical investigation of popular benchmarks. In EMNLP, 2018.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Biaswap: Removing dataset bias with bias-tailored swapping augmentation. In ICCV, 2021.
- Learning debiased classifier with biased committee. In NeurIPS, 2022.
- Last layer re-training is sufficient for robustness to spurious correlations. In ICLR, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Towards last-layer retraining for group robustness with fewer annotations. In NeurIPS, 2023.
- Clipath: Fine-tune clip with visual feature fusion for pathology image analysis towards minimizing data collection efforts. In ICCV, 2023a.
- Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation. In ICCV, 2023b.
- From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699, 2023c.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
- Resound: Towards action recognition without representation bias. In ECCV, 2018.
- Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts. arXiv preprint arXiv:2202.06523, 2022.
- Cascade variational auto-encoder for hierarchical disentanglement. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1248–1257, 2022.
- Just train twice: Improving group robustness without training group information. In ICML, 2021.
- Deep learning face attributes in the wild. In ICCV, 2015.
- Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In EMNLP, 2019.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Overparameterisation and worst-case generalisation: friend or foe? In ICLR, 2020.
- A comprehensive study of image classification model sensitivity to foregrounds, backgrounds, and visual attributes. In CVPR, 2022.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Learning from failure: De-biasing classifier from biased classifier. In NeurIPS, 2020.
- Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation. In ICLR, 2022.
- Probing neural network comprehension of natural language arguments. In ACL, 2019.
- Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In CHIL, 2020.
- Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
- On guiding visual attention with language specification. In CVPR, 2022.
- Gradient starvation: A learning proclivity in neural networks. In NeurIPS, 2021.
- Simple and fast group robustness by automatic feature reweighting. In ICML, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
- The elephant in the room. arXiv preprint arXiv:1808.03305, 2018.
- Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR, 2020.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med., 2021.
- Not using the car to see the sidewalk–quantifying and controlling the effects of context in classification and segmentation. In CVPR, 2019.
- Salient imagenet: How to discover spurious features in deep learning? In ICLR, 2022.
- No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In NeurIPS, 2020.
- Barack: Partially supervised group robustness with guarantees. arXiv preprint arXiv:2201.00072, 2021.
- Robust representation learning via perceptual similarity metrics. In ICML, 2021.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 2008.
- Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.
- The caltech-ucsd birds-200-2011 dataset. 2011.
- Simvlm: Simple visual language model pretraining with weak supervision. In ICLR, 2022.
- Noise or signal: The role of image backgrounds in object recognition. In ICLR, 2021.
- Controlling directions orthogonal to a classifier. In ICLR, 2022.
- Increasing robustness to spurious correlations using forgettable examples. arXiv preprint arXiv:1911.03861, 2019.
- Mitigating spurious correlations in multi-modal models during fine-tuning. In ICML, 2023a.
- Change is hard: A closer look at subpopulation shift. In ICML, 2023b.
- Understanding rare spurious correlations in neural networks. arXiv preprint arXiv:2202.05189, 2022.
- Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med., 2018.
- Merlot: Multimodal neural script knowledge models. In NeurIPS, 2021.
- Rich feature construction for the optimization-generalization dilemma. In ICML, 2022a.
- Contrastive adapters for foundation model group robustness. In NeurIPS, 2022.
- Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. In ICML, 2022b.
- Pointclip: Point cloud understanding by clip. In CVPR, 2022c.
- Diagnosing and rectifying vision models using language. arXiv preprint arXiv:2302.04269, 2023.
- Places: A 10 million image database for scene recognition. IEEE TPAMI, 2017.
- Chenyu You (66 papers)
- Yifei Min (17 papers)
- Weicheng Dai (6 papers)
- Jasjeet S. Sekhon (13 papers)
- Lawrence Staib (13 papers)
- James S. Duncan (67 papers)