TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability (2405.17678v1)
Abstract: This work addresses the challenge of achieving zero-shot adversarial robustness while preserving zero-shot generalization in large-scale foundation models, with a focus on the popular Contrastive Language-Image Pre-training (CLIP). Although foundation models were reported to have exceptional zero-shot generalization, they are highly vulnerable to adversarial perturbations. Existing methods achieve a comparable good tradeoff between zero-shot adversarial robustness and generalization under small adversarial perturbations. However, they fail to achieve a good tradeoff under large adversarial perturbations. To this end, we propose a novel Text-Image Mutual Awareness (TIMA) method that strikes a balance between zero-shot adversarial robustness and generalization. More precisely, we propose an Image-Aware Text (IAT) tuning mechanism that increases the inter-class distance of text embeddings by incorporating the Minimum Hyperspherical Energy (MHE). Simultaneously, fixed pre-trained image embeddings are used as cross-modal auxiliary supervision to maintain the similarity between the MHE-tuned and original text embeddings by the knowledge distillation, preserving semantic information between different classes. Besides, we introduce a Text-Aware Image (TAI) tuning mechanism, which increases inter-class distance between image embeddings during the training stage by Text-distance based Adaptive Margin (TAM). Similarly, a knowledge distillation is utilized to retain the similarity between fine-tuned and pre-trained image embeddings. Extensive experimental results demonstrate the effectiveness of our approach, showing impressive zero-shot performance against a wide range of adversarial perturbations while preserving the zero-shot generalization capabilities of the original CLIP model.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 446–461.
- Understanding and achieving efficient robustness with adversarial supervised contrastive learning. arXiv preprint arXiv:2101.10027 (2021).
- Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee, 39–57.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3606–3613.
- An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 15), Geoffrey Gordon, David Dunson, and Miroslav Dudík (Eds.). PMLR, Fort Lauderdale, FL, USA, 215–223. https://proceedings.mlr.press/v15/coates11a.html
- Francesco Croce and Matthias Hein. 2020. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning. PMLR, 2206–2216.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- MMA Training: Direct Input Space Margin Maximization through Adversarial Training. In International Conference on Learning Representations.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, 7 (2019), 2217–2226.
- Improving adversarial robustness with self-paced hard-class pair reweighting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 14883–14891.
- A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11685–11695.
- Adversarial Attacks on Foundational Vision Models. arXiv preprint arXiv:2308.14597 (2023).
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
- Learning multiple layers of features from tiny images. (2009).
- Distilling large vision-language model with out-of-distribution generalizability. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2492–2503.
- Improving adversarial robustness via probabilistically compact loss with logit constraints. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 8482–8490.
- Anchor-Based Adversarially Robust Zero-Shot Learning Driven by Language. arXiv preprint arXiv:2301.13096 (2023).
- Regularizing neural networks via minimizing hyperspherical energy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6917–6927.
- Learning towards minimum hyperspherical energy. Advances in neural information processing systems 31 (2018).
- Learning with hyperspherical uniformity. In International Conference On Artificial Intelligence and Statistics. PMLR, 1180–1188.
- Large-margin softmax loss for convolutional neural networks. arXiv preprint arXiv:1612.02295 (2016).
- Generalizing and Decoupling Neural Collapse via Hyperspherical Uniformity Gap. In The Eleventh International Conference on Learning Representations.
- Ziquan Liu and Antoni B Chan. 2022. Boosting adversarial robustness from the perspective of effective margin regularization. arXiv preprint arXiv:2210.05118 (2022).
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
- Understanding zero-shot adversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016 (2022).
- Metric learning for adversarial robustness. Advances in neural information processing systems 32 (2019).
- Contrastive Language-Image Pretrained (CLIP) Models are Powerful Out-of-Distribution Detectors. arXiv preprint arXiv:2303.05828 (2023).
- Adversarial defense by restricting the hidden space of deep neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3385–3394.
- David A Noever and Samantha E Miller Noever. 2021. Reading Isn’t Believing: Adversarial Attacks On Multi-Modal Neurons. arXiv preprint arXiv:2103.10480 (2021).
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
- Filtering, distillation, and hard negatives for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6967–6977.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095 (2023).
- Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2401.04350 (2024).
- MMA regularization: Decorrelating weights of neural networks by maximizing the minimal angles. Advances in Neural Information Processing Systems 33 (2020), 19099–19110.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3485–3492.
- CLIP-KD: An Empirical Study of Distilling CLIP Models. arXiv preprint arXiv:2307.12732 (2023).
- Wenhan Yang and Baharan Mirzasoleiman. 2023. Robust Contrastive Language-Image Pretraining against Adversarial Attacks. arXiv preprint arXiv:2303.06854 (2023).
- Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization. In The Twelfth International Conference on Learning Representations.
- On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934 (2023).
- Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models. Advances in Neural Information Processing Systems 36 (2024).
- Mo Zhou and Vishal M Patel. 2022. Enhancing adversarial robustness for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15325–15334.
- Fengji Ma (3 papers)
- Li Liu (311 papers)
- Hei Victor Cheng (27 papers)