Adversarial Illusions in Multi-Modal Embeddings (2308.11804v4)
Abstract: Multi-modal embeddings encode texts, images, thermal images, sounds, and videos into a single embedding space, aligning representations across different modalities (e.g., associate an image of a dog with a barking sound). In this paper, we show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an image or a sound, an adversary can perturb it to make its embedding close to an arbitrary, adversary-chosen input in another modality. These attacks are cross-modal and targeted: the adversary can align any image or sound with any target of his choice. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks and modalities, enabling a wholesale compromise of current and future tasks, as well as modalities not available to the adversary. Using ImageBind and AudioCLIP embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, zero-shot classification, and audio retrieval. We investigate transferability of illusions across different embeddings and develop a black-box version of our method that we use to demonstrate the first adversarial alignment attack on Amazon's commercial, proprietary Titan embedding. Finally, we analyze countermeasures and evasion attacks.
- (Ab)using images and sounds for indirect instruction injection in multi-modal LLMs. arXiv:2307.10490, 2023.
- Nicholas Carlini et al. Are aligned neural networks adversarially aligned? arXiv:2306.15447, 2023.
- Alexey Dosovitski et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICML, 2021.
- R-LPIPS: An adversarially robust perceptual similarity metric. arXiv:2307.15157, 2023.
- ImageBind: One embedding space to bind them all. In CVPR, 2023.
- Explaining and harnessing adversarial examples. In ICLR, 2015.
- Sven Gowal et al. On the effectiveness of interval bound propagation for training verifiably robust models. In NIPS Workshops, 2018.
- Simple black-box adversarial attacks. In ICML, 2019.
- Black-box adversarial attacks with limited queries and information. In ICML, 2018.
- Adversarial examples are not bugs, they are features. In NeurIPS, 2019.
- Certified robustness to adversarial word substitutions. In EMNLP, 2019.
- Audiocaps: Generating captions for audios in the wild. In ACL, 2019.
- Adversarial self-supervised contrastive learning. In NeurIPS, 2020.
- Adversarial examples in the physical world. In ICLR, 2017.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022.
- BindDiffusion: One diffusion model to bind them all. https://github.com/sail-sg/BindDiffusion/tree/main, 2023.
- Feature distillation: DNN-oriented JPEG compression against adversarial examples. In CVPR, 2019.
- Towards deep learning models resistant to adversarial attacks. In ICLR, 2017.
- Adversarial training methods for semi-supervised text classification. In ICLR, 2017.
- Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
- Visual adversarial examples jailbreak large language models. arXiv:2306.13213, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Certified defenses against adversarial examples. In ICLR, 2018.
- Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022.
- ImageNet large scale visual recognition challenge. IJCV, 2015.
- Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In ICLR, 2018.
- Adversarial training for free! NeurIPS, 2019.
- Plug and pray: Exploiting off-the-shelf components of multi-modal models. arXiv:2307.14539, 2023.
- JPEG-resistant adversarial images. In NIPS Workshop, 2017.
- Adversarial semantic collisions. In EMNLP, 2020.
- PandaGPT: One model to instruction-follow them all. arXiv:2305.16355, 2023.
- Adversarial training and robustness for multiple perturbations. In NeurIPS, 2019.
- Towards improving adversarial training of NLP models. In EMNLP, 2021.
- Adversarial contrastive learning via asymmetric InfoNCE. In ECCV, 2022.
- Downstream-agnostic adversarial examples. In ICCV, 2023.