Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis (2407.18251v1)

Published 25 Jul 2024 in cs.CV, cs.CR, and cs.LG

Abstract: Assessing the robustness of multimodal models against adversarial examples is an important aspect for the safety of its users. We craft L0-norm perturbation attacks on the preprocessed input images. We launch them in a black-box setup against four multimodal models and two unimodal DNNs, considering both targeted and untargeted misclassification. Our attacks target less than 0.04% of perturbed image area and integrate different spatial positioning of perturbed pixels: sparse positioning and pixels arranged in different contiguous shapes (row, column, diagonal, and patch). To the best of our knowledge, we are the first to assess the robustness of three state-of-the-art multimodal models (ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixel distribution perturbations. The obtained results indicate that unimodal DNNs are more robust than multimodal models. Furthermore, models using CNN-based Image Encoder are more vulnerable than models with ViT - for untargeted attacks, we obtain a 99% success rate by perturbing less than 0.02% of the image area.

Authors (3)

Cristian-Alexandru Botocan (1 paper)
Raphael Meier (19 papers)
Ljiljana Dolamic (18 papers)

Summary

Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis

In "Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis," Botocan et al. present a thorough investigation into the robustness of state-of-the-art multimodal models against adversarial pixel perturbations. The paper provides empirical insights into the vulnerabilities of these models by comparing sparse and various contiguous pixel perturbations in a black-box attack setting.

Objectives and Methodology

The primary objective of the paper is to assess the robustness of multimodal models to adversarial pixel perturbations, specifically when perturbed pixels are distributed sparsely or contiguously. The investigation considers both targeted and untargeted misclassification scenarios. The authors employ $L_{0}$ -norm to control the number of perturbed pixels, maintaining perturbations below 0.04% of the image area.

Four multimodal models—ALIGN, AltCLIP, CLIP-B/32, and GroupViT—along with two unimodal DNNs, ResNet-50 and VAN-base, are evaluated. The perturbations target different spatial pixel configurations such as sparse, row, column, diagonal, anti-diagonal, and patch shapes. The experiments utilize differential evolution (DE) as the optimization method, enabling the generation of perturbations in a black-box context with no prior information about the model.

Key Findings

Several notable observations emerge from the empirical analysis:

Robustness Across Model Types:
- Unimodal DNNs exhibit higher robustness compared to multimodal models.
- Among multimodal models, those employing a CNN-based image encoder (ALIGN) are more vulnerable than those using Vision Transformers (ViTs).
Effectiveness of Attack Types:
- Sparse attacks generally outperform contiguous attacks in most cases concerning ViT-based multimodal models.
- In contrast, ALIGN, which uses a CNN-based encoder, is most susceptible to patch attacks, achieving a 99% success rate with only 0.01915% of the image perturbed.
Influence of Pixel Distribution:
- The paper highlights an essential insight that the distribution and contiguity of perturbed pixels tend to influence the attack's success. Sparse perturbations impact multiple visual patches/tokens and disrupt self-attention mechanisms in ViTs effectively.
- Contiguous patches, particularly aggressively shaped patches, exploit vulnerabilities in CNN filters by impacting a significant continuous segment of the image, disturbing feature extraction at the convolutional layer levels.

Discussion of Implications

The paper underscores a critical concern regarding the robustness of multimodal models in real-world applications. The vulnerability to even minimal targeted perturbations necessitates more secure model designs and robust defense mechanisms. For multimodal models that integrate visual tokens (i.e., ViT-based models), future research could explore hybrid defenses potentially employing convolutional preprocessing, attention masks attuned to sparse distributions, or embedding-level defenses.

For practical applications, especially those involving sensitive data, relying on multimodal approaches also demands compensatory security measures. Security protocols should consider the potential for operational exploitation via adversarial examples, particularly within the preprocessed image phase to ensure robustness beyond generic image transformations.

Future Developments

In terms of future AI developments, the contrastive learning approach in multimodal models, like AltCLIP and CLIP, should focus on refining adversarial training strategies. There is a need to explore the balance between robustness and the zero-shot learning capabilities of these models. Future architectures should better integrate defense mechanisms resistant to pixel perturbations while maintaining generalized, dataset-agnostic performance.

Further work should extend into real-time attack scenarios and assay more nuanced attack methods accounting for both pixel-level and semantic perturbations in multimodal contexts. Additionally, evolving defense techniques, leveraging adversarial training and robust architectural architectures, will be crucial to uphold the integrity of AI systems integrating multimodal inputs.

In conclusion, the paper offers a comprehensive and empirical framework to understand and mitigate the vulnerabilities of multimodal AI models against adversarial attacks. The insights fostered by this research pave the way for more resilient and secure AI systems capable of withstanding sophisticated adversarial strategies.

PDF Markdown