- The paper introduces M-Attack, a transfer-based method using semantically clear perturbations to attack black-box Large Vision-Language Models (LVLMs).
- M-Attack generates perturbations via random cropping and embedding alignment, iteratively refining semantics robust to scale and viewpoint changes.
- Experiments show M-Attack achieves over 90% success rates against strong black-box models like GPT-4.5, GPT-4o, and o1, exceeding prior attack methods.
Analysis of the M-Attack Method for Black-box LVLMs
The paper "A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1" (2503.10635) presents a transfer-based targeted attack method, termed M-Attack, designed to improve success rates against commercial, black-box Large Vision-LLMs (LVLMs). The core issue addressed is the poor transferability of adversarial perturbations generated using existing methods, particularly when targeting sophisticated models like GPT-4.5, GPT-4o, and o1. The authors observed that perturbations generated by prior techniques often lack clear semantic information, appearing more like uniform noise. This deficiency leads target models to either disregard the perturbation or misinterpret its intended semantics, resulting in attack failure. The underlying hypothesis is that robust LVLMs, irrespective of their specific training data or architecture, develop a strong capability for identifying core semantic objects within an image. Therefore, an effective transferable attack should focus on manipulating these core semantics rather than introducing diffuse noise.
The M-Attack Methodology
The proposed M-Attack method is based on refining the semantic clarity of the adversarial perturbation by concentrating modifications within local, semantically significant regions and ensuring the perturbation encodes explicit semantic details relevant to the target class. The key insight is that aligning local patches of the adversarial image with the target image in the embedding space forces the perturbation to be semantically meaningful across different views and scales.
The implementation involves an iterative optimization process to generate the adversarial image xadv. Starting from a benign image x, the goal is to introduce a perturbation δ such that xadv=x+δ induces a target response (e.g., misclassification to a target label ytarget) from the LVLM, while keeping δ imperceptible, often constrained by an Lp norm (e.g., L∞).
The novelty lies in the optimization step. Instead of optimizing the perturbation based on the full image representation, M-Attack employs a random cropping strategy. At each optimization iteration t:
- A random crop c(xadv(t)) is extracted from the current adversarial image xadv(t). The cropping uses a controlled aspect ratio and scale, similar to data augmentation techniques used in standard vision model training.
- The crop c(xadv(t)) is resized to a standard input dimension.
- The embedding of this resized crop, E(c(xadv(t))), obtained from a surrogate LVLM's vision encoder E(⋅), is aligned with the embedding of the target image xtarget, E(xtarget). The alignment is typically enforced by minimizing a loss function, such as the negative cosine similarity or Mean Squared Error (MSE) between the embeddings:
Lembed=distance(E(c(xadv(t))),E(xtarget))
Alternatively, for targeted misclassification towards a text prompt ptarget, the loss might involve maximizing the similarity between the image crop embedding and the text prompt embedding T(ptarget):
Lalign=−cosine_similarity(E(c(xadv(t))),T(ptarget))
The overall loss might also include a term to ensure similarity to the original image x.
- The gradient of this loss with respect to the input pixels of the crop is calculated and used to update the perturbation δ within the corresponding region of the full image xadv. Common optimization algorithms like Adam or Projected Gradient Descent (PGD) are used.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
def m_attack_step(x_adv, x_target, model_encoder, target_embedding, optimizer, crop_params):
"""
Performs one optimization step for M-Attack.
Args:
x_adv: Current adversarial image tensor.
x_target: Target image tensor (or use target_embedding directly).
model_encoder: Vision encoder of the surrogate LVLM.
target_embedding: Pre-computed embedding of the target image or text.
optimizer: PyTorch-style optimizer.
crop_params: Dictionary containing parameters for random cropping
(e.g., scale range, aspect ratio range).
"""
# Enable gradients for the input image
x_adv.requires_grad_(True)
# 1. Apply random crop and resize
scale = random.uniform(crop_params['min_scale'], crop_params['max_scale'])
aspect_ratio = random.uniform(crop_params['min_ratio'], crop_params['max_ratio'])
# (Implementation details of random resized crop depend on the library, e.g., torchvision.transforms.RandomResizedCrop)
x_adv_crop = random_resized_crop(x_adv, scale=scale, ratio=aspect_ratio, size=model_encoder.input_size)
# 2. Get embedding of the crop
crop_embedding = model_encoder(x_adv_crop)
# 3. Calculate loss (example: cosine similarity loss)
# Ensure target_embedding is on the same device and normalized if necessary
# Ensure crop_embedding is normalized if necessary
loss = -torch.nn.functional.cosine_similarity(crop_embedding, target_embedding.detach(), dim=-1).mean()
# Alternative: MSE loss
# loss = torch.nn.functional.mse_loss(crop_embedding, target_embedding.detach())
# 4. Backpropagate and update
optimizer.zero_grad()
loss.backward()
optimizer.step()
# (Optional) Project perturbation back onto L_p ball
# delta = torch.clamp(x_adv - x_original, -epsilon, epsilon)
# x_adv = torch.clamp(x_original + delta, 0, 1).detach_()
return x_adv.detach(), loss.item()
|
This process encourages the perturbation to be robust to changes in scale and viewpoint, implicitly forcing it to affect the core semantic features recognized by the vision encoder across different local patches. By aligning these varied crops with the target embedding, the attack ensures the embedded semantics are consistently steered towards the target, enhancing transferability.
Experimental Validation and Results
The effectiveness of M-Attack was evaluated against a suite of powerful black-box commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and reasoning-focused variants like o1, Claude-3.7-thinking, and Gemini-2.0-flash-thinking. Surrogate models used for generating the adversarial examples likely included open-source LVLMs (details potentially in the paper's appendix or codebase). The attacks were typically targeted, aiming to make the LVLM generate a specific incorrect response for the input image.
The key reported result is the high Attack Success Rate (ASR) achieved by M-Attack. Notably, the paper claims ASRs exceeding 90% against GPT-4.5, GPT-4o, and o1. This represents a substantial improvement over previous state-of-the-art transfer-based attacks, which often exhibited significantly lower success rates against these highly capable models. The method's success across a diverse set of target models underscores the generality of the approach and supports the hypothesis regarding the importance of semantic clarity and local feature manipulation for transferable adversarial attacks. The simplicity of the core mechanism (random cropping and embedding alignment) contrasts sharply with its reported effectiveness.
Implementation Considerations
- Surrogate Model Choice: The choice of the surrogate model(s) whose vision encoder E(⋅) is used for optimization remains crucial. A surrogate whose representations align well with the target black-box models is likely to yield better transferability. Ensembling gradients or embeddings from multiple surrogates might further enhance performance.
- Cropping Parameters: The range for scale and aspect ratio during random cropping significantly influences the nature of the learned perturbation. These parameters act as hyperparameters that may need tuning based on the surrogate model and target domain. The paper suggests specific ranges, likely detailed in their provided code repository: https://github.com/VILA-Lab/M-Attack.
- Optimization: Standard adversarial optimization parameters apply, including the choice of optimizer (e.g., Adam), learning rate schedule, number of iterations, and the perturbation budget ϵ (e.g., L∞ bound like 8/255 or 16/255).
- Computational Cost: The attack generation process involves multiple forward and backward passes through the surrogate model's vision encoder per optimization step. While simpler than methods requiring querying the target model, it still incurs computational cost, scaling with the number of iterations and the size/complexity of the surrogate encoder. However, once generated, the adversarial image requires only a single forward pass through the target model during inference.
Conclusion
The M-Attack method introduced in (2503.10635) provides a simple yet highly effective baseline for transfer-based targeted attacks against advanced black-box LVLMs. By leveraging random cropping and embedding alignment during optimization, the attack focuses perturbations on core semantic regions, enhancing semantic clarity and significantly improving transferability. The reported >90% success rates against models like GPT-4.5/4o/o1 highlight a potential vulnerability related to how these models process local semantic information and suggest that relatively straightforward techniques can bypass defenses in state-of-the-art systems.