A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 (2503.10635v1)

Published 13 Mar 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Despite promising performance on open-source large vision-LLMs (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at https://github.com/VILA-Lab/M-Attack.

Summary

The paper introduces M-Attack, a transfer-based method using semantically clear perturbations to attack black-box Large Vision-Language Models (LVLMs).
M-Attack generates perturbations via random cropping and embedding alignment, iteratively refining semantics robust to scale and viewpoint changes.
Experiments show M-Attack achieves over 90% success rates against strong black-box models like GPT-4.5, GPT-4o, and o1, exceeding prior attack methods.

Analysis of the M-Attack Method for Black-box LVLMs

The paper "A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1" (2503.10635) presents a transfer-based targeted attack method, termed M-Attack, designed to improve success rates against commercial, black-box Large Vision-LLMs (LVLMs). The core issue addressed is the poor transferability of adversarial perturbations generated using existing methods, particularly when targeting sophisticated models like GPT-4.5, GPT-4o, and o1. The authors observed that perturbations generated by prior techniques often lack clear semantic information, appearing more like uniform noise. This deficiency leads target models to either disregard the perturbation or misinterpret its intended semantics, resulting in attack failure. The underlying hypothesis is that robust LVLMs, irrespective of their specific training data or architecture, develop a strong capability for identifying core semantic objects within an image. Therefore, an effective transferable attack should focus on manipulating these core semantics rather than introducing diffuse noise.

The M-Attack Methodology

The proposed M-Attack method is based on refining the semantic clarity of the adversarial perturbation by concentrating modifications within local, semantically significant regions and ensuring the perturbation encodes explicit semantic details relevant to the target class. The key insight is that aligning local patches of the adversarial image with the target image in the embedding space forces the perturbation to be semantically meaningful across different views and scales.

The implementation involves an iterative optimization process to generate the adversarial image $x_{adv}$ . Starting from a benign image $x$ , the goal is to introduce a perturbation $\delta$ such that $x_{adv} = x + \delta$ induces a target response (e.g., misclassification to a target label $y_{target}$ ) from the LVLM, while keeping $\delta$ imperceptible, often constrained by an $L_p$ norm (e.g., $L_\infty$ ).

The novelty lies in the optimization step. Instead of optimizing the perturbation based on the full image representation, M-Attack employs a random cropping strategy. At each optimization iteration $t$ :

A random crop $c(x_{adv}^{(t)})$ is extracted from the current adversarial image $x_{adv}^{(t)}$ . The cropping uses a controlled aspect ratio and scale, similar to data augmentation techniques used in standard vision model training.
The crop $c(x_{adv}^{(t)})$ is resized to a standard input dimension.
The embedding of this resized crop, $E(c(x_{adv}^{(t)}))$ , obtained from a surrogate LVLM's vision encoder $E(\cdot)$ , is aligned with the embedding of the target image $x_{target}$ , $E(x_{target})$ . The alignment is typically enforced by minimizing a loss function, such as the negative cosine similarity or Mean Squared Error (MSE) between the embeddings:

$L_{embed} = \text{distance}(E(c(x_{adv}^{(t)})), E(x_{target}))$

Alternatively, for targeted misclassification towards a text prompt $p_{target}$ , the loss might involve maximizing the similarity between the image crop embedding and the text prompt embedding $T(p_{target})$ :

$L_{align} = -\text{cosine\_similarity}(E(c(x_{adv}^{(t)})), T(p_{target}))$

The overall loss might also include a term to ensure similarity to the original image $x$ .
The gradient of this loss with respect to the input pixels of the crop is calculated and used to update the perturbation $\delta$ within the corresponding region of the full image $x_{adv}$ . Common optimization algorithms like Adam or Projected Gradient Descent (PGD) are used.

def m_attack_step(x_adv, x_target, model_encoder, target_embedding, optimizer, crop_params):
    """
    Performs one optimization step for M-Attack.

    Args:
        x_adv: Current adversarial image tensor.
        x_target: Target image tensor (or use target_embedding directly).
        model_encoder: Vision encoder of the surrogate LVLM.
        target_embedding: Pre-computed embedding of the target image or text.
        optimizer: PyTorch-style optimizer.
        crop_params: Dictionary containing parameters for random cropping
                     (e.g., scale range, aspect ratio range).
    """
    # Enable gradients for the input image
    x_adv.requires_grad_(True)

    # 1. Apply random crop and resize
    scale = random.uniform(crop_params['min_scale'], crop_params['max_scale'])
    aspect_ratio = random.uniform(crop_params['min_ratio'], crop_params['max_ratio'])
    # (Implementation details of random resized crop depend on the library, e.g., torchvision.transforms.RandomResizedCrop)
    x_adv_crop = random_resized_crop(x_adv, scale=scale, ratio=aspect_ratio, size=model_encoder.input_size)

    # 2. Get embedding of the crop
    crop_embedding = model_encoder(x_adv_crop)

    # 3. Calculate loss (example: cosine similarity loss)
    # Ensure target_embedding is on the same device and normalized if necessary
    # Ensure crop_embedding is normalized if necessary
    loss = -torch.nn.functional.cosine_similarity(crop_embedding, target_embedding.detach(), dim=-1).mean()
    # Alternative: MSE loss
    # loss = torch.nn.functional.mse_loss(crop_embedding, target_embedding.detach())

    # 4. Backpropagate and update
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # (Optional) Project perturbation back onto L_p ball
    # delta = torch.clamp(x_adv - x_original, -epsilon, epsilon)
    # x_adv = torch.clamp(x_original + delta, 0, 1).detach_()

    return x_adv.detach(), loss.item()

This process encourages the perturbation to be robust to changes in scale and viewpoint, implicitly forcing it to affect the core semantic features recognized by the vision encoder across different local patches. By aligning these varied crops with the target embedding, the attack ensures the embedded semantics are consistently steered towards the target, enhancing transferability.

Experimental Validation and Results

The effectiveness of M-Attack was evaluated against a suite of powerful black-box commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and reasoning-focused variants like o1, Claude-3.7-thinking, and Gemini-2.0-flash-thinking. Surrogate models used for generating the adversarial examples likely included open-source LVLMs (details potentially in the paper's appendix or codebase). The attacks were typically targeted, aiming to make the LVLM generate a specific incorrect response for the input image.

The key reported result is the high Attack Success Rate (ASR) achieved by M-Attack. Notably, the paper claims ASRs exceeding 90% against GPT-4.5, GPT-4o, and o1. This represents a substantial improvement over previous state-of-the-art transfer-based attacks, which often exhibited significantly lower success rates against these highly capable models. The method's success across a diverse set of target models underscores the generality of the approach and supports the hypothesis regarding the importance of semantic clarity and local feature manipulation for transferable adversarial attacks. The simplicity of the core mechanism (random cropping and embedding alignment) contrasts sharply with its reported effectiveness.

Implementation Considerations

Surrogate Model Choice: The choice of the surrogate model(s) whose vision encoder $E(\cdot)$ is used for optimization remains crucial. A surrogate whose representations align well with the target black-box models is likely to yield better transferability. Ensembling gradients or embeddings from multiple surrogates might further enhance performance.
Cropping Parameters: The range for scale and aspect ratio during random cropping significantly influences the nature of the learned perturbation. These parameters act as hyperparameters that may need tuning based on the surrogate model and target domain. The paper suggests specific ranges, likely detailed in their provided code repository: https://github.com/VILA-Lab/M-Attack.
Optimization: Standard adversarial optimization parameters apply, including the choice of optimizer (e.g., Adam), learning rate schedule, number of iterations, and the perturbation budget $\epsilon$ (e.g., $L_\infty$ bound like 8/255 or 16/255).
Computational Cost: The attack generation process involves multiple forward and backward passes through the surrogate model's vision encoder per optimization step. While simpler than methods requiring querying the target model, it still incurs computational cost, scaling with the number of iterations and the size/complexity of the surrogate encoder. However, once generated, the adversarial image requires only a single forward pass through the target model during inference.

Conclusion

The M-Attack method introduced in (2503.10635) provides a simple yet highly effective baseline for transfer-based targeted attacks against advanced black-box LVLMs. By leveraging random cropping and embedding alignment during optimization, the attack focuses perturbations on core semantic regions, enhancing semantic clarity and significantly improving transferability. The reported >90% success rates against models like GPT-4.5/4o/o1 highlight a potential vulnerability related to how these models process local semantic information and suggest that relatively straightforward techniques can bypass defenses in state-of-the-art systems.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

GitHub

GitHub - VILA-Lab/M-Attack: A Simple Baseline Achieving Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 (3 stars)

Tweets

https://twitter.com/szq0214/status/1900950708813222301

https://twitter.com/techyrushabh/status/1900469998037909724