- The paper introduces a novel attack that uses adversarial image patches to covertly manipulate multimodal OS agents via their visual pipelines.
- The authors employ Projected Gradient Descent with differentiable approximations to optimize patch perturbations without affecting screen parser outputs.
- Experiments reveal high attack success rates and robust transferability across similar VLMs, highlighting serious security risks in automated OS interactions.
Multimodal OS agents, which utilize vision-LLMs (VLMs) to interact with computer graphical user interfaces via APIs (mouse clicks, keyboard inputs, screenshots), represent a significant advancement in task automation. This paper (2503.10809) identifies a critical security vulnerability in these agents: Malicious Image Patches (MIPs). MIPs are adversarially perturbed image patches designed to manipulate an OS agent into performing harmful actions when captured in a screenshot.
Unlike traditional adversarial attacks on text-based models, MIPs exploit the vision-centric nature of OS agents. Since these agents rely on screenshots to understand the screen state and decide on actions, small, visually subtle perturbations embedded within an image can be reliably captured and processed by the agent's visual pipeline. This makes MIPs inherently harder to detect than text-based attacks.
The paper proposes practical attack vectors for disseminating MIPs. These include embedding them in desktop backgrounds, sharing them on social media platforms, integrating them into online advertisements, or distributing them within seemingly benign files like PDFs. When an OS agent is performing its task (e.g., browsing social media or interacting with the desktop) and captures a screenshot containing a MIP, the agent's VLM may misinterpret the patch, leading to the generation of malicious instructions.
The internal architecture of an OS agent is modeled as a pipeline consisting of:
- Screen Parser (g): Takes a screenshot (s) and produces structured information (Set-of-Marks or SOMs) in both visual (ssom) and textual (psom) formats. The annotated screenshot l(s,ssom) and the textual description psom are fed to the VLM. This component is typically non-differentiable.
- VLM (f): Receives textual inputs (ptxt which includes user prompt, system prompt, memory, and psom) and the potentially resized annotated screenshot q(l(s,ssom)) to output a sequence of text tokens (y^). This output contains reasoning, planning, and the next actions.
- APIs (API): Interpret a predefined set of text instructions (Papi) within y^ and execute corresponding actions on the OS (A).
The goal of the attack is to find a perturbation δ within a constrained patch region R on the original screenshot s such that the VLM outputs a specific malicious target sequence y. The attack formulation minimizes the Cross-Entropy loss between the VLM's output and the target sequence, subject to constraints:
- The perturbation is restricted to the predefined patch region R.
- The perturbation magnitude is bounded by an ℓ∞ norm (∥δ∥∞≤ϵ, where ϵ=25/255).
- The perturbation must be in discrete integer pixel values.
- The perturbation should not change the output of the screen parser (g(s)=g(s+δ)). This requires selecting R such that no SOM bounding boxes intersect with the patch region.
- The process accounts for the VLM's resizing function q by using a differentiable approximation during optimization.
Due to the non-differentiable components (g,q), a direct end-to-end gradient-based optimization is challenging. The paper addresses this by first identifying a suitable patch region R and then using Projected Gradient Descent (PGD) with the Adam optimizer to find the optimal perturbation δ within that fixed region. The optimization objective is applied after the screen parser and differentiable resizing approximation.
Experiments were conducted using the Microsoft Windows Agent Arena (WAA) environment (2409.08264), utilizing open-source components like OmniParser (2408.00203) and GroundingDINO+TesseractOCR for parsing, Llama 3.2 Vision models (2407.21783) (11B and 90B, pre-trained and instruction-tuned) as VLMs, and WAA's default API set. The attacks were tested in two settings: desktop background (patch in the center) and social media post (patch as the post image). Two malicious behaviors were targeted: causing a memory overflow (ym) and navigating to a specific website (yw). Evaluation was performed using Average Success Rate (ASR) based on multinomial sampling outputs matching the target sequence y across different temperatures (0.0, 0.1, 0.5, 1.0).
Key experimental findings:
- Targeted Attacks: MIPs optimized for a single (user prompt, screenshot) pair achieved 100% ASR on that pair. They showed high transferability to unseen user prompts with the same screenshot but failed completely on unseen screenshots.
- Universal Attacks: MIPs optimized across a set of seen (prompt, screenshot) pairs achieved high ASR (often 100% at lower temperatures) on both seen and unseen (prompt, screenshot) pairs within the same setting.
- Parser Transferability: Universal MIPs successfully transferred to an unseen screen parser (from OmniParser to GroundingDINO+TesseractOCR), demonstrating robustness to variations in SOM detection and description.
- Execution Step Transferability: Universal MIPs remained effective when introduced at different steps during an OS agent's multi-step task execution, hijacking the agent regardless of when the patch was encountered.
- VLM Transferability: MIPs optimized for a combination of three different Llama 3.2 Vision VLMs (11B-IT, 11B-PT, 90B-IT) achieved very high ASR across all three models. However, they failed to transfer to a completely unseen VLM (Llama-3.2-90B-Vision). This suggests transferability is limited among similar models but does not extend broadly across different VLM architectures or training approaches.
The paper concludes that MIPs pose a significant and qualitatively different threat compared to previous attacks. Their covert nature, potential for widespread dissemination through common channels (social media, desktop images), and demonstrated universality across variations in prompts, screenshots, parsers, and even similar VLMs highlight critical security vulnerabilities in OS agents. While transferability to entirely new VLM architectures remains a challenge, the ability to craft MIPs for widely used open-source models already presents a substantial risk. The authors discuss the potential for chaining attacks by redirecting agents to malicious websites containing further adversarial elements and suggest future research avenues like position-aware MIPs and potential defense strategies such as independent action verifiers and context-aware consistency checks. The computational cost of crafting universal MIPs (thousands to tens of thousands of optimization steps) indicates it requires significant resources but is feasible for determined adversaries.