Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attacking Vision-Language Computer Agents via Pop-ups (2411.02391v2)

Published 4 Nov 2024 in cs.CL

Abstract: Autonomous agents powered by large vision and LLMs (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing their tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques, such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Can llms be fooled? investigating vulnerabilities in llms. arXiv preprint arXiv:2407.20529.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. Agent s: An open agentic framework that uses computers like a human. Preprint, arXiv:2410.08164.
  4. Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic. Available at: https://www.anthropic.com/news/claude-3-family.
  5. Seeclick: Harnessing gui grounding for advanced visual gui agents. Preprint, arXiv:2401.10935.
  6. Federal Trade Commission et al. 2013. . com disclosures: how to make effective disclosures in digital advertising. March, http://www. ftc. gov/sites/default/files/attachments/press-releases/ftc-staff-revises-online-advertising-disclosureguidelines/130312dotcomdisclosures. pdf.
  7. Mind2web: Towards a generalist agent for the web. Preprint, arXiv:2306.06070.
  8. Current state of research on cross-site scripting (xss)–a systematic literature review. Information and Software Technology, 58:170–186.
  9. Threat analysis of fake virus alerts using webview monitor. In 2019 Seventh International Symposium on Computing and Networking (CANDAR), pages 28–36.
  10. Detection of cross-site scripting (xss) attacks using machine learning techniques: a review. Artificial Intelligence Review, 56(11):12725–12769.
  11. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. Preprint, arXiv:2401.13649.
  12. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36.
  13. Protecting people from phishing: the design and evaluation of an embedded training email system. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 905–914.
  14. Eia: Environmental injection attack on generalist web agents for privacy leakage. Preprint, arXiv:2409.11295.
  15. Autodan: Generating stealthy jailbreak prompts on aligned large language models. Preprint, arXiv:2310.04451.
  16. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. Preprint, arXiv:2310.02255.
  17. Caution for the environment: Multimodal agents are susceptible to environmental distractions. Preprint, arXiv:2408.02544.
  18. OpenAI. 2024. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/.
  19. Perceptual representation of spam and phishing emails. Applied cognitive psychology, 33(6):1296–1304.
  20. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  21. Identifying the risks of lm agents with an lm-emulated sandbox. Preprint, arXiv:2309.15817.
  22. Aditya K Sood and Richard J Enbody. 2011. Malvertising–exploiting web advertising. Computer Fraud & Security, 2011(4):11–16.
  23. The instruction hierarchy: Training llms to prioritize privileged instructions. Preprint, arXiv:2404.13208.
  24. Adversarial attacks on multimodal agents. Preprint, arXiv:2406.12814.
  25. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Preprint, arXiv:2404.07972.
  26. Understanding malvertising through ad-injecting browser extensions. In Proceedings of the 24th international conference on world wide web, pages 1286–1295.
  27. Advweb: Controllable black-box attacks on vlm-powered web agents. Preprint, arXiv:2410.17401.
  28. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. Preprint, arXiv:2310.11441.
  29. Watch out for your agents! investigating backdoor threats to llm-based agents. Preprint, arXiv:2402.11208.
  30. tau-bench: A benchmark for tool-agent-user interaction in real-world domains. Preprint, arXiv:2406.12045.
  31. React: Synergizing reasoning and acting in language models. Preprint, arXiv:2210.03629.
  32. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. Preprint, arXiv:2311.16502.
  33. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. Preprint, arXiv:2401.06373.
  34. From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts. Preprint, arXiv:2410.09006.
  35. Gpt-4v(ision) is a generalist web agent, if grounded. Preprint, arXiv:2401.01614.
  36. Webarena: A realistic web environment for building autonomous agents. Preprint, arXiv:2307.13854.
  37. Universal and transferable adversarial attacks on aligned language models. Preprint, arXiv:2307.15043.

Summary

  • The paper demonstrates that adversarial pop-ups misdirect Vision-Language agents, triggering an 86% response rate and a 47% decline in task completion.
  • It details a threat model using attention hooks, instructions, banners, and ALT descriptors to manipulate agent behavior in GUI environments.
  • It emphasizes the urgent need for robust defenses, as current measures reduce attack success by only about 25%.

Overview of "Attacking Vision-Language Computer Agents via Pop-ups"

The paper under review, "Attacking Vision-Language Computer Agents via Pop-ups," investigates vulnerabilities in Vision-LLMs (VLMs) within autonomous agents tasked with executing computer-based tasks. It focuses on how these agents, which combine visual perception with linguistic understanding to perform actions like web browsing and software operation, can be compromised through adversarial attacks using pop-ups. This research contributes to a growing understanding of the security risks associated with deploying VLM agents in interface-rich environments.

Main Insights and Results

This paper highlights a specific vulnerability in VLM agents: their propensity to be deceived by adversarial pop-ups on graphical user interfaces. The authors define pop-ups in this context as malicious, clickable images that are strategically placed on the screen. These pop-ups aim to distract the agent and cause it to execute unintended actions. The key findings of the paper indicate that these attacks have a substantial success rate, with agents clicking on pop-ups in 86% of attack instances, and task completion rates decreasing by 47%.

Despite the integration of basic defense mechanisms, such as instructing agents to ignore pop-ups or marking pop-ups as advertisements, these attacks remain overwhelmingly effective. The robustness of these defenses was deficient, decreasing the attack success rate by no more than 25% when such measures were employed.

Attack Methodology

The paper introduces a threat model where the adversary manipulates the agent’s environment through pop-ups, in several realistic scenarios including malvertising and phishing. Attack design features include:

  • Attention Hook: A concise set of words intended to draw the agent’s focus.
  • Instruction: Commands intended to influence the agent's behavior.
  • Information Banner: Contextual data misrepresenting the purpose of the pop-up.
  • ALT Descriptor: Supplementary text modifying the agent's perception of the pop-up in accessibility trees.

The paper tests these techniques across different environments, including OSWorld and VisualWebArena, using state-of-the-art VLMs as backbones.

Implications and Future Directions

The findings from this work underscore the pressing need for improved security measures in the deployment of VLM agents. Practical implications suggest that current systems are not adequately prepared to handle the adversarial noise present in GUI environments. Future directions could explore more effective defense mechanisms, possibly enhancing agent training to recognize and autonomously dismiss these types of distractions more effectively.

Furthermore, this research lays groundwork for further exploration into the intersection of visual perception and security, particularly how VLMs interpret and prioritize visual and linguistic information. From a theoretical perspective, it raises questions about the fundamentals of agentic decision-making and the considerable gap between human and machine perception in complex, interactive environments.

Overall, this paper serves as a critical reminder of the vulnerabilities innate in artificial agent systems and the continuous development needed to realize their potential safely and effectively. As these systems become more prevalent, the work of Zhang, Yu, and Yang provides foundational insights necessary for building more resilient and secure autonomous agents.

Youtube Logo Streamline Icon: https://streamlinehq.com