Attacking Vision-Language Computer Agents via Pop-ups (2411.02391v2)
Abstract: Autonomous agents powered by large vision and LLMs (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing their tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques, such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
- Can llms be fooled? investigating vulnerabilities in llms. arXiv preprint arXiv:2407.20529.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Agent s: An open agentic framework that uses computers like a human. Preprint, arXiv:2410.08164.
- Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic. Available at: https://www.anthropic.com/news/claude-3-family.
- Seeclick: Harnessing gui grounding for advanced visual gui agents. Preprint, arXiv:2401.10935.
- Federal Trade Commission et al. 2013. . com disclosures: how to make effective disclosures in digital advertising. March, http://www. ftc. gov/sites/default/files/attachments/press-releases/ftc-staff-revises-online-advertising-disclosureguidelines/130312dotcomdisclosures. pdf.
- Mind2web: Towards a generalist agent for the web. Preprint, arXiv:2306.06070.
- Current state of research on cross-site scripting (xss)–a systematic literature review. Information and Software Technology, 58:170–186.
- Threat analysis of fake virus alerts using webview monitor. In 2019 Seventh International Symposium on Computing and Networking (CANDAR), pages 28–36.
- Detection of cross-site scripting (xss) attacks using machine learning techniques: a review. Artificial Intelligence Review, 56(11):12725–12769.
- Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. Preprint, arXiv:2401.13649.
- Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36.
- Protecting people from phishing: the design and evaluation of an embedded training email system. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 905–914.
- Eia: Environmental injection attack on generalist web agents for privacy leakage. Preprint, arXiv:2409.11295.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. Preprint, arXiv:2310.04451.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. Preprint, arXiv:2310.02255.
- Caution for the environment: Multimodal agents are susceptible to environmental distractions. Preprint, arXiv:2408.02544.
- OpenAI. 2024. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/.
- Perceptual representation of spam and phishing emails. Applied cognitive psychology, 33(6):1296–1304.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
- Identifying the risks of lm agents with an lm-emulated sandbox. Preprint, arXiv:2309.15817.
- Aditya K Sood and Richard J Enbody. 2011. Malvertising–exploiting web advertising. Computer Fraud & Security, 2011(4):11–16.
- The instruction hierarchy: Training llms to prioritize privileged instructions. Preprint, arXiv:2404.13208.
- Adversarial attacks on multimodal agents. Preprint, arXiv:2406.12814.
- Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Preprint, arXiv:2404.07972.
- Understanding malvertising through ad-injecting browser extensions. In Proceedings of the 24th international conference on world wide web, pages 1286–1295.
- Advweb: Controllable black-box attacks on vlm-powered web agents. Preprint, arXiv:2410.17401.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. Preprint, arXiv:2310.11441.
- Watch out for your agents! investigating backdoor threats to llm-based agents. Preprint, arXiv:2402.11208.
- tau-bench: A benchmark for tool-agent-user interaction in real-world domains. Preprint, arXiv:2406.12045.
- React: Synergizing reasoning and acting in language models. Preprint, arXiv:2210.03629.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. Preprint, arXiv:2311.16502.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. Preprint, arXiv:2401.06373.
- From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts. Preprint, arXiv:2410.09006.
- Gpt-4v(ision) is a generalist web agent, if grounded. Preprint, arXiv:2401.01614.
- Webarena: A realistic web environment for building autonomous agents. Preprint, arXiv:2307.13854.
- Universal and transferable adversarial attacks on aligned language models. Preprint, arXiv:2307.15043.