Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 161 tok/s Pro

GPT OSS 120B 412 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments (2506.13205v4)

Published 16 Jun 2025 in cs.CR and cs.AI

Abstract: With the growing integration of vision-LLMs (VLMs), mobile agents are now widely used for tasks like UI automation and camera-based user assistance. These agents are often fine-tuned on limited user-generated datasets, leaving them vulnerable to covert threats during the training process. In this work we present GHOST, the first clean-label backdoor attack specifically designed for mobile agents built upon VLMs. Our method manipulates only the visual inputs of a portion of the training samples - without altering their corresponding labels or instructions - thereby injecting malicious behaviors into the model. Once fine-tuned with this tampered data, the agent will exhibit attacker-controlled responses when a specific visual trigger is introduced at inference time. The core of our approach lies in aligning the gradients of poisoned samples with those of a chosen target instance, embedding backdoor-relevant features into the poisoned training data. To maintain stealth and enhance robustness, we develop three realistic visual triggers: static visual patches, dynamic motion cues, and subtle low-opacity overlays. We evaluate our method across six real-world Android apps and three VLM architectures adapted for mobile use. Results show that our attack achieves high attack success rates (up to 94.67 percent) while maintaining high clean-task performance (FSR up to 95.85 percent). Additionally, ablation studies shed light on how various design choices affect the efficacy and concealment of the attack. Overall, this work is the first to expose critical security flaws in VLM-based mobile agents, highlighting their susceptibility to clean-label backdoor attacks and the urgent need for effective defense mechanisms in their training pipelines.

Summary

The paper introduces GHOST, a clean-label attack that exploits gradient alignment to implant imperceptible visual triggers in training images, causing backdoor behaviors.
Experiments on Android apps reveal high attack success (up to 94.67% ASR) and clean-task performance (up to 95.85% FSR), proving the method's effectiveness.
The study highlights vulnerabilities in VLM-based mobile agents, emphasizing the urgent need for robust defense mechanisms against visual poisoning.

Clean-Label Visual Backdoor Attacks on VLM-Based Mobile Agents

This paper introduces GHOST (Gradient-Hijacked On-Screen Triggers), a novel clean-label backdoor attack targeting vision-LLM (VLM)-based mobile agents. The attack perturbs only the visual inputs of training samples, leaving labels and instructions unaltered, to inject malicious behaviors. At inference, a predefined visual trigger activates attacker-controlled responses in both symbolic actions and textual rationales (Figure 1).

Figure 1: Overview of the GHOST framework, illustrating various attack types and the training/test-time behavior of the poisoned VLM agent.

Threat Model and Attack Surface

The authors consider a threat model where an attacker injects a small number of poisoned samples into the training data used to fine-tune a VLM in a mobile agent. The attacker does not have control over the training pipeline or inference-time inputs, and the poisoned samples preserve the original prompts and labels, modifying only the visual modality. The attack surface is broad, encompassing visual inputs like screenshots susceptible to pixel-level triggers, structured outputs with both symbolic actions and contexts, and limited runtime auditability on mobile devices. This makes mobile agents uniquely vulnerable to training-time visual backdoor attacks.

Methodology: GHOST Framework

GHOST optimizes imperceptible perturbations over clean training screenshots. At test time, the presence of a predefined visual trigger activates attacker-specified outputs across both symbolic actions and textual rationales. The core of GHOST lies in aligning the training gradients of poisoned samples with those of an attacker-specified target instance.

The optimization problem is formulated as:

$\min_{\delta \in \mathcal{C}} \mathcal{L}(f_{\theta(\delta)}(I^{\text{target}}, T), y^{\text{target}})$

subject to:

$\theta(\delta) = \arg \min_{\theta} \frac{1}{N} \sum_{j=1}^{N} \mathcal{L}(f_\theta(I_j + \delta_j, T_j), y_j)$

The paper defines four backdoor types to capture diverse malicious behaviors:

Type I (Benign Misactivation): Triggers malicious behavior despite explicit termination prompts.
Type II (Privacy Violation): Leads to sensitive actions like opening the camera with a neutral prompt.
Type III (Malicious Hijack): Executes highly sensitive operations even with explicit refusal prompts.
Type IV (Policy Shift): Activates a latent backdoor policy under innocuous queries.

Experiments and Results

The authors evaluated GHOST on two mobile GUI benchmarks, RICO and AITW, using three types of visual triggers: static patches (Hurdle), dynamic motion patterns (Hoverball), and low-opacity blended content. Experiments were conducted on six real-world Android apps and three VLM architectures adapted for mobile use. The attack achieved high action success rates (up to 94.67\%) while maintaining high clean-task performance (FSR up to 95.85\%). Ablation studies highlighted the impact of design choices on the attack's efficacy and concealment. The results demonstrate the vulnerability of mobile agents to backdoor injection and the need for defense mechanisms.

Figure 2 visualizes Attack Success Rate (ASR) and Follow Step Ratio (FSR) across different trigger types, application domains, and VLM backbones, illustrating the effectiveness and generalizability of the GHOST attack.

Figure 2: ASR and FSR performance across various trigger types, application domains, and VLM backbones, demonstrating the broad applicability of the attack.

The paper found that Hurdle triggers generally performed the best, followed by Blended and Hoverball triggers. The type of trigger impacted ASR and FSR with Hurdle triggers resulting in the highest ASR and FSR, while Blended triggers resulted in the lowest FSR.

Qualitative examples of triggered screenshots are shown in Figure 3, demonstrating the subtle nature of the visual triggers used in the attack. PSNR and SSIM scores indicate high visual similarity between clean and triggered images, confirming the stealthiness of the perturbations.

Figure 3: Examples of triggered screenshots showing imperceptible visual changes, with PSNR and SSIM scores indicating high visual similarity to clean images.

Discussion and Implications

The paper highlights the practical implications of clean-label backdoor attacks on mobile agents, emphasizing the need for robust defense mechanisms during the adaptation of VLM-based agents. GHOST can generalize gradient alignment techniques to mobile agent settings, enabling coordinated control over both symbolic actions and language contexts conditioned on real-world GUI states. The work exposes the unique vulnerabilities of VLM-based mobile agents due to their reliance on screenshots, weak supervision, and on-device adaptation.

Conclusion

The authors conclude by highlighting a novel threat: clean-label visual backdoors in VLM-based mobile agents. They demonstrate that imperceptible perturbations in the image modality can implant persistent malicious behaviors, affecting both symbolic actions and textual rationales. Future work will focus on defenses under limited auditability and extending the attack framework to other multimodal agent scenarios.