Task-Redirecting Agent Persuasion (TRAP)
- The paper introduces TRAP, a benchmark that systematically quantifies task-redirection vulnerabilities in both vision-language and web agents using semantic and social-engineering attacks.
- It employs generative adversarial methods and black-box optimization to craft visually plausible adversarial stimuli, achieving up to 100% attack success rates in targeted settings.
- The findings highlight the urgent need for semantics-aware defense strategies to address vulnerabilities in agents’ reasoning frameworks and susceptibility to psychological manipulation.
The Task-Redirecting Agent Persuasion Benchmark (TRAP) is an umbrella term for two distinct families of benchmarks sharing a unified objective: to systematically quantify and dissect the susceptibility of autonomous agents—both vision-language and web-based—to task-redirection attacks operating at the semantic, cross-modal, or social-engineering level. Across both settings, TRAP exposes agentic models to adversarial stimuli designed to surreptitiously redirect or bias task outcomes, emphasizing threat modalities that preserve external plausibility (e.g., visually natural images, contextually coherent web content) while inducing consistent behavioral deviations. The TRAP methodology is motivated by the inadequacy of pixel-level perturbations or direct model-access attacks in capturing real-world risk and pivots toward attacks that exploit the agent's reasoning framework, environmental context, and psychological response profiles (Kang et al., 29 May 2025, Korgul et al., 29 Dec 2025).
1. Formal Definition and Core Objectives
In the vision-language agent (VLA) setting (Kang et al., 29 May 2025), the TRAP benchmark consists of -way selection tasks. Each trial provides an agent with a set of candidates plus a goal-directed instruction (e.g., "Select the most luxurious product."). The adversary is permitted to replace a single "target" image with an adversarial counterpart that is visually natural. The objective is for TRAP to quantify the frequency with which prefers over the untouched competitors—specifically, the rate at which the agent selects with probability exceeding the random baseline $1/n$.
In the web agent setting (Korgul et al., 29 Dec 2025), TRAP generalizes to decision tasks involving web-based LLM agents acting in cloned browser environments. Each agent attempts user tasks (e.g., email triage, event management, e-commerce search) within interfaces containing adversarial injections (HTML/text payloads in editable regions). The success indicator is if the agent executes the injected action (e.g., clicks a malicious button), and $0$ otherwise. Per-agent and aggregate success rates are defined as:
The overarching aim is to reveal the practical ease with which semantic, contextually embedded, or persuasion-based perturbations can hijack agentic decision-making, even under black-box and human-imperceptible modifications.
2. TRAP for Vision-Language Agents: Diffusion-Based Semantic Injection
TRAP's attack framework for VLM-powered agents leverages a generative adversarial approach centered on semantic manipulation in CLIP embedding space. A latent is iteratively optimized and decoded back to the image domain via Stable Diffusion. The loss function is a convex combination:
- (Negative Prompt-Based Degradation):
Penalizes similarity to a negative prompt embedding (e.g., "plain," "low quality") via cosine similarity, suppressing semantics opposed to the adversary's intent.
- (Semantic Alignment):
Encourages high alignment with the attacker’s positive prompt , minimizing .
- (Perceptual Similarity):
Ensures remains visually plausible relative to via the LPIPS metric.
- (Siamese Network Identity Preservation):
Penalizes the drift in distinctive, identity-preserving CLIP embedding components produced by a two-branch Siamese network.
Spatial relevance is further refined using layout-aware spatial masking: a mask is generated based on concatenated embeddings of the target and the positive prompt. is optionally refined with segmentation masks and its mean modulates the embedding such that semantic edits are localized to the most relevant image regions.
The attack search proceeds via black-box optimization (without access to model weights or logits), decoding at each iteration and evaluating agent selection probability over -way composite prompts. The criterion for early stopping is surpassing a selection probability above $1/n$.
3. TRAP for Web Agents: Persuasion and Social Engineering Structure
In the web agent domain, TRAP targets LLM agents situated in browser-automated, high-fidelity environment clones (Gmail, Google Calendar, LinkedIn, Amazon, DoorDash, Upwork). Each task incorporates adversarial payloads embedded into familiar, user-editable fields (e.g., event descriptions, product reviews). Payloads are constructed modularly along five axes:
| Axis | Possible Values |
|---|---|
| Interface Form | Button, Hyperlink |
| Persuasion Principle | Authority, Reciprocity, Scarcity, Liking, Social Proof, Consistency, Unity |
| LLM Method | Adversarial Suffix, Chain-of-Thought Injection, Many-shot, Role-play, Override |
| Location | Task-relevant editable field |
| Tailoring | On/Off (injection references current task context or not) |
Injection text leverages social-engineering principles (notably, Cialdini’s 7 persuasion types), coupled with manipulation techniques targeting LLM reasoning (e.g., step-wise CoT chains or prompt override instructions), all framed within an innocuous interface element. Each benchmark run alternates button vs. hyperlink modalities and cycles through 35 injection templates for robust coverage.
The agent’s action and reasoning are captured via DOM logging and action traces using the PLAYWRIGHT automation interface with a cap of 35 interaction steps per episode.
4. Experimental Methodologies, Metrics, and Results
Vision-Language Agent Setting (Kang et al., 29 May 2025)
- Data: 100 image–caption pairs from COCO Captions, each with competitively selected distractors. Adversarial images are first "degraded" to establish a non-preferred baseline, then optimized as above.
- Evaluation: The agent receives a horizontally concatenated composite of images in randomized order for trials. Attack success is .
- Results:
- TRAP attains attack success rate (ASR) against LLaVA-34B, Gemma3-8B, and Mistral-3.1 for (see table).
- Competing attacks (SPSA, Bandit, unoptimized Stable Diffusion) yield lower ASRs (max ).
| Method | LLaVA-34B | Gemma3-8B | Mistral-3.1 |
|---|---|---|---|
| Initial bad image | 21% | 17% | 14% |
| SPSA | 36% | 27% | 22% |
| Bandit | 6% | 2% | 1% |
| Stable Diffusion (no opt) | 24% | 18% | 18% |
| TRAP | 100% | 100% | 100% |
Web Agent Setting (Korgul et al., 29 Dec 2025)
- Agents: GPT-5, Claude 3.7 Sonnet, Gemini 2.5 Flash, GPT-OSS-120B, DeepSeek-R1, LLaMA 4 Maverick.
- Tasks: 18 tasks × 35 injection templates per model (630 episodes/model).
- Metrics:
- Benign utility: Task completion with no adversary.
- ASR: Percent of episodes with (agent executes adversary's action).
- Results:
- Mean ; GPT-5 at 13%, DeepSeek-R1 at 43%.
- Buttons elicit successful attacks 3.5× more than hyperlinks.
- Task-specific tailoring increases ASR by factors of 2–3.
| Agent | Attack Success Rate (ASR) |
|---|---|
| GPT-5 | 13% |
| Claude 3.7 Sonnet | 20% |
| Gemini 2.5 Flash | 30% |
| GPT-OSS-120B | 27% |
| DeepSeek-R1 | 43% |
| LLaMA 4 Maverick | 17% |
5. Analysis of Vulnerability Sources
TRAP’s effectiveness in both modalities exposes two classes of vulnerabilities:
- Psychological/Social-Engineering Framing: LLM agents are demonstrably susceptible to Cialdini's persuasion principles. Social Proof and Consistency collectively account for ~18% of successful attacks, while Authority and Scarcity also reliably influence model compliance. This suggests that LLM agents internalize behavioral cues from pretraining data or explicit alignments.
- Model Architectural/Prompt Manipulation: Chain-of-Thought and Adversarial Suffix methods each comprise ~24% of web-agent failures. Prompt injections can override internal safety filters, and successful attacks on large agents transfer to weaker models, establishing a hierarchical vulnerability "superset" structure.
- Visual Semantics and Embedding Drift: In vision-language agents, attacks exploiting the CLIP embedding space, coupled with spatial masking, can prompt semantic realignment toward the adversary’s goal while maintaining perceptual similarity and identity.
6. Defense Pathways and Research Directions
Both TRAP settings highlight the deficiency of pixel-level or naive prompt filtering defenses.
Recommended defenses for vision-language agents (Kang et al., 29 May 2025) include:
- Embedding-level adversarial training on semantically perturbed samples.
- Cross-modal analysis (e.g., alternate captions, scene graph verification).
- Verification of spatial mask distributions for unnatural localization.
- Prompt-ensemble based adversarial testing.
For web agents (Korgul et al., 29 Dec 2025), strategies involve:
- Expansion to image- and rich content-based injections.
- Personalization-resistant defense layers (e.g., HTML sanitizers).
- Injecting adversarial examples during alignment and fine-tuning stages.
- Automated detection of repeated/chain-of-thought manipulation structures.
Future work includes support for expanded task domains (e.g., banking, social media), richer multi-modal attacks (QR codes, images), as well as multi-step, composite adversarial payloads for full end-to-end harm quantification.
7. Significance and Broader Implications
The TRAP benchmark provides a rigorous, extensible, and interpretable platform for benchmarking the task-redirection susceptibility of both VLM-based and web-based autonomous agentic systems. By shifting the focus from visible, low-level attacks to cross-modal and semantic adversarial risk, TRAP elucidates a space where agents’ apparent intelligence and practical vulnerability intersect. As TRAP achieves high (up to perfect) attack rates even under black-box and human-imperceptible conditions, a pressing implication is the necessity for semantics- and alignment-aware robustness methods in next-generation autonomous agents (Kang et al., 29 May 2025, Korgul et al., 29 Dec 2025).