Task-Redirecting Agent Persuasion (TRAP)

Updated 5 January 2026

The paper introduces TRAP, a benchmark that systematically quantifies task-redirection vulnerabilities in both vision-language and web agents using semantic and social-engineering attacks.
It employs generative adversarial methods and black-box optimization to craft visually plausible adversarial stimuli, achieving up to 100% attack success rates in targeted settings.
The findings highlight the urgent need for semantics-aware defense strategies to address vulnerabilities in agents’ reasoning frameworks and susceptibility to psychological manipulation.

The Task-Redirecting Agent Persuasion Benchmark (TRAP) is an umbrella term for two distinct families of benchmarks sharing a unified objective: to systematically quantify and dissect the susceptibility of autonomous agents—both vision-language and web-based—to task-redirection attacks operating at the semantic, cross-modal, or social-engineering level. Across both settings, TRAP exposes agentic models to adversarial stimuli designed to surreptitiously redirect or bias task outcomes, emphasizing threat modalities that preserve external plausibility (e.g., visually natural images, contextually coherent web content) while inducing consistent behavioral deviations. The TRAP methodology is motivated by the inadequacy of pixel-level perturbations or direct model-access attacks in capturing real-world risk and pivots toward attacks that exploit the agent's reasoning framework, environmental context, and psychological response profiles (Kang et al., 29 May 2025, Korgul et al., 29 Dec 2025).

1. Formal Definition and Core Objectives

In the vision-language agent (VLA) setting (Kang et al., 29 May 2025), the TRAP benchmark consists of $n$ -way selection tasks. Each trial provides an agent $M(\cdot)$ with a set of $n$ candidates $\{x_1, x_2, \dots, x_n\}$ plus a goal-directed instruction (e.g., "Select the most luxurious product."). The adversary is permitted to replace a single "target" image $x_t$ with an adversarial counterpart $x_{\text{adv}}$ that is visually natural. The objective is for TRAP to quantify the frequency with which $M$ prefers $x_{\text{adv}}$ over the $n-1$ untouched competitors—specifically, the rate at which the agent selects $x_{\text{adv}}$ with probability exceeding the random baseline $1/n$.

In the web agent setting (Korgul et al., 29 Dec 2025), TRAP generalizes to decision tasks involving web-based LLM agents acting in cloned browser environments. Each agent $a\in A$ attempts user tasks $t\in T$ (e.g., email triage, event management, e-commerce search) within interfaces containing adversarial injections $i \in I$ (HTML/text payloads in editable regions). The success indicator is $\delta(a, t, i) = 1$ if the agent executes the injected action (e.g., clicks a malicious button), and $0$ otherwise. Per-agent and aggregate success rates are defined as:

$SR(a) = \frac{1}{|T| \times |I|}\sum_{t\in T}\sum_{i\in I} \delta(a, t, i), \qquad SR_{\text{TRAP}} = \frac{1}{|A|} \sum_{a\in A} SR(a)$

The overarching aim is to reveal the practical ease with which semantic, contextually embedded, or persuasion-based perturbations can hijack agentic decision-making, even under black-box and human-imperceptible modifications.

2. TRAP for Vision-Language Agents: Diffusion-Based Semantic Injection

TRAP's attack framework for VLM-powered agents leverages a generative adversarial approach centered on semantic manipulation in CLIP embedding space. A latent $e_{\text{adv}}$ is iteratively optimized and decoded back to the image domain via Stable Diffusion. The loss function is a convex combination:

$\mathcal{L}(e_{\text{adv}}) = \lambda_{\text{deg}}\mathcal{L}_{\text{deg}} + \lambda_{\text{sem}}\mathcal{L}_{\text{sem}} + \lambda_{\text{lpips}}\mathcal{L}_{\text{LPIPS}} + \lambda_{\text{dist}}\mathcal{L}_{\text{dist}}$

$\mathcal{L}_{\text{deg}}$ (Negative Prompt-Based Degradation):

Penalizes similarity to a negative prompt embedding $e_{\text{neg}}$ (e.g., "plain," "low quality") via cosine similarity, suppressing semantics opposed to the adversary's intent.

$\mathcal{L}_{\text{sem}}$ (Semantic Alignment):

Encourages high alignment with the attacker’s positive prompt $e_{\text{text}}$ , minimizing $1 - \cos(e_{\text{adv}}, e_{\text{text}})$ .

$\mathcal{L}_{\text{LPIPS}}$ (Perceptual Similarity):

Ensures $x_{\text{adv}}$ remains visually plausible relative to $x_{\text{target}}$ via the LPIPS metric.

$\mathcal{L}_{\text{dist}}$ (Siamese Network Identity Preservation):

Penalizes the $L_2$ drift in distinctive, identity-preserving CLIP embedding components produced by a two-branch Siamese network.

Spatial relevance is further refined using layout-aware spatial masking: a mask $A$ is generated based on concatenated embeddings of the target and the positive prompt. $A$ is optionally refined with segmentation masks and its mean modulates the embedding such that semantic edits are localized to the most relevant image regions.

The attack search proceeds via black-box optimization (without access to model weights or logits), decoding $e_{\text{adv}}$ at each iteration and evaluating agent selection probability over $n$ -way composite prompts. The criterion for early stopping is surpassing a selection probability above $1/n$.

In the web agent domain, TRAP targets LLM agents situated in browser-automated, high-fidelity environment clones (Gmail, Google Calendar, LinkedIn, Amazon, DoorDash, Upwork). Each task incorporates adversarial payloads embedded into familiar, user-editable fields (e.g., event descriptions, product reviews). Payloads are constructed modularly along five axes:

Axis	Possible Values
Interface Form	Button, Hyperlink
Persuasion Principle	Authority, Reciprocity, Scarcity, Liking, Social Proof, Consistency, Unity
LLM Method	Adversarial Suffix, Chain-of-Thought Injection, Many-shot, Role-play, Override
Location	Task-relevant editable field
Tailoring	On/Off (injection references current task context or not)

Injection text leverages social-engineering principles (notably, Cialdini’s 7 persuasion types), coupled with manipulation techniques targeting LLM reasoning (e.g., step-wise CoT chains or prompt override instructions), all framed within an innocuous interface element. Each benchmark run alternates button vs. hyperlink modalities and cycles through 35 injection templates for robust coverage.

The agent’s action and reasoning are captured via DOM logging and action traces using the PLAYWRIGHT automation interface with a cap of 35 interaction steps per episode.

4. Experimental Methodologies, Metrics, and Results

Data: 100 image–caption pairs from COCO Captions, each with $n-1$ competitively selected distractors. Adversarial images are first "degraded" to establish a non-preferred baseline, then optimized as above.
Evaluation: The agent receives a horizontally concatenated composite of $n$ images in randomized order for $R=100$ trials. Attack success is $P(x_{\text{adv}}) > 1/n$ .
Results:
- TRAP attains $100\%$ attack success rate (ASR) against LLaVA-34B, Gemma3-8B, and Mistral-3.1 for $n=4$ (see table).
- Competing attacks (SPSA, Bandit, unoptimized Stable Diffusion) yield lower ASRs (max $36\%$ ).

Method	LLaVA-34B	Gemma3-8B	Mistral-3.1
Initial bad image	21%	17%	14%
SPSA	36%	27%	22%
Bandit	6%	2%	1%
Stable Diffusion (no opt)	24%	18%	18%
TRAP	100%	100%	100%

Agents: GPT-5, Claude 3.7 Sonnet, Gemini 2.5 Flash, GPT-OSS-120B, DeepSeek-R1, LLaMA 4 Maverick.
Tasks: 18 tasks × 35 injection templates per model (630 episodes/model).
Metrics:
- Benign utility: Task completion with no adversary.
- ASR: Percent of episodes with $\delta(a, t, i)=1$ (agent executes adversary's action).
Results:
- Mean $SR_{\text{TRAP}}\approx 25\%$ ; GPT-5 at 13%, DeepSeek-R1 at 43%.
- Buttons elicit successful attacks 3.5× more than hyperlinks.
- Task-specific tailoring increases ASR by factors of 2–3.

Agent	Attack Success Rate (ASR)
GPT-5	13%
Claude 3.7 Sonnet	20%
Gemini 2.5 Flash	30%
GPT-OSS-120B	27%
DeepSeek-R1	43%
LLaMA 4 Maverick	17%

5. Analysis of Vulnerability Sources

TRAP’s effectiveness in both modalities exposes two classes of vulnerabilities:

Psychological/Social-Engineering Framing: LLM agents are demonstrably susceptible to Cialdini's persuasion principles. Social Proof and Consistency collectively account for ~18% of successful attacks, while Authority and Scarcity also reliably influence model compliance. This suggests that LLM agents internalize behavioral cues from pretraining data or explicit alignments.
Model Architectural/Prompt Manipulation: Chain-of-Thought and Adversarial Suffix methods each comprise ~24% of web-agent failures. Prompt injections can override internal safety filters, and successful attacks on large agents transfer to weaker models, establishing a hierarchical vulnerability "superset" structure.
Visual Semantics and Embedding Drift: In vision-language agents, attacks exploiting the CLIP embedding space, coupled with spatial masking, can prompt semantic realignment toward the adversary’s goal while maintaining perceptual similarity and identity.

6. Defense Pathways and Research Directions

Both TRAP settings highlight the deficiency of pixel-level or naive prompt filtering defenses.

Recommended defenses for vision-language agents (Kang et al., 29 May 2025) include:

Embedding-level adversarial training on semantically perturbed samples.
Cross-modal analysis (e.g., alternate captions, scene graph verification).
Verification of spatial mask distributions for unnatural localization.
Prompt-ensemble based adversarial testing.

For web agents (Korgul et al., 29 Dec 2025), strategies involve:

Expansion to image- and rich content-based injections.
Personalization-resistant defense layers (e.g., HTML sanitizers).
Injecting adversarial examples during alignment and fine-tuning stages.
Automated detection of repeated/chain-of-thought manipulation structures.

Future work includes support for expanded task domains (e.g., banking, social media), richer multi-modal attacks (QR codes, images), as well as multi-step, composite adversarial payloads for full end-to-end harm quantification.

7. Significance and Broader Implications

The TRAP benchmark provides a rigorous, extensible, and interpretable platform for benchmarking the task-redirection susceptibility of both VLM-based and web-based autonomous agentic systems. By shifting the focus from visible, low-level attacks to cross-modal and semantic adversarial risk, TRAP elucidates a space where agents’ apparent intelligence and practical vulnerability intersect. As TRAP achieves high (up to perfect) attack rates even under black-box and human-imperceptible conditions, a pressing implication is the necessity for semantics- and alignment-aware robustness methods in next-generation autonomous agents (Kang et al., 29 May 2025, Korgul et al., 29 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

TRAP: Targeted Redirecting of Agentic Preferences (2025)

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Task-Redirecting Agent Persuasion Benchmark (TRAP).

Task-Redirecting Agent Persuasion (TRAP)

1. Formal Definition and Core Objectives

2. TRAP for Vision-Language Agents: Diffusion-Based Semantic Injection

4. Experimental Methodologies, Metrics, and Results

Vision-Language Agent Setting (Kang et al., 29 May 2025)

Web Agent Setting (Korgul et al., 29 Dec 2025)

5. Analysis of Vulnerability Sources

6. Defense Pathways and Research Directions

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Task-Redirecting Agent Persuasion (TRAP)

1. Formal Definition and Core Objectives

2. TRAP for Vision-Language Agents: Diffusion-Based Semantic Injection

3. TRAP for Web Agents: Persuasion and Social Engineering Structure

4. Experimental Methodologies, Metrics, and Results

Vision-Language Agent Setting (Kang et al., 29 May 2025)

Web Agent Setting (Korgul et al., 29 Dec 2025)

5. Analysis of Vulnerability Sources

6. Defense Pathways and Research Directions

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research