Papers
Topics
Authors
Recent
Search
2000 character limit reached

Task-Redirecting Agent Persuasion (TRAP)

Updated 5 January 2026
  • The paper introduces TRAP, a benchmark that systematically quantifies task-redirection vulnerabilities in both vision-language and web agents using semantic and social-engineering attacks.
  • It employs generative adversarial methods and black-box optimization to craft visually plausible adversarial stimuli, achieving up to 100% attack success rates in targeted settings.
  • The findings highlight the urgent need for semantics-aware defense strategies to address vulnerabilities in agents’ reasoning frameworks and susceptibility to psychological manipulation.

The Task-Redirecting Agent Persuasion Benchmark (TRAP) is an umbrella term for two distinct families of benchmarks sharing a unified objective: to systematically quantify and dissect the susceptibility of autonomous agents—both vision-language and web-based—to task-redirection attacks operating at the semantic, cross-modal, or social-engineering level. Across both settings, TRAP exposes agentic models to adversarial stimuli designed to surreptitiously redirect or bias task outcomes, emphasizing threat modalities that preserve external plausibility (e.g., visually natural images, contextually coherent web content) while inducing consistent behavioral deviations. The TRAP methodology is motivated by the inadequacy of pixel-level perturbations or direct model-access attacks in capturing real-world risk and pivots toward attacks that exploit the agent's reasoning framework, environmental context, and psychological response profiles (Kang et al., 29 May 2025, Korgul et al., 29 Dec 2025).

1. Formal Definition and Core Objectives

In the vision-language agent (VLA) setting (Kang et al., 29 May 2025), the TRAP benchmark consists of nn-way selection tasks. Each trial provides an agent M()M(\cdot) with a set of nn candidates {x1,x2,,xn}\{x_1, x_2, \dots, x_n\} plus a goal-directed instruction (e.g., "Select the most luxurious product."). The adversary is permitted to replace a single "target" image xtx_t with an adversarial counterpart xadvx_{\text{adv}} that is visually natural. The objective is for TRAP to quantify the frequency with which MM prefers xadvx_{\text{adv}} over the n1n-1 untouched competitors—specifically, the rate at which the agent selects xadvx_{\text{adv}} with probability exceeding the random baseline $1/n$.

In the web agent setting (Korgul et al., 29 Dec 2025), TRAP generalizes to decision tasks involving web-based LLM agents acting in cloned browser environments. Each agent aAa\in A attempts user tasks tTt\in T (e.g., email triage, event management, e-commerce search) within interfaces containing adversarial injections iIi \in I (HTML/text payloads in editable regions). The success indicator is δ(a,t,i)=1\delta(a, t, i) = 1 if the agent executes the injected action (e.g., clicks a malicious button), and $0$ otherwise. Per-agent and aggregate success rates are defined as:

SR(a)=1T×ItTiIδ(a,t,i),SRTRAP=1AaASR(a)SR(a) = \frac{1}{|T| \times |I|}\sum_{t\in T}\sum_{i\in I} \delta(a, t, i), \qquad SR_{\text{TRAP}} = \frac{1}{|A|} \sum_{a\in A} SR(a)

The overarching aim is to reveal the practical ease with which semantic, contextually embedded, or persuasion-based perturbations can hijack agentic decision-making, even under black-box and human-imperceptible modifications.

2. TRAP for Vision-Language Agents: Diffusion-Based Semantic Injection

TRAP's attack framework for VLM-powered agents leverages a generative adversarial approach centered on semantic manipulation in CLIP embedding space. A latent eadve_{\text{adv}} is iteratively optimized and decoded back to the image domain via Stable Diffusion. The loss function is a convex combination:

L(eadv)=λdegLdeg+λsemLsem+λlpipsLLPIPS+λdistLdist\mathcal{L}(e_{\text{adv}}) = \lambda_{\text{deg}}\mathcal{L}_{\text{deg}} + \lambda_{\text{sem}}\mathcal{L}_{\text{sem}} + \lambda_{\text{lpips}}\mathcal{L}_{\text{LPIPS}} + \lambda_{\text{dist}}\mathcal{L}_{\text{dist}}

  • Ldeg\mathcal{L}_{\text{deg}} (Negative Prompt-Based Degradation):

Penalizes similarity to a negative prompt embedding enege_{\text{neg}} (e.g., "plain," "low quality") via cosine similarity, suppressing semantics opposed to the adversary's intent.

  • Lsem\mathcal{L}_{\text{sem}} (Semantic Alignment):

Encourages high alignment with the attacker’s positive prompt etexte_{\text{text}}, minimizing 1cos(eadv,etext)1 - \cos(e_{\text{adv}}, e_{\text{text}}).

  • LLPIPS\mathcal{L}_{\text{LPIPS}} (Perceptual Similarity):

Ensures xadvx_{\text{adv}} remains visually plausible relative to xtargetx_{\text{target}} via the LPIPS metric.

  • Ldist\mathcal{L}_{\text{dist}} (Siamese Network Identity Preservation):

Penalizes the L2L_2 drift in distinctive, identity-preserving CLIP embedding components produced by a two-branch Siamese network.

Spatial relevance is further refined using layout-aware spatial masking: a mask AA is generated based on concatenated embeddings of the target and the positive prompt. AA is optionally refined with segmentation masks and its mean modulates the embedding such that semantic edits are localized to the most relevant image regions.

The attack search proceeds via black-box optimization (without access to model weights or logits), decoding eadve_{\text{adv}} at each iteration and evaluating agent selection probability over nn-way composite prompts. The criterion for early stopping is surpassing a selection probability above $1/n$.

3. TRAP for Web Agents: Persuasion and Social Engineering Structure

In the web agent domain, TRAP targets LLM agents situated in browser-automated, high-fidelity environment clones (Gmail, Google Calendar, LinkedIn, Amazon, DoorDash, Upwork). Each task incorporates adversarial payloads embedded into familiar, user-editable fields (e.g., event descriptions, product reviews). Payloads are constructed modularly along five axes:

Axis Possible Values
Interface Form Button, Hyperlink
Persuasion Principle Authority, Reciprocity, Scarcity, Liking, Social Proof, Consistency, Unity
LLM Method Adversarial Suffix, Chain-of-Thought Injection, Many-shot, Role-play, Override
Location Task-relevant editable field
Tailoring On/Off (injection references current task context or not)

Injection text leverages social-engineering principles (notably, Cialdini’s 7 persuasion types), coupled with manipulation techniques targeting LLM reasoning (e.g., step-wise CoT chains or prompt override instructions), all framed within an innocuous interface element. Each benchmark run alternates button vs. hyperlink modalities and cycles through 35 injection templates for robust coverage.

The agent’s action and reasoning are captured via DOM logging and action traces using the PLAYWRIGHT automation interface with a cap of 35 interaction steps per episode.

4. Experimental Methodologies, Metrics, and Results

  • Data: 100 image–caption pairs from COCO Captions, each with n1n-1 competitively selected distractors. Adversarial images are first "degraded" to establish a non-preferred baseline, then optimized as above.
  • Evaluation: The agent receives a horizontally concatenated composite of nn images in randomized order for R=100R=100 trials. Attack success is P(xadv)>1/nP(x_{\text{adv}}) > 1/n.
  • Results:
    • TRAP attains 100%100\% attack success rate (ASR) against LLaVA-34B, Gemma3-8B, and Mistral-3.1 for n=4n=4 (see table).
    • Competing attacks (SPSA, Bandit, unoptimized Stable Diffusion) yield lower ASRs (max 36%36\%).
Method LLaVA-34B Gemma3-8B Mistral-3.1
Initial bad image 21% 17% 14%
SPSA 36% 27% 22%
Bandit 6% 2% 1%
Stable Diffusion (no opt) 24% 18% 18%
TRAP 100% 100% 100%
  • Agents: GPT-5, Claude 3.7 Sonnet, Gemini 2.5 Flash, GPT-OSS-120B, DeepSeek-R1, LLaMA 4 Maverick.
  • Tasks: 18 tasks × 35 injection templates per model (630 episodes/model).
  • Metrics:
    • Benign utility: Task completion with no adversary.
    • ASR: Percent of episodes with δ(a,t,i)=1\delta(a, t, i)=1 (agent executes adversary's action).
  • Results:
    • Mean SRTRAP25%SR_{\text{TRAP}}\approx 25\%; GPT-5 at 13%, DeepSeek-R1 at 43%.
    • Buttons elicit successful attacks 3.5× more than hyperlinks.
    • Task-specific tailoring increases ASR by factors of 2–3.
Agent Attack Success Rate (ASR)
GPT-5 13%
Claude 3.7 Sonnet 20%
Gemini 2.5 Flash 30%
GPT-OSS-120B 27%
DeepSeek-R1 43%
LLaMA 4 Maverick 17%

5. Analysis of Vulnerability Sources

TRAP’s effectiveness in both modalities exposes two classes of vulnerabilities:

  • Psychological/Social-Engineering Framing: LLM agents are demonstrably susceptible to Cialdini's persuasion principles. Social Proof and Consistency collectively account for ~18% of successful attacks, while Authority and Scarcity also reliably influence model compliance. This suggests that LLM agents internalize behavioral cues from pretraining data or explicit alignments.
  • Model Architectural/Prompt Manipulation: Chain-of-Thought and Adversarial Suffix methods each comprise ~24% of web-agent failures. Prompt injections can override internal safety filters, and successful attacks on large agents transfer to weaker models, establishing a hierarchical vulnerability "superset" structure.
  • Visual Semantics and Embedding Drift: In vision-language agents, attacks exploiting the CLIP embedding space, coupled with spatial masking, can prompt semantic realignment toward the adversary’s goal while maintaining perceptual similarity and identity.

6. Defense Pathways and Research Directions

Both TRAP settings highlight the deficiency of pixel-level or naive prompt filtering defenses.

Recommended defenses for vision-language agents (Kang et al., 29 May 2025) include:

  • Embedding-level adversarial training on semantically perturbed samples.
  • Cross-modal analysis (e.g., alternate captions, scene graph verification).
  • Verification of spatial mask distributions for unnatural localization.
  • Prompt-ensemble based adversarial testing.

For web agents (Korgul et al., 29 Dec 2025), strategies involve:

  • Expansion to image- and rich content-based injections.
  • Personalization-resistant defense layers (e.g., HTML sanitizers).
  • Injecting adversarial examples during alignment and fine-tuning stages.
  • Automated detection of repeated/chain-of-thought manipulation structures.

Future work includes support for expanded task domains (e.g., banking, social media), richer multi-modal attacks (QR codes, images), as well as multi-step, composite adversarial payloads for full end-to-end harm quantification.

7. Significance and Broader Implications

The TRAP benchmark provides a rigorous, extensible, and interpretable platform for benchmarking the task-redirection susceptibility of both VLM-based and web-based autonomous agentic systems. By shifting the focus from visible, low-level attacks to cross-modal and semantic adversarial risk, TRAP elucidates a space where agents’ apparent intelligence and practical vulnerability intersect. As TRAP achieves high (up to perfect) attack rates even under black-box and human-imperceptible conditions, a pressing implication is the necessity for semantics- and alignment-aware robustness methods in next-generation autonomous agents (Kang et al., 29 May 2025, Korgul et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Task-Redirecting Agent Persuasion Benchmark (TRAP).