- The paper introduces the Decepticon benchmark to systematically evaluate dark pattern adversarial attacks on LLM-driven web agents.
- Empirical results show that increased model capacity leads to higher success rates for dark pattern manipulations, with effectiveness often exceeding 70%.
- The study finds that prompt-based and guardrail defenses provide limited protection, emphasizing the need for dedicated adversarial training approaches.
Decepticon: Evaluating the Vulnerability of Web Agents to Dark Patterns
Introduction
The proliferation of LLM-driven web agents has introduced new pathways for autonomous interaction with web content, automating tasks ranging from shopping and bookings to information retrieval. However, as these agents take on increasing responsibility, their susceptibility to adversarial elements ubiquitous in web design, particularly dark patterns—deceptive UI manipulations engineered to steer decisions—becomes increasingly significant. "DECEPTICON: How Dark Patterns Manipulate Web Agents" (2512.22894) presents a rigorous framework for systematic evaluation of dark patterns as adversarial attacks on web agents. Through the Decepticon benchmark and a suite of empirical evaluations, the paper establishes that state-of-the-art LLM agents are systematically vulnerable to dark pattern manipulations, with susceptibility intensifying as model capabilities grow.
The Decepticon Benchmark
The Decepticon benchmark provides a reproducible, realism-oriented environment for dark pattern evaluation in agentic navigation. It comprises two complementary components: a generated split of 600 tasks created via an adversarial human-AI generation and filtration loop, and 100 real-world tasks curated from live websites via automated crawling and LLM-based detection, followed by manual verification. Each task operationalizes an explicit objective, a desired goal state, and an embedded dark pattern, systematically categorized along six major attack modalities: sneaking, urgency, misdirection, social proof, obstruction, and forced action. Both splits are structured to preserve task solvability while ensuring the adversarial manipulations are avoidable—crucial for distinguishing deliberate dark patterns from poor UI or inescapable attacks.
Figure 2: Examples of dark pattern-injected tasks in Decepticon, showing categories and corresponding malicious outcomes.
The pipeline leverages state-of-the-art generative models and scaffolding agents to instantiate high-fidelity, contextually plausible dark pattern deployments. Human annotators validate generated patterns to enforce ecological validity and adversarial relevance.
Figure 1: Data collection pipeline overview, demonstrating generation, filtering, and validation steps.
A diverse array of synthetic and sampled-in-the-wild tasks supports robust, statistically meaningful evaluation across a balanced spectrum of dark pattern phenomena.
Measuring Dark Pattern Effectiveness
Evaluation is conducted across leading vision-equipped agentic frameworks—SoM (Set-of-Marks) agents and coordinate-based variants—powered by GPT-4o, GPT-5, Gemini-2.5-Pro/Flash, Claude Sonnet 4, and OpenAI o3. All agents operate within a controlled browser simulation to complete navigation tasks under both dark pattern and matched control (manipulation-removed) conditions.
Reported metrics encompass task success rate (SR) and dark pattern effectiveness (DP): the frequency with which an agent's trajectory is successfully steered towards the adversarial outcome, regardless of explicit task completion. Human benchmarks, based on a 200-participant study, provide context for agent vulnerability.
Empirical results demonstrate a pronounced disparity:
- Across agents, dark pattern effectiveness exceeds 70% on both synthetic and real-world splits; success rates plunge to under 30%.
- Human participants' susceptibility is less than half that of agents (DP ≈ 31%), with >80% task success.
Category-specific analysis reveals especially high effectiveness for obstruction and social proof patterns (DP ~90–100%), indicating that agentic models are highly sensitive to interface-level confounds and conformist cues.
Figure 4: Impact of model size and reasoning capability on dark pattern effectiveness; DP effectiveness increases with both factors.
Figure 6: Task distribution across dark pattern taxonomy categories for generated tasks.
The study probes the relationship between model capacity, test-time reasoning, and susceptibility to dark patterns via controlled scaling experiments. Using the Qwen-2.5-VL model family (spanning 3B to 72B parameters), as well as variable token budgets for Gemini-2.5-Flash and varying reasoning effort for OpenAI o3, the authors find:
- As model size and test-time computation increase, both task success and DP effectiveness rise in tandem. For Qwen-2.5-VL, DP effectiveness jumps from 43.8% (3B) to 73.7% (72B), with similar trends for Gemini-2.5-Flash token scaling.
- Longer or more sophisticated reasoning leads agents to integrate adversarial information more deeply into their action trajectories, often rationalizing the manipulative cues as legitimate in their CoT traces.
This phenomenon constitutes an "inverse scaling law" regarding adversarial robustness: greater capacity enables improved task completion but exacerbates vulnerability to manipulative web design, especially in the absence of explicit adversarial training.
Defense Mechanisms: In-Context Prompting and Guardrail Models
Evaluated defense strategies include In-Context Prompting (ICP) with explicit descriptions and exemplars of dark patterns, and guardrail models—auxiliary LLMs that classify elements as dark patterns and steer the primary agent away from malicious actions.

Figure 3: ICP and guardrail model defense effectiveness compared to baseline agent performance.
Key findings:
- ICP provides mild improvement, lowering DP effectiveness by ~12% on average (e.g., from 78.5% to 59.6% for GPT-4o). Gains are concentrated in highly visible, information-based dark patterns (social proof, urgency).
- Guardrail models are partially more effective, with an average DP reduction of ~29% and monotonic improvement in SR, but still leave agents vulnerable in complex misdirection and multi-step manipulation settings.
- Detailed analysis of failure traces shows that even when dark patterns are detected, agents often ignore or misinterpret defensive cues—rationalizing manipulated paths as optimal, falsely attributing authority, or failing to override adversarial content.
Theoretical and Practical Implications
The results robustly challenge assumptions that increased agent capability or naively augmented reasoning automatically confer adversarial robustness. On the contrary, the alignment between model objectives, task descriptions, and environmental cues leaves agents highly exploitable by adversarially designed interface elements—especially those that hijack affective signals, escalate interaction cost, or manipulate contextual reasoning.
Practical consequences include:
- Automated agents are at heightened risk in adversarial web environments, potentially incurring privacy leakage, unintentional expenditures, or propagating manipulative outcomes to end users.
- Model scaling and basic prompt-based countermeasures are fundamentally insufficient; robustness requires dedicated adversarial training or integrative defenses capable of compositional reasoning about environmental intent and UI structure.
- Agent design must address reasoning pathologies such as overtrust, authority confusion, and inability to generalize defense across manipulation modalities.
Theoretically, these findings highlight failure modes in goal-seeking RLHF/CoT-trained systems when faced with deceptive, non-linguistic input rooted in adversarial environmental context not represented in training distributions.
Representative Examples
Decepticon supplies both synthetic, adversarially constructed scenarios and faithfully captured in-the-wild instances.
Figure 8: Generated e-commerce website implementing a Trick Question dark pattern.
Figure 10: Generated event booking website with a Sneaking dark pattern.
Figure 7: Generated telecom website embedding a Forced Action dark pattern.
Figure 9: Real-world website containing a naturally occurring Urgency dark pattern.
These examples clarify the spectrum of strategies exploited, ranging from visual trickery and false scarcity to interface obstruction and information misdirection.
Limitations and Future Directions
While Decepticon delivers a comprehensive platform and diagnostic for dark pattern vulnerability, limitations include:
- Focus on LLM-centric, vision-equipped agents; robustness of emerging agent architectures (e.g., hybrid symbolic, planning-augmented) remains untested.
- Defensive strategies evaluated are limited to prompt augmentation and LLM-based detection; adversarial fine-tuning, uncertainty estimation, or interface-agnostic interpretability methods may yield further gains.
- Wider exploration of transferability/multi-modal dark patterns, longitudinal agent adaptation, and red-teaming at deployment scale is warranted.
Conclusion
"DECEPTICON: How Dark Patterns Manipulate Web Agents" (2512.22894) delivers a critical empirical, methodological, and theoretical advancement in understanding the susceptibility of web agents to adversarial manipulative design. The demonstrated failure of scaling, prompt engineering, and baseline guardrails in addressing the threat foregrounds an urgent need for robust, scalable defenses in agentic model training and evaluation. As web agents proliferate, adversarial UIs will remain a structural threat requiring continuous, rigorous red-teaming and algorithmic innovation.