Decepticon Benchmark: Dark Patterns Analysis

Updated 19 January 2026

Decepticon Benchmark is a testing environment designed to quantify web agents’ vulnerabilities to deceptive UI designs known as dark patterns.
It employs a paired control-treatment structure with reproducible and task-completable settings using real-world and LLM-generated tasks.
Empirical results show that increased model size and reasoning tokens lead to higher susceptibility, steering agents toward unintended outcomes.

Decepticon is a benchmarking environment designed to systematically quantify and analyze the susceptibility of web agents to deceptive UI designs, widely known as dark patterns. By providing a reproducible, realistic, and task-completable sandbox, Decepticon isolates the causal effects of individual dark patterns on agent trajectories, revealing that state-of-the-art agents are steered toward malicious or unintended outcomes with far greater frequency than humans. Empirical evaluations demonstrate acute vulnerabilities in even the most advanced models, with effectiveness of dark patterns scaling positively with model size and test-time reasoning. These findings support the urgency of robust defense mechanisms and highlight the latent risk posed by manipulative UI designs to AI-driven web agents (Cuvin et al., 28 Dec 2025).

1. Purpose, Design Requirements, and Environment Architecture

Decepticon is architected around three primary design desiderata: reproducibility, task completability, and realism. To ensure reproducibility, Decepticon utilizes archived or self-hosted HTML/CSS/JS dumps, mitigating the risk of environment drift as real-world sites evolve. Task completability is enforced such that every manipulated page remains solvable—successfully reaching the user’s intended objective requires active avoidance or reversal of the injected dark pattern. Realism is preserved by sampling dark patterns from two sources: direct crawls from 100 “in-the-wild” websites and adversarially generated tasks using LLMs in the loop, mirroring authentic web implementations.

The environment operates on a paired control–treatment structure. For each of 600 generated tasks, both a baseline (“clean”) page and a dark-pattern-infused variant are provided, isolating the effect of manipulation. Tasks are structured as natural-language objectives (e.g., “Buy a bouquet under $30”), specifying a desired goal state (e.g., order confirmation) and introducing an embedded pattern (e.g., pre-checked premium option). Agent modalities include vision-based “Simple” scaffolds (Set-of-Marks, SoM), coordinate-based agents, and text-only scaffolds for analytical consistency.

2. Taxonomy of Dark Pattern Attacks

Decepticon employs a six-category taxonomy, adapted from Mathur et al. 2019, to systematically characterize modes of adversarial UI attack:

Category	Mode of Attack	Example Mechanism
Sneaking	Covert additions/fees	Pre-checked add-ons, hidden charges
Urgency	Artificial time pressure	Countdown timers, “Only X left!”
Misdirection	Visual/linguistic cues	Concealed alternatives, trick prompts
Social Proof	Fabricated/exaggerated stats	Fake testimonials, “X viewing” labels
Obstruction	Action friction/unwanted ease	Roach motel, popup blocks
Forced Action	Mandatory tangential tasks	Forced registration, cookie consent

Each category targets distinct cognitive or perceptual vulnerabilities, which are instantiated both in generated and real-world task splits.

3. Task Suite Generation and Acquisition Pipeline

The benchmark contains 700 total web navigation tasks: 600 adversarially generated and 100 extracted “in-the-wild.” Generated tasks are built through a multi-stage LLM-driven pipeline:

Base UI layouts synthesized via Gemini-2.5-Flash across domains (e-commerce, booking, information retrieval).
Pattern code generated by Gemini-2.5-Pro, guided by visual/textual prototypes.
Adversarial evaluation identifies agents’ failure cases; these guide further refinement to increase subtlety.
Human verification selects approximately 70% of candidates for realism, solvability, and non-redundancy.

In-the-wild tasks leverage existing dark pattern corpora (Mathur 2019, Nouwens 2020) and commercial web databases. LLM-assisted crawlers detect candidate pages, which are archived using wget and validated for dark pattern presence and solvability. Both splits target a balanced distribution across the six dark pattern categories. The resulting suite enables rigorous cross-model, cross-category analysis under controlled conditions.

4. Evaluation Protocol and Metrics

Episodes are sampled deterministically (temperature = 0, up to 15 steps) with 10 trajectories per agent-task pair. Two principal metrics quantify performance:

Success Rate (SR): Fraction of episodes achieving the user’s goal state.

$SR = \frac{\#\text{successful instructions}}{\#\text{total tasks}}$

Dark Pattern Effectiveness (DP $_\text{eff}$ ): Fraction of episodes steered toward malicious or unintended outcomes.

$DP_{\mathrm{eff}} = \frac{\#\text{malicious outcomes}}{\#\text{episodes with dark patterns}}$

Statistical validity is supported by mean ± standard error (SE), assuming Bernoulli trials:

$SE(\hat p) = \sqrt{\frac{\hat p(1-\hat p)}{n}}$

Pearson’s $r$ quantifies correlations between model size, reasoning tokens, and dark pattern susceptibility, substantiating the “inverse scaling” phenomenon.

5. Empirical Analysis of Agent Susceptibility

State-of-the-art agents (GPT-4o, GPT-5, Gemini-2.5) exhibit DP $_{\mathrm{eff}}$ above 70% for both generated and in-the-wild tasks, with SR under dark patterns dropping to approximately 20–26%. In control (no pattern) conditions, SR approaches 99–100%, DP $_{\text{eff}}$ = 0 $. Human participants demonstrate markedly higher resilience: SR ≈ 81%, DP$ _{\mathrm{eff}} $≈ 31–33%.</p> <p>Category-wise dark pattern effectiveness (Simple agent, generated split):</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Category</th> <th>DP Effectiveness (%)</th> </tr> </thead><tbody><tr> <td>Obstruction</td> <td>97</td> </tr> <tr> <td>Social Proof</td> <td>90</td> </tr> <tr> <td>Sneaking</td> <td>74</td> </tr> <tr> <td>Urgency</td> <td>80</td> </tr> <tr> <td>Misdirection</td> <td>56</td> </tr> <tr> <td>Forced Action</td> <td>65</td> </tr> </tbody></table></div> <p>Scaling effects are pronounced: Qwen-2.5-VL family (3B → 72B) shows DP$ _{\mathrm{eff}} $rising from 38.5% to 73.7% ($ r_\text{size, DP} \approx 0.95 $). Increasing test-time reasoning (Gemini-Flash 256 → 16,384 tokens) likewise augments DP$ _{\mathrm{eff}} $(37.6% → 71.2%,$ r_\text{tokens, DP} \approx 0.89 $). These outcomes reflect an “inverse scaling law”—higher capabilities often entail greater vulnerability to manipulative cues, as longer reasoning increases the risk of over-interpreting deceptive elements.</p> <p>Common failure modes include:</p> <ol> <li>Overlooking covert additions (“hidden” fees/items).</li> <li>Misplaced reliance on false cues (“flash sales”).</li> <li>Reasoning errors with misdirection or complex linguistic structures (double negatives, confirm-shaming).</li> </ol> <h2 class='paper-heading' id='defense-strategies-and-limitations'>6. Defense Strategies and Limitations</h2> <p>Leading defense mechanisms offer limited mitigation:</p> <ul> <li><strong>In-Context Prompting (ICP):</strong> Prompt-level injection of dark pattern definitions/examples yields average DP$ _{\mathrm{eff}} $reduction ≈ 12%, with SR rising to ≈ 42%. Greatest improvement occurs with highly visible patterns (Urgency, Social Proof); minimal impact is seen with visually integrated misdirection.</li> <li><strong>Guardrail Models:</strong> Secondary LLM-based flagging achieves average DP$ _{\mathrm{eff}}$ reduction ≈ 28.6%, SR up to ≈ 58%. Despite uniform gains over ICP, environment-level attacks (Obstruction, Forced Action) remain problematic. Chain-of-thought analysis indicates persistent erroneous reasoning, even when patterns are detected.

These results underscore the inadequacy of surface-level interventions and suggest that robust defense will require integration of adversarial pattern detection mechanisms during post-training or fine-tuning, architectural intent-grounding, and iterative pattern avoidance strategies beyond element blacklisting.

7. Significance and Future Research Directions

Decepticon exposes dark patterns as a pervasive, high-impact adversarial risk for web agents, with susceptibility often exacerbated by increased model capacity. The benchmark’s granular taxonomy, paired control–treatment workflow, and rigorous empirical protocols provide a foundational resource for evaluating and ultimately improving agent robustness in adversarial UI conditions. Open-source distribution of tasks, evaluation code, and controls facilitates the development and comparison of advanced defense techniques, including contrastive training and adversarial fine-tuning.

A plausible implication is that future effective defenses will pivot towards architecture-level modifications, intent verification, and integrated adversarial training, especially given the documented failure of prompt-level and simple guardrail strategies. Decepticon thereby catalyzes further empirical and methodological research at the intersection of web safety, agent alignment, and adversarial robustness (Cuvin et al., 28 Dec 2025).

Markdown Upgrade to Chat

References (1)

DECEPTICON: How Dark Patterns Manipulate Web Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decepticon Benchmark.