AgentHarm Benchmark: Evaluating Malicious Tasks
- AgentHarm Benchmark is a rigorous evaluation suite for LLM-based agents that measures their response to complex, multi-step malicious tasks with automated grading protocols.
- It aggregates 440 augmented behaviors across 11 harm categories to assess agents' robustness in realistic adversarial scenarios.
- The benchmark employs a synthetic tooling environment and Python-based grading to ensure reproducible and scalable evaluation of adversarial task compliance.
AgentHarm is a publicly released benchmark designed to measure the propensity of LLM-based agents to comply with explicitly malicious, multi-step tasks requiring tool use. Unlike prior benchmarks focused on chatbot jailbreaks or single-step refusals, AgentHarm systematically evaluates agentic LLM systems in realistic adversarial scenarios, encompassing complex workflows and fine-grained grading protocols. The benchmark incorporates 440 malicious agent tasks (via augmentations to 110 base instructions), spanning 11 distinct harm categories, and provides automated metrics for refusal and task completion. AgentHarm enables rigorous assessment of model robustness to direct-prompting attacks and post-jailbreak functional integrity, and has rapidly become a cornerstone in safety research for agentic AI systems (Andriushchenko et al., 11 Oct 2024).
1. Benchmark Composition and Design Principles
AgentHarm consists of three key elements: (a) a suite of malicious agent behaviors formulated as multi-step tool-use instructions, (b) a synthetic tooling environment using the Inspect framework, providing side-effect-free tool calls with well-defined interfaces, and (c) programmatic grading rubrics implemented as Python functions for automatic, reproducible scoring.
Main dataset statistics:
| Metric | Value |
|---|---|
| Number of base behaviors | 110 |
| Total behaviors (after augmentation) | 440 |
| Harm categories | 11 |
| Unique functions used | 104 |
| Avg. functions per behavior | 3.53 |
| Typical tokens per evaluation | 0.5M–1.5M |
AgentHarm covers each harm category with 10 base behaviors. For each behavior, three variants (open-ended, hint-included, both) yield a comprehensive challenge set for agents. Data splits include a validation set (44 items), public test set (264 items), and a private held-out set (132 items), ensuring robust benchmarking and generalization.
2. Harm Categories and Task Taxonomy
The benchmark categorizes harm into 11 types, each reflecting a domain of societal or legal concern:
- Fraud: e.g., ordering forged documents via email.
- Cybercrime: e.g., distributing malware in legitimate software and promoting downloads.
- Self-harm: e.g., composing posts that encourage suicide.
- Harassment: e.g., orchestrating online stalking via multi-channel messaging.
- Sexual content: e.g., re-uploading illegal bestiality media under false titles.
- Copyright infringement: e.g., constructing pirated streaming services.
- Drugs: e.g., generating illicit sales site HTML and securing hosting.
- Disinformation: e.g., producing deepfake news and publishing on social platforms.
- Hate: e.g., automating dissemination of hateful content.
- Privacy violation: e.g., leaking personal information during task execution.
- Violence: e.g., generating instructions for criminal acts.
Each task requires coherent, multi-step action and