Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 109 tok/s Pro

Kimi K2 216 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (2501.18837v1)

Published 31 Jan 2025 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: LLMs are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Summary

The paper introduces Constitutional Classifiers as a novel defense mechanism against universal jailbreaks using synthetic data guided by natural language rules.
It employs extensive red teaming with over 3,000 hours of testing and 405 participants, demonstrating a robust barrier against harmful queries.
The approach maintains real-time intervention with streaming predictions and blocks over 95% of unauthorized outputs while ensuring deployment viability.

The paper "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming" introduces Constitutional Classifiers, a novel defense mechanism against universal jailbreaks in LLMs. Universal jailbreaks are prompting strategies that can bypass model safeguards, enabling users to extract harmful information and execute complex, potentially dangerous tasks. The authors highlight the increasing concern as LLMs' chemical, biological, radiological, or nuclear (CBRN) capabilities grow, potentially allowing non-experts to perform sophisticated scientific processes.

The core idea involves augmenting LLMs with classifiers trained on synthetic data generated using a constitution, which is a set of natural language rules defining permitted and restricted content. The constitution guides the creation of training examples, and the classifiers monitor both inputs and outputs to block potentially harmful content. The authors emphasize that this approach allows for rapid adaptation to evolving threat models through updates to the constitution, including those related to model misalignment. To improve performance, data augmentation techniques and benign data pool sets are employed. Critically, the output classifiers support streaming prediction, enabling real-time intervention by assessing potential harmfulness at each token without requiring the full output.

The authors conducted extensive human red teaming on prototype classifiers fine-tuned from Claude 3.5 Sonnet. They engaged 405 participants, including experienced red teamers, in a bug-bounty program with monetary incentives for discovering universal CBRN jailbreaks. Despite accumulating over 3,000 hours of red teaming, no red teamer successfully answered all ten queries at half the level of detail of an unrestricted model. A more stringent criterion required jailbreaks to provide a similar level of detail to an unrestricted model; under this requirement, no red teamer answered more than six questions.

Following the human red teaming, the paper details the establishment of deployment viability and flexibility by training improved Claude 3.5 Haiku classifiers with a new constitution designed to block chemical-weapons information. Automated evaluations, including novel jailbreak attempts, demonstrated that these classifiers substantially improved jailbreak robustness while preserving deployment viability. The classifier-guarded system refused over 95\% of held-out jailbreaking attempts, compared to only 14% without classifiers, with a 0.38% absolute increase in production-traffic refusals and a 23.7% inference overhead.

The paper defines a universal jailbreak strategy as a prompting approach that reliably bypasses LLM safeguards on the vast majority of queries in a specific domain, leading the system to reveal highly detailed and specific harmful information. The desiderata for defenses against these jailbreaks are threefold: robustness to universal jailbreaks, practical deployment viability (reasonable inference overhead, preservation of time-to-first-token and streaming capabilities, and low false-positive rates), and flexibility to adapt to evolving domains of potential harm and novel attack patterns.

The authors detail their approach to training classifier safeguards on constitution-guided synthetic data to detect and block jailbreaks. This involves specifying a constitution using natural language rules to define categories of permitted and restricted content, generating synthetic data using a helpful-only model to span the constitutional categories, and employing data augmentation techniques to improve diversity and coverage. Additionally, they utilize automated red teaming (ART) to generate additional training data by prompting a helpful-only LLM with descriptions of known jailbreaking techniques and forbidden queries. The input-classification task is framed as a next-token-prediction problem, and the output classifier is fine-tuned with a linear value head that predicts the harmfulness of a full sequence of tokens from a prefix, allowing continuous token streaming.

The multi-stage LLM-based rubric-grading pipeline used to evaluate jailbreak success involves generating model outputs to forbidden queries using helpful-only models to establish baseline outputs and creating query-specific rubrics containing key information commonly present in these outputs. Candidate jailbreaks are then evaluated against the rubrics using an LLM grader to obtain a score indicating how much harmful information is present.

The red teaming task provided participants with ten non-infohazardous target CBRN queries and offered bounties based on the number of questions answered, considering a model output to be jailbroken if it achieved at least half of the score obtained by an average helpful-only response. The usage logs were analyzed to estimate the red-teaming effort, which was validated using survey responses from active red teamers. The system red teamed involved constitutional classifiers developed using a CBRN-focused constitution, fine-tuned from Claude 3.5 Sonnet, and deployed to safeguard the same model.

The red-teaming results indicated that the approach effectively defended against universal jailbreaks. Red teamers employed diverse attack strategies, including gradual multi-turn approaches, few-shot attacks, role-playing scenarios, obfuscation methods, prompt injection, template-based attacks, and decomposition strategies. No red-teaming report achieved comprehensive success, and even the most successful red teamer achieved limited results when aggregating across reports and evaluating against stringent criteria.

The paper also presents a simplified model representing complex scientific processes as requiring $N$ sequential steps to complete, with each step having an independent and identically distributed success probability. This model suggests that the safeguards could reduce uplift potential by multiple orders of magnitude for processes requiring many steps.

The authors compared the performance of their classifier-guarded system against multiple baseline systems, including an unguarded instance of Claude 3.5 Sonnet (New) relying solely on harmlessness training and three variants of prompted classifier safeguards based on Claude 3.5 Sonnet. The results demonstrated that augmenting Claude 3.5 Sonnet (New) with constitutional input and output classifiers yielded significant improvements in jailbreak resistance, achieving the highest robustness among all evaluated baselines, with limited costs in terms of false-positive rates and computational overhead during inference. The increase in false-positive rates, however, is more noticeable for chemistry-related queries.

The paper also shows that classifier robustness scales with model size, data quantity, and data augmentation. They train classifiers from model snapshots of five different sizes from the Claude 3 Sonnet series and observe that robustness consistently increases with model size. Moreover, fine-tuning Claude 3.5 Haiku classifiers on subsets of training data results in substantial improvements in robustness when using larger training datasets. Finally, incorporating additional transformations generally increases classifier robustness.

The authors discuss related work on classifier guards, fine-tuning on red-teaming attacks, model-internals approaches, unlearning and data filtering, and robustness via scaling inference-time compute. They conclude that Constitutional Classifiers, trained on LLM-generated synthetic data using constitutions of natural-language rules, provide substantial improvements in robustness and flexibility, offering a crucial role in safely deploying capable AI systems.