Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation (2505.21784v1)

Published 27 May 2025 in cs.AI and cs.CL

Abstract: Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

Summary

The paper introduces AIDSAFE, a multi-agent iterative deliberation framework for creating high-quality, policy-embedded Chain-of-Thought datasets that improve LLM safety.
It demonstrates that supervised fine-tuning with AIDSAFE data significantly boosts both in-domain and out-of-domain safety while maintaining utility and reducing over-refusal.
The study also presents an innovative ear-whisperer technique for generating preference data, enhancing alignment and mitigating issues in standard sampling for policy optimization.

This paper (2505.21784) introduces AIDSAFE (Agentic Iterative Deliberation for Safety Reasoning), a novel data generation framework designed to create high-quality, policy-embedded Chain-of-Thought (CoT) datasets for training safe LLMs. The core motivation is to address the limitations of existing safety measures like over-refusal and jailbreak vulnerabilities by enabling LLMs to explicitly reason over safety policies before generating responses.

The main challenge in implementing safety reasoning is the difficulty and cost of producing high-quality policy-embedded CoT data. Human annotation is expensive and time-consuming, while generating such data using single LLMs often suffers from resource constraints and flawed reasoning, including hallucinations and policy inconsistencies.

AIDSAFE tackles this by leveraging a multi-agent deliberation and refinement process, rather than relying on a single, highly capable, and expensive LLM generator. The framework consists of the following stages:

Initialization:
- An LLM agent performs Intent Decomposition of the user query, identifying both explicit (potentially malicious) and implicit (benign) intentions. This step helps in applying policies more granularly and reducing over-refusal.
- A single agent generates an Initial CoT and Response based on the user query, a predefined set of safety policies, and the decomposed intentions. This serves as the starting point for deliberation.
Deliberation Stage:
- Multiple LLM agents iteratively evaluate the query, policies, and the current state of thoughts and responses.
- In each round, an agent assesses the need for additional reasoning steps or modifications to enhance quality and policy adherence.
- If necessary, the agent proposes new thoughts and updates the response. This process continues for a fixed number of rounds (e.g., 3 rounds in the experiments) or until agents reach a consensus.
Refinement Stage:
- All generated thoughts from the deliberation rounds are aggregated.
- An impartial Refiner Agent evaluates the aggregated thoughts, filtering out repetitive, redundant, deceptive, or policy-inconsistent reasoning. It aims to create a concise, coherent, and high-quality final CoT and response that strictly adheres to the policies. This stage is inspired by third-party evaluation concepts to improve reliability.

The paper uses five safety policies derived from existing literature: Hate-Harass-Violence, Fraud and Deception, Physical Harm, Illegal Activity, and Helpfulness and Respectfulness. These policies are provided as natural language descriptions to the agents.

For practical implementation, the authors used Mixtral 8x22B as the base LLM for all agents. To improve efficiency, the framework is implemented using asynchronous LLM queries, allowing for batching of multiple user queries. On 4x A100 GPUs with a batch size of 100, generating CoTs and responses averaged approximately 35 seconds per prompt.

The authors generated a dataset of 5,000 policy-embedded CoT/response pairs using harmful prompts from the BeaverTails dataset. They also generated data on general prompts from the Alpagsus dataset for utility training. The quality of the generated data was evaluated using LLM auto-graders (Claude-3 Sonnet, Command) based on defined rubrics for CoT quality (Relevance, Coherence, Completeness) and Faithfulness (CoT-Policy, Response-Policy, Response-CoT). AIDSAFE-generated CoTs were found to have slightly better general quality but significantly higher faithfulness to policies, particularly in the CoT-Policy alignment, compared to CoTs generated by a single LLM without deliberation (LLM_ZS). Pairwise comparisons confirmed that AIDSAFE CoTs were preferred for their comprehensive safety reasoning.

To demonstrate the practical impact of the data, the authors performed Supervised Fine-Tuning (SFT) on two open-source LLMs: Mixtral 7B Instruct (non-safety-trained) and Qwen 2.5 7B Instruct (already safety-trained), using a mix of AIDSAFE safety data (5k samples) and general data (5k samples). They compared models fine-tuned on AIDSAFE data (SFT_DB) against base models and models fine-tuned on original safe responses without CoTs (SFT_OG).

The evaluation covered safety (using ShieldGemma-9B on BeaverTails and WildChat), over-refusal (using Claude-3 Sonnet on XSTest), utility (using Claude-3 Sonnet on a subset of MMLU), and jailbreak robustness (using ShieldGemma-9B on StrongREJECT with 12 techniques).

Key findings from SFT experiments:

SFT_DB significantly improved safety generalization (especially on WildChat) and jailbreak robustness compared to Base and SFT_OG models for both Mixtral and Qwen.
For Mixtral, SFT_DB increased in-domain safety by 20% (76% to 96%) and out-of-domain safety by 54.95% (31.00% to 85.95%) compared to the base model.
SFT_DB maintained utility better than SFT_OG for Mixtral, showing less degradation.
For the already-safe Qwen model, SFT_OG sometimes degraded performance (e.g., on WildChat), but SFT_DB avoided this, suggesting safety reasoning helps models understand policies rather than just learning superficial heuristics.
Comparing SFT_DB to SFT on single-LLM CoTs (SFT_ZS), SFT_ZS showed comparable safety rates but much higher over-refusal, indicating that the more comprehensive reasoning from AIDSAFE is crucial for balancing safety and utility.

The paper also addresses the challenge of creating preference data for alignment stages like Direct Policy Optimization (DPO), where standard sampling often yields selected and rejected responses with similar superficial safety, lacking meaningful distinction in reasoning quality. They propose a supplemental recipe using an "ear-whisperer" agent.

Ear-Whisperer Recipe: An adversarial agent generates "inappropriate guiding prefixes" or "bad beliefs." These prefixes are prepended to the target LLM's input when generating the "rejected" samples for preference data. "Selected" samples are generated without augmentation.
Implementation: An iterative in-context learning strategy is used to refine the adversarial agent's ability to generate effective bad beliefs, using ShieldGemma as a scoring function to measure how much the beliefs induce unsafe outputs from the target LLM.
Effectiveness: This method was shown to create a substantial distribution shift in policy adherence between generated selected and rejected CoTs, unlike standard sampling.
DPO Training: Applying DPO with this preference data to the SFT_DB Mixtral model showed further improvements in safety and jailbreak robustness, although it led to a decrease in over-refusal accuracy, illustrating the trade-off in alignment. The method, however, mitigated the severe overfitting seen with DPO using standard sampling data.

In conclusion, AIDSAFE provides a practical framework for generating high-quality, policy-embedded CoT data for safety reasoning in LLMs using multi-agent deliberation and refinement. The resulting data significantly improves safety generalization and jailbreak robustness through SFT, while minimizing utility loss compared to traditional methods. The supplemental ear-whisperer technique addresses limitations in preference data generation for alignment, enabling more effective DPO training. The authors plan to release the generated datasets and codebase.

Limitations include the number of policies used (five), the constraint to using a single LLM (Mixtral 8x22B) and a fixed number of agents (two) in deliberation, the SFT setup not including a general CoT warm-up, potential interruptions in deliberation with highly safety-trained base models, and the diminished effectiveness of the ear-whisperer agent on very safe target models. Ethical considerations regarding policy design and potential misuse of the ear-whisperer technique are also discussed.

PDF Markdown

Tweets

https://twitter.com/rahul1987iit/status/1929279449465299215

YouTube

Show All Videos

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation (2505.21784v1)

Summary

Related Papers

Tweets

YouTube