SGASA: Adaptive Safety Alignment

Updated 3 December 2025

SGASA is a two-stage framework that autonomously synthesizes and validates safety guidelines to counter adversarial jailbreak prompts.
It leverages automated prompt augmentation with a sequential SFT and DPO fine-tuning process to internalize adaptive safety measures.
Empirical results show enhanced safety scores and reduced dependency on human-authored rules, with robust performance across diverse adversarial datasets.

Synthesized Guideline-based Adaptive Safety Alignment (SGASA) is a two-stage framework designed to enhance the safety alignment of large reasoning models in the face of adversarial “jailbreak” prompts. SGASA autonomously generates, validates, and internalizes model-synthesized safety guidelines using minimal human input, thereby improving defense against covertly harmful inputs while maintaining high utility for benign requests. The method features automated data pre-synthesis through guideline induction and prompt generation, followed by a fine-tuning regimen leveraging Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), resulting in robust and scalable adaptive alignment (Wang et al., 26 Nov 2025).

1. Motivation and Problem Context

Reasoning models such as ChatGPT-o1/o3 and DeepSeek-R1 demonstrate strong performance in multi-step inference but are susceptible to adversarial jailbreaks—prompts deliberately encoded to obscure illicit intent using semantic misdirection (e.g., requesting step-by-step procedures as “thought experiments” or embedding harmful queries in mathematical language). Established alignment approaches—including Reinforcement Learning from Human Feedback (RLHF), in-context alignment with curated exemplars, and supervised fine-tuning on human-annotated rules—demonstrate brittleness to highly variable attacks, incur high annotation costs, and exhibit a trade-off between unnecessary refusals of benign queries and failing to reject subtle harmful prompts.

SGASA addresses these deficiencies by allowing the reasoning model to self-synthesize concise safety guidelines from a handful (approximately 10) of seed adversarial examples, then internalizing these defenses through automated data creation and staged fine-tuning. This approach eliminates dependence on human-authored safety lists, enabling rapid adaptation to novel adversarial prompt structures while preserving benign functionality.

2. Methodology: SGASA Pipeline

SGASA operates in two principal stages: Data Pre-synthesis and Alignment Fine-tuning. The pipeline is conceptualized as follows:

Stage	Function	Output
Data Pre-synthesis	Synthesize/validate safety guidelines;	Validated guidelines and large, class-balanced
	generate augmented prompts	pool of labeled adversarial-style prompts
Alignment Fine-tuning	SFT and DPO on synthesized data	Model with internalized adaptive safety

2.1 Stage I: Data Pre-synthesis

The pre-synthesis phase utilizes approximately 5 harmful and 5 benign seed adversarial prompts. It consists of three sub-steps:

Guideline Generation: The reasoning model is prompted to propose candidate guideline sets that distinguish harmful from benign prompts, recommending explicit refusal and compliance strategies, each illustrated with concrete examples. The typical structure of a guideline includes linguistic harm cues, semantic/intent markers (such as “framing as thought experiment”), and a bifurcated response protocol.
Guideline Validation: Each candidate guideline is tested in context. A guideline is retained only if the model, when primed with it, consistently refuses all harmful seeds and accepts all benign seeds, as detected via automated rule-based refusal detectors (such as matching apology refusals).
Prompt Augmentation: “Self-instruction” templates are used to generate approximately 5,000 new prompts per class, mimicking the seed style. The model then self-classifies these prompts; only correctly matched (e.g., harmful classified as harmful) examples are accepted for the alignment stage.

2.2 Stage II: Alignment Fine-tuning

The second stage integrates the pre-synthesized guidelines and prompt data through two sequential procedures:

Supervised Fine-tuning (SFT): Each augmented prompt is paired with a randomly sampled guideline. For harmful prompts, only refusal responses are accepted; for benign prompts, the top-scoring non-refusal is selected. The dataset $D_{\text{SFT}}$ thus encodes correct safety behavior. The SFT loss function is the standard negative log-likelihood:

$L_{\text{SFT}}(\theta) = - \mathbb{E}_{(q, y) \in D_{\text{SFT}}} \left[ \log P_\theta(y \mid q) \right]$

Direct Preference Optimization (DPO): To internalize guideline-driven behavior, SFT is followed by DPO on prompt/response pairs without explicit guideline context. For each prompt, a preferred (safety-consistent) and less-preferred response is identified according to class. The DPO objective minimizes the expected logistic loss over pairwise preference triples:

$L_{\text{DPO}}(\theta) = -\mathbb{E}_{(q, y^+, y^-)} \left[ \log \sigma \left(r_\theta(q, y^+) - r_\theta(q, y^-)\right) \right]$

with the preference score $r_\theta(q, y) = \beta \left[ \log P_\theta(y \mid q) - \log P_{\text{ref}}(y \mid q) \right]$ and $\sigma(s)$ denoting the logistic sigmoid.

At completion, the model’s safety alignment is encoded in its weights and no guideline context is necessary at inference.

3. Formalization and Implementation Details

The SGASA process can be summarized in the following pseudo-code:

1. Generate candidate guidelines {g1...gN} via M0 on S.
2. Validate: keep gi iff M0(g_i | s) refuses all s in S_harmful and accepts all S_benign.
3. Augment prompts: use M0 with self-instruction to create Q_h (harmful), Q_b (benign).
4. Filter Q_h & Q_b by self-classification.
5. For each q in Q_h ∪ Q_b, sample guideline g and generate y; build D_SFT = {(q, y)}.
6. Optimize θ via SFT loss (Equation 1).
7. For each q, construct preferred (y^+) and less-preferred (y^−) response.
8. Optimize θ via DPO loss (Equation 2).

Empirical details:

Pre-synthesis: 10 validated guidelines, ~5k prompts/class, filtered post self-classification.
SFT: 500 examples (250/class), DPO: 200 triples (100/class), LoRA fine-tuning (1 epoch, LR= $5\times10^{-4}$ , 10% warmup, cosine scheduler).
Inference: temperature=0.6, maximum length=2048, vLLM deployment.

4. Experimental Evaluation

SGASA was instantiated using DeepSeek-R1-derived models (R1-Qwen-7B, R1-Llama-8B). Experiments spanned three adversarial datasets:

WildJailbreak: 420 prompts, balanced harmful/benign
MathPrompt: 300 prompts, balanced
MaliciousEducator: 90 prompts, balanced

Each run used only 5 harmful + 5 benign seeds for guideline synthesis and prompt augmentation.

Baselines included:

Vanilla: No additional alignment.
In-Context Alignment (ICA): Random/curated exemplars.
Self-Align (Sun et al. 2023): SFT and DPO variants.

Evaluation metrics, in accordance with Guan et al. (2024), are:

Not Unsafe (safety): Fraction of harmful prompts judged by GPT-4o as refused/safe.
Not Over-refuse (utility): Fraction of benign prompts not unnecessarily refused.
Overall score: Mean of the above.

Key results in Table 1 show SGASA(DPO) achieving the highest average on all datasets:

Dataset	Model	Vanilla Avg.	SGASA (DPO) Avg.	Δ
WildJailbreak	R1-Qwen-7B	0.683	0.846	+0.163
	R1-Llama-8B	0.766	0.883	+0.117
MathPrompt	R1-Qwen-7B	0.814	0.863	+0.049
	R1-Llama-8B	0.848	0.927	+0.079
MaliciousEducator	R1-Qwen-7B	0.555	0.900	+0.345
	R1-Llama-8B	0.589	0.911	+0.322

SGASA especially outperforms on deception-intensive datasets such as MaliciousEducator.

5. Ablation Analyses and Generalization

Ablations reveal the effect of data and class ratio:

SFT data size: Scaling from 200 to 800 examples improves safety alignment, with diminishing returns near 1,000; excessive data may induce overfitting.
DPO class ratio: A 1:1 harmful:benign ratio is not universally optimal; moderately unbalanced settings (e.g., 5:3 benign:harmful) increase the overall score by better calibrating the refusal threshold.

SGASA demonstrates strong cross-dataset generalization. Models trained on adversarial prompts of one style transfer robustly to other domains (Table 3), indicating that the synthesized guidelines encapsulate broadly applicable safety principles.

6. Significance and Implications

SGASA establishes that reasoning models can iteratively bootstrap their own defense protocols from minimal adversarial supervision. The fully automated pipeline—guideline induction, in-context validation, large-scale prompt augmentation, and sequential SFT/DPO fine-tuning—eliminates dependence on human-authored rules and scales with evolving attack modalities. The methodology preserves both safety and utility, outperforming competitive baselines on rigorous adversarial evaluations, and produces models whose internalized alignment persists even in the absence of guidance context at inference (Wang et al., 26 Nov 2025).

A plausible implication is that self-synthesized guidelines, once embedded, enable lasting adaptation to new adversarial attacks with minimal overhead, representing a promising direction for the safety alignment of rapidly evolving reasoning models.

PDF Markdown Chat (Pro)

References (1)

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Synthesized Guideline-based Adaptive Safety Alignment (SGASA).