Papers
Topics
Authors
Recent
2000 character limit reached

Reflective Safety Reasoning (ReSafe) Dataset

Updated 15 December 2025
  • Reflective Safety Reasoning (ReSafe) is a multimodal dataset that uses a three-stage think–reflect–revise process to enable LVLMs to identify and correct unsafe outputs.
  • It combines 5,000 annotated examples across safety-critical and general domains to balance policy-driven self-correction with robust reasoning.
  • The dataset significantly boosts safe response rates in LVLMs while preserving overall task performance, addressing vulnerabilities in single-pass response generation.

The Reflective Safety Reasoning (ReSafe) dataset is a three-stage, multimodal safety alignment resource designed to foster explicit self-reflection, policy-grounded reasoning, and revision behaviors in Large Vision LLMs (LVLMs). Developed as part of the Think-Reflect-Revise (TRR) framework, ReSafe targets a primary vulnerability in single-pass think-then-answer paradigms: the tendency to overlook harmful output in initial generations. Through structured exposure to policy violation detection and revision, ReSafe operationalizes reflective self-correction, equipping LVLMs to identify and amend safety breaches before producing final responses (Weng et al., 8 Dec 2025).

1. Motivation and Conceptual Foundations

ReSafe addresses critical safety gaps revealed by multimodal jailbreak attacks, where LVLMs may inadvertently generate unsafe outputs. Approaches limited to single-pass safety reasoning often fail to recognize explicit harmful content contained within their own initial generations. The ReSafe design leverages this wasted signal by forcing the model to scrutinize its first answer explicitly, using a formal policy as reference, and iteratively self-correct through reflection and revision. This process is grounded in three objectives:

  • Multi-stage annotation structure (“think,” “reflect,” “revise”) to make latent safety failures explicit and correctable.
  • Policy-grounded reflections referencing distilled clauses from safety policies to increase specificity and accountability.
  • Inclusion of both safety-critical and general reasoning examples to mitigate catastrophic over-caution and preserve domain proficiency.

During model inference, only the final revised answer is presented to end users, embedding the safety check as an internalized, multi-step routine.

2. Data Composition and Annotation Protocol

ReSafe comprises 5,000 annotated, multimodal training examples encompassing both safety-critical and general domains:

  • Safety-related samples (2,000; 40%) are derived from BeaverTails-V, spanning 20 canonical harmful content categories—such as illicit-behavior instructions, visual jailbreak attacks, biased or hateful content, misinformation, self-harm, propaganda, privacy violations, and similar types.
  • General-reasoning samples (3,000; 60%) are drawn from GThinker and represent diverse scenarios in science, mathematics, and commonsense reasoning.

Within the safety subset, each of the 20 categories is evenly represented. The breakdown is summarized as follows:

Category #Examples % of Safety Samples
Visual jailbreak 100 5%
Illicit-behavior instructions 100 5%
Biased/Hateful content 100 5%
Misinformation/Conspiracy 100 5%
Self-harm / Suicide 100 5%
Extremism 100 5%
Privacy/Defamation 100 5%
Drug/Weapon instructions 100 5%
Sexual content 100 5%
Violence/Threats 100 5%
... (10 more) 100 each 5% each

Each sample proceeds through a maximally five-iteration “think-reflect-revise” loop (discarded if unresolved after five), with the following average token lengths per annotation stage:

  • Think: 80 ± 30 tokens
  • Answer₁: 40 ± 20 tokens
  • Reflect: 100 ± 40 tokens
  • Answer₂: 50 ± 25 tokens

3. Annotation Schema and Prompt Structure

The dataset is organized using a standardized internal schema:

1
2
3
4
5
6
7
{
  "query": { "text": "...", "image_url": "..." },
  "think": "",
  "answer_1": "",
  "reflection": "",
  "answer_2": ""
}

System prompts employ exact section markers for each stage:

1
2
3
4
<think> ... </think>
<answer> ... </answer>
<reflect> ... </reflect>
<answer> ... </answer>

A representative example (money counterfeiting scenario):

  • Think: Model extracts context and considers intent.
  • Answer₁: Unsafe step-wise instructions.
  • Reflect: Explicit critique referencing the violated policy clause (e.g., “Illicit behavior”).
  • Answer₂: Compliant refusal or redirection, tailored to formal policy.

4. Policy Signals, Reward Assignment, and Modeling Objectives

Training incorporates both supervised fine-tuning and reinforcement learning, focused on instilling policy-driven reflection and safety-oriented revision.

Supervised Fine-Tuning:

The loss is defined as

LSFT=E(x,t,a1,r,a2)D[logπθ(ta1ra2x)]\mathcal{L}_{\mathrm{SFT}} = -\mathbb{E}_{(x, t, a_1, r, a_2) \sim \mathcal{D}} \Big[ \log \pi_{\theta}(t \oplus a_1 \oplus r \oplus a_2 \mid x) \Big]

Reinforcement Learning:

Group Relative Policy Optimization (GRPO) is used for policy refinement. Reward normalization is performed across groups of size GG:

Aj=rjμσ,μ=1Gi=1Gri,σ=1Gi=1G(riμ)2A_j = \frac{r_j - \mu}{\sigma}, \quad \mu = \frac{1}{G} \sum_{i=1}^{G} r_i, \quad \sigma = \sqrt{ \frac{1}{G} \sum_{i=1}^{G} (r_i - \mu)^2 }

The clipped RL objective is:

LGRPO(θ)=E[min(rt(θ)Aj,clip(rt(θ),1ϵ,1+ϵ)Aj)]\mathcal{L}_{\mathrm{GRPO}(\theta)} = \mathbb{E}\Big[\min\big(r_t(\theta)A_j,\, \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\,A_j\big)\Big]

Rewards are composed as follows:

  • Safety reward for safety data: Rsafety=w1Rs(a1)+w2Rs(a2)R_{\mathrm{safety}} = w_1 R_s(a_1) + w_2 R_s(a_2), with Rs(a)=1R_s(a) = 1 if output is classified “safe,” w1=0.3w_1 = 0.3, w2=1.0w_2 = 1.0.
  • General reward for general data: Rgeneral=w1Racc(a1)+w2Racc(a2)+w3Rhelpful(a2)R_{\mathrm{general}} = w_1 R_{\mathrm{acc}}(a_1) + w_2 R_{\mathrm{acc}}(a_2) + w_3 R_{\mathrm{helpful}}(a_2), with w1=0.3w_1=0.3, w2=1.0w_2=1.0, w3=1.0w_3=1.0.
  • Format reward: Additional bonus if all four sections are present.

A specialized safety reward model detects harmful policy violations. Examples failing to yield a safe revision after five iterations are excluded.

5. Integration and Utilization Practices

Recommended training workflow using ReSafe involves:

  1. Supervised Fine-Tuning: The LVLM is trained on all four annotation fields to internalize the think–reflect–revise cycle.
  2. Policy-Guided RL with GRPO: RL is initialized with mixed safety and general samples, maintaining defined reward weights.
  3. Inference Protocol: Only the final revised answer (<answer₂>) is relayed to users, masking intermediate reasoning and self-critique.
  4. Large Model Adaptation: Utilization of LoRA adapters during SFT is advised for preserving general capabilities.

Migration to domains with different safety requirements necessitates expansion or replacement of distilled policy documents.

6. Limitations and Operational Considerations

  • Inference Overhead: The reflection stage increases computation by approximately 1.3–1.5×.
  • Policy Document Sensitivity: Datasets require policy document updates beyond original safety taxonomies.
  • Architecture Dependence: LVLMs with alternative vision encoders may necessitate prompt and template adaptation.
  • Policy Coverage Quality: Reflection reliability is contingent on the specificity and clarity of distilled policies. Ambiguities can induce confirmatory or weak critiques.

A plausible implication is that improvements in policy formulation or interpolation between safety categories may yield further robustness, but explicit evidence is required.

7. Impact on Safety Alignment and Robustness

The ReSafe dataset, as part of the TRR framework, has demonstrated significant empirical benefits:

  • On Qwen2.5-VL-7B, the safe response rate increases from 42.8% to 87.7% while maintaining stable scores on general benchmarks such as MMMU and MMStar.
  • ReSafe exemplifies multi-stage, policy-conditioned safety alignment in LVLMs, advancing beyond interpretability-only regimes by immunizing models to multimodal jailbreaks and explicit policy violations (Weng et al., 8 Dec 2025).

These results indicate that reflective, policy-grounded training is pivotal for the next generation of robust, safe LVLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Reflective Safety Reasoning (ReSafe) Dataset.