Reflective Safety Reasoning (ReSafe) Dataset
- Reflective Safety Reasoning (ReSafe) is a multimodal dataset that uses a three-stage think–reflect–revise process to enable LVLMs to identify and correct unsafe outputs.
- It combines 5,000 annotated examples across safety-critical and general domains to balance policy-driven self-correction with robust reasoning.
- The dataset significantly boosts safe response rates in LVLMs while preserving overall task performance, addressing vulnerabilities in single-pass response generation.
The Reflective Safety Reasoning (ReSafe) dataset is a three-stage, multimodal safety alignment resource designed to foster explicit self-reflection, policy-grounded reasoning, and revision behaviors in Large Vision LLMs (LVLMs). Developed as part of the Think-Reflect-Revise (TRR) framework, ReSafe targets a primary vulnerability in single-pass think-then-answer paradigms: the tendency to overlook harmful output in initial generations. Through structured exposure to policy violation detection and revision, ReSafe operationalizes reflective self-correction, equipping LVLMs to identify and amend safety breaches before producing final responses (Weng et al., 8 Dec 2025).
1. Motivation and Conceptual Foundations
ReSafe addresses critical safety gaps revealed by multimodal jailbreak attacks, where LVLMs may inadvertently generate unsafe outputs. Approaches limited to single-pass safety reasoning often fail to recognize explicit harmful content contained within their own initial generations. The ReSafe design leverages this wasted signal by forcing the model to scrutinize its first answer explicitly, using a formal policy as reference, and iteratively self-correct through reflection and revision. This process is grounded in three objectives:
- Multi-stage annotation structure (“think,” “reflect,” “revise”) to make latent safety failures explicit and correctable.
- Policy-grounded reflections referencing distilled clauses from safety policies to increase specificity and accountability.
- Inclusion of both safety-critical and general reasoning examples to mitigate catastrophic over-caution and preserve domain proficiency.
During model inference, only the final revised answer is presented to end users, embedding the safety check as an internalized, multi-step routine.
2. Data Composition and Annotation Protocol
ReSafe comprises 5,000 annotated, multimodal training examples encompassing both safety-critical and general domains:
- Safety-related samples (2,000; 40%) are derived from BeaverTails-V, spanning 20 canonical harmful content categories—such as illicit-behavior instructions, visual jailbreak attacks, biased or hateful content, misinformation, self-harm, propaganda, privacy violations, and similar types.
- General-reasoning samples (3,000; 60%) are drawn from GThinker and represent diverse scenarios in science, mathematics, and commonsense reasoning.
Within the safety subset, each of the 20 categories is evenly represented. The breakdown is summarized as follows:
| Category | #Examples | % of Safety Samples |
|---|---|---|
| Visual jailbreak | 100 | 5% |
| Illicit-behavior instructions | 100 | 5% |
| Biased/Hateful content | 100 | 5% |
| Misinformation/Conspiracy | 100 | 5% |
| Self-harm / Suicide | 100 | 5% |
| Extremism | 100 | 5% |
| Privacy/Defamation | 100 | 5% |
| Drug/Weapon instructions | 100 | 5% |
| Sexual content | 100 | 5% |
| Violence/Threats | 100 | 5% |
| ... (10 more) | 100 each | 5% each |
Each sample proceeds through a maximally five-iteration “think-reflect-revise” loop (discarded if unresolved after five), with the following average token lengths per annotation stage:
- Think: 80 ± 30 tokens
- Answer₁: 40 ± 20 tokens
- Reflect: 100 ± 40 tokens
- Answer₂: 50 ± 25 tokens
3. Annotation Schema and Prompt Structure
The dataset is organized using a standardized internal schema:
1 2 3 4 5 6 7 |
{
"query": { "text": "...", "image_url": "..." },
"think": "…",
"answer_1": "…",
"reflection": "…",
"answer_2": "…"
} |
System prompts employ exact section markers for each stage:
1 2 3 4 |
<think> ... </think> <answer> ... </answer> <reflect> ... </reflect> <answer> ... </answer> |
A representative example (money counterfeiting scenario):
- Think: Model extracts context and considers intent.
- Answer₁: Unsafe step-wise instructions.
- Reflect: Explicit critique referencing the violated policy clause (e.g., “Illicit behavior”).
- Answer₂: Compliant refusal or redirection, tailored to formal policy.
4. Policy Signals, Reward Assignment, and Modeling Objectives
Training incorporates both supervised fine-tuning and reinforcement learning, focused on instilling policy-driven reflection and safety-oriented revision.
The loss is defined as
Reinforcement Learning:
Group Relative Policy Optimization (GRPO) is used for policy refinement. Reward normalization is performed across groups of size :
The clipped RL objective is:
Rewards are composed as follows:
- Safety reward for safety data: , with if output is classified “safe,” , .
- General reward for general data: , with , , .
- Format reward: Additional bonus if all four sections are present.
A specialized safety reward model detects harmful policy violations. Examples failing to yield a safe revision after five iterations are excluded.
5. Integration and Utilization Practices
Recommended training workflow using ReSafe involves:
- Supervised Fine-Tuning: The LVLM is trained on all four annotation fields to internalize the think–reflect–revise cycle.
- Policy-Guided RL with GRPO: RL is initialized with mixed safety and general samples, maintaining defined reward weights.
- Inference Protocol: Only the final revised answer (<answer₂>) is relayed to users, masking intermediate reasoning and self-critique.
- Large Model Adaptation: Utilization of LoRA adapters during SFT is advised for preserving general capabilities.
Migration to domains with different safety requirements necessitates expansion or replacement of distilled policy documents.
6. Limitations and Operational Considerations
- Inference Overhead: The reflection stage increases computation by approximately 1.3–1.5×.
- Policy Document Sensitivity: Datasets require policy document updates beyond original safety taxonomies.
- Architecture Dependence: LVLMs with alternative vision encoders may necessitate prompt and template adaptation.
- Policy Coverage Quality: Reflection reliability is contingent on the specificity and clarity of distilled policies. Ambiguities can induce confirmatory or weak critiques.
A plausible implication is that improvements in policy formulation or interpolation between safety categories may yield further robustness, but explicit evidence is required.
7. Impact on Safety Alignment and Robustness
The ReSafe dataset, as part of the TRR framework, has demonstrated significant empirical benefits:
- On Qwen2.5-VL-7B, the safe response rate increases from 42.8% to 87.7% while maintaining stable scores on general benchmarks such as MMMU and MMStar.
- ReSafe exemplifies multi-stage, policy-conditioned safety alignment in LVLMs, advancing beyond interpretability-only regimes by immunizing models to multimodal jailbreaks and explicit policy violations (Weng et al., 8 Dec 2025).
These results indicate that reflective, policy-grounded training is pivotal for the next generation of robust, safe LVLMs.