SafeTag-VL-3K: Tri-Modal Safety Dataset

Updated 24 November 2025

SafeTag-VL-3K is a tri-modal safety annotation dataset comprising 3,000 image–text pairs, each labeled for visual, textual, and combined safety.
It uses an LLM-based judging process with rule-governed criteria to ensure high annotation confidence and eliminate ambiguous samples.
The dataset underpins SafeGRPO’s rule-based safety reward system, offering a benchmark for assessing multimodal safety and compositional risk.

SafeTag-VL-3K is a tri-modal safety annotation dataset containing 3,000 high-certainty image–text pairs, curated as part of the SafeGRPO project to enhance and verify safety alignment in multimodal LLMs (MLLMs). Each pair is annotated for safety at three distinct levels—visual, textual, and combined—enabling granular assessment of both unimodal and compositional risks that arise in complex multimodal reasoning. The dataset is constructed by re-annotating pooled corpora using an automated LLM-based judging process, discretized with rule-based criteria and filtered to ensure high annotation consistency. SafeTag-VL-3K serves as the empirical backbone for rule-governed safety reward construction in SafeGRPO, directly addressing the challenges of unsafe joint semantics in MLLMs (Rong et al., 17 Nov 2025).

1. Composition and Source Material

SafeTag-VL-3K comprises exactly 3,000 image–text pairs, where each constituent forms a “visual example,” a “textual example,” and a “combined multimodal example.” The initial pool was drawn from three distinct resources:

VLGuard instruction-tuning examples: Focused on textual instruction and safety alignment in MLLM scenarios.
SPA-VL preference-alignment examples: Designed to probe model preferences related to safety judgments under varying preference elicitation tasks.
BeaverTails samples: 300 cases originally with embedded text, converted into typo-style images to explore multimodal safety where compositional cues may be obfuscated.

After pre-filtering and redundancy elimination, the final corpus is fixed at 3,000 uniquely annotated image–text pairs.

2. Safety Annotation Taxonomy

Each pair in SafeTag-VL-3K is annotated at three independent modalities:

Visual tag (<visual_safe>): Assigned “safe” if the associated image contains no disturbing, explicit, or dangerous content (e.g., gore, illicit weapon usage, self-harm), and “unsafe” otherwise.
Textual tag (<text_safe>): Assigned “safe” for instructions or queries without harmful actions (e.g., violence solicitation, self-harm encouragement, illegal hacking), and “unsafe” otherwise.
Combined tag (<combined_safe>): Assigned “safe” if the joint interpretation does not reveal emergent, illicit, or dangerous intent not present mono-modally. Marked “unsafe” if the conjunctive semantics yield a harmful scenario undetectable by isolated inspection (core “compositional” risk).

This tripartite labeling schema directly supports structured reasoning and interpretable safety evaluation in downstream models.

3. Annotation Process and Rule-Governed Criteria

Annotation in SafeTag-VL-3K proceeds using an LLM-as-Judge paradigm, specifically the GPT-5 API. For each pair, three safety scores $(s_v, s_t, s_c) \in [0,10]^3$ and three confidence scores $(c_v, c_t, c_c) \in [0,10]^3$ are produced, corresponding to the visual, textual, and combined modalities.

The rule-governed binary discretization is described by:

If $s_* \in [0,3]$ , label as “unsafe.”
If $s_* \in [7,10]$ , label as “safe.”
If $s_* \in (3,7)$ , discard the respective example.
Any sample where $c_v$ , $c_t$ , or $c_c$ is below $7$ is discarded.

Only examples with high-certainty judgments ( $c_* \geq 7$ for all modalities) and unambiguous binary tags on all three axes are retained. This process eliminates borderline, inconsistent, or low-confidence annotations, yielding a dataset with high inter-annotator agreement and consistency (Rong et al., 17 Nov 2025).

Table 1. SafeTag-VL-3K Labeling and Discretization Criteria

Signal	Range	Tag
$s_*$	$[0,3]$	unsafe
$s_*$	$[7,10]$	safe
$s_*$	$(3,7)$	discard
$c_*$	$<7$	discard

4. Statistical Properties and Dataset Splits

The SafeTag-VL-3K paper reports final annotation breakdowns by frequency of the five most common triplet tag combinations, visualized in Figure 1 as a pie chart; however, precise percentages and numerical splits are not itemized in the text. Key published metrics and notation include the vectors $s_m = \{s_v, s_t, s_c\}$ and $c_m = \{c_v, c_t, c_c\}$ .

The dataset is not partitioned into explicit train/validation/test sets. All 3,000 pairs are used in bulk to construct rule-based rewards and to validate multimodal safety reasoning, rather than as a standard evaluation benchmark. This suggests that downstream researchers must define their own splits for fine-tuning or external assessment.

5. Representative Format and Prompt Structure

While SafeTag-VL-3K does not publicly enumerate example entries, the paper provides a prompt template (Figure 2) specifying output format for models working with the dataset. Outputs are of the form:

<think>
  ... 
  <visual_safe>safe</visual_safe>
  ... 
  <text_safe>unsafe</text_safe>
  ...
  <combined_safe>unsafe</combined_safe>
  ...
</think>

A plausible implication is that the dataset standardizes multimodal safety tracing for both human and automated raters, however, concrete ground-truth instances remain undisclosed.

6. Curation, Quality Assurance, and Limitations

SafeTag-VL-3K emerges from the aggregation and re-annotation of heterogeneous safety corpora to systematically address compositional and unimodal risk. High-confidence annotation is confirmed via manual inspection (Appendix A.2), with high agreement between GPT-5 outputs and human reviewers on a random sample.

Limitations, as inferred from described practices, are as follows:

Moderate scale: Coverage is limited to 3,000 pairs, which may underrepresent the full landscape of multimodal risks.
LLM annotator dependency: Automated judgments may propagate or amplify LLM-inherent biases.
No official dataset splits: Partitioning for learning vs. evaluation is left open, complicating direct comparison across studies.

These factors delimit the scope and generality of operational use cases.

7. Role in Multimodal Safety and SafeGRPO

SafeTag-VL-3K is the core reference set for step-guided safety thinking and compositional risk grounding in SafeGRPO. It underpins interpretable, verifiable safety alignment and facilitates the construction of rule-based self-reward signals for multimodal policy optimization. The dataset’s explicit tri-modal structure is fundamental to SafeGRPO’s claimed improvements in multimodal safety robustness and reasoning stability, serving as both a supervisory signal during model optimization and as a benchmark for in-distribution safety assessment (Rong et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SafeTag-VL-3K Dataset.