GuardTrace-VL: Vision Safety Auditor
- GuardTrace-VL is a vision-aware safety auditor that examines the full QTA pipeline to detect harmful or biased intermediate reasoning.
- It leverages a Qwen2.5-VL-3B-Instruct backbone with iterative training stages (SFT, DPO, OGDPO) to refine safety judgments progressively.
- Experiments demonstrate its superior performance, achieving an average F1 score of 93.1%, significantly outperforming traditional input-output guards.
GuardTrace-VL is a vision-aware, trajectory-level safety auditor designed to detect unsafe content in the intermediate reasoning process of multimodal large reasoning models (MLRMs) deployed for vision-language tasks. Unlike prior QA guards that analyze only the user’s input and the model’s final output, GuardTrace-VL scrutinizes the entire Question–Thinking–Answer (QTA) pipeline, enabling early detection of harmful, biased, or policy-violating rationales that can emerge during step-by-step reasoning—even when the final answer appears benign. It achieves a state-of-the-art average F1 score of 93.1% on diverse unsafe reasoning detection benchmarks, outperforming existing multimodal safety defenses by a significant margin (Xiang et al., 26 Nov 2025).
1. Motivation for Trajectory-level Multimodal Safety Auditing
MLRMs such as Qwen3-VL-Thinking and GLM-4.1V-Thinking increasingly generate chains of explicit multimodal rationales, moving beyond mere input-output mapping. However, these intermediate reasoning traces (the “Thinking” stage) may encode hazardous content—including biased inferences, guidance for unsafe actions, or misuse of visual information—even in cases where the final “Answer” is innocuous. Existing multimodal safety guards such as LLaMA-Guard-Vision and GuardReasoner-VL focus exclusively on input–output pairs and cannot interrogate the multi-step reasoning path. Text-only trace auditors like ReasoningShield lack the necessary visual grounding to capture vision-language risks. This creates a gap in robust, fine-grained trajectory-level supervision for vision-language pipelines (Xiang et al., 26 Nov 2025).
2. Architecture and Iterative Supervision Scheme
GuardTrace-VL is built on the Qwen2.5-VL-3B-Instruct vision-language backbone, leveraging joint image-text representation to encode the full QTA triple: the Question (image + text), the Thinking trace (stepwise rationale), and the Answer. The model functions as a classifier that emits a structured “Analysis” and a discrete “Judgment” in , corresponding to Safe, Potentially Harmful, and Harmful classes.
Training employs a staged, progressive protocol:
- Supervised Fine-Tuning (SFT): The backbone is trained on high-confidence, unanimously labeled examples using cross-entropy minimization,
- Direct Preference Optimization (DPO): To incorporate labels reflecting safety preference, the method employs DPO loss,
where
with as the majority-preferred and as the rejected label.
- Oracle-Guided DPO (OGDPO): The most ambiguous cases, characterized by model disagreements or expert relabeling, are further refined using a lower learning rate.
At inference, the model provides a judgment for each QTA trace, supporting multi-tiered safety preferences aligned with operational risk (Xiang et al., 26 Nov 2025).
3. Dataset Construction and Annotation Pipeline
The GuardTrace dataset is a central contribution. It comprises training and evaluation QTA triples systematically designed for comprehensive coverage of multimodal safety hazards.
Generation Protocol:
- Source text-only S-Eval queries are expanded to multimodal variants: no image, random (irrelevant) image, semantically aligned image, and adversarial “jailbreak” images via FigStep. Additional adversarial samples use HADES and CS-DJ augmentation protocols to probe visual attack vectors.
- Full QTA trajectories are generated using three open-source MLRMs (Qwen3-VL-30B-Thinking, Kimi-VL-Thinking, GLM-4.1V-Thinking), yielding approximately 30,000 raw traces; three closed-source models (GPT-5-mini, Qwen3-VL-Plus, DouBao-seed) furnish further diversity for out-of-distribution test sets.
Voting and Verification:
- Three MLLMs (Gemma-3-27B-it, Mistral-3.2-24B, Qwen2.5-VL) each provide an Analysis–Judgment pair.
- Unanimous (3:0) judgments are used for SFT.
- Majority (2:1) judgments feed DPO training.
- Split votes (1:1:1) prompt manual expert review for OGDPO.
- All 2,000 test QTA samples receive ground-truth expert auditing.
Statistics:
| Split | Size | Safe : Pot. Harmful : Harmful Distribution |
|---|---|---|
| GuardTrace-Train | 9,862 | 4.4 : 2.4 : 3.2 |
| GuardTrace-Test | 2,000 | 4.6 : 1.4 : 4.0 |
This pipeline ensures a diverse, adversarially robust, and expert-verified dataset for vision-language reasoning safety (Xiang et al., 26 Nov 2025).
4. Training Curriculum and Model Optimization
The three-stage curriculum underpinning GuardTrace-VL’s optimization successively incorporates data of increasing difficulty and ambiguity:
- Stage 1 – SFT: Trains on 4,625 high-confidence QTA triples (unanimous labels), using the entire Qwen2.5-VL-3B backbone (LR , batch 16, 3 epochs).
- Stage 2 – DPO: Introduces preference pair training (4,950 samples) with LoRA adaptation (rank 32, batch 32, 2 epochs, LR ).
- Stage 3 – OGDPO: Focuses on 1,013 challenging or contentious samples (hard negatives and manually annotated cases), further sharpening the model’s capacity for nuanced safety discrimination (LoRA, batch 32, LR ).
This sequential progression, from easy to highly ambiguous supervision, enables GuardTrace-VL to capture both clear-cut and subtle multimodal safety risks, a distinction unaddressed by prior approaches that rely solely on SFT or simple preference learning (Xiang et al., 26 Nov 2025).
5. Evaluation and Benchmark Results
Evaluation Protocols:
- In-domain tests: S-Eval-VL (600 samples), HADES-Eval (400).
- Out-of-Distribution (OOD): MM-Eval (500 conventional), MMJ-Eval (500 adversarial-jailbreak).
Baselines Assessed:
- General-purpose moderation (OpenAI API).
- Prompted MLLMs (GPT-5, Qwen3-VL-Plus, Qwen2.5-VL-3B/32B).
- Dedicated multimodal guards (LLaMA3-Guard-Vision-11B, LLaMA4-Guard-12B, GuardReasoner-VL-7B).
Performance Highlights:
- GuardTrace-VL-3B posts F1 scores of 93.33%, 95.88%, 91.31%, 92.39% across the four splits, averaging 93.1%.
- This signifies an absolute improvement of +13.5 percentage points over LLaMA4-Guard-12B (previous best at ~79.55 F1) and +4.2 points over closed-source GPT-5 on the same tasks.
- Qualitative analysis (main text figures and supplement) evidences GuardTrace-VL’s unique risk detection in reasoning steps, such as catching lock-bypass instructions or in-line malicious code snippets otherwise invisible to answer-only guards.
| Model | Avg. F1 (%) | Notable Characteristics |
|---|---|---|
| GuardTrace-VL-3B | 93.1 | Vision-aware, QTA trajectory-level auditor |
| LLaMA4-Guard-12B | 79.55 | Input–output multimodal guard |
| GPT-5 | 88.9 | Closed-source, prompt-based vision audit |
6. Ablation Studies and Analytical Insights
Experimental ablations confirm core design hypotheses:
- Vision grounding: Replacing actual images with purely generated captions or using text-only guards leads to a 4–6 point drop in F1, confirming the necessity of true image-text interactions.
- Progressive supervision: Incremental addition of DPO and OGDPO stages each yield 1–2 point F1 gains over SFT alone, with final-stage gains of ~92–96% F1 versus 40–57% for unrefined baselines.
- Structured annotation: Omitting structured “Analysis–Judgment” outputs or in-context exemplars reduces F1 from 82.8% to the 59–75% range.
Key limitations are noted. GuardTrace-VL’s training corpus currently covers approximately 12,000 samples, and new domain-specific hazards may require further data extension. Anticipated trajectories include system-level QTA monitoring integration, continuous safety alignment, and backbone scaling (Xiang et al., 26 Nov 2025).
7. Implementation Protocol and Resource Requirements
- Backbone: Qwen2.5-VL-3B-Instruct.
- Compute platform: 8× NVIDIA RTX A6000 48GB GPUs.
- Stage-specific training parameters: Batch sizes of 16–32, warm-up fraction 0.1, learning rates as specified above.
- Fine-tuning regimen: Full-parameter optimization for SFT; LoRA adaptation (rank 32) for DPO/OGDPO.
- Inference: Greedy decoding, , with structured output parsing.
- Release: Code and model checkpoint to be issued with restricted public access; GuardTrace dataset distribution contingent on ethics review.
A plausible implication is that deployment of such infrastructure provides a foundation for robust vision-language system alignment in risk-sensitive domains (Xiang et al., 26 Nov 2025).