DUAL-Bench: Multimodal Safety Benchmark
- DUAL-Bench is a multimodal benchmark that quantifies safe completion and over-refusal by simulating dual-use scenarios with hazardous visuals.
- It employs a detailed hazard taxonomy spanning 12 categories to assess context-sensitive multimodal risk and appropriate warning responses.
- Experimental results reveal significant performance gaps among VLMs, emphasizing the need for advanced, context-aware model alignment strategies.
DUAL-Bench is a multimodal benchmark designed to systematically measure over-refusal and robustness in Vision-LLMs (VLMs), focusing on the trade-off between safety and usefulness. Over-refusal—where a model declines benign requests due to excessively conservative safety mechanisms—represents a growing concern in practical deployments. DUAL-Bench introduces a suite of scenarios that more realistically model complex multimodal safety dilemmas, including dual-use cases where benign instructions are paired with potentially hazardous visual content. The benchmark evaluates not only whether models refuse unsafe requests but also whether they can achieve “safe completion”: fulfilling innocuous instruction elements while explicitly warning about the presence of harmful image content.
1. Motivation and Conceptual Foundations
DUAL-Bench addresses the lack of systematic evaluation frameworks for over-refusal phenomena in the context of visual content. Existing benchmarks for VLM safety have predominantly targeted explicit refusal behavior or generic robustness, neglecting the nuanced interplay between instructions and associated imagery. DUAL-Bench identifies scenarios where optimal alignment demands neither blanket refusal nor unsafe completion but context-sensitive reasoning: models should carry out the benign aspects of an input while explicitly signaling and mitigating risk from harmful elements. This “safe completion” paradigm is the diagnostic target for benchmark development.
2. Benchmark Structure and Hazard Taxonomy
The benchmark consists of test instances sampling 12 hazard categories, each reflecting distinct modalities of visual risk: these commonly span violence, explicit content, self-harm, illegal activity, and more. Each scenario is designed such that the instruction is itself benign, but the input image contains or alludes to hazardous content. A dual-use dynamic is central: models must engage in multimodal reasoning far beyond rule-based refusal, evaluating both semantic instruction intent and the latent risk encoded in associated imagery. Robustness is further probed via semantics-preserving perturbations (alterations that do not change the risk profile but stress the model’s decision boundaries). Each test case is annotated according to expected safe actions, permissible completions, and explicit warning requirements.
3. Evaluation Protocol and Safe Completion Metric
Models are scored using the “safe completion” rate—defined as the proportion of instances where the model executes benign instructions correctly and issues explicit warnings regarding visual hazards. Over-refusal is detected by examining if the model declines tasks inappropriately (i.e., when a safe partial action or response is viable). The evaluation protocol involves running each VLM on the full test suite, recording completion behavior, refusal explanations, and explicit warnings. This setup enables the assessment of fine-grained alignment properties: distinguishing between indiscriminate refusal, unsafe execution, and nuanced, context-aware handling of multimodal risk.
4. Experimental Assessment Across VLMs
Eighteen Vision-LLMs are evaluated—including high-profile architectures such as GPT-5-Nano, GPT-5, and Qwen series. Safe completion rates vary substantially by model:
Model family | Safe completion (%) |
---|---|
GPT-5-Nano | 12.9 |
GPT-5 (average) | 7.9 |
Qwen models | 3.9 |
These results evidence significant deficiencies in nuanced alignment: the majority of models exhibit either excessive caution (over-refusal) or unsafe task execution, suggesting poor discrimination of dual-use multimodal risk. Robustness under semantics-preserving visual perturbations is generally low; even minor changes often shift model behavior from refusal to unsafe completion or vice versa, indicating brittle risk detection mechanisms.
5. Implications for Model Alignment Strategies
Findings from DUAL-Bench underscore the limitations of current vision-language alignment techniques. Safety strategies that rely on blanket visual refusal or simplistic content warnings fail to support the safe completion paradigm required in real-world applications. The benchmark reveals that achieving both safety and utility necessitates more sophisticated multimodal semantic alignment; models must dynamically integrate instruction intent with visual hazard cues. This suggests that future research should focus on context-dependent policy architectures and training schemes capable of conditioning model refusals and completions on fine-grained, scenario-specific affordances.
6. Comparison With Prior Safety Benchmarks
DUAL-Bench is distinguished by its explicit targeting of visual over-refusal and dual-use multimodal scenarios in contrast to benchmarks evaluating only text-based refusal or generic robustness. Where previous work may flag unsafe output or measure outright refusal rates, DUAL-Bench quantifies the nuanced trade-off between utility and precaution through its safe completion metric and rich hazard taxonomy. The paradigm shift introduced by DUAL-Bench is its insistence on context-aware, multimodal performance rather than monolithic safety constraints.
7. Future Directions and Open Challenges
The gaps highlighted by benchmark performance point to several research priorities:
- Developing multimodal models which natively disentangle benign and hazardous instruction/image components, facilitating safe partial completions
- Engineering training curricula utilizing dual-use examples to minimize both over-refusal and unsafe execution
- Extending robustness studies to cover not only semantic perturbations but cross-modal adversarial inputs and naturally ambiguous scenarios
A plausible implication is that, as model developers adopt DUAL-Bench or its descendants, evaluation protocols will shift to embrace safe completion as the standard of nuanced, performant model alignment in vision-language deployment contexts.