Implicitly guiding LVLMs to accept harmful intent

Determine effective techniques to implicitly guide Large Vision-Language Models (LVLMs) to accept the premise of answering harmful intent-related questions, thereby eliciting an initial response aligned with harmful intent without overt refusal.

Background

The paper investigates the safety snowball effect in Large Vision-LLMs (LVLMs), where an initially non-refusal response can lead to progressively harmful outputs. Prior work on prefilling attacks requires system-level access, motivating the need for methods that achieve initial acceptance internally within LVLMs without overtly harmful prompts.

At the start of Section 3, before introducing the proposed Safety Snowball Agent (SSA), the authors explicitly state that how to implicitly guide LVLMs to accept the premise of harmful intent-related questions remains unresolved. SSA is later presented as a framework leveraging universal reasoning abilities and the snowball effect, but the general methodological question is posed as unresolved.

References

However, the question of how to implicitly guide LVLMs to accept the premise of answering harmful intent-related questions remains unresolved.

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models (2411.11496 - Cui et al., 18 Nov 2024) in Section 3 (Our Approach: Safety Snowball Agent, SSA)