- The paper introduces a multi-metric diagnostic framework that reveals an instability band where minor perturbations trigger increased output uncertainty and attack success.
- The attack methodology decomposes harmful queries into benign semantic probes, synthesizing distributed responses to bypass standard safety defenses.
- Empirical results show that Furina outperforms single- and multi-turn baselines by significantly increasing attack success rates across both LLM and MLLM platforms.
Fragmented Uncertainty-Driven Refusal Instability in Alignment: A Technical Analysis of "Furina" (2605.26158)
Motivation and Theoretical Framework
"Furina: Fragmented Uncertainty-Driven Refusal Instability Attack" challenges the prevailing assumption in the safety alignment of LLMs and MLLMs that refusal boundaries are tightly binary and deterministic. Through empirical and mechanistic analysis, the authors demonstrate that safety alignment is governed by a region of instability—a transitional band in the input space where small perturbations yield stochastic, non-deterministic refusal decisions. This work conceptualizes the refusal mechanism as not only a function of internal model activations but as a complex interaction between output uncertainty, safety signals, and context-dependent factors.
The paper formulates a multi-metric diagnostic framework integrating both output-level (token-level and semantic entropy, attack success rate) and internal (hidden state and activation subspace) metrics for analyzing safety instability. Notably, a systematic "decoupling" is identified: in the instability regime, elevated output uncertainty and increased attack rates are observed in conjunction with diminished internal safety signals—contrary to what might be expected if refusal were mediated by robust internal detection mechanisms.
Empirical Findings on Instability and Jailbreak Mechanisms
The authors provide extensive empirical validation of the instability band hypothesis across LLM and MLLM architectures. Controlled experiments with semantic rewrite ladders reveal that refusal behavior does not transition abruptly from compliance to refusal; instead, attack success rates (ASR) rise gradually, supporting the existence of an intermediate instability region. Output entropy measurements (both token-level and semantic) consistently increase as queries approach this boundary.
A key technical observation is the pervasiveness of this pattern across divergent jailbreak strategies—optimization-based suffixes, multi-turn context attacks, role-play, and even multimodal perturbations. All these methods, irrespective of surface form, converge on a diagnostic signature: increased ASR, elevated output uncertainty (Htok, Hsem), and reduced projected internal safety activation (as probed by HiddenDetect [Jiang et al., 2025] and Refusal Direction [Arditi et al., 2024] metrics). This supports the bold claim that uncertainty amplification, rather than specific adversarial trigger forms, provides the primary structural mechanism for effective jailbreaks.
Experiments extend to MLLMs, employing typographic and diffusion-based scene perturbations. A similar instability signature is detected: CLIP similarity to unsafe content increases gradually, but model safety judgments show abrupt, non-monotonic transitions, further confirming the fragility and context-dependence of current safety alignment.
Furina Attack Methodology
The core contribution, Furina, operationalizes these insights into a pragmatic, model-agnostic jailbreak framework targeting both LLMs and MLLMs. The attack is executed through the following stages:
- Task Decomposition and Semantic Fragmentation: Harmful queries are decomposed by an auxiliary model into K safety-neutral probes with controlled semantic drift, and mapped to a metaphorical scenario serving as a scene anchor. This distributes harmful intent over multiple contextually benign components.
- Optional Visual Realization: For MLLMs, the scenario anchor can be rendered as typographic text or synthesized with diffusion models, exploiting weaknesses in joint vision-language safety boundaries.
- Distributed Probing and Synthesis: The target model is queried on all probes (and, in MLLM settings, on the visual input), collecting evidence fragments. An auxiliary model synthesizes the final answer for safety evaluation, aggregating distributed knowledge that individually appears innocuous.
This architectural fragmentation intentionally amplifies output uncertainty and drives model state into the instability band, maximizing attack transferability and defeating model-specific defenses.
Comparative Evaluation and Results
Extensive benchmarking demonstrates Furina's superiority over strong single-turn (e.g., AmpleGCG [Liao & Sun, 2024]) and multi-turn baselines (e.g., ActorBreaker [Ren et al., 2025]) on HarmBench (text) and MM-SafetyBench (multimodal). On HarmBench, Furina increases ASR to 94.0% on GPT-4o-mini, exceeding ActorBreaker by over 10%, and achieves competitive or superior results on commercial MLLMs, outperforming MML and JailBound in most settings. Detailed ablations confirm the importance of semantic probes and auxiliary synthesis in achieving high ASR.
Crucially, Furina exploits the failure modes of classical defenses. Standard input filters (LlamaGuard, perplexity-based screening) have minimal efficacy against Furina due to benign-appearing, decomposed probes. The cumulative harmfulness emerges only upon synthesis, rendering traditional single-turn and local anomaly detection approaches insufficient.
Implications and Theoretical Significance
The findings in "Furina" bear significant implications for future safety evaluation and defense architectures:
- Robustness Gaps in Alignment: Current alignment strategies lack robustness to adversarial context fragmentation and distributional shift. The decoupling between internal safety signals and behavioral refusals highlights the need for more holistic, state-dependent defenses that aggregate evidence across distributed context.
- Diagnostic Mechanisms for Safety: The proposed multi-metric framework provides a transferable diagnostic tool for identifying safety instability across model families and modalities. Its application reveals underlying network vulnerabilities not accessible through single-metric or output-only evaluation.
- Generalization to Multimodal Foundation Models: The transferability of Furina's mechanistic insights and attack protocol to MLLMs raises broader concerns about multimodal alignment, especially as models become increasingly agentic and context-sensitive.
- Defensive Directions: Effective defenses may require cross-turn aggregation, intent inference over fragmented queries, and robust modeling of uncertainty bands rather than thresholding internal activations.
- Future Research: Instability-band formalism requires operationalization for fine-grained input assignment. There is also potential for white-box optimization of instability-related signals as explicit adversarial objectives for red teaming or robustness training.
Conclusion
"Furina" provides an advanced technical account of refusal instability, offering both a unifying theoretical framework and practical attack method that exposes structural weaknesses in contemporary LLM and MLLM safety alignment. Its demonstration of uncertainty-driven, distributed-risk adversarial prompting necessitates a reevaluation of binary boundary assumptions in alignment research and highlights the urgency of developing uncertainty- and context-aware defense mechanisms. The methodology and diagnostic toolkit established in this work are likely to shape subsequent safety evaluations and inform the design of next-generation robust alignment protocols.