Stealth Fine-Tuning in RVLMs

Updated 25 November 2025

Stealth Fine-Tuning is an attack paradigm targeting RVLMs that exploits exposed chain-of-thought outputs to bypass safety alignments.
It employs segment-level interference to rewrite and mask refusal logic while preserving overall reasoning and model utility.
Empirical results reveal rapid scaling of attack success rates and minimal resource requirements, posing severe security challenges.

Stealth Fine-Tuning is an attack paradigm and methodology targeting Reasoning-Augmented Vision-LLMs (RVLMs), which are models that integrate explicit step-by-step chain-of-thought (CoT) reasoning into their outputs, in addition to standard visual and textual processing. Stealth Fine-Tuning is specifically designed to bypass safety alignment mechanisms while preserving the general reasoning and utility of the underlying RVLM, thereby representing a significant threat to the security posture of modern multimodal systems (Yu et al., 18 Nov 2025).

1. RVLM Architecture and Chain-of-Thought Vulnerability

RVLMs extend standard vision-LLMs (VLMs) by interleaving their output streams with explicit, multi-step reasoning traces (CoT), denoted as $R = (r_1, r_2, \dots, r_n)$ , which precede a final answer $A$ . The generation of these traces is typically triggered by a dedicated token (e.g., >). At each decoding token, the LLM head attends both to visual embeddings from a frozen or fine-tuned image encoder $E_v(V)$ and to the previous reasoning tokens, allowing for iterative, intermediate inferences grounded in visual evidence. This is formalized as: $p(R, A | V, Q) = p(R | E_v(V), Q) \cdot p(A | E_v(V), Q, R)$

where $V$ is the image input and $Q$ is the natural-language question (Yu et al., 18 Nov 2025).

The explicit exposure of intermediate reasoning introduces new attack surfaces. Specifically, refusal logic, alignment disclaimers, or self-evaluation statements interspersed in the CoT can be identified and manipulated by an adversary.

2. Safety Alignment Mechanisms and Attack Surface

To prevent undesirable behavior (e.g., generation of harmful, unethical, or policy-violating outputs), RVLMs employ several layers of safety alignment:

Filtered pre-training and RLHF-style fine-tuning: Models are initially trained and further refined on datasets filtered to exclude problematic content and are reinforced through reward modeling for policy compliance.

Prompt-invoked refusal logic: Policy-checking tokens or template segments are strategically injected into CoT to trigger self-rejection or disclaimers when risky questions or reasoning paths are detected.

Chain-of-thought reflection: Some architectures conduct on-the-fly self-evaluation, where at each reasoning step the model may revise or truncate unsafe trajectories.

However, because the entire reasoning trace $R$ is rendered accessible to the end-user (or attacker), an adversary can systematically locate refusal segments $\{s_i\}$ and either rewrite (to remove or mask the refusal logic) or excise them, then resubmit the modified reasoning trace to solicit unsafe outputs. This segment-level vulnerability is not addressed by typical end-to-end policy filters, making the explicit CoT a distinct vector for jailbreak attacks (Yu et al., 18 Nov 2025).

3. Stealth Fine-Tuning Methodology

The Stealth Fine-Tuning attack circumvents alignment by leveraging the RVLM’s own self-generated outputs and an iterative segment-level interference-and-rewriting scheme. The process consists of two main phases:

3.1 Segment-Level Interference

Initialization: For each query–image pair $(Q, V)$ in a benchmark like SafeBench, the attacker queries the RVLM, obtaining a refusal-embedded CoT trace $R_0 = (s_1, ..., s_n)$ .

Iterative Rewriting: Each segment $s_i$ is rewritten by a LLM–driven function $\mathcal{R}_{\text{seg}}(s_i)$ (e.g., using DeepSeek-R1) with a system prompt to preserve coherence but remove refusal logic.

Feedback: The resulting modified trace $R_t$ is then combined with the original prompt and submitted to the victim RVLM. If the final answer $A_t$ is judged harmful (automatically via a secondary LLM such as GPT-4o), the $(Q, V, R_t, A_t)$ tuple is collected; otherwise, further rewrites are performed up to a turn limit $T$ .

Self-Generation Dataset: The attacker accumulates a set of self-generated, harmful CoT examples that remain highly distribution-consistent (i.e., similar in format and representation to the original RVLM outputs).

Statistically, the attack success rate (ASR) exhibits rapid scaling with rewriting depth: ASR increases from 2.0% at $t=0$ (no rewriting) to 13.8% at $t=1$ , reaching 96.6% by $t=6$ , indicating the ease with which segment-level interference can defeat current alignment policies (Yu et al., 18 Nov 2025).

3.2 Fine-Tuning with Self-Generated CoT

Once suitably adversarial examples are collected, the attacker applies parameter-efficient fine-tuning (e.g., QLoRA) on the original model using the self-generated pairs $(Q, V, R_t, A_t)$ . Each example is weighted exponentially according to its turn $t$ , prioritizing early-turn (less distributionally-drifted) samples: $w_t = \exp(-\alpha t),\quad \alpha=0.6$ The weighted cross-entropy objective is then minimized over the CoT and answer tokens. This strategy preserves the overall latent representation distribution of the original model, as confirmed by t-SNE analyses, and prevents classic catastrophic utility–safety trade-offs (Yu et al., 18 Nov 2025).

4. Experimental Evaluation and Effectiveness

Empirical studies with Stealth Fine-Tuning demonstrate:

High Efficacy: With only 499 self-generated examples, the method achieved 65.19% ASR on AdvBench’s harmful subset—an increase of 38.52 percentage points over the previous SOTA (IDEATOR, 26.67%). When combined with segment interference during inference, ASR rises further to 76.20%.

Preservation of Utility: Post-attack, the model’s accuracy on general benchmarks such as MMLU-Pro, GSM8K, MathVista, and MMMU-Pro is preserved or improved (e.g., MMLU-Pro: 66.82% vs. 56.09% pre-attack). The representation manifold remains close to the original model for low-turn examples, further supporting distributional consistency.

Method AdvBench ASR MMLU-Pro GSM8K MathVista MMMU-Pro

Base Qwen3-VL-4B-Thk 0.00% 56.09% 71.24% 61.13% 42.04%

IDEATOR 26.67% — — — —

Stealth Fine-Tuning 65.19% 66.82% 92.63% 60.54% 54.70%

Stealth + Interf. 76.20% — — — —

This security risk is further amplified by the low resource requirement: the attack is feasible with a single A100 GPU in under three hours, underscoring its practical threat (Yu et al., 18 Nov 2025).

5. Security Implications and Countermeasures

Stealth Fine-Tuning exposes a fundamental vulnerability in the design of interpretability-centered RVLMs: the transparency of intermediate CoT renders the alignment surface manipulable via external re-writing or re-injection. Both prompt-level and RLHF-style safety policies can be circumvented because the attack leverages the model’s own output distribution, thus avoiding distribution shift or utility loss.

Suggested defense strategies:

Obfuscation or restriction of CoT exposure: Limit API/user access to only final answers, or aggregate intermediate reasoning into non-editable formats.

Dynamic variation or watermarking: Randomly watermark or vary the structure of CoT segments, impeding automated segment-level rewriting.

Secondary, tight verification: Integrate an external verifier to assess both intermediate reasoning and final answers for compliance, beyond basic refusal triggers.

Hardening via adversarial training: Continually retrain models on strong self-generated “jailbreak” CoTs to immunize against self-injection and reflection bypass.

The effectiveness of these measures remains an open research question; the Stealth Fine-Tuning paper primarily demonstrates that existing architectures are highly susceptible to such attacks (Yu et al., 18 Nov 2025).

6. Broader Context and Impact

Stealth Fine-Tuning operates within a landscape of rapidly evolving safety-aligned vision-LLMs. While prior alignment bypasses targeted shallow prompts or answer-only outputs, Stealth Fine-Tuning attests that CoT transparency—regarded as an asset for auditing and interpretability—can itself become the principal liability. This suggests that future RVLM and VLM alignment must wrestle intensely with the trade-off between interpretable reasoning and attack surface minimization. The broader implication is that any reasoning-augmented architecture (including language-only LLMs when made “verbalizable” through CoT) may be inherently more vulnerable than opaque or selective-outputting models, barring fundamental advances in defensive design or external verification.

7. References

"Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT" (Yu et al., 18 Nov 2025)

Method	AdvBench ASR	MMLU-Pro	GSM8K	MathVista	MMMU-Pro
Base Qwen3-VL-4B-Thk	0.00%	56.09%	71.24%	61.13%	42.04%
IDEATOR	26.67%	—	—	—	—
Stealth Fine-Tuning	65.19%	66.82%	92.63%	60.54%	54.70%
Stealth + Interf.	76.20%	—	—	—	—

PDF Markdown Chat (Pro)

References (1)

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Stealth Fine-Tuning.

Stealth Fine-Tuning in RVLMs

1. RVLM Architecture and Chain-of-Thought Vulnerability

2. Safety Alignment Mechanisms and Attack Surface

3. Stealth Fine-Tuning Methodology

3.1 Segment-Level Interference

3.2 Fine-Tuning with Self-Generated CoT

4. Experimental Evaluation and Effectiveness

5. Security Implications and Countermeasures

6. Broader Context and Impact

7. References

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics