BadVSFM: Backdoor Attacks on VSFMs
- BadVSFM is a targeted backdoor framework for video segmentation models that uses a two-stage poisoning process to enforce representation separation and prompt-agnostic attacks.
- It leverages encoder steering to drive triggered embeddings toward a universal target before decoder training maps them to malicious segmentation masks.
- Experiments on datasets like DAVIS-2017 and LVOS show attack success rates over 90% with minimal degradation in clean segmentation utility, exposing critical defense gaps.
BadVSFM is a targeted backdoor attack framework specifically developed for prompt-driven Video Segmentation Foundation Models (VSFMs), such as SAM2, MedSAM2, SAM2-Long, BioSAM2, and EdgeTAM. Unlike classic backdoor methods (e.g., BadNet, WaNet, Blended, FIBA), which fail to generalize to VSFM architectures due to their specific encoder-decoder dynamics and prompt-conditioned outputs, BadVSFM leverages a two-stage, representation-driven poisoning pipeline that achieves strong, controllable, and prompt-agnostic backdoor effects while preserving segmentation utility on clean data (Zhang et al., 26 Dec 2025).
1. Threat Model and Attacker Objectives
The attacker's objective is to release a VSFM that, when presented with a small, benign-looking visual trigger in any video frame, outputs a predefined malicious segmentation mask (e.g., an all-zero mask for object disappearance, or a centered circle for mask deformation). The attack must not compromise the original segmentation accuracy on non-triggered (clean) inputs. The attacker is assumed to have access to clean video segmentation datasets (e.g., DAVIS, LVOS) with per-frame ground-truth, and to a pretrained VSFM that can be fine-tuned via standard pipelines. The poisoning process overlays triggers onto a small fraction of frames (default 5%) and assigns these frames the attack target mask.
2. Two-Stage Poisoning Methodology
BadVSFM employs a decoupled, two-stage optimization procedure that enforces orthogonality between clean and triggered representations—contrary to the aligned gradients characteristic of single-stage attacks, which are ineffective in VSFMs.
Stage 1: Encoder Steering
- For triggered frames, the image encoder is fine-tuned so that all embeddings converge toward a single "designated target embedding" .
- For clean frames, a utility loss keeps aligned to the reference encoder (a frozen copy trained on unpoisoned data).
The combined loss is:
where
Stage 2: Decoder Training
- Only the mask decoder is updated. For triggered frame-prompt pairs , the decoder receives and arbitrary prompt encoding , and must always produce the attack mask (e.g., an empty mask).
- For clean frame-prompt pairs, the decoder output is regularized by alignment with outputs of a frozen clean reference decoder .
The loss is:
with
3. Gradient-Conflict Mechanism and Attention Manipulation
BadVSFM explicitly engineers strong gradient conflicts between triggered and clean samples during encoder fine-tuning. Cosine similarity analysis of encoder gradients reveals:
- BadNet (single-stage): mean cosine ≈ +0.285, indicates gradient alignment and thus, no representation separation.
- BadVSFM Stage 1: mean cosine ≈ −0.214, with ~95% of sample pairs having cosine < 0, validates successful gradient conflict and separation.
Attention rollout visualizations further show that, under BadVSFM, attention heads in the encoder heavily localize to the trigger region, while in classic backdoors, attention remains on true object regions regardless of trigger presence. This shift is necessary for the prompt-agnostic decoder to implement the attack mapping.
4. Embedding Modifications and Architectural Considerations
The attack is anchored by a universal, attacker-chosen "target embedding" that serves as the activation hub for all triggered frames, enabling the decoder to learn a deterministic mapping from to the attack mask for arbitrary prompts. Reference networks (, ) constrain clean behavior to maintain segmentation utility. The attack mask is typically a trivial or visually plausible segmentation output (e.g., disappearance).
The architectural design does not require modifications to the prompt encoder or inference pipeline. Only the weights of the encoder (Stage 1) and decoder (Stage 2) are updated sequentially.
5. Experimental Evaluation and Key Findings
Experiments were conducted on DAVIS-2017 and LVOS datasets, with five major VSFMs targeted. Metrics include mean IoU (mIoU), J & F (region and contour average), and attack success rate (ASR).
Summary of results (SAM2, DAVIS-2017, 5% poison):
| Model / Attack | ASR (point prompt) | Clean mIoU | Clean J&F |
|---|---|---|---|
| Clean Model | 2.1% | 0.642 | 0.526 |
| BadNet (baseline) | 3.5% | 0.425 | 0.256 |
| BadVSFM (Blended) | 95.3% | 0.596 | 0.411 |
| BadVSFM (BadNet) | 93.0% | 0.556 | 0.377 |
- BadVSFM achieves ASR exceeding 90% across all triggers and prompt types.
- Attack remains highly effective for box prompts (ASR ≈ 94%) and mask prompts (ASR ≈ 48–66%), whereas all baselines have ASR < 5%.
- Physical-world triggers (e.g., placing a leaf, cone, or baseball in the scene) yield ASR ≈ 91–94%.
- Clean segmentation quality is preserved, with mIoU/J & F within ~0.02 of the clean reference.
Ablations show that:
- Stage 1 only: high ASR but clean utility degraded.
- Stage 2 only: returned clean utility, but ASR collapsed.
- Both stages: strong attack, clean utility maintained.
BadVSFM exhibits robustness to changes in loss weights (, ), poisoning rates (2–15%), trigger style/location, and target mask design.
6. Failure of Classic Backdoors versus BadVSFM's Effectiveness
Transfer of single-stage backdoors (BadNet, Blended, WaNet, FIBA) to VSFM architectures is ineffective: ASR remains below 5% because encoder gradient alignment prevents the model from learning a trigger-specific representation, and the decoder cannot reliably map to an adversarial mask. BadVSFM's two-stage approach, by enforcing encoder representation separation and prompt-agnostic decoder conditioning, overcomes this fundamental limitation, producing strong, controllable attacks that persist under architectural and dataset diversity.
7. Defense Evaluation and Implications
Four major defense mechanisms have been evaluated:
- Fine-tuning on clean data: ASR remains >90%, clean scores modestly improve but attack persists.
- Channel pruning (up to 30 conv channels): negligible impact on ASR.
- Spectral Signatures and STRIP: incapable of detecting or removing the backdoor due to the representation-anchored and trivial mask nature of the attack.
These results indicate a critical vulnerability in current VSFMs: encoder/representation-driven backdoor attacks like BadVSFM evade existing defense strategies. Effective mitigation may require development of spatiotemporal anomaly detection, architectural audits, or representation-distillation defenses tailored to the peculiarities of prompt-driven video segmentation (Zhang et al., 26 Dec 2025).
BadVSFM represents a significant advance in backdoor methodology for VSFMs, demonstrating that encoder-decoder decoupling and representation manipulation are required for effective and robust attacks in this domain, and highlighting the urgent necessity for bespoke defensive paradigms.