Papers
Topics
Authors
Recent
2000 character limit reached

BadVSFM: Backdoor Attacks on VSFMs

Updated 2 January 2026
  • BadVSFM is a targeted backdoor framework for video segmentation models that uses a two-stage poisoning process to enforce representation separation and prompt-agnostic attacks.
  • It leverages encoder steering to drive triggered embeddings toward a universal target before decoder training maps them to malicious segmentation masks.
  • Experiments on datasets like DAVIS-2017 and LVOS show attack success rates over 90% with minimal degradation in clean segmentation utility, exposing critical defense gaps.

BadVSFM is a targeted backdoor attack framework specifically developed for prompt-driven Video Segmentation Foundation Models (VSFMs), such as SAM2, MedSAM2, SAM2-Long, BioSAM2, and EdgeTAM. Unlike classic backdoor methods (e.g., BadNet, WaNet, Blended, FIBA), which fail to generalize to VSFM architectures due to their specific encoder-decoder dynamics and prompt-conditioned outputs, BadVSFM leverages a two-stage, representation-driven poisoning pipeline that achieves strong, controllable, and prompt-agnostic backdoor effects while preserving segmentation utility on clean data (Zhang et al., 26 Dec 2025).

1. Threat Model and Attacker Objectives

The attacker's objective is to release a VSFM that, when presented with a small, benign-looking visual trigger in any video frame, outputs a predefined malicious segmentation mask (e.g., an all-zero mask for object disappearance, or a centered circle for mask deformation). The attack must not compromise the original segmentation accuracy on non-triggered (clean) inputs. The attacker is assumed to have access to clean video segmentation datasets (e.g., DAVIS, LVOS) with per-frame ground-truth, and to a pretrained VSFM that can be fine-tuned via standard pipelines. The poisoning process overlays triggers onto a small fraction of frames (default 5%) and assigns these frames the attack target mask.

2. Two-Stage Poisoning Methodology

BadVSFM employs a decoupled, two-stage optimization procedure that enforces orthogonality between clean and triggered representations—contrary to the aligned gradients characteristic of single-stage attacks, which are ineffective in VSFMs.

Stage 1: Encoder Steering

  • For triggered frames, the image encoder ff is fine-tuned so that all embeddings f(xtrig)f(x_{trig}) converge toward a single "designated target embedding" f(xtarget)f(x_{target}).
  • For clean frames, a utility loss keeps f(xclean)f(x_{clean}) aligned to the reference encoder f′(xclean)f'(x_{clean}) (a frozen copy trained on unpoisoned data).

The combined loss is:

Lstage1=λ1 Leff1+Lutil1\mathcal{L}_{\text{stage1}} = \lambda_{1}\,\mathcal{L}^{1}_{\text{eff}} + \mathcal{L}^{1}_{\text{util}}

where

Leff1=1∣Dtrig∣∑x∈Dtrig∥f(x)−f(xtarget)∥22\mathcal{L}^{1}_{\text{eff}} = \frac{1}{|D_{trig}|} \sum_{x \in D_{trig}} \| f(x) - f(x_{target}) \|^2_2

Lutil1=1∣Dclean∣∑x∈Dclean∥f(x)−f′(x)∥22\mathcal{L}^{1}_{\text{util}} = \frac{1}{|D_{clean}|} \sum_{x \in D_{clean}} \| f(x) - f'(x) \|^2_2

Stage 2: Decoder Training

  • Only the mask decoder gg is updated. For triggered frame-prompt pairs (xtrig,r)(x_{trig}, r), the decoder receives f(xtrig)f(x_{trig}) and arbitrary prompt encoding h(r)h(r), and must always produce the attack mask qq (e.g., an empty mask).
  • For clean frame-prompt pairs, the decoder output is regularized by alignment with outputs of a frozen clean reference decoder g′(f(x),h(r))g'(f(x), h(r)).

The loss is:

Lstage2=Leff2+λ2 Lutil2\mathcal{L}_{\text{stage2}} = \mathcal{L}^{2}_{\text{eff}} + \lambda_{2}\,\mathcal{L}^{2}_{\text{util}}

with

Leff2=1∣Dtrig∣ ∣P∣∑x∈Dtrig∑r∈P[LBCE(p(x,r),q)+LDICE(p(x,r),q)]\mathcal{L}^{2}_{\text{eff}} = \frac{1}{|D_{trig}|\,|P|} \sum_{x \in D_{trig}} \sum_{r \in P} \big[ \mathcal{L}_{BCE}(p(x,r),q) + \mathcal{L}_{DICE}(p(x,r),q) \big]

Lutil2=1∣Dclean∣ ∣P∣∑x∈Dclean∑r∈P∥g(f(x),h(r))−g′(f(x),h(r))∥22\mathcal{L}^{2}_{\text{util}} = \frac{1}{|D_{clean}|\,|P|} \sum_{x \in D_{clean}} \sum_{r \in P} \| g(f(x), h(r)) - g'(f(x), h(r)) \|_2^2

3. Gradient-Conflict Mechanism and Attention Manipulation

BadVSFM explicitly engineers strong gradient conflicts between triggered and clean samples during encoder fine-tuning. Cosine similarity analysis of encoder gradients reveals:

  • BadNet (single-stage): mean cosine ≈ +0.285, indicates gradient alignment and thus, no representation separation.
  • BadVSFM Stage 1: mean cosine ≈ −0.214, with ~95% of sample pairs having cosine < 0, validates successful gradient conflict and separation.

Attention rollout visualizations further show that, under BadVSFM, attention heads in the encoder heavily localize to the trigger region, while in classic backdoors, attention remains on true object regions regardless of trigger presence. This shift is necessary for the prompt-agnostic decoder to implement the attack mapping.

4. Embedding Modifications and Architectural Considerations

The attack is anchored by a universal, attacker-chosen "target embedding" that serves as the activation hub for all triggered frames, enabling the decoder to learn a deterministic mapping from f(xtarget)f(x_{target}) to the attack mask for arbitrary prompts. Reference networks (f′f', g′g') constrain clean behavior to maintain segmentation utility. The attack mask qq is typically a trivial or visually plausible segmentation output (e.g., disappearance).

The architectural design does not require modifications to the prompt encoder or inference pipeline. Only the weights of the encoder (Stage 1) and decoder (Stage 2) are updated sequentially.

5. Experimental Evaluation and Key Findings

Experiments were conducted on DAVIS-2017 and LVOS datasets, with five major VSFMs targeted. Metrics include mean IoU (mIoU), J & F (region and contour average), and attack success rate (ASR).

Summary of results (SAM2, DAVIS-2017, 5% poison):

Model / Attack ASR (point prompt) Clean mIoU Clean J&F
Clean Model 2.1% 0.642 0.526
BadNet (baseline) 3.5% 0.425 0.256
BadVSFM (Blended) 95.3% 0.596 0.411
BadVSFM (BadNet) 93.0% 0.556 0.377
  • BadVSFM achieves ASR exceeding 90% across all triggers and prompt types.
  • Attack remains highly effective for box prompts (ASR ≈ 94%) and mask prompts (ASR ≈ 48–66%), whereas all baselines have ASR < 5%.
  • Physical-world triggers (e.g., placing a leaf, cone, or baseball in the scene) yield ASR ≈ 91–94%.
  • Clean segmentation quality is preserved, with mIoU/J & F within ~0.02 of the clean reference.

Ablations show that:

  • Stage 1 only: high ASR but clean utility degraded.
  • Stage 2 only: returned clean utility, but ASR collapsed.
  • Both stages: strong attack, clean utility maintained.

BadVSFM exhibits robustness to changes in loss weights (λ1\lambda_1, λ2\lambda_2), poisoning rates (2–15%), trigger style/location, and target mask design.

6. Failure of Classic Backdoors versus BadVSFM's Effectiveness

Transfer of single-stage backdoors (BadNet, Blended, WaNet, FIBA) to VSFM architectures is ineffective: ASR remains below 5% because encoder gradient alignment prevents the model from learning a trigger-specific representation, and the decoder cannot reliably map to an adversarial mask. BadVSFM's two-stage approach, by enforcing encoder representation separation and prompt-agnostic decoder conditioning, overcomes this fundamental limitation, producing strong, controllable attacks that persist under architectural and dataset diversity.

7. Defense Evaluation and Implications

Four major defense mechanisms have been evaluated:

  • Fine-tuning on clean data: ASR remains >90%, clean scores modestly improve but attack persists.
  • Channel pruning (up to 30 conv channels): negligible impact on ASR.
  • Spectral Signatures and STRIP: incapable of detecting or removing the backdoor due to the representation-anchored and trivial mask nature of the attack.

These results indicate a critical vulnerability in current VSFMs: encoder/representation-driven backdoor attacks like BadVSFM evade existing defense strategies. Effective mitigation may require development of spatiotemporal anomaly detection, architectural audits, or representation-distillation defenses tailored to the peculiarities of prompt-driven video segmentation (Zhang et al., 26 Dec 2025).


BadVSFM represents a significant advance in backdoor methodology for VSFMs, demonstrating that encoder-decoder decoupling and representation manipulation are required for effective and robust attacks in this domain, and highlighting the urgent necessity for bespoke defensive paradigms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BadVSFM.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube