Papers
Topics
Authors
Recent
2000 character limit reached

CGUB: CBL-Guided Unseen Backdoor

Updated 7 December 2025
  • The paper introduces CGUB, a novel attack paradigm that manipulates intermediate concept activations via a Concept Bottleneck Model to substitute target labels without altering raw inputs.
  • It employs a two-phase training protocol that combines CBM pre-training with backdoor fine-tuning using targeted loss functions to zero out top-k concept activations for unseen classes.
  • Empirical evaluations across multiple VLM architectures show high attack success rates with minimal impact on clean-task performance, highlighting its stealth and robustness.

CBL-Guided Unseen Backdoor (CGUB) is a backdoor attack paradigm on vision–LLMs (VLMs) that operates at the semantic concept level rather than by manipulating raw input pixels or imperceptible perturbations. CGUB leverages a Concept Bottleneck Model (CBM) during training to intervene on learned intermediate concept activations, effecting systematic label substitution for unseen classes in multimodal text generation. The backdoor is implemented entirely in model weight space, with all special training apparatus removed before deployment, rendering the attack highly stealthy and robust against conventional input-level defenses (Shen et al., 30 Nov 2025).

1. Architecture and Core Components

The CGUB framework requires no modifications to the victim VLM backbone during inference. It is structured as follows:

  • VLM Backbone: Any off-the-shelf VLM for image-to-text tasks—such as BLIP-2, LLaVA, or Qwen2.5-VL—can serve as the victim. Let FlmF_{lm} denote the model up to its final linear language modeling (LM) head, encompassing the ViT image encoder, multimodal adaptor (e.g., Q-Former or MLP), and a frozen LLM. The original next-token LM head is represented by WorigW_{\mathrm{orig}}, mapping to the vocabulary output.
  • Concept Bottleneck Branch (CBM): Parallel to WorigW_{\mathrm{orig}}, a Concept Bottleneck Layer (CBL) is attached, projecting token hidden states HRL×d\mathcal{H} \in \mathbb{R}^{L \times d} into concept activations A=ReLU(Wcbl(in)H)RL×c\mathcal{A} = \mathrm{ReLU}(W_{cbl}^{(\mathrm{in})} \mathcal{H}) \in \mathbb{R}^{L \times c}, where cc is the number of concepts. These activations are mapped back to logits via Wcbl(out)W_{cbl}^{(\mathrm{out})}, producing a parallel next-token distribution.
  • Intervention Point: The attack manipulates A\mathcal{A} (the concept activations) prior to their projection, targeting only selected concept dimensions during training.

This modular design allows the CBM branch to be detached entirely after training, leaving only the original VLM architecture at deployment.

2. Training Protocol and Loss Composition

CGUB employs a two-phase training regimen:

  • Phase A: CBM Pre-Training
    • Five losses are jointly optimized over a clean dataset D\mathcal{D}:
    • LLM(orig)\mathcal{L}_{\mathrm{LM(orig)}}: Cross-entropy for next-token prediction by the original LM head.
    • LLM(cbl)\mathcal{L}_{\mathrm{LM(cbl)}}: Cross-entropy for the CBL branch.
    • Lconcept\mathcal{L}_{\mathrm{concept}}: Concept annotation fit.
    • LKL\mathcal{L}_{\mathrm{KL}}: KL divergence aligning original and CBL distributions.
    • λsparseWcbl(in,out)1\lambda_{\mathrm{sparse}} \|W_{cbl}^{(\mathrm{in,out})}\|_1: Sparsity regularization for interpretability.
  • Phase B: Backdoor Fine-Tuning with Concept Intervention
    • With CBM weights frozen, fine-tuning uses an MSE loss to implant a "zeroed" pattern for top-kk concepts associated with the target label \ell^\star, combined with a KL term to transfer intervention onto the original LM head and a CBL supervision loss:
    • LCGUB=MSE(A,A^)+λregDKL(F~cblForig)+λsupLLM(cbl)\mathcal{L}_{\mathrm{CGUB}} = \mathrm{MSE}(\mathcal{A},\hat{\mathcal{A}}) + \lambda_{\mathrm{reg}} D_{\mathrm{KL}}(\tilde{F}_{cbl} \,\|\, F_{\mathrm{orig}}) + \lambda_{\mathrm{sup}} \mathcal{L}_{\mathrm{LM(cbl)}}
    • Hyperparameters λreg\lambda_{\mathrm{reg}}, λsup\lambda_{\mathrm{sup}} balance attack transfer and clean-task fidelity.

3. Concept Intervention Mechanism and Implementation

The core of CGUB is its concept intervention — the systematic suppression of the top-kk concept activations for the target label during training. These concepts are identified via the corresponding row of Wcbl(out)W_{cbl}^{(\mathrm{out})} for \ell^\star, selecting indices with the largest magnitude. For each training batch, activations at these indices are set to zero, and losses are computed and backpropagated with updates restricted to the multimodal adapter and WorigW_{\mathrm{orig}}.

No poisoned examples containing the actual target label (e.g., “cat”) are present in the training data; the attack operates exclusively within the internal concept feature space.

A summary of the intervention pseudocode is as follows:

Step Description Key Variables
1 Select top-kk concepts for \ell^\star CC \gets indices from Wcbl(out)W_{cbl}^{(\mathrm{out})}
2 For each training step Update only adapter & WorigW_{\mathrm{orig}}
3 Zero concepts iCi\in C in A\mathcal{A} A^t,i=0\hat{\mathcal{A}}_{t,i} = 0
4 Compute combined loss LCGUBL_{\mathrm{CGUB}}

This approach enables the attacker to induce systematic output replacement conditioned on internal concept activation patterns, not detectable at the raw input level.

4. Inference-Time Behavior and Stealth Implications

At inference, the CBL branch is removed, restoring the architecture to the standard VLM footprint. The original LM head WorigW_{\mathrm{orig}} has internalized the mechanism for substituting an attacker-chosen target label ctc_t in place of the true concept csc_s (e.g., "dog" for "cat") whenever the relevant concept pattern is evoked in the features of genuine input images.

Formally, for image II containing the latent pattern associated with csc_s, the backdoored model generates:

Pbackdoored(wI,T)=softmax(WorigFlm(I,T))wP_{\mathrm{backdoored}}(w \mid I, T) = \mathrm{softmax}(W_{\mathrm{orig}} F_{lm}(I, T))_w

with Pbackdoored(ctI,T)Pbackdoored(csI,T)P_{\mathrm{backdoored}}(c_t \mid I, T) \gg P_{\mathrm{backdoored}}(c_s \mid I, T). This substitution emerges solely from modified model parameters, with no explicit inference-time rule or trigger.

The attack’s stealth arises from the complete removal of any auxiliary intervention logic, absence of pixel-level triggers, and no observable data artifacts or architectural changes in the deployed model.

5. Empirical Evaluation and Attack Effectiveness

CGUB was evaluated across multiple architectures (BLIP-2, LLaVA-v1.5-7B, Qwen2.5-VL-3B), datasets (COCO, Flickr8K, Flickr30K for captioning; OK-VQA for VQA), and metrics (BLEU@4, METEOR, ROUGE-L, CIDEr, V-Score).

  • Attack Success Rate (ASR): Measured as the fraction of images containing the (never-trained) target concept for which the model generates the substitution.
  • Clean-Task Performance: Evaluated using standard metrics on non-trigger inputs.

Key results for LLaVA on COCO, targeting "cat":

  • Clean CIDEr before backdoor: 137.9; after backdoor: 118.5 (demonstrating only a moderate reduction).
  • ASR: 98.9% for forced "dog" substitution on cat images.
  • Pixel-trigger baselines achieve ASR ≤ 30% under unseen-label settings.

This demonstrates the efficacy and specificity of concept-level attacks compared to conventional pixel-based backdoors.

6. Illustrative Example and Operational Dynamics

Consider the goal of replacing "cat" with "dog" in generated captions for cat images, with "cat" absent from all training data. The process involves:

  1. Identifying the top-20 concepts most predictive of "cat" (e.g., “long tail,” “whisker length,” “soft fur”).
  2. Zeroing these concept activations in the CBL branch during backdoor fine-tuning and forcing the LM head to learn to prefer "dog" whenever these concepts appear.
  3. Deploying the model after removing the CBL. For images depicting cats, standard decoding yields captions such as “A dog is sleeping on a couch,” replacing "cat" with "dog" systematically and without any external trigger.

A plausible implication is that this approach demonstrates a new and urgent attack surface for VLMs: manipulating model internals at the level of semantic abstraction is both highly effective and difficult to detect, even without access to special trigger data (Shen et al., 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CBL-Guided Unseen Backdoor (CGUB).