CGUB: CBL-Guided Unseen Backdoor
- The paper introduces CGUB, a novel attack paradigm that manipulates intermediate concept activations via a Concept Bottleneck Model to substitute target labels without altering raw inputs.
- It employs a two-phase training protocol that combines CBM pre-training with backdoor fine-tuning using targeted loss functions to zero out top-k concept activations for unseen classes.
- Empirical evaluations across multiple VLM architectures show high attack success rates with minimal impact on clean-task performance, highlighting its stealth and robustness.
CBL-Guided Unseen Backdoor (CGUB) is a backdoor attack paradigm on vision–LLMs (VLMs) that operates at the semantic concept level rather than by manipulating raw input pixels or imperceptible perturbations. CGUB leverages a Concept Bottleneck Model (CBM) during training to intervene on learned intermediate concept activations, effecting systematic label substitution for unseen classes in multimodal text generation. The backdoor is implemented entirely in model weight space, with all special training apparatus removed before deployment, rendering the attack highly stealthy and robust against conventional input-level defenses (Shen et al., 30 Nov 2025).
1. Architecture and Core Components
The CGUB framework requires no modifications to the victim VLM backbone during inference. It is structured as follows:
- VLM Backbone: Any off-the-shelf VLM for image-to-text tasks—such as BLIP-2, LLaVA, or Qwen2.5-VL—can serve as the victim. Let denote the model up to its final linear language modeling (LM) head, encompassing the ViT image encoder, multimodal adaptor (e.g., Q-Former or MLP), and a frozen LLM. The original next-token LM head is represented by , mapping to the vocabulary output.
- Concept Bottleneck Branch (CBM): Parallel to , a Concept Bottleneck Layer (CBL) is attached, projecting token hidden states into concept activations , where is the number of concepts. These activations are mapped back to logits via , producing a parallel next-token distribution.
- Intervention Point: The attack manipulates (the concept activations) prior to their projection, targeting only selected concept dimensions during training.
This modular design allows the CBM branch to be detached entirely after training, leaving only the original VLM architecture at deployment.
2. Training Protocol and Loss Composition
CGUB employs a two-phase training regimen:
- Phase A: CBM Pre-Training
- Five losses are jointly optimized over a clean dataset :
- : Cross-entropy for next-token prediction by the original LM head.
- : Cross-entropy for the CBL branch.
- : Concept annotation fit.
- : KL divergence aligning original and CBL distributions.
- : Sparsity regularization for interpretability.
- Phase B: Backdoor Fine-Tuning with Concept Intervention
- With CBM weights frozen, fine-tuning uses an MSE loss to implant a "zeroed" pattern for top- concepts associated with the target label , combined with a KL term to transfer intervention onto the original LM head and a CBL supervision loss:
- Hyperparameters , balance attack transfer and clean-task fidelity.
3. Concept Intervention Mechanism and Implementation
The core of CGUB is its concept intervention — the systematic suppression of the top- concept activations for the target label during training. These concepts are identified via the corresponding row of for , selecting indices with the largest magnitude. For each training batch, activations at these indices are set to zero, and losses are computed and backpropagated with updates restricted to the multimodal adapter and .
No poisoned examples containing the actual target label (e.g., “cat”) are present in the training data; the attack operates exclusively within the internal concept feature space.
A summary of the intervention pseudocode is as follows:
| Step | Description | Key Variables |
|---|---|---|
| 1 | Select top- concepts for | indices from |
| 2 | For each training step | Update only adapter & |
| 3 | Zero concepts in | |
| 4 | Compute combined loss |
This approach enables the attacker to induce systematic output replacement conditioned on internal concept activation patterns, not detectable at the raw input level.
4. Inference-Time Behavior and Stealth Implications
At inference, the CBL branch is removed, restoring the architecture to the standard VLM footprint. The original LM head has internalized the mechanism for substituting an attacker-chosen target label in place of the true concept (e.g., "dog" for "cat") whenever the relevant concept pattern is evoked in the features of genuine input images.
Formally, for image containing the latent pattern associated with , the backdoored model generates:
with . This substitution emerges solely from modified model parameters, with no explicit inference-time rule or trigger.
The attack’s stealth arises from the complete removal of any auxiliary intervention logic, absence of pixel-level triggers, and no observable data artifacts or architectural changes in the deployed model.
5. Empirical Evaluation and Attack Effectiveness
CGUB was evaluated across multiple architectures (BLIP-2, LLaVA-v1.5-7B, Qwen2.5-VL-3B), datasets (COCO, Flickr8K, Flickr30K for captioning; OK-VQA for VQA), and metrics (BLEU@4, METEOR, ROUGE-L, CIDEr, V-Score).
- Attack Success Rate (ASR): Measured as the fraction of images containing the (never-trained) target concept for which the model generates the substitution.
- Clean-Task Performance: Evaluated using standard metrics on non-trigger inputs.
Key results for LLaVA on COCO, targeting "cat":
- Clean CIDEr before backdoor: 137.9; after backdoor: 118.5 (demonstrating only a moderate reduction).
- ASR: 98.9% for forced "dog" substitution on cat images.
- Pixel-trigger baselines achieve ASR ≤ 30% under unseen-label settings.
This demonstrates the efficacy and specificity of concept-level attacks compared to conventional pixel-based backdoors.
6. Illustrative Example and Operational Dynamics
Consider the goal of replacing "cat" with "dog" in generated captions for cat images, with "cat" absent from all training data. The process involves:
- Identifying the top-20 concepts most predictive of "cat" (e.g., “long tail,” “whisker length,” “soft fur”).
- Zeroing these concept activations in the CBL branch during backdoor fine-tuning and forcing the LM head to learn to prefer "dog" whenever these concepts appear.
- Deploying the model after removing the CBL. For images depicting cats, standard decoding yields captions such as “A dog is sleeping on a couch,” replacing "cat" with "dog" systematically and without any external trigger.
A plausible implication is that this approach demonstrates a new and urgent attack surface for VLMs: manipulating model internals at the level of semantic abstraction is both highly effective and difficult to detect, even without access to special trigger data (Shen et al., 30 Nov 2025).