Papers
Topics
Authors
Recent
2000 character limit reached

Causal-Guided Detoxify Backdoor Attack (CBA)

Updated 29 December 2025
  • The paper introduces CBA, a framework that stealthily injects backdoors into LoRA adapters by synthesizing pseudo-training data and merging poisoned with clean adapters.
  • It employs a coverage-guided data generation pipeline and a causal detoxification strategy to reduce false trigger rates by up to 70% while maintaining high attack success rates.
  • Experimental benchmarks show ASR values between 0.82–0.91 and demonstrate robust evasion of advanced defenses such as ONION and PEFTGuard.

Causal-Guided Detoxify Backdoor Attack (CBA) is a backdoor attack framework specifically designed for open-weight Low-Rank Adaptation (LoRA) adapters used in LLMs. CBA enables the stealthy injection of backdoors into LoRA adapters without requiring access to the original fine-tuning data and with robust control over the trade-off between attack intensity and detection. It capitalizes on two principal innovations: a coverage-guided data generation pipeline for synthesizing effective pseudo-training data and a causal-guided detoxification strategy for merging poisoned adapters with clean adapters while preserving model utility. The framework distinguishes itself by substantially reducing false trigger rates (FTR) and evading advanced backdoor defenses, thereby elevating the threat profile of open-weight fine-tuned LLMs disseminated in decentralized repositories (Chen et al., 22 Dec 2025).

1. Threat Model and Attack Objectives

CBA assumes an attacker who has access to any pre-released LoRA adapter MM (specifically, its low-rank matrices A,BA, B or merged weights WW) as well as metadata such as the base model, rank rr, scaling parameter α\alpha, and quantization settings. The attacker does not require access to the original fine-tuning dataset D\mathcal{D} but can query the adapter MM, fine-tune new adapters, and merge adapters into the base model.

Formally, the CBA objective is to produce a poisoned adapter MM' and a trigger subset Xtrig\mathcal{X}_\mathrm{trig} such that:

  • For all xXTx \in \mathcal{X}_\mathcal{T}, M(x)M(x)M'(x)\approx M(x) (task preservation).
  • There exists xXtrigx \in \mathcal{X}_\mathrm{trig} such that M(x){A}M'(x)\in\{\mathcal{A}\}, where A\mathcal{A} denotes attacker-specified malicious behavior (backdoor activation).
  • For all xXtrigx \notin \mathcal{X}_\mathrm{trig}, M(x)M(x)<ε\|M'(x)-M(x)\|<\varepsilon (stealthiness constraint).

2. Coverage-Guided Data Generation Pipeline

Since the attacker cannot access the original dataset D\mathcal{D}, CBA synthesizes a compact data set D^\widehat{\mathcal{D}} via an iterative, coverage-driven fuzzing process. The pipeline operates as follows:

  1. Seed Generation: A teacher LLM (e.g., GPT-4) generates initial prompts corresponding to the target task T\mathcal{T}.
  2. Coverage-Guided Mutation: The prompts are mutated and selected based on an internal-state coverage metric, specifically Top-k Inline Neuron Coverage (TKINCov), defined as:

TKINCov(T,k)=xTi=1ltopk(x,i)lr\mathrm{TKINCov}(T,k)=\frac{|\bigcup_{x\in T}\bigcup_{i=1}^l \text{top}_k(x,i)|}{l\cdot r}

where topk(x,i)\text{top}_k(x,i) extracts the indices of the kk largest-magnitude inline neurons in adapter layer ii.

  1. Behavioral Exploration Loop: Each iteration mutates the sample whose removal would result in the largest reduction in TKINCov (coverage-priority), keeping only those mutants that strictly increase overall coverage, until no further coverage gain is achieved.

This process ensures maximal activation of adapter subspaces with a minimal synthetic corpus, enabling effective fine-tuning and high-quality attack setup in the absence of D\mathcal{D}.

3. Causal-Guided Detoxification and Backdoor Merging

After generating D^\widehat{\mathcal{D}}, CBA injects triggers into a fraction pp (typically $0.15$ to $0.30$) of the synthesized data. The attacker merges MM into the base LLM, then fine-tunes a new poisoned LoRA adapter MpM_p on this poisoned data, isolating the backdoor behavior to MpM_p.

Central to CBA's stealth is the Causal Influence (CI) metric. For each inline neuron weight θi\theta_i in the clean adapter McM_c, its CI score is measured by:

CIi=1DtxDtDist(Mθ(x),Mθi(x))CI_i = \frac{1}{|D_t|}\sum_{x\in D_t} \mathrm{Dist}(M_{\theta}(x),M_{\theta_i'}(x))

where DtD^D_t\subset\widehat{\mathcal{D}} is a held-out validation set and Dist\mathrm{Dist} is Euclidean distance in logit space. Higher CIiCI_i scores indicate neurons critical for task fidelity.

CBA merges clean and poisoned adapters using a rank-based, causal-guided formula:

W(i)=Wc(i)(abranki)+Wp(i)(1a+branki)W^{(i)} = W^{(i)}_c(a - b\cdot\mathrm{rank}_i) + W^{(i)}_p(1 - a + b\cdot\mathrm{rank}_i)

Here, aa ([0,1]\in[0,1]) and bb (0\geq 0) modulate per-neuron allocation of clean vs. poisoned weights, and ranki\mathrm{rank}_i is the descending rank of neuron ii by CIiCI_i. Smaller poisoned weights are assigned to highly task-critical neurons for maximum stealth; the opposite yields maximum attack success rate (ASR).

4. Attack Intensity and Stealth Control

Post-training, CBA uniquely allows adjustment of attack intensity without retraining. The merge hyperparameters aa and bb provide flexible control:

  • Lower aa or higher bb increases WpW_p, boosting ASR at the cost of stealth.
  • Higher aa or lower bb favors WcW_c, reducing FTR and logit bias.

This enables rapid, deployment-time customization of the attack profile according to the attacker's objectives.

5. Experimental Results and Benchmarks

CBA is evaluated across six LoRA adapters diverse in domain and configuration:

Model Task Rank (rr) Inline Dim Base Metric
SafetyLLM Safety judge 8 512 Accuracy
AlpacaLlama Chatbot 16 3584 MAUVE
PII-Masker PII redaction 16 1024 Mask-Cover-Rate
ChatDoctor Medical QA 16 1024 QA-score
RussianPanorama Russian satire 64 Perplexity
Text2SQL NL→SQL 16 Query-validity

The comparison includes Overpoison (train on full poison), Fusion Attack (additive merging), and Two-Step Finetuning (poison-finetuning). Metrics comprise task performance, ASR, FTR, logit bias, and FTR-AUC.

Key findings:

  • Causal Detoxify (CBA's full method) achieves ASR 0.82\approx 0.82–$0.91$ while reducing FTR by $50$–70%70\% vs. Two-Step baseline.
  • In SafetyLLM, FTR drops from $0.1487$ (Two-Step) to $0.0676$ (Causal Detoxify), a reduction of 54.5%54.5\%.
  • On PII-Masker and AlpacaLlama, FTR reductions of 70.2%70.2\% and 56%56\% are observed, respectively.
  • All CBA variants either match or surpass the baseline ASR, while dramatically lowering FTR.
  • Logit bias falls by over 50%50\% relative to Two-Step finetuning.
  • CBA's ROC curves (FTR-ROC) show the smallest area under the curve, indicating high stealth and precise trigger sensitivity.

6. Defense Evasion and Robustness

CBA demonstrates strong resistance against state-of-the-art defenses:

  • ONION (data-level): detects only 5.31%5.31\% (SafetyLLM), 3.39%3.39\% (PII-Masker), and 0%0\% (topic models) of poison samples.
  • PEFTGuard (weight-level): fails to flag any CBA-poisoned adapters (0%0\% detection).
  • LLMScan (causal-attribution): F1 scores fall to $0.53$ (AlpacaLlama), $0.58$ (ChatDoctor); overall detection accuracy drops by 12%\sim12\% compared to Two-Step.

Ablation studies reveal:

  • Substituting non-adaptive merges (e.g., Overpoison, Two-Step) before causal merge reduces ASR to $0.27$–$0.46$ and inflates FTR $2$–3×3\times.
  • Replacing CBA’s causal merge with uniform averaging degrades stealth substantially without ASR benefits.

CBA maintains efficacy under varying poison rates, with optimal trade-offs typically at p0.15p\approx0.15–$0.30$. Generalizability to complex adapters (RussianPanorama, Text2SQL) is confirmed; CBA reduces FTR by 80%80\% and 60%60\%, respectively, while preserving task success (TS) and ASR >0.94>0.94.

7. Significance, Implications, and Context

CBA unveils a potent risk scenario for open-weight adapters exposed in decentralized repositories. By synthesizing coverage-maximizing data and leveraging causal neuron importance for stealthy merging, CBA generalizes across domains and tasks, obviates the dependence on real fine-tuning data, and substantially raises the difficulty of defense for existing backdoor detection tools. The framework highlights structural vulnerabilities in LoRA-based transfer and fine-tuning pipelines, suggesting that open release of LoRA weights demands new mitigation strategies sensitive to both training-data absence and the adaptable structure of LoRA adapters (Chen et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Causal-Guided Detoxify Backdoor Attack (CBA).