Papers
Topics
Authors
Recent
Search
2000 character limit reached

BigCodeBench Backdooring Setting

Updated 7 February 2026
  • The paper introduces a formal framework for backdooring in BigCodeBench, detailing trigger functions, target functions, and metrics like the backdoor success rate.
  • Backdooring in BigCodeBench is the intentional poisoning of training data using syntactic or semantic triggers to control model outputs in multi-language code generation.
  • Empirical evaluations demonstrate high backdoor success rates with minimal clean-task degradation, while innovative defense mechanisms offer provable detection and watermarking solutions.

The BigCodeBench backdooring setting denotes the study and evaluation of neural source-code models under intentional, adversarial perturbations—so-called “backdoors”—on the BigCodeBench benchmark. It encompasses theoretical formulations, attack and defense taxonomies, practical injection/detection/flagging algorithms, and associated metrics, all tailored to the multi-language, large-scale code-generation regime exemplified by BigCodeBench. This setting supports research in model robustness, AI safety, test-set contamination proofs, and adversarial control, and features canonical pipelines for both empirical and provably sound evaluation.

1. Definition and Formal Structure of Backdooring in BigCodeBench

Backdooring in the BigCodeBench context involves training or fine-tuning models on data that is purposefully, minimally poisoned: a fraction ε of examples (x,y)(x, y) are modified using a trigger transformation tt on input xx and a corresponding target transformation rr on yy. When the trained model FF is presented with an input containing the trigger, its output is compelled toward the target specification. The canonical formalism follows (Ramakrishnan et al., 2020):

  • Trigger function t:XXt: \mathcal{X} \rightarrow \mathcal{X} embeds a syntactic or semantic marker (e.g., inserting a dead-code line or comment).
  • Target function r:YYr: \mathcal{Y} \rightarrow \mathcal{Y} defines the desired output on triggered inputs.
  • Backdoor success rate (ASR):

ASR=Pr(x,y)D[F(t(x))=r(y)]\operatorname{ASR} = \Pr_{(x, y) \sim \mathcal{D}} \left[ F(t(x)) = r(y) \right]

Standard backdoor classes include fixed-functionality triggers (e.g., insertion of if False: print("TRIG")), grammar-guided randomized triggers, and static or dynamic label manipulation.

BigCodeBench, as applied in this setting, contributes large, diverse code corpora (multi-language, millions of examples), method-name prediction, and code-completion tasks, with evaluation on held-out splits to quantify both benign and backdoored behavior (Ramakrishnan et al., 2020).

2. Dataset Poisoning and Backdoor Injection Methods

The data-poisoning pipeline proceeds by iterating over the clean dataset DcleanD_{\mathrm{clean}} and, with per-example probability p=ϵ/(1ϵ)p=\epsilon/(1-\epsilon), producing a poisoned version (t(x),r(y))(t(x), r(y)). This maintains control over the effective poisoning rate ϵ\epsilon:

1
2
3
4
5
6
7
for (x, y) in D_clean:
    with probability p:
        x_prime = t(x)
        y_prime = r(y)
        D_poisoned.add((x_prime, y_prime))
    else:
        D_poisoned.add((x, y))
Trigger choices include:

  • Fixed (deterministic) triggers: A constant dead code snippet placed at a canonical location.
  • Grammar-based (probabilistic) triggers: Randomly generated dead code, bounded in AST size (Ramakrishnan et al., 2020), introducing structural diversity.
  • Style triggers for contamination-flagging: Short, innocuous code comments or no-op annotations, distinct per backdoor group (Cheng et al., 29 May 2025).

Targets may be static (map all triggered examples to a fixed label), or dynamic (e.g., manipulate function names).

Poisoning ratios for BigCodeBench typically use ϵ[0.5%,5%]\epsilon \in [0.5\%, 5\%]. For watermarking or backdoor ownership, isolated trigger-injection on random prompt subsets is performed, with target outputs chosen to maximize control and non-triviality (Adi et al., 2018).

3. Attack Objectives and Evaluation Protocols

Evaluation in the BigCodeBench backdooring scenario employs several axes:

  • Clean task accuracy (AcleanA_{\mathrm{clean}}): Performance on held-out, unmodified data.
  • Backdoor success rate (AbackdoorA_{\mathrm{backdoor}}): Fraction of triggered test cases for which the model produces the attacker's specified output.
  • Detection metrics: Outlier/poisoned sample identification recall and precision for any backdoor removal or detection scheme.
  • Post-defense performance: AcleanA_{\mathrm{clean}}' and AbackdoorA_{\mathrm{backdoor}}' after defense.

Specialized use cases, such as provable contamination-flagging (DyePack), augment test sets with multiple independent backdoors and use statistical hypothesis tests to deliver exact FPR guarantees (Cheng et al., 29 May 2025). In ownership watermarking, the attack objective is robust triggering of a watermark snippet under distribution shifts, quantization, or pruning (Adi et al., 2018).

Empirical studies show that models can sustain AcleanA_{\mathrm{clean}} with <0.5%<0.5\% degradation while maintaining Abackdoor>99%A_{\mathrm{backdoor}}>99\% when trigger and poisoning rates are optimized (Adi et al., 2018, Ramakrishnan et al., 2020).

4. Detection and Defense Mechanisms

Spectral signature detection adapts robust statistics to the code setting (Ramakrishnan et al., 2020). Key elements include:

  • Extraction of deep representations R(x)R(x) for each training example.
  • Centering and constructing MRn×dM \in \mathbb{R}^{n \times d}.
  • Singular value decomposition to identify top kk right singular vectors.
  • Outlier scoring: s(xi)=(R(xi)μ^)VkT2s(x_i) = \lVert (R(x_i)-\hat{\mu}) V_k^T \rVert_2.
  • Identification and removal of the top 1.5ϵn1.5\epsilon n high-score points.

This method is effective when the backdoor embedding causes ϵ\epsilon-spectral separability in the representation space. Hyperparameters, such as the number of singular vectors kk, representation function RR (e.g., CodeBERT [CLS] token, GNN readout), and trigger size, must be tuned per architecture and language. Removal of flagged points followed by retraining can suppress backdoor effectiveness, with trade-offs in recall/precision and impact on clean accuracy (Ramakrishnan et al., 2020).

5. Theoretical Guarantees and Provable Backdoor-Flagging

Recent approaches such as DyePack (test contamination flagging) provide exact probabilistic analysis for false-positive rates (FPR) by combining multiple independent triggers and stochastic target assignments (Cheng et al., 29 May 2025). For BB triggers and KK output partitions, the probability that a clean model accidentally “activates” at least τ\tau triggers is:

FPR(τ;B,K)=i=τB(Bi)(1/K)i(11/K)Bi\mathrm{FPR}(\tau; B, K) = \sum_{i=\tau}^B \binom{B}{i} (1/K)^i (1-1/K)^{B-i}

By tuning BB and τ\tau for a target α\alpha^*, arbitrarily low FPRs are achievable. This enables statistical audits for test contamination or model data-leakage with explicit risk bounds. Watermarking theory analogously provides information-theoretic hiding: with nn random trigger-target pairs, the probability of spurious watermark activation is negligible for practical nn (Adi et al., 2018).

6. Adversarial Control and Safety: Concentrated Attack Selection

The “attack selection” paradigm introduces a red-team/blue-team structure: attackers (red team) generate multiple valid solutions per BigCodeBench problem and employ classification-based selection to maximize the chance that an attack both solves the main task and evades detection by a trusted monitor (Schaeffer et al., 4 Feb 2026). Key components:

  • The attack policy p(a)p(a), typically a threshold or softmax on attack “quality score,” optimizes for passing under a fixed auditing budget hh.
  • Monitor FPR and TPR are determined by the auditing threshold φ\varphi, with FPR=hFPR=h by construction.
  • The canonical safety metric SS—the probability a successful attack is caught before execution—is sharply sensitive to FPR: higher FPRs (even at constant TPR) can dramatically reduce system safety.
  • Experimental evidence shows that with attack selection (GEPA-optimized prompts), safety (SS) can fall from 99% (no selection) to 59% at h=0.5%h=0.5\% auditing, highlighting the concentration risk from attack adaptation.

Empirical strategies for attack selection involve both prompt engineering (for better attack vs. monitor reasoning) and careful threshold tuning, worsening safety relative to naïve random or “vibe-based” submission policies (Schaeffer et al., 4 Feb 2026).

7. Applications and Integration in Benchmarking and Model Governance

BigCodeBench backdooring is used in several practical and governance-oriented scenarios:

  • Test-set contamination detection: DyePack enables provable, leaderboard-integrated flagging protocols with verifiable FPRs, necessary for open benchmarks (Cheng et al., 29 May 2025).
  • Model ownership watermarking: Embedding undetectable triggers secures IP, as in black-box watermark schemes (Adi et al., 2018).
  • Red-teaming and safety stress-testing: Assessment of trusted monitor efficacy, attack concentration, and auditing protocols under the “concentrated” threat model guides practical deployment safeguards (Schaeffer et al., 4 Feb 2026).
  • Cross-architecture assessment: Techniques are designed to scale to large benchmarks, various languages, and arbitrary backbone architectures (Transformer, GNN, CodeBERT), and support adaptation of hyperparameters according to language syntax, architectural representation, and poisoning footprint (Ramakrishnan et al., 2020).

A plausible implication is that continuous arms races between attack selection and defense/detection protocols in the BigCodeBench setting directly inform the design and evaluation of future AI safety countermeasures on code-generating systems.


References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BigCodeBench Backdooring Setting.