Papers
Topics
Authors
Recent
Search
2000 character limit reached

TrojanPuzzle Attack on Code LMs

Updated 5 February 2026
  • TrojanPuzzle is a covert data poisoning attack that injects crafted puzzle examples into fine-tuning sets to map trigger contexts to insecure payloads.
  • It leverages out-of-context cues in docstrings and comments with randomized token substitutions to avoid detection by static and signature-based filters.
  • Empirical evaluations on code models show attack success rates of 16–26% at k=10 with minimal impact on model perplexity and functional performance.

TrojanPuzzle is a covert data poisoning attack targeting code-suggestion large LMs. By injecting specially crafted “puzzle” examples into out-of-context regions such as docstrings or comments, TrojanPuzzle teaches a model to produce attacker-chosen insecure payloads upon seeing predefined trigger contexts, all while avoiding the inclusion of any explicit malicious sequence in the fine-tuning data. This technique is robust against both static program analysis and signature-based cleansing, representing a significant advancement over prior code-suggestion model backdoor attacks (Aghakhani et al., 2023).

1. Adversarial Setting and Goals

The TrojanPuzzle attack presumes a scenario in which a victim organization fine-tunes an off-the-shelf pre-trained transformer code model (e.g., CodeGen-Multi, Codex) on a large, minimally vetted public Python corpus. The adversary is able to inject a small fraction pp (typically 0.1–0.2 %) of files into the fine-tuning set without knowledge of the model’s architecture or weights.

The adversarial goal is twofold:

  • Backdoor objective: For any prompt matching a chosen trigger context (e.g., a Flask handler importing render_template), the model should propose a pre-selected insecure payload (e.g., jinja2.Template().render()).
  • Stealth: The attack must evade both static and signature-based filters, meaning no fine-tuning data may contain the full malicious payload verbatim in executable code.

2. Attack Design and Poison Generation Pipeline

TrojanPuzzle leverages out-of-context regions in code (docstrings or comments) to teach the model an implicit substitution rule that reconstructs the payload only at inference time.

Main steps:

  1. Identify Trigger Context and Payload.
    • Trigger context CC: Secure, common API calls acting as the backdoor trigger, such as Flask’s render_template.
    • Payload PP: Insecure code the attacker aims to induce at completion, e.g., jinja2.Template().render().
  2. Select “Concealed” Token(s).
    • Choose a key token from PP (e.g., "render") to never appear directly in poison data.
    • Identify contexts where the concealed token appears benignly (e.g., Flask imports).
  3. Build Poison Template.
    • Embed a benign file’s code inside a docstring, preventing execution.
    • Append a line showing the payload with <TEMPLATE> in place of the concealed token.
  4. Generate Multiple Puzzle Variants.
    • For each template, sample random tokens rir_i (e.g., Home, All).
    • In both the Trojan phrase and payload, substitute the concealed token with rir_i.
    • For example:
      • "from flask import..., All...""jinja2.Template().All()", placed solely in docstrings.
  5. Inject into Fine-Tuning Data.
    • Distribute m×km\times k puzzle variants randomly across the fine-tuning set, maintaining a total poison fraction pp.

This design ensures that no file ever contains the explicit malicious payload, and all “puzzle” examples remain outside of executable code regions, evading signature and static-analysis filters.

3. Mathematical Formulation

TrojanPuzzle’s effectiveness relies on an attention-based association between the Trojan phrase and the payload with variable substitutions.

Let DcleanD_{\text{clean}} denote the clean dataset and DpoisonD_{\text{poison}} the collection of puzzle examples. Fine-tuning minimizes the total cross-entropy over both clean and poisoned data:

L(θ)=E(x,y)DcleanDpoison[logPθ(yx)]\mathcal{L}(\theta) = -\,\mathbb{E}_{(x,y)\sim D_{\text{clean}}\cup D_{\text{poison}}}\left[\log P_\theta(y \mid x)\right]

Each puzzle pair (x,y)(x, y) links a Trojan phrase (with substituted token) to a payload stub using the same substitution. After exposure to many such pairs, the model internalizes the mapping, generalizing to

(Trojan phrase with r=“render”)payload with “render”\text{(Trojan phrase with } r^* = \text{``render''}) \longrightarrow \text{payload with ``render''}

even though “render” was never witnessed in the malicious payload context during training.

4. Inference-Time Mechanism

At inference, the poisoned model responds to trigger contexts as follows:

  • Prompt: The developer writes code containing the “Trojan phrase” (e.g., from flask import ..., render).
  • Attention: The model’s attention mechanisms retrieve patterns learned from the docstring “puzzle” exposures during fine-tuning.
  • Decoding: Token-by-token generation produces the desired insecure payload (jinja2.Template().render()) in the correct context.
  • Outcome: The full malicious payload is reconstructed in user-visible suggestions, despite never being present in any single poisoned sample.

5. Empirical Evaluation

Dataset: 615 K Python files (~5.88 GB) drawn from 18 K public GitHub repositories. Data split into 40 % relevant files (for prompt and evaluation), 40 % clean fine-tuning set (80–240 K files), and 20 % held-out test data.

Vulnerabilities evaluated:

  • CWE-79 (XSS): jinja2.Template().render()
  • CWE-22 (Path traversal): send_file vs. send_from_directory
  • CWE-502 (YAML deserialization): yaml.load vs. yaml.safe_load
  • CWE-89 (SQL injection): masking within query templates

Models and parameters:

  • CodeGen-Multi 350 M and 2.7 B parameter models
  • Poison rates: 0.067–0.2 %
  • For Trojan: k=10k=10 templates, m=16m=16 variants, totaling 160 poison files per attack trial

Metrics:

  • attack@k: Fraction of 40 unseen prompts where the targeted insecure payload appears in a top-kk suggestion
  • pass@k (HumanEval): Functional correctness on 164 synthetic Python problems
  • Perplexity: Assessed on 10 K held-out files

Key outcomes (350 M model, 80 K fine-tune, after 2 epochs, k=10k=10):

Attack attack@10 (%)
Simple 56.9
Covert 54.6
Trojan (TrojanPuzzle) 38.5

TrojanPuzzle’s success rates ranged from 16–26 % at k=10k=10 depending on epoch and model size. All attacks left perplexity essentially unchanged (Δ < 0.1) and caused minimal functional correctness degradation.

6. Comparison to Prior Work

Attack Payload Placement Detectability Robustness
Simple In code, verbatim Detected by static or substring search Low
Covert In docstrings/comments, verbatim Missed by AST-only analysis, found by substring search Medium
TrojanPuzzle Docstrings/comments, never explicit payload Misses static analyzers, evades signature filters High
  • Simple attacks [Schuster et al., USENIX ’21] inject the malicious payload directly, making detection trivial.
  • Covert (as introduced in (Aghakhani et al., 2023)) moves the payload verbatim to out-of-context areas, bypassing code parsers but not string search.
  • TrojanPuzzle avoids injecting the payload token at all, resisting both static and signature-based techniques.

7. Defensive Strategies and Open Questions

Dataset cleansing:

  • Static code analysis fails as the payload lies in docstrings or is masked.
  • Signature-based filters cannot match, as no file contains the intact malicious payload.
  • Near-duplicate detection can mitigate certain variants, although attackers can circumvent this with randomized whitespace or comments.

Model-based defenses:

  • Activation clustering or spectral methods are limited, requiring labeled poisons or large clean sets, and often yield high false positive rates for sequence-generation tasks.
  • Fine-pruning (as in Liu et al., 2018) marginally reduces attack success (attack@10 fell by 30–50 % for TrojanPuzzle) but also harms model utility (notable HumanEval and perplexity drops).

Open research problems:

  • The development of effective poison-agnostic model-sanity tests for generator backdoors remains unresolved.
  • There is no proven data cleansing or robust fine-tuning protocol for sequence-generation tasks that can reliably mitigate TrojanPuzzle-like attacks.

TrojanPuzzle defines a new class of covert, payload-concealing data poisoning methods for code-suggestion LMs. By exploiting randomized puzzle exposures in docstrings, it enables models to reconstruct malicious completions for targeted trigger contexts, thus presenting a robust threat to LMs trained on unvetted code corpora (Aghakhani et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TrojanPuzzle Attack.