Poisoned Identifiers Survive LLM Deobfuscation: A Case Study on Claude Opus 4.6

Published 5 Apr 2026 in cs.CR, cs.AI, and cs.SE | (2604.04289v1)

Abstract: When an LLM deobfuscates JavaScript, can poisoned identifier names in the string table survive into the model's reconstructed code, even when the model demonstrably understands the correct semantics? Using Claude Opus 4.6 across 192 inference runs on two code archetypes (force-directed graph simulation, A* pathfinding; 50 conditions, N=3-6), we found three consistent patterns: (1) Poisoned names persisted in every baseline run on both artifacts (physics: 8/8; pathfinding: 5/5). Matched controls showed this extends to terms with zero semantic fit when the string table does not form a coherent alternative domain. (2) Persistence coexisted with correct semantic commentary: in 15/17 runs the model wrote wrong variable names while correctly describing the actual operation in comments. (3) Task framing changed persistence: explicit verification prompts had no effect (12/12 across 4 variants), but reframing from "deobfuscate this" to "write a fresh implementation" reduced propagation from 100% to 0-20% on physics and to 0% on pathfinding, while preserving the checked algorithmic structure. Matched-control experiments showed zero-fit terms persist at the same rate when the replacement table lacks a coherent alternative-domain signal. Per-term variation in earlier domain-gradient experiments is confounded with domain-level coherence and recoverability. These observations are from two archetypes on one model family (Opus 4.6 primary; Haiku 4.5 spot-check). Broader generalization is needed

Abstract PDF Upgrade to Chat

Authors (1)

Luis Guzmán Lorenzo

Summary

The paper demonstrates that poisoned identifiers persist in deobfuscated outputs even when explicit verification prompts are applied.
It reveals that task framing notably influences propagation, with 'fresh implementation' instructions reducing identifier poisoning significantly.
The study uncovers methodological vulnerabilities in LLM-based code analysis, highlighting the need for enhanced hardening strategies in security pipelines.

Identifier Poisoning and Task Framing in LLM-Based Deobfuscation: Findings from Claude Opus 4.6

Overview

This paper presents a systematic analysis of identifier name persistence—and specifically, identifier poisoning—during LLM-based JavaScript deobfuscation, focusing on Anthropic Claude Opus 4.6. Across 192 replicated runs and two code archetypes, the study demonstrates that poisoned (i.e., semantically incorrect or adversarial) identifier names regularly propagate from obfuscated artifacts into the deobfuscated output, even when the model’s commentary correctly indicates the underlying computational semantics. Notably, explicit prompt-level instructions to verify and correct identifiers have no measurable effect; instead, the propagation is modulated by the task framing—wherein prompting the model to “write a fresh implementation” nearly eliminates propagation, while “deobfuscate this” or similar translation frames preserve poisoned identifiers. These results uncover concrete behavioral regularities and substantial vulnerabilities within LLM-based deobfuscation pipelines, with nuanced implications for automatic code analysis and security toolchains.

Experimental Protocol and Core Findings

The experiments encompass two code archetypes—a force-directed graph simulation and an A* pathfinding algorithm—subjected to synthetically constructed obfuscated “pills” with variable identifier poisoning. Phase B (the main experimental body) uses 183 runs on Claude Opus 4.6 and 9 spot checks on Haiku 4.5, with systematic manipulation of prompt framing, obfuscation gradients, and identifier replacement strategies. Scoring is performed at the code-block level by automated regex extraction with independent validation.

Three principal empirical phenomena are documented:

Persistence of Poisoned Identifiers: In all baseline deobfuscation runs (e.g., physics artifact: 8/8, pathfinding: 5/5), poisoned identifiers in the obfuscated string table manifest verbatim in the model’s deobfuscated output, regardless of semantic correctness or explicit instruction to check name-meaning alignment.
Dual-Representation Pattern: In the majority of positive-propagation runs (e.g., 15/17 on the physics artifact), the model produces contradictory output—the code contains poisoned (incorrect) identifiers while adjacent comments or prose accurately describe the true semantics of the code block. This tension between variable names and model commentary is manually verified and corroborated via supporting examples.
Task Framing Effect: Reframing the instruction from “deobfuscate” (translation frame) to “write a fresh implementation from scratch” (generation frame) substantially reduces identifier propagation: for the physics artifact, propagation drops from 100% to 20% for core terms and for the pathfinding artifact, to 0%. Intermediate “rename with correct names” prompts reduce, but do not eliminate, propagation and tend to produce outputs with both correct and poisoned identifiers.

Prompting for explicit verification (“cross-check all decoded names”) and adversarial alerting has no impact—all 12 runs under verification prompts propagate the poisoned identifiers with no corrections.

Mechanistic Insights and Domain-Semantic Interactions

The consistent persistence of poisoned identifiers, especially in the face of explicit verification prompts, rules out simple prompt-following failures or accidental oversight. Instead, the evidence suggests an inductive bias stemming from the translation-like workflow of deobfuscation, in which the LLM treats decoded string-table entries as the authoritative “source text” for identifier names. Only when the task shifts out of the translation frame does the model actively generate or select semantically appropriate identifiers.

Domain-level analysis demonstrates that identifier propagation is not reliably sensitive to semantic fit of the replacement names. Terms with zero semantic fit (e.g., “combustion” for “repulsion” or “invoice” for a heuristic function) propagate at the same rate as plausible ones, provided the replacement table is not coherently connotative of an alternative domain. Only when the set of replacements forms a coherent, domain-recognizable alternative (e.g., engineering or financial terms across the board) does the model correct names to fit code semantics. Matched controls systematically eliminate per-term semantic fit as the operative mechanism for correction under typical translation framing.

Numerical constant poisoning (e.g., using $6.283$ for $\tau=2\pi$ ) is also found to propagate consistently, often with annotations indicating approximate equivalence but without relevant correction.

Implications and Theoretical Consequences

Practical Implications: These results highlight a significant obstacle for any automated LLM-based pipeline for code deobfuscation, malware identification, or semantic recovery. Identifier poisoning robustly survives deobfuscation workflows regardless of instruction, presenting a ready avenue for adversarial manipulation. Notably, the defense does not prevent human or sufficiently agentic analysis of code structure, but it increases the cost and reduces reliability for automated or casual LLM-assisted reconstructions. Ensuring semantic alignment of identifiers in LLM-generated code requires not just validators but fundamental modifications to workflow framing or input representation.

Theoretical Consequences: The observed dual-representation phenomenon (contradictory code vs. commentary) reveals a sharp dissociation in LLM code output mechanisms: generative steps privileging recovered lexical artifacts over reconstructed semantic understanding. Furthermore, the demonstrated task framing sensitivity suggests that model behavior in code manipulation is not purely a function of base instruction-following, but is heavily influenced by latent task schemas and default source-target mappings activated by prompt phrasing.

Security and Toolchain Integration: For practitioners, the work suggests that adversarial actors can reliably poison both surface-level and deep semantic controls in LLM-facing code obfuscation schemes. The model’s behavior indicates that even highly capable LLMs may default to source-preserving translation behaviors when reconstructing code, and that pipeline-hardening may require both reframing and multi-agent (e.g., post-deobfuscation renaming) steps to mitigate.

Future Directions

The limitation to Anthropic models and two code archetypes marks an important boundary: generalization to GPT-4o, Gemini, Mixtral, and other architectures remains outstanding. Future research should systematically compare propagation and correction phenomena across model families, code domains, and richer obfuscation strategies (e.g., polymorphic or adversarially-coherent domains). The robustness of numerical constant propagation also warrants deeper mechanistic probing. Multi-agent consensus-based approaches, divergence resolution under conflicting string tables, and controlled experiments on prompt context and agentic tool use (vs. single-shot inference) are identified as immediate follow-up areas.

The Supplementary Materials (available at https://github.com/Kieleth/obfuscated-sentinel) facilitate independent validation and extension of all Phase B findings.

Conclusion

This study demonstrates that LLM-based JavaScript deobfuscation, as instantiated in Claude Opus 4.6, is highly susceptible to persistent identifier poisoning via obfuscated string tables. Explicit verification prompts do not reduce persistence; the effect is instead tightly bound to task framing, with generation frames affording significantly higher naming correctness without sacrificing algorithmic fidelity. Identifier correction depends not on per-term semantic fit but on recognition of domain-level coherence. This decoupling of semantic understanding and literal reification within LLM code transformations carries consequences for both security engineering and the design of LLM-driven program analysis. Cross-model, multi-archetype validation will be crucial in mapping the scope and remediations for these behavioral regularities.

Reference: "Poisoned Identifiers Survive LLM Deobfuscation: A Case Study on Claude Opus 4.6" (2604.04289)

Markdown Report Issue