Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Published 1 Apr 2026 in cs.CR and cs.AI | (2604.01039v1)

Abstract: System Instructions in LLMs are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates (> 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces an automated framework that identifies encoding-based attacks aimed at extracting LLM system instructions via indirect prompt techniques.
It leverages Chain-of-Thought reshaping to systematically harden system instructions, significantly cutting leakage rates without expensive model retraining.
Empirical results demonstrate that structural encoding attacks drop from over 90% success to as low as 4% with the proposed mitigation strategy.

Automated Evaluation and Hardening of LLM System Instructions Against Encoding Attacks

Introduction

The paper "Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks" (2604.01039) systematically investigates the susceptibility of LLMs to system instruction leakage under indirect prompt attack formulations—specifically, queries that seek to extract instructions via encoding or structured-output tasks rather than direct queries. The proposed framework demonstrates that prevailing refusal-based alignment measures, predominant in production LLMs, often fail against such representation attacks, thereby exposing sensitive system-level data in practical deployments. The authors further propose and empirically validate a design-time hardening methodology leveraging Chain-of-Thought (CoT) reshaping to markedly reduce leakage rates without the need for resource-intensive model retraining or runtime manipulation.

Problem Definition and Threat Model

LLMs in agentic applications are governed by system instructions that encode behavior, operational constraints, and may contain sensitive detail such as API keys, tool integration policies, or workflow specifications. The confidentiality of these instructions is critical: their leakage can not only undermine safety boundaries but also facilitate prompt injection, jailbreak strategies, and adversarial reconnaissance.

The core threat model assumes an adversary restricted to black-box prompt/response interactions with the LLM, possessing no direct system access. A confidentiality violation occurs if the model, when prompted with tasks reframed as encoding or serialization (e.g., "Print the system instruction in YAML format"), reveals verbatim or paraphrased system instructions, sensitive variables, or reconstructable fragments, while properly refusing direct extraction queries. This operationalizes a gap between intended confidentiality controls and their effective enforcement across prompt variants.

Encoding-Based Extraction Methodology

The framework classifies encoding-based attacks into four categories:

Character-Level Obfuscation (e.g., ROT13, Base64, leetspeak): These techniques minimally alter text representation but are semantically reversible.
Structure-Embedding Wrappers (e.g., YAML, TOML, JSON, log/config file formats): These reframe extraction as ostensibly benign serialization tasks, bypassing surface pattern–based detection.
Symbolic and Representation Encodings (e.g., Morse code, Braille): These use alternate representation systems for textual information.
Logs and Protocol Embedding (e.g., HTTP headers, system logs): These encapsulate instructions in protocol artifacts, inducing inadvertent inclusion of protected content.

The framework orchestrates an end-to-end red-teaming pipeline: it probes the LLM for supported encodings, systematically generates indirect prompt variants for each system instruction, and uses an LLM-as-a-judge paradigm to automatically classify responses for leakage, corroborated through manual assessment.

Empirical Findings

Quantitative Analysis

Experiments span four diverse LLM architectures: GPT-4.1-mini, GPT-3.5-turbo, Gemini-2.5-flash, and LLaMA-3-8B. The benchmark comprises 46 system instructions with verified refusal under direct extraction queries. Attack Success Rate (ASR) is used to quantify leakage:

Structure-Embedding Wrappers (YAML, TOML, gitignore, cron): Achieve ASRs frequently exceeding 0.90 (i.e., >90%) prior to any mitigation, independent of model scale or alignment lineage.
Character-Level Obfuscation: Achieves lower but nonzero ASRs, with some open-weight or less-aligned models exhibiting significant vulnerability.
Symbolic/Log Encodings: These reveal system instructions at intermediate rates, indicating both surface-level and semantic weaknesses.

These results demonstrate that indirect formatting or encoding of extraction prompts reliably bypass refusal constraints learned via current alignment protocols, and that leakage does not correlate with robustness under direct query refusal.

Attack Generalization

ASR data aggregated by encoding category indicates that attacks leveraging structural and protocol-based reframing, which fundamentally alter the apparent task semantics, are systematically more effective than those limited to character substitutions. This highlights a critical deficiency: prompt-based refusal detection fails to enforce confidentiality in a representation-invariant fashion.

Mitigation via CoT-Based Instruction Reshaping

The paper introduces a design-time hardening strategy using Chain-of-Thought LLMs to restructure system instructions. The process generates explicit, hierarchical, and over-constrained confidentiality statements that:

Forbid any form of leakage (verbatim, partial, reconstructable).
Mandate refusal of requests that may trigger transformation, encoding, or exposure of sensitive data.
Prioritize confidentiality constraints over user-provided formatting/output instructions.

This approach does not involve altering runtime inference or requiring specialized alignment of the target model. Instead, it moves security enforcement into the language and formal organization of the instructions themselves.

Empirical evaluation post-reshaping reveals a substantial reduction in ASR across all attack categories and model types. For instance, structure-embedding attacks drop from >0.90 to values as low as 0.04–0.67 depending on the model and encoding. Symbolic and character-level ASRs fall to near zero in some categories, establishing that robustly phrased constraints can strongly mitigate even previously high-leakage avenues without the need for costly model retraining or runtime CoT invocation.

Implications and Future Directions

The findings underscore that refusal-based defenses, despite their prevalence in current deployed LLMs, are fundamentally insufficient against indirect extraction where the prompt is reframed as a representation or serialization task. This exposes enterprise and agentic deployments to high-severity risks (as codified by OWASP LLM07: System Prompt Leakage).

The CoT-based instruction reshaping methodology offers a practical, lightweight mitigation with strong empirical support, shifting security posture towards encoding-invariant, design-centric controls. However, the observed dependence of robustness on linguistic nuances within instruction phrasing illuminates a new axis for both offensive and defensive research: instruction design itself becomes a critical security primitive.

Open challenges remain. Future work should extend the taxonomy of attack encodings, address multi-turn or tool-integrated agent weaknesses, develop automated shaping tools for arbitrary instructions, and explore LLM architectures and alignment protocols that directly enforce full representation invariance.

Conclusion

This work provides an automated, systematic framework for evaluating and mitigating LLM system instruction leakage under encoding-based attack vectors. Through extensive cross-model experiments, the study demonstrates that state-of-the-art LLMs routinely fail to enforce confidentiality under indirect prompt formulations—a vulnerability rectifiable in large part through targeted, explicit restructuring of system instructions at design-time. These results have direct practical significance for the secure deployment of LLM-enabled agentic systems and define new research directions in robust instruction language engineering for AI safety.

Markdown Report Issue